linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs
@ 2026-01-06 12:03 Lance Yang
  2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang
  2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang
  0 siblings, 2 replies; 12+ messages in thread
From: Lance Yang @ 2026-01-06 12:03 UTC (permalink / raw)
  To: akpm
  Cc: david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin,
	peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0

Hi all,

When unsharing hugetlb PMD page tables or collapsing pages in khugepaged,
we send two IPIs: one for TLB invalidation, and another to synchronize
with concurrent GUP-fast walkers. However, if the TLB flush already
reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs
disabled, so when the TLB flush IPI completes, any concurrent GUP-fast
must have finished.

We now track whether IPIs were actually sent during TLB flush. We pass
the mmu_gather context through the flush path, and native_flush_tlb_multi()
sets a flag when sending IPIs. Works with PV and INVLPGB since only
native_flush_tlb_multi() sets the flag - no matter what replaces
pv_ops.mmu.flush_tlb_multi or whether INVLPGB is available.

David Hildenbrand did the initial implementation. I built on his work and
relied on off-list discussions to push it further - thanks a lot David!

v2 -> v3:
- Complete rewrite: use dynamic IPI tracking instead of static checks
  (per Dave Hansen, thanks!)
- Track IPIs via mmu_gather: native_flush_tlb_multi() sets flag when
  actually sending IPIs
- Motivation for skipping redundant IPIs explained by David:
  https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
- https://lore.kernel.org/linux-mm/20251229145245.85452-1-lance.yang@linux.dev/

v1 -> v2:
- Fix cover letter encoding to resolve send-email issues. Apologies for
  any email flood caused by the failed send attempts :(

RFC -> v1:
- Use a callback function in pv_mmu_ops instead of comparing function
  pointers (per David)
- Embed the check directly in tlb_remove_table_sync_one() instead of
  requiring every caller to check explicitly (per David)
- Move tlb_table_flush_implies_ipi_broadcast() outside of
  CONFIG_MMU_GATHER_RCU_TABLE_FREE to fix build error on architectures
  that don't enable this config.
  https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
- https://lore.kernel.org/linux-mm/20251213080038.10917-1-lance.yang@linux.dev/

Lance Yang (2):
  mm/tlb: skip redundant IPI when TLB flush already synchronized
  mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI

 arch/x86/include/asm/tlb.h      |  3 ++-
 arch/x86/include/asm/tlbflush.h |  9 +++++----
 arch/x86/kernel/alternative.c   |  2 +-
 arch/x86/kernel/ldt.c           |  2 +-
 arch/x86/mm/tlb.c               | 22 +++++++++++++++------
 include/asm-generic/tlb.h       | 14 +++++++++-----
 include/linux/pgtable.h         | 13 +++++++++----
 mm/khugepaged.c                 |  9 +++------
 mm/mmu_gather.c                 | 26 ++++++++++++++++++-------
 mm/pgtable-generic.c            | 34 +++++++++++++++++++++++++++++++++
 10 files changed, 99 insertions(+), 35 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized
  2026-01-06 12:03 [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs Lance Yang
@ 2026-01-06 12:03 ` Lance Yang
  2026-01-06 15:19   ` David Hildenbrand (Red Hat)
  2026-01-06 16:24   ` Dave Hansen
  2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang
  1 sibling, 2 replies; 12+ messages in thread
From: Lance Yang @ 2026-01-06 12:03 UTC (permalink / raw)
  To: akpm
  Cc: david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin,
	peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

When unsharing hugetlb PMD page tables, we currently send two IPIs: one
for TLB invalidation, and another to synchronize with concurrent GUP-fast
walkers via tlb_remove_table_sync_one().

However, if the TLB flush already sent IPIs to all CPUs (when freed_tables
or unshared_tables is true), the second IPI is redundant. GUP-fast runs
with IRQs disabled, so when the TLB flush IPI completes, any concurrent
GUP-fast must have finished.

To avoid the redundant IPI, we add a flag to mmu_gather to track whether
the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB
flush path via flush_tlb_info, so native_flush_tlb_multi() can set the
flag when it sends IPIs for freed_tables. We also set the flag for
local-only flushes, since disabling IRQs provides the same guarantee.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 arch/x86/include/asm/tlb.h      |  3 ++-
 arch/x86/include/asm/tlbflush.h |  9 +++++----
 arch/x86/kernel/alternative.c   |  2 +-
 arch/x86/kernel/ldt.c           |  2 +-
 arch/x86/mm/tlb.c               | 22 ++++++++++++++++------
 include/asm-generic/tlb.h       | 14 +++++++++-----
 mm/mmu_gather.c                 | 26 +++++++++++++++++++-------
 7 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..c5950a92058c 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+			   tlb->freed_tables || tlb->unshared_tables, tlb);
 }
 
 static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 00daedfefc1b..83c260c88b80 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -220,6 +220,7 @@ struct flush_tlb_info {
 	 *   will be zero.
 	 */
 	struct mm_struct	*mm;
+	struct mmu_gather	*tlb;
 	unsigned long		start;
 	unsigned long		end;
 	u64			new_tlb_gen;
@@ -305,23 +306,23 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif
 
 #define flush_tlb_mm(mm)						\
-		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
+		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true, NULL)
 
 #define flush_tlb_range(vma, start, end)				\
 	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
 			   ((vma)->vm_flags & VM_HUGETLB)		\
 				? huge_page_shift(hstate_vma(vma))	\
-				: PAGE_SHIFT, true)
+				: PAGE_SHIFT, true, NULL)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables);
+				bool freed_tables, struct mmu_gather *tlb);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false, NULL);
 }
 
 static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 28518371d8bf..006f3705b616 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2572,7 +2572,7 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	flush_tlb_mm_range(text_poke_mm, text_poke_mm_addr, text_poke_mm_addr +
 			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
-			   PAGE_SHIFT, false);
+			   PAGE_SHIFT, false, NULL);
 
 	if (func == text_poke_memcpy) {
 		/*
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 0f19ef355f5f..d8494706fec5 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -374,7 +374,7 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
 	}
 
 	va = (unsigned long)ldt_slot_va(ldt->slot);
-	flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false);
+	flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false, NULL);
 }
 
 #else /* !CONFIG_MITIGATION_PAGE_TABLE_ISOLATION */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f5b93e01e347..be45976c0d16 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1374,6 +1374,9 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
 				(void *)info, 1, cpumask);
+
+	if (info->freed_tables && info->tlb)
+		info->tlb->tlb_flush_sent_ipi = true;
 }
 
 void flush_tlb_multi(const struct cpumask *cpumask,
@@ -1403,7 +1406,7 @@ static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned long start, unsigned long end,
 			unsigned int stride_shift, bool freed_tables,
-			u64 new_tlb_gen)
+			u64 new_tlb_gen, struct mmu_gather *tlb)
 {
 	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
 
@@ -1433,6 +1436,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->new_tlb_gen	= new_tlb_gen;
 	info->initiating_cpu	= smp_processor_id();
 	info->trim_cpumask	= 0;
+	info->tlb		= tlb;
 
 	return info;
 }
@@ -1447,8 +1451,8 @@ static void put_flush_tlb_info(void)
 }
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
+			unsigned long end, unsigned int stride_shift,
+			bool freed_tables, struct mmu_gather *tlb)
 {
 	struct flush_tlb_info *info;
 	int cpu = get_cpu();
@@ -1458,7 +1462,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	new_tlb_gen = inc_mm_tlb_gen(mm);
 
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
-				  new_tlb_gen);
+				  new_tlb_gen, tlb);
 
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
@@ -1476,6 +1480,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		local_irq_disable();
 		flush_tlb_func(info);
 		local_irq_enable();
+		/*
+		 * Only current CPU uses this mm, so we can treat this as
+		 * having synchronized with GUP-fast. No sync IPI needed.
+		 */
+		if (tlb && freed_tables)
+			tlb->tlb_flush_sent_ipi = true;
 	}
 
 	put_flush_tlb_info();
@@ -1553,7 +1563,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	guard(preempt)();
 
 	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
-				  TLB_GENERATION_INVALID);
+				  TLB_GENERATION_INVALID, NULL);
 
 	if (info->end == TLB_FLUSH_ALL)
 		kernel_tlb_flush_all(info);
@@ -1733,7 +1743,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	int cpu = get_cpu();
 
 	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
-				  TLB_GENERATION_INVALID);
+				  TLB_GENERATION_INVALID, NULL);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 3975f7d11553..cbbe008590ee 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -249,6 +249,7 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 #define tlb_needs_table_invalidate() (true)
 #endif
 
+void tlb_gather_remove_table_sync_one(struct mmu_gather *tlb);
 void tlb_remove_table_sync_one(void);
 
 #else
@@ -257,6 +258,7 @@ void tlb_remove_table_sync_one(void);
 #error tlb_needs_table_invalidate() requires MMU_GATHER_RCU_TABLE_FREE
 #endif
 
+static inline void tlb_gather_remove_table_sync_one(struct mmu_gather *tlb) { }
 static inline void tlb_remove_table_sync_one(void) { }
 
 #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
@@ -378,6 +380,12 @@ struct mmu_gather {
 	 */
 	unsigned int		fully_unshared_tables : 1;
 
+	/*
+	 * Did the TLB flush for freed/unshared tables send IPIs to all CPUs?
+	 * If true, we can skip the redundant IPI in tlb_remove_table_sync_one().
+	 */
+	unsigned int		tlb_flush_sent_ipi : 1;
+
 	unsigned int		batch_count;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -833,13 +841,9 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
 	 *
 	 * We only perform this when we are the last sharer of a page table,
 	 * as the IPI will reach all CPUs: any GUP-fast.
-	 *
-	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
-	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
-	 * required IPIs already for us.
 	 */
 	if (tlb->fully_unshared_tables) {
-		tlb_remove_table_sync_one();
+		tlb_gather_remove_table_sync_one(tlb);
 		tlb->fully_unshared_tables = false;
 	}
 }
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 2faa23d7f8d4..da36de52b281 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -273,8 +273,14 @@ static void tlb_remove_table_smp_sync(void *arg)
 	/* Simply deliver the interrupt */
 }
 
-void tlb_remove_table_sync_one(void)
+void tlb_gather_remove_table_sync_one(struct mmu_gather *tlb)
 {
+	/* Skip the IPI if the TLB flush already synchronized with other CPUs */
+	if (tlb && tlb->tlb_flush_sent_ipi) {
+		tlb->tlb_flush_sent_ipi = false;
+		return;
+	}
+
 	/*
 	 * This isn't an RCU grace period and hence the page-tables cannot be
 	 * assumed to be actually RCU-freed.
@@ -285,6 +291,11 @@ void tlb_remove_table_sync_one(void)
 	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
 }
 
+void tlb_remove_table_sync_one(void)
+{
+	tlb_gather_remove_table_sync_one(NULL);
+}
+
 static void tlb_remove_table_rcu(struct rcu_head *head)
 {
 	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
@@ -328,7 +339,7 @@ static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
 	__tlb_remove_table(ptdesc);
 }
 
-static inline void __tlb_remove_table_one(void *table)
+static inline void __tlb_remove_table_one(void *table, struct mmu_gather *tlb)
 {
 	struct ptdesc *ptdesc;
 
@@ -336,16 +347,16 @@ static inline void __tlb_remove_table_one(void *table)
 	call_rcu(&ptdesc->pt_rcu_head, __tlb_remove_table_one_rcu);
 }
 #else
-static inline void __tlb_remove_table_one(void *table)
+static inline void __tlb_remove_table_one(void *table, struct mmu_gather *tlb)
 {
-	tlb_remove_table_sync_one();
+	tlb_gather_remove_table_sync_one(tlb);
 	__tlb_remove_table(table);
 }
 #endif /* CONFIG_PT_RECLAIM */
 
-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
 {
-	__tlb_remove_table_one(table);
+	__tlb_remove_table_one(table, tlb);
 }
 
 static void tlb_table_flush(struct mmu_gather *tlb)
@@ -367,7 +378,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT);
 		if (*batch == NULL) {
 			tlb_table_invalidate(tlb);
-			tlb_remove_table_one(table);
+			tlb_remove_table_one(table, tlb);
 			return;
 		}
 		(*batch)->nr = 0;
@@ -427,6 +438,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->vma_pfn = 0;
 
 	tlb->fully_unshared_tables = 0;
+	tlb->tlb_flush_sent_ipi = 0;
 	__tlb_reset_range(tlb);
 	inc_tlb_flush_pending(tlb->mm);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
  2026-01-06 12:03 [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs Lance Yang
  2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang
@ 2026-01-06 12:03 ` Lance Yang
  2026-01-06 15:07   ` David Hildenbrand (Red Hat)
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Lance Yang @ 2026-01-06 12:03 UTC (permalink / raw)
  To: akpm
  Cc: david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin,
	peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

pmdp_collapse_flush() may already send IPIs to flush TLBs, and then
callers send another IPI via tlb_remove_table_sync_one() or
pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers.

However, since GUP-fast runs with IRQs disabled, the TLB flush IPI already
provides the necessary synchronization. We can avoid the redundant second
IPI.

Introduce pmdp_collapse_flush_sync() which combines flush and sync:

- For architectures using the generic pmdp_collapse_flush() implementation
  (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent
  an IPI, tlb_gather_remove_table_sync_one() will skip the redundant one.

- For architectures with custom pmdp_collapse_flush() (s390, riscv,
  powerpc): Fall back to calling pmdp_collapse_flush() followed by
  tlb_remove_table_sync_one(). No behavior change.

Update khugepaged to use pmdp_collapse_flush_sync() instead of separate
flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() macro.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/linux/pgtable.h | 13 +++++++++----
 mm/khugepaged.c         |  9 +++------
 mm/pgtable-generic.c    | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index eb8aacba3698..69e290dab450 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 	return pmd;
 }
 #define pmdp_get_lockless pmdp_get_lockless
-#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
 
@@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 {
 	return pmdp_get(pmdp);
 }
-static inline void pmdp_get_lockless_sync(void)
-{
-}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
+extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp);
 #else
 static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 					unsigned long address,
@@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return *pmdp;
 }
+static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmdp)
+{
+	BUILD_BUG();
+	return *pmdp;
+}
 #define pmdp_collapse_flush pmdp_collapse_flush
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9f790ec34400..0a98afc85c50 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1177,10 +1177,9 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	 * Parallel GUP-fast is fine since GUP-fast will back off when
 	 * it detects PMD is changed.
 	 */
-	_pmd = pmdp_collapse_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush_sync(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_remove_table_sync_one();
 
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
@@ -1663,8 +1662,7 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign
 			}
 		}
 	}
-	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
-	pmdp_get_lockless_sync();
+	pgt_pmd = pmdp_collapse_flush_sync(vma, haddr, pmd);
 	pte_unmap_unlock(start_pte, ptl);
 	if (ptl != pml)
 		spin_unlock(pml);
@@ -1817,8 +1815,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * races against the prior checks.
 		 */
 		if (likely(file_backed_vma_is_retractable(vma))) {
-			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
-			pmdp_get_lockless_sync();
+			pgt_pmd = pmdp_collapse_flush_sync(vma, addr, pmd);
 			success = true;
 		}
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d3aec7a9926a..be2ee82e6fc4 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
 }
+
+pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address,
+			       pmd_t *pmdp)
+{
+	struct mmu_gather tlb;
+	pmd_t pmd;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(pmd_trans_huge(*pmdp));
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
+
+	flush_tlb_mm_range(vma->vm_mm, address, address + HPAGE_PMD_SIZE,
+			   PAGE_SHIFT, true, &tlb);
+
+	/*
+	 * Synchronize with GUP-fast. If the flush sent IPIs, skip the
+	 * redundant sync IPI.
+	 */
+	tlb_gather_remove_table_sync_one(&tlb);
+	tlb_finish_mmu(&tlb);
+	return pmd;
+}
+#else
+pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address,
+			       pmd_t *pmdp)
+{
+	pmd_t pmd;
+
+	pmd = pmdp_collapse_flush(vma, address, pmdp);
+	tlb_remove_table_sync_one();
+	return pmd;
+}
 #endif
 
 /* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
  2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang
@ 2026-01-06 15:07   ` David Hildenbrand (Red Hat)
  2026-01-06 15:41     ` Lance Yang
  2026-01-07  9:46   ` kernel test robot
  2026-01-07 10:52   ` kernel test robot
  2 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-06 15:07 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz,
	tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0

On 1/6/26 13:03, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> pmdp_collapse_flush() may already send IPIs to flush TLBs, and then
> callers send another IPI via tlb_remove_table_sync_one() or
> pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers.
> 
> However, since GUP-fast runs with IRQs disabled, the TLB flush IPI already
> provides the necessary synchronization. We can avoid the redundant second
> IPI.
> 
> Introduce pmdp_collapse_flush_sync() which combines flush and sync:
> 
> - For architectures using the generic pmdp_collapse_flush() implementation
>    (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent
>    an IPI, tlb_gather_remove_table_sync_one() will skip the redundant one.
> 
> - For architectures with custom pmdp_collapse_flush() (s390, riscv,
>    powerpc): Fall back to calling pmdp_collapse_flush() followed by
>    tlb_remove_table_sync_one(). No behavior change.
> 
> Update khugepaged to use pmdp_collapse_flush_sync() instead of separate
> flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() macro.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   include/linux/pgtable.h | 13 +++++++++----
>   mm/khugepaged.c         |  9 +++------
>   mm/pgtable-generic.c    | 34 ++++++++++++++++++++++++++++++++++
>   3 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index eb8aacba3698..69e290dab450 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>   	return pmd;
>   }
>   #define pmdp_get_lockless pmdp_get_lockless
> -#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>   
> @@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>   {
>   	return pmdp_get(pmdp);
>   }
> -static inline void pmdp_get_lockless_sync(void)
> -{
> -}
>   #endif
>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>   				 unsigned long address, pmd_t *pmdp);
> +extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp);
>   #else
>   static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>   					unsigned long address,
> @@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>   	BUILD_BUG();
>   	return *pmdp;
>   }
> +static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
> +					unsigned long address,
> +					pmd_t *pmdp)
> +{
> +	BUILD_BUG();
> +	return *pmdp;
> +}
>   #define pmdp_collapse_flush pmdp_collapse_flush
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   #endif
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9f790ec34400..0a98afc85c50 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1177,10 +1177,9 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>   	 * Parallel GUP-fast is fine since GUP-fast will back off when
>   	 * it detects PMD is changed.
>   	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush_sync(vma, address, pmd);
>   	spin_unlock(pmd_ptl);
>   	mmu_notifier_invalidate_range_end(&range);
> -	tlb_remove_table_sync_one();

Now you issue the IPI under PTL.

[...]

> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index d3aec7a9926a..be2ee82e6fc4 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
>   	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>   	return pmd;
>   }
> +
> +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address,
> +			       pmd_t *pmdp)
> +{
> +	struct mmu_gather tlb;
> +	pmd_t pmd;
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	VM_BUG_ON(pmd_trans_huge(*pmdp));
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);

Should we be using the new tlb_gather_mmu_vma(), and do we have to set 
the TLB pagesize to PMD?

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized
  2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang
@ 2026-01-06 15:19   ` David Hildenbrand (Red Hat)
  2026-01-06 16:10     ` Lance Yang
  2026-01-06 16:24   ` Dave Hansen
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-06 15:19 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz,
	tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0

>   
>   static void tlb_table_flush(struct mmu_gather *tlb)
> @@ -367,7 +378,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
>   		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT);
>   		if (*batch == NULL) {
>   			tlb_table_invalidate(tlb);
> -			tlb_remove_table_one(table);
> +			tlb_remove_table_one(table, tlb);
>   			return;
>   		}
>   		(*batch)->nr = 0;
> @@ -427,6 +438,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>   	tlb->vma_pfn = 0;
>   
>   	tlb->fully_unshared_tables = 0;
> +	tlb->tlb_flush_sent_ipi = 0;
>   	__tlb_reset_range(tlb);
>   	inc_tlb_flush_pending(tlb->mm);
>   }

But when would we have to reset tlb->tlb_flush_sent_ipi = 0 later? 
That's where it gets tricky. Just imagine the MMU gather gets reused later.

Also,

	+	if (info->freed_tables && info->tlb)
	+		info->tlb->tlb_flush_sent_ipi = true;

in native_flush_tlb_multi() misses the fact that we have different 
flushing types for removed/unshared tables vs. other flush.

So this approach more here certainly gets more complicated and error prone.

tlb_table_flush_implies_ipi_broadcast() was clearer in that regard: if 
you flushed the TLB after removing /unsharing tables, the IPI for 
handling page tables can be skipped. It's on the code flow to assure that.

What could work is tracking "tlb_table_flush_sent_ipi" really when we 
are flushing the TLB for removed/unshared tables, and maybe resetting it 
... I don't know when from the top of my head.

v2 was simpler IMHO.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
  2026-01-06 15:07   ` David Hildenbrand (Red Hat)
@ 2026-01-06 15:41     ` Lance Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Lance Yang @ 2026-01-06 15:41 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz,
	tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0



On 2026/1/6 23:07, David Hildenbrand (Red Hat) wrote:
> On 1/6/26 13:03, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> pmdp_collapse_flush() may already send IPIs to flush TLBs, and then
>> callers send another IPI via tlb_remove_table_sync_one() or
>> pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers.
>>
>> However, since GUP-fast runs with IRQs disabled, the TLB flush IPI 
>> already
>> provides the necessary synchronization. We can avoid the redundant second
>> IPI.
>>
>> Introduce pmdp_collapse_flush_sync() which combines flush and sync:
>>
>> - For architectures using the generic pmdp_collapse_flush() 
>> implementation
>>    (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent
>>    an IPI, tlb_gather_remove_table_sync_one() will skip the redundant 
>> one.
>>
>> - For architectures with custom pmdp_collapse_flush() (s390, riscv,
>>    powerpc): Fall back to calling pmdp_collapse_flush() followed by
>>    tlb_remove_table_sync_one(). No behavior change.
>>
>> Update khugepaged to use pmdp_collapse_flush_sync() instead of separate
>> flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() 
>> macro.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   include/linux/pgtable.h | 13 +++++++++----
>>   mm/khugepaged.c         |  9 +++------
>>   mm/pgtable-generic.c    | 34 ++++++++++++++++++++++++++++++++++
>>   3 files changed, 46 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index eb8aacba3698..69e290dab450 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>       return pmd;
>>   }
>>   #define pmdp_get_lockless pmdp_get_lockless
>> -#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
>>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>> @@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>   {
>>       return pmdp_get(pmdp);
>>   }
>> -static inline void pmdp_get_lockless_sync(void)
>> -{
>> -}
>>   #endif
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> @@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct 
>> mm_struct *mm,
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>   extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>>                    unsigned long address, pmd_t *pmdp);
>> +extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
>> +                 unsigned long address, pmd_t *pmdp);
>>   #else
>>   static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>>                       unsigned long address,
>> @@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct 
>> vm_area_struct *vma,
>>       BUILD_BUG();
>>       return *pmdp;
>>   }
>> +static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
>> +                    unsigned long address,
>> +                    pmd_t *pmdp)
>> +{
>> +    BUILD_BUG();
>> +    return *pmdp;
>> +}
>>   #define pmdp_collapse_flush pmdp_collapse_flush
>>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>   #endif
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9f790ec34400..0a98afc85c50 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1177,10 +1177,9 @@ static enum scan_result 
>> collapse_huge_page(struct mm_struct *mm, unsigned long a
>>        * Parallel GUP-fast is fine since GUP-fast will back off when
>>        * it detects PMD is changed.
>>        */
>> -    _pmd = pmdp_collapse_flush(vma, address, pmd);
>> +    _pmd = pmdp_collapse_flush_sync(vma, address, pmd);
>>       spin_unlock(pmd_ptl);
>>       mmu_notifier_invalidate_range_end(&range);
>> -    tlb_remove_table_sync_one();
> 
> Now you issue the IPI under PTL.
We do send TLB flush IPI under PTL before, e.g. in 
try_collapse_pte_mapped_thp():

	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
	pmdp_get_lockless_sync();
	pte_unmap_unlock(start_pte, ptl);

But anyway, we can do better by passing ptl in and unlocking
before the sync IPI ;)
> 
> [...]
> 
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index d3aec7a9926a..be2ee82e6fc4 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct 
>> *vma, unsigned long address,
>>       flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>       return pmd;
>>   }
>> +
>> +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned 
>> long address,
>> +                   pmd_t *pmdp)
>> +{
>> +    struct mmu_gather tlb;
>> +    pmd_t pmd;
>> +
>> +    VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +    VM_BUG_ON(pmd_trans_huge(*pmdp));
>> +
>> +    tlb_gather_mmu(&tlb, vma->vm_mm);
> 
> Should we be using the new tlb_gather_mmu_vma(), and do we have to set 
> the TLB pagesize to PMD?

Yes, good point on tlb_gather_mmu_vma()!

So, the sequence will be:

	tlb_gather_mmu_vma(&tlb, vma);
	pmd = pmdp_huge_get_and_clear(...);
	flush_tlb_mm_range(..., &tlb);
	if (ptl)
		spin_unlock(ptl);
	tlb_gather_remove_table_sync_one(&tlb);
	tlb_finish_mmu(&tlb);Thanks,
Lance

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized
  2026-01-06 15:19   ` David Hildenbrand (Red Hat)
@ 2026-01-06 16:10     ` Lance Yang
  2026-01-07  6:37       ` Lance Yang
  0 siblings, 1 reply; 12+ messages in thread
From: Lance Yang @ 2026-01-06 16:10 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz,
	tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0



On 2026/1/6 23:19, David Hildenbrand (Red Hat) wrote:
>>   static void tlb_table_flush(struct mmu_gather *tlb)
>> @@ -367,7 +378,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void 
>> *table)
>>           *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT);
>>           if (*batch == NULL) {
>>               tlb_table_invalidate(tlb);
>> -            tlb_remove_table_one(table);
>> +            tlb_remove_table_one(table, tlb);
>>               return;
>>           }
>>           (*batch)->nr = 0;
>> @@ -427,6 +438,7 @@ static void __tlb_gather_mmu(struct mmu_gather 
>> *tlb, struct mm_struct *mm,
>>       tlb->vma_pfn = 0;
>>       tlb->fully_unshared_tables = 0;
>> +    tlb->tlb_flush_sent_ipi = 0;
>>       __tlb_reset_range(tlb);
>>       inc_tlb_flush_pending(tlb->mm);
>>   }
> 
> But when would we have to reset tlb->tlb_flush_sent_ipi = 0 later? 
> That's where it gets tricky. Just imagine the MMU gather gets reused later.
> 
> Also,
> 
>      +    if (info->freed_tables && info->tlb)
>      +        info->tlb->tlb_flush_sent_ipi = true;
> 
> in native_flush_tlb_multi() misses the fact that we have different 
> flushing types for removed/unshared tables vs. other flush.
> 
> So this approach more here certainly gets more complicated and error prone.

Agreed. Tracking the flag through mmu_gather lifecycle does get
more complicated and error-prone ...

> 
> tlb_table_flush_implies_ipi_broadcast() was clearer in that regard: if 
> you flushed the TLB after removing /unsharing tables, the IPI for 
> handling page tables can be skipped. It's on the code flow to assure that.

v2 was definitely simpler.

> 
> What could work is tracking "tlb_table_flush_sent_ipi" really when we 
> are flushing the TLB for removed/unshared tables, and maybe resetting 
> it ... I don't know when from the top of my head.

Not sure what's the best way forward here :(

> 
> v2 was simpler IMHO.

The main concern Dave raised was that with PV hypercalls or when
INVLPGB is available, we can't tell from a static check whether IPIs
were actually sent.

Maybe that's acceptable, or we could find a simpler way to track that ...

Open to suggestions!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized
  2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang
  2026-01-06 15:19   ` David Hildenbrand (Red Hat)
@ 2026-01-06 16:24   ` Dave Hansen
  2026-01-07  2:47     ` Lance Yang
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2026-01-06 16:24 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: david, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx,
	mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0

On 1/6/26 04:03, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> When unsharing hugetlb PMD page tables, we currently send two IPIs: one
> for TLB invalidation, and another to synchronize with concurrent GUP-fast
> walkers via tlb_remove_table_sync_one().
> 
> However, if the TLB flush already sent IPIs to all CPUs (when freed_tables
> or unshared_tables is true), the second IPI is redundant. GUP-fast runs
> with IRQs disabled, so when the TLB flush IPI completes, any concurrent
> GUP-fast must have finished.
> 
> To avoid the redundant IPI, we add a flag to mmu_gather to track whether
> the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB
> flush path via flush_tlb_info, so native_flush_tlb_multi() can set the
> flag when it sends IPIs for freed_tables. We also set the flag for
> local-only flushes, since disabling IRQs provides the same guarantee.

The lack of imperative voice is killing me. :)

> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 866ea78ba156..c5950a92058c 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>  		end = tlb->end;
>  	}
>  
> -	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
> +	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
> +			   tlb->freed_tables || tlb->unshared_tables, tlb);
>  }

I think this hunk sums up v3 pretty well. Where there was a single boolean, now there are two. To add to that, the structure that contains the booleans is itself being passed in. The boolean is still named 'freed_tables', and is going from:

	tlb->freed_tables

which is pretty obviously correct to:

	tlb->freed_tables || tlb->unshared_tables

which is _far_ from obviously correct.

I'm at a loss for why the patch wouldn't just do this:

-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb);

I suspect these were sent out in a bit of haste, which isn't the first time I've gotten that feeling with this series.

Could we slow down, please?

>  static inline void invlpg(unsigned long addr)
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 00daedfefc1b..83c260c88b80 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -220,6 +220,7 @@ struct flush_tlb_info {
>  	 *   will be zero.
>  	 */
>  	struct mm_struct	*mm;
> +	struct mmu_gather	*tlb;
>  	unsigned long		start;
>  	unsigned long		end;
>  	u64			new_tlb_gen;

This also gives me pause.

There is a *lot* of redundant information between 'struct mmu_gather' and 'struct tlb_flush_info'. There needs to at least be a description of what the relationship is and how these relate to each other. I would have naively thought that the right move here would be to pull the mmu_gather data out at one discrete time rather than store a pointer to it.

What I see here is, I suspect, the most expedient way to do it. I'd _certainly_ have done this myself if I was just hacking something together to play with as quickly as possible.

So, in the end, I don't hate the approach here (yet). But it is almost impossible to evaluate it because the series is taking some rather egregious shortcuts and is lacking any real semblance of a refactoring effort.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized
  2026-01-06 16:24   ` Dave Hansen
@ 2026-01-07  2:47     ` Lance Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Lance Yang @ 2026-01-07  2:47 UTC (permalink / raw)
  To: Dave Hansen, david
  Cc: dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh,
	linux-arch, linux-mm, linux-kernel, ioworker0, akpm



On 2026/1/7 00:24, Dave Hansen wrote:
> On 1/6/26 04:03, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When unsharing hugetlb PMD page tables, we currently send two IPIs: one
>> for TLB invalidation, and another to synchronize with concurrent GUP-fast
>> walkers via tlb_remove_table_sync_one().
>>
>> However, if the TLB flush already sent IPIs to all CPUs (when freed_tables
>> or unshared_tables is true), the second IPI is redundant. GUP-fast runs
>> with IRQs disabled, so when the TLB flush IPI completes, any concurrent
>> GUP-fast must have finished.
>>
>> To avoid the redundant IPI, we add a flag to mmu_gather to track whether
>> the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB
>> flush path via flush_tlb_info, so native_flush_tlb_multi() can set the
>> flag when it sends IPIs for freed_tables. We also set the flag for
>> local-only flushes, since disabling IRQs provides the same guarantee.
> 
> The lack of imperative voice is killing me. :)

Oops.

> 
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 866ea78ba156..c5950a92058c 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>>   		end = tlb->end;
>>   	}
>>   
>> -	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
>> +	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
>> +			   tlb->freed_tables || tlb->unshared_tables, tlb);
>>   }
> 
> I think this hunk sums up v3 pretty well. Where there was a single boolean, now there are two. To add to that, the structure that contains the booleans is itself being passed in. The boolean is still named 'freed_tables', and is going from:
> 
> 	tlb->freed_tables
> 
> which is pretty obviously correct to:
> 
> 	tlb->freed_tables || tlb->unshared_tables
> 
> which is _far_ from obviously correct.
> 
> I'm at a loss for why the patch wouldn't just do this:
> 
> -	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
> +	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb);
> 
> I suspect these were sent out in a bit of haste, which isn't the first time I've gotten that feeling with this series.
> 
> Could we slow down, please?

Sorry, I went too fast ...

> 
>>   static inline void invlpg(unsigned long addr)
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 00daedfefc1b..83c260c88b80 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -220,6 +220,7 @@ struct flush_tlb_info {
>>   	 *   will be zero.
>>   	 */
>>   	struct mm_struct	*mm;
>> +	struct mmu_gather	*tlb;
>>   	unsigned long		start;
>>   	unsigned long		end;
>>   	u64			new_tlb_gen;
> 
> This also gives me pause.
> 
> There is a *lot* of redundant information between 'struct mmu_gather' and 'struct tlb_flush_info'. There needs to at least be a description of what the relationship is and how these relate to each other. I would have naively thought that the right move here would be to pull the mmu_gather data out at one discrete time rather than store a pointer to it.
> 
> What I see here is, I suspect, the most expedient way to do it. I'd _certainly_ have done this myself if I was just hacking something together to play with as quickly as possible.
> 
> So, in the end, I don't hate the approach here (yet). But it is almost impossible to evaluate it because the series is taking some rather egregious shortcuts and is lacking any real semblance of a refactoring effort.

The flag lifetime issue David pointed out is real, and you're right
about the messy parameters :)

And, yeah, I need to think more those. Maybe v3 can be fixed, or maybe
v2 is actually sufficient - it's conservative but safe (no false positives).

Will take more time, thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized
  2026-01-06 16:10     ` Lance Yang
@ 2026-01-07  6:37       ` Lance Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Lance Yang @ 2026-01-07  6:37 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz,
	tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	ioworker0

Hi David,

On 2026/1/7 00:10, Lance Yang wrote:
[..]
>> What could work is tracking "tlb_table_flush_sent_ipi" really when we 
>> are flushing the TLB for removed/unshared tables, and maybe resetting 
>> it ... I don't know when from the top of my head.
>

Seems like we could fix the issue that the flag lifetime was broken
if the MMU gather gets reused by splitting the flush and reset. This
ensures the flag stays valid between flush and sync.

Now tlb_flush_unshared_tables() does:
   1) __tlb_flush_mmu_tlbonly() - flush only, keeps flags alive
   2) tlb_gather_remove_table_sync_one() - can check the flag
   3) __tlb_reset_range() - reset everything after sync

Something like this:

---8<---
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 3975f7d11553..a95b054dfcca 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -415,6 +415,7 @@ static inline void __tlb_reset_range(struct 
mmu_gather *tlb)
  	tlb->cleared_puds = 0;
  	tlb->cleared_p4ds = 0;
  	tlb->unshared_tables = 0;
+	tlb->tlb_flush_sent_ipi = 0;
  	/*
  	 * Do not reset mmu_gather::vma_* fields here, we do not
  	 * call into tlb_start_vma() again to set them if there is an
@@ -492,7 +493,7 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct 
vm_area_struct *vma)
  	tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP));
  }

-static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static inline void __tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
  {
  	/*
  	 * Anything calling __tlb_adjust_range() also sets at least one of
@@ -503,6 +504,11 @@ static inline void tlb_flush_mmu_tlbonly(struct 
mmu_gather *tlb)
  		return;

  	tlb_flush(tlb);
+}
+
+static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+{
+	__tlb_flush_mmu_tlbonly(tlb);
  	__tlb_reset_range(tlb);
  }

@@ -824,7 +830,7 @@ static inline void tlb_flush_unshared_tables(struct 
mmu_gather *tlb)
  	 * flush the TLB for the unsharer now.
  	 */
  	if (tlb->unshared_tables)
-		tlb_flush_mmu_tlbonly(tlb);
+		__tlb_flush_mmu_tlbonly(tlb);

  	/*
  	 * Similarly, we must make sure that concurrent GUP-fast will not
@@ -834,14 +840,16 @@ static inline void 
tlb_flush_unshared_tables(struct mmu_gather *tlb)
  	 * We only perform this when we are the last sharer of a page table,
  	 * as the IPI will reach all CPUs: any GUP-fast.
  	 *
-	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
-	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
-	 * required IPIs already for us.
+	 * Use tlb_gather_remove_table_sync_one() instead of
+	 * tlb_remove_table_sync_one() to skip the redundant IPI if the
+	 * TLB flush above already sent one.
  	 */
  	if (tlb->fully_unshared_tables) {
-		tlb_remove_table_sync_one();
+		tlb_gather_remove_table_sync_one(tlb);
  		tlb->fully_unshared_tables = false;
  	}
+
+	__tlb_reset_range(tlb);
  }
  #endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
---

For khugepaged, it should be fine - it uses a local mmu_gather that
doesn't get reused. The lifetime is simply:

   tlb_gather_mmu() → flush → sync → tlb_finish_mmu()

Let me know if this addresses your concern :)

[...]

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
  2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang
  2026-01-06 15:07   ` David Hildenbrand (Red Hat)
@ 2026-01-07  9:46   ` kernel test robot
  2026-01-07 10:52   ` kernel test robot
  2 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2026-01-07  9:46 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: oe-kbuild-all, david, dave.hansen, dave.hansen, will,
	aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd,
	lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, shy828301, riel, jannh,
	linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang

Hi Lance,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20260107]
[cannot apply to tip/x86/core tip/x86/mm arnd-asm-generic/master linus/master v6.19-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-skip-redundant-IPI-when-TLB-flush-already-synchronized/20260106-200505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260106120303.38124-3-lance.yang%40linux.dev
patch subject: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
config: s390-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260107/202601071005.oEsmtf0J-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260107/202601071005.oEsmtf0J-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601071005.oEsmtf0J-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/khugepaged.c: In function 'collapse_huge_page':
>> mm/khugepaged.c:1180:16: error: implicit declaration of function 'pmdp_collapse_flush_sync'; did you mean 'pmdp_collapse_flush'? [-Wimplicit-function-declaration]
    1180 |         _pmd = pmdp_collapse_flush_sync(vma, address, pmd);
         |                ^~~~~~~~~~~~~~~~~~~~~~~~
         |                pmdp_collapse_flush


vim +1180 mm/khugepaged.c

  1092	
  1093	static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
  1094						   int referenced, int unmapped,
  1095						   struct collapse_control *cc)
  1096	{
  1097		LIST_HEAD(compound_pagelist);
  1098		pmd_t *pmd, _pmd;
  1099		pte_t *pte;
  1100		pgtable_t pgtable;
  1101		struct folio *folio;
  1102		spinlock_t *pmd_ptl, *pte_ptl;
  1103		enum scan_result result = SCAN_FAIL;
  1104		struct vm_area_struct *vma;
  1105		struct mmu_notifier_range range;
  1106	
  1107		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
  1108	
  1109		/*
  1110		 * Before allocating the hugepage, release the mmap_lock read lock.
  1111		 * The allocation can take potentially a long time if it involves
  1112		 * sync compaction, and we do not need to hold the mmap_lock during
  1113		 * that. We will recheck the vma after taking it again in write mode.
  1114		 */
  1115		mmap_read_unlock(mm);
  1116	
  1117		result = alloc_charge_folio(&folio, mm, cc);
  1118		if (result != SCAN_SUCCEED)
  1119			goto out_nolock;
  1120	
  1121		mmap_read_lock(mm);
  1122		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1123		if (result != SCAN_SUCCEED) {
  1124			mmap_read_unlock(mm);
  1125			goto out_nolock;
  1126		}
  1127	
  1128		result = find_pmd_or_thp_or_none(mm, address, &pmd);
  1129		if (result != SCAN_SUCCEED) {
  1130			mmap_read_unlock(mm);
  1131			goto out_nolock;
  1132		}
  1133	
  1134		if (unmapped) {
  1135			/*
  1136			 * __collapse_huge_page_swapin will return with mmap_lock
  1137			 * released when it fails. So we jump out_nolock directly in
  1138			 * that case.  Continuing to collapse causes inconsistency.
  1139			 */
  1140			result = __collapse_huge_page_swapin(mm, vma, address, pmd,
  1141							     referenced);
  1142			if (result != SCAN_SUCCEED)
  1143				goto out_nolock;
  1144		}
  1145	
  1146		mmap_read_unlock(mm);
  1147		/*
  1148		 * Prevent all access to pagetables with the exception of
  1149		 * gup_fast later handled by the ptep_clear_flush and the VM
  1150		 * handled by the anon_vma lock + PG_lock.
  1151		 *
  1152		 * UFFDIO_MOVE is prevented to race as well thanks to the
  1153		 * mmap_lock.
  1154		 */
  1155		mmap_write_lock(mm);
  1156		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1157		if (result != SCAN_SUCCEED)
  1158			goto out_up_write;
  1159		/* check if the pmd is still valid */
  1160		vma_start_write(vma);
  1161		result = check_pmd_still_valid(mm, address, pmd);
  1162		if (result != SCAN_SUCCEED)
  1163			goto out_up_write;
  1164	
  1165		anon_vma_lock_write(vma->anon_vma);
  1166	
  1167		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
  1168					address + HPAGE_PMD_SIZE);
  1169		mmu_notifier_invalidate_range_start(&range);
  1170	
  1171		pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
  1172		/*
  1173		 * This removes any huge TLB entry from the CPU so we won't allow
  1174		 * huge and small TLB entries for the same virtual address to
  1175		 * avoid the risk of CPU bugs in that area.
  1176		 *
  1177		 * Parallel GUP-fast is fine since GUP-fast will back off when
  1178		 * it detects PMD is changed.
  1179		 */
> 1180		_pmd = pmdp_collapse_flush_sync(vma, address, pmd);
  1181		spin_unlock(pmd_ptl);
  1182		mmu_notifier_invalidate_range_end(&range);
  1183	
  1184		pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
  1185		if (pte) {
  1186			result = __collapse_huge_page_isolate(vma, address, pte, cc,
  1187							      &compound_pagelist);
  1188			spin_unlock(pte_ptl);
  1189		} else {
  1190			result = SCAN_NO_PTE_TABLE;
  1191		}
  1192	
  1193		if (unlikely(result != SCAN_SUCCEED)) {
  1194			if (pte)
  1195				pte_unmap(pte);
  1196			spin_lock(pmd_ptl);
  1197			BUG_ON(!pmd_none(*pmd));
  1198			/*
  1199			 * We can only use set_pmd_at when establishing
  1200			 * hugepmds and never for establishing regular pmds that
  1201			 * points to regular pagetables. Use pmd_populate for that
  1202			 */
  1203			pmd_populate(mm, pmd, pmd_pgtable(_pmd));
  1204			spin_unlock(pmd_ptl);
  1205			anon_vma_unlock_write(vma->anon_vma);
  1206			goto out_up_write;
  1207		}
  1208	
  1209		/*
  1210		 * All pages are isolated and locked so anon_vma rmap
  1211		 * can't run anymore.
  1212		 */
  1213		anon_vma_unlock_write(vma->anon_vma);
  1214	
  1215		result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
  1216						   vma, address, pte_ptl,
  1217						   &compound_pagelist);
  1218		pte_unmap(pte);
  1219		if (unlikely(result != SCAN_SUCCEED))
  1220			goto out_up_write;
  1221	
  1222		/*
  1223		 * The smp_wmb() inside __folio_mark_uptodate() ensures the
  1224		 * copy_huge_page writes become visible before the set_pmd_at()
  1225		 * write.
  1226		 */
  1227		__folio_mark_uptodate(folio);
  1228		pgtable = pmd_pgtable(_pmd);
  1229	
  1230		spin_lock(pmd_ptl);
  1231		BUG_ON(!pmd_none(*pmd));
  1232		pgtable_trans_huge_deposit(mm, pmd, pgtable);
  1233		map_anon_folio_pmd_nopf(folio, pmd, vma, address);
  1234		spin_unlock(pmd_ptl);
  1235	
  1236		folio = NULL;
  1237	
  1238		result = SCAN_SUCCEED;
  1239	out_up_write:
  1240		mmap_write_unlock(mm);
  1241	out_nolock:
  1242		if (folio)
  1243			folio_put(folio);
  1244		trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
  1245		return result;
  1246	}
  1247	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
  2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang
  2026-01-06 15:07   ` David Hildenbrand (Red Hat)
  2026-01-07  9:46   ` kernel test robot
@ 2026-01-07 10:52   ` kernel test robot
  2 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2026-01-07 10:52 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: oe-kbuild-all, david, dave.hansen, dave.hansen, will,
	aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd,
	lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, shy828301, riel, jannh,
	linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang

Hi Lance,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20260107]
[cannot apply to tip/x86/core tip/x86/mm linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-skip-redundant-IPI-when-TLB-flush-already-synchronized/20260106-200505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260106120303.38124-3-lance.yang%40linux.dev
patch subject: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI
config: riscv-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260107/202601071153.9k8Fm05X-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260107/202601071153.9k8Fm05X-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601071153.9k8Fm05X-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/khugepaged.c: In function 'collapse_huge_page':
   mm/khugepaged.c:1180:16: error: implicit declaration of function 'pmdp_collapse_flush_sync'; did you mean 'pmdp_collapse_flush'? [-Wimplicit-function-declaration]
    1180 |         _pmd = pmdp_collapse_flush_sync(vma, address, pmd);
         |                ^~~~~~~~~~~~~~~~~~~~~~~~
         |                pmdp_collapse_flush
>> mm/khugepaged.c:1180:16: error: incompatible types when assigning to type 'pmd_t' from type 'int'
   mm/khugepaged.c: In function 'try_collapse_pte_mapped_thp':
   mm/khugepaged.c:1665:19: error: incompatible types when assigning to type 'pmd_t' from type 'int'
    1665 |         pgt_pmd = pmdp_collapse_flush_sync(vma, haddr, pmd);
         |                   ^~~~~~~~~~~~~~~~~~~~~~~~
   mm/khugepaged.c: In function 'retract_page_tables':
   mm/khugepaged.c:1818:35: error: incompatible types when assigning to type 'pmd_t' from type 'int'
    1818 |                         pgt_pmd = pmdp_collapse_flush_sync(vma, addr, pmd);
         |                                   ^~~~~~~~~~~~~~~~~~~~~~~~


vim +1180 mm/khugepaged.c

  1092	
  1093	static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
  1094						   int referenced, int unmapped,
  1095						   struct collapse_control *cc)
  1096	{
  1097		LIST_HEAD(compound_pagelist);
  1098		pmd_t *pmd, _pmd;
  1099		pte_t *pte;
  1100		pgtable_t pgtable;
  1101		struct folio *folio;
  1102		spinlock_t *pmd_ptl, *pte_ptl;
  1103		enum scan_result result = SCAN_FAIL;
  1104		struct vm_area_struct *vma;
  1105		struct mmu_notifier_range range;
  1106	
  1107		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
  1108	
  1109		/*
  1110		 * Before allocating the hugepage, release the mmap_lock read lock.
  1111		 * The allocation can take potentially a long time if it involves
  1112		 * sync compaction, and we do not need to hold the mmap_lock during
  1113		 * that. We will recheck the vma after taking it again in write mode.
  1114		 */
  1115		mmap_read_unlock(mm);
  1116	
  1117		result = alloc_charge_folio(&folio, mm, cc);
  1118		if (result != SCAN_SUCCEED)
  1119			goto out_nolock;
  1120	
  1121		mmap_read_lock(mm);
  1122		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1123		if (result != SCAN_SUCCEED) {
  1124			mmap_read_unlock(mm);
  1125			goto out_nolock;
  1126		}
  1127	
  1128		result = find_pmd_or_thp_or_none(mm, address, &pmd);
  1129		if (result != SCAN_SUCCEED) {
  1130			mmap_read_unlock(mm);
  1131			goto out_nolock;
  1132		}
  1133	
  1134		if (unmapped) {
  1135			/*
  1136			 * __collapse_huge_page_swapin will return with mmap_lock
  1137			 * released when it fails. So we jump out_nolock directly in
  1138			 * that case.  Continuing to collapse causes inconsistency.
  1139			 */
  1140			result = __collapse_huge_page_swapin(mm, vma, address, pmd,
  1141							     referenced);
  1142			if (result != SCAN_SUCCEED)
  1143				goto out_nolock;
  1144		}
  1145	
  1146		mmap_read_unlock(mm);
  1147		/*
  1148		 * Prevent all access to pagetables with the exception of
  1149		 * gup_fast later handled by the ptep_clear_flush and the VM
  1150		 * handled by the anon_vma lock + PG_lock.
  1151		 *
  1152		 * UFFDIO_MOVE is prevented to race as well thanks to the
  1153		 * mmap_lock.
  1154		 */
  1155		mmap_write_lock(mm);
  1156		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1157		if (result != SCAN_SUCCEED)
  1158			goto out_up_write;
  1159		/* check if the pmd is still valid */
  1160		vma_start_write(vma);
  1161		result = check_pmd_still_valid(mm, address, pmd);
  1162		if (result != SCAN_SUCCEED)
  1163			goto out_up_write;
  1164	
  1165		anon_vma_lock_write(vma->anon_vma);
  1166	
  1167		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
  1168					address + HPAGE_PMD_SIZE);
  1169		mmu_notifier_invalidate_range_start(&range);
  1170	
  1171		pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
  1172		/*
  1173		 * This removes any huge TLB entry from the CPU so we won't allow
  1174		 * huge and small TLB entries for the same virtual address to
  1175		 * avoid the risk of CPU bugs in that area.
  1176		 *
  1177		 * Parallel GUP-fast is fine since GUP-fast will back off when
  1178		 * it detects PMD is changed.
  1179		 */
> 1180		_pmd = pmdp_collapse_flush_sync(vma, address, pmd);
  1181		spin_unlock(pmd_ptl);
  1182		mmu_notifier_invalidate_range_end(&range);
  1183	
  1184		pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
  1185		if (pte) {
  1186			result = __collapse_huge_page_isolate(vma, address, pte, cc,
  1187							      &compound_pagelist);
  1188			spin_unlock(pte_ptl);
  1189		} else {
  1190			result = SCAN_NO_PTE_TABLE;
  1191		}
  1192	
  1193		if (unlikely(result != SCAN_SUCCEED)) {
  1194			if (pte)
  1195				pte_unmap(pte);
  1196			spin_lock(pmd_ptl);
  1197			BUG_ON(!pmd_none(*pmd));
  1198			/*
  1199			 * We can only use set_pmd_at when establishing
  1200			 * hugepmds and never for establishing regular pmds that
  1201			 * points to regular pagetables. Use pmd_populate for that
  1202			 */
  1203			pmd_populate(mm, pmd, pmd_pgtable(_pmd));
  1204			spin_unlock(pmd_ptl);
  1205			anon_vma_unlock_write(vma->anon_vma);
  1206			goto out_up_write;
  1207		}
  1208	
  1209		/*
  1210		 * All pages are isolated and locked so anon_vma rmap
  1211		 * can't run anymore.
  1212		 */
  1213		anon_vma_unlock_write(vma->anon_vma);
  1214	
  1215		result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
  1216						   vma, address, pte_ptl,
  1217						   &compound_pagelist);
  1218		pte_unmap(pte);
  1219		if (unlikely(result != SCAN_SUCCEED))
  1220			goto out_up_write;
  1221	
  1222		/*
  1223		 * The smp_wmb() inside __folio_mark_uptodate() ensures the
  1224		 * copy_huge_page writes become visible before the set_pmd_at()
  1225		 * write.
  1226		 */
  1227		__folio_mark_uptodate(folio);
  1228		pgtable = pmd_pgtable(_pmd);
  1229	
  1230		spin_lock(pmd_ptl);
  1231		BUG_ON(!pmd_none(*pmd));
  1232		pgtable_trans_huge_deposit(mm, pmd, pgtable);
  1233		map_anon_folio_pmd_nopf(folio, pmd, vma, address);
  1234		spin_unlock(pmd_ptl);
  1235	
  1236		folio = NULL;
  1237	
  1238		result = SCAN_SUCCEED;
  1239	out_up_write:
  1240		mmap_write_unlock(mm);
  1241	out_nolock:
  1242		if (folio)
  1243			folio_put(folio);
  1244		trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
  1245		return result;
  1246	}
  1247	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-01-07 10:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-06 12:03 [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs Lance Yang
2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang
2026-01-06 15:19   ` David Hildenbrand (Red Hat)
2026-01-06 16:10     ` Lance Yang
2026-01-07  6:37       ` Lance Yang
2026-01-06 16:24   ` Dave Hansen
2026-01-07  2:47     ` Lance Yang
2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang
2026-01-06 15:07   ` David Hildenbrand (Red Hat)
2026-01-06 15:41     ` Lance Yang
2026-01-07  9:46   ` kernel test robot
2026-01-07 10:52   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).