* [PATCH RFC 0/3] skip redundant TLB sync IPIs
@ 2025-12-13 8:00 Lance Yang
2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw)
To: akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 1334 bytes --]
Hi all,
When unsharing hugetlb PMD page tables or collapsing pages in khugepaged,
we send two IPIs: one for TLB invalidation, and another to synchronize
with concurrent GUP-fast walkers. However, if the TLB flush already
reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs
disabled, so when the TLB flush IPI completes, any concurrent GUP-fast
must have finished.
This series introduces a way for architectures to indicate their TLB flush
already provides full synchronization, allowing the redundant IPI to be
skipped. For now, the optimization is implemented for x86 first and applied
to both hugetlb PMD unsharing and khugepaged. I will dig into other
architectures later, and folks maintaining other architectures are welcome
to help out.
David Hildenbrand did the initial implementation. I just built on his work
and relied on off-list discussions to push it further — thanks a lot David!
Lance Yang (3):
mm/tlb: allow architectures to skip redundant TLB sync IPIs
x86/mm: implement redundant IPI elimination for PMD unsharing
mm/khugepaged: skip redundant IPI in collapse_huge_page()
arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
mm/khugepaged.c | 7 ++++++-
3 files changed, 43 insertions(+), 3 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
@ 2025-12-13 8:00 ` Lance Yang
2025-12-15 5:48 ` Lance Yang
2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw)
To: akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
Lance Yang
From: Lance Yang <lance.yang@linux.dev>
When unsharing hugetlb PMD page tables, we currently send two IPIs:
one for TLB invalidation, and another to synchronize with concurrent
GUP-fast walkers.
However, if the TLB flush already reaches all CPUs, the second IPI is
redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
completes, any concurrent GUP-fast must have finished.
Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
their TLB flush provides full synchronization, enabling the redundant IPI
to be skipped.
The default implementation returns false to maintain current behavior.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 324a21f53b64..3f0add95604f 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
#define tlb_needs_table_invalidate() (true)
#endif
+/*
+ * Architectures can override if their TLB flush already broadcasts IPIs to all
+ * CPUs when freeing or unsharing page tables.
+ *
+ * Return true only when the flush guarantees:
+ * - IPIs reach all CPUs with potentially stale paging-structure cache entries
+ * - Synchronization with IRQ-disabled code like GUP-fast
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+ return false;
+}
+#endif
+
void tlb_remove_table_sync_one(void);
#else
@@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
* We only perform this when we are the last sharer of a page table,
* as the IPI will reach all CPUs: any GUP-fast.
*
+ * However, if the TLB flush already synchronized with other CPUs
+ * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip
+ * the additional IPI.
+ *
* Note that on configs where tlb_remove_table_sync_one() is a NOP,
* the expectation is that the tlb_flush_mmu_tlbonly() would have issued
* required IPIs already for us.
*/
if (tlb->fully_unshared_tables) {
- tlb_remove_table_sync_one();
+ if (!tlb_table_flush_implies_ipi_broadcast())
+ tlb_remove_table_sync_one();
tlb->fully_unshared_tables = false;
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing
2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
@ 2025-12-13 8:00 ` Lance Yang
2025-12-18 13:08 ` David Hildenbrand (Red Hat)
2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw)
To: akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
Lance Yang
From: Lance Yang <lance.yang@linux.dev>
Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
flush_tlb_mm_range(..., freed_tables || unshared_tables);
Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
without paravirt or INVLPGB, the TLB flush IPI already provides necessary
synchronization, allowing the second IPI to be skipped. For paravirt with
non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..96602b7b7210 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,24 @@
#define tlb_flush tlb_flush
static inline void tlb_flush(struct mmu_gather *tlb);
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
#include <asm-generic/tlb.h>
#include <linux/kernel.h>
#include <vdso/bits.h>
#include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+ /* Paravirt may use hypercalls that don't send real IPIs. */
+ if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+ return false;
+#endif
+ return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+}
static inline void tlb_flush(struct mmu_gather *tlb)
{
@@ -20,7 +34,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
end = tlb->end;
}
- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+ tlb->freed_tables || tlb->unshared_tables);
}
static inline void invlpg(unsigned long addr)
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
@ 2025-12-13 8:00 ` Lance Yang
2025-12-14 13:24 ` kernel test robot
` (2 more replies)
2 siblings, 3 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw)
To: akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
Lance Yang
From: Lance Yang <lance.yang@linux.dev>
Similar to the hugetlb PMD unsharing optimization, skip the second IPI
in collapse_huge_page() when the TLB flush already provides necessary
synchronization.
Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
INVLPGB, all x86 systems started sending the second IPI. However, on
native x86 this is redundant:
- pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
CPUs to invalidate TLB entries
- GUP-fast runs with IRQs disabled, so when the flush IPI completes,
any concurrent GUP-fast must have finished
- tlb_remove_table_sync_one() provides no additional synchronization
On x86, skip the second IPI when running native (without paravirt) and
without INVLPGB. For paravirt with non-native flush_tlb_multi and for
INVLPGB, conservatively keep both IPIs.
Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
optimization.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
mm/khugepaged.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..06ea793a8190 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
_pmd = pmdp_collapse_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(&range);
- tlb_remove_table_sync_one();
+ /*
+ * Skip the second IPI if the TLB flush above already synchronized
+ * with concurrent GUP-fast via broadcast IPIs.
+ */
+ if (!tlb_table_flush_implies_ipi_broadcast())
+ tlb_remove_table_sync_one();
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
--
2.49.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
@ 2025-12-14 13:24 ` kernel test robot
2025-12-14 13:56 ` kernel test robot
2025-12-18 13:13 ` David Hildenbrand (Red Hat)
2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-12-14 13:24 UTC (permalink / raw)
To: Lance Yang; +Cc: oe-kbuild-all
Hi Lance,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251212]
[cannot apply to tip/x86/core arnd-asm-generic/master linus/master v6.19-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-allow-architectures-to-skip-redundant-TLB-sync-IPIs/20251213-160431
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251213080038.10917-4-lance.yang%40linux.dev
patch subject: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
config: arc-randconfig-001-20251214 (https://download.01.org/0day-ci/archive/20251214/202512142156.cShiu6PU-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 12.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251214/202512142156.cShiu6PU-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/khugepaged.c: In function 'collapse_huge_page':
>> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
1185 | if (!tlb_table_flush_implies_ipi_broadcast())
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/tlb_table_flush_implies_ipi_broadcast +1185 mm/khugepaged.c
1090
1091 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
1092 int referenced, int unmapped,
1093 struct collapse_control *cc)
1094 {
1095 LIST_HEAD(compound_pagelist);
1096 pmd_t *pmd, _pmd;
1097 pte_t *pte;
1098 pgtable_t pgtable;
1099 struct folio *folio;
1100 spinlock_t *pmd_ptl, *pte_ptl;
1101 int result = SCAN_FAIL;
1102 struct vm_area_struct *vma;
1103 struct mmu_notifier_range range;
1104
1105 VM_BUG_ON(address & ~HPAGE_PMD_MASK);
1106
1107 /*
1108 * Before allocating the hugepage, release the mmap_lock read lock.
1109 * The allocation can take potentially a long time if it involves
1110 * sync compaction, and we do not need to hold the mmap_lock during
1111 * that. We will recheck the vma after taking it again in write mode.
1112 */
1113 mmap_read_unlock(mm);
1114
1115 result = alloc_charge_folio(&folio, mm, cc);
1116 if (result != SCAN_SUCCEED)
1117 goto out_nolock;
1118
1119 mmap_read_lock(mm);
1120 result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
1121 if (result != SCAN_SUCCEED) {
1122 mmap_read_unlock(mm);
1123 goto out_nolock;
1124 }
1125
1126 result = find_pmd_or_thp_or_none(mm, address, &pmd);
1127 if (result != SCAN_SUCCEED) {
1128 mmap_read_unlock(mm);
1129 goto out_nolock;
1130 }
1131
1132 if (unmapped) {
1133 /*
1134 * __collapse_huge_page_swapin will return with mmap_lock
1135 * released when it fails. So we jump out_nolock directly in
1136 * that case. Continuing to collapse causes inconsistency.
1137 */
1138 result = __collapse_huge_page_swapin(mm, vma, address, pmd,
1139 referenced);
1140 if (result != SCAN_SUCCEED)
1141 goto out_nolock;
1142 }
1143
1144 mmap_read_unlock(mm);
1145 /*
1146 * Prevent all access to pagetables with the exception of
1147 * gup_fast later handled by the ptep_clear_flush and the VM
1148 * handled by the anon_vma lock + PG_lock.
1149 *
1150 * UFFDIO_MOVE is prevented to race as well thanks to the
1151 * mmap_lock.
1152 */
1153 mmap_write_lock(mm);
1154 result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
1155 if (result != SCAN_SUCCEED)
1156 goto out_up_write;
1157 /* check if the pmd is still valid */
1158 vma_start_write(vma);
1159 result = check_pmd_still_valid(mm, address, pmd);
1160 if (result != SCAN_SUCCEED)
1161 goto out_up_write;
1162
1163 anon_vma_lock_write(vma->anon_vma);
1164
1165 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
1166 address + HPAGE_PMD_SIZE);
1167 mmu_notifier_invalidate_range_start(&range);
1168
1169 pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
1170 /*
1171 * This removes any huge TLB entry from the CPU so we won't allow
1172 * huge and small TLB entries for the same virtual address to
1173 * avoid the risk of CPU bugs in that area.
1174 *
1175 * Parallel GUP-fast is fine since GUP-fast will back off when
1176 * it detects PMD is changed.
1177 */
1178 _pmd = pmdp_collapse_flush(vma, address, pmd);
1179 spin_unlock(pmd_ptl);
1180 mmu_notifier_invalidate_range_end(&range);
1181 /*
1182 * Skip the second IPI if the TLB flush above already synchronized
1183 * with concurrent GUP-fast via broadcast IPIs.
1184 */
> 1185 if (!tlb_table_flush_implies_ipi_broadcast())
1186 tlb_remove_table_sync_one();
1187
1188 pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
1189 if (pte) {
1190 result = __collapse_huge_page_isolate(vma, address, pte, cc,
1191 &compound_pagelist);
1192 spin_unlock(pte_ptl);
1193 } else {
1194 result = SCAN_NO_PTE_TABLE;
1195 }
1196
1197 if (unlikely(result != SCAN_SUCCEED)) {
1198 if (pte)
1199 pte_unmap(pte);
1200 spin_lock(pmd_ptl);
1201 BUG_ON(!pmd_none(*pmd));
1202 /*
1203 * We can only use set_pmd_at when establishing
1204 * hugepmds and never for establishing regular pmds that
1205 * points to regular pagetables. Use pmd_populate for that
1206 */
1207 pmd_populate(mm, pmd, pmd_pgtable(_pmd));
1208 spin_unlock(pmd_ptl);
1209 anon_vma_unlock_write(vma->anon_vma);
1210 goto out_up_write;
1211 }
1212
1213 /*
1214 * All pages are isolated and locked so anon_vma rmap
1215 * can't run anymore.
1216 */
1217 anon_vma_unlock_write(vma->anon_vma);
1218
1219 result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
1220 vma, address, pte_ptl,
1221 &compound_pagelist);
1222 pte_unmap(pte);
1223 if (unlikely(result != SCAN_SUCCEED))
1224 goto out_up_write;
1225
1226 /*
1227 * The smp_wmb() inside __folio_mark_uptodate() ensures the
1228 * copy_huge_page writes become visible before the set_pmd_at()
1229 * write.
1230 */
1231 __folio_mark_uptodate(folio);
1232 pgtable = pmd_pgtable(_pmd);
1233
1234 spin_lock(pmd_ptl);
1235 BUG_ON(!pmd_none(*pmd));
1236 pgtable_trans_huge_deposit(mm, pmd, pgtable);
1237 map_anon_folio_pmd_nopf(folio, pmd, vma, address);
1238 spin_unlock(pmd_ptl);
1239
1240 folio = NULL;
1241
1242 result = SCAN_SUCCEED;
1243 out_up_write:
1244 mmap_write_unlock(mm);
1245 out_nolock:
1246 if (folio)
1247 folio_put(folio);
1248 trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
1249 return result;
1250 }
1251
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2025-12-14 13:24 ` kernel test robot
@ 2025-12-14 13:56 ` kernel test robot
2025-12-18 13:13 ` David Hildenbrand (Red Hat)
2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-12-14 13:56 UTC (permalink / raw)
To: Lance Yang; +Cc: llvm, oe-kbuild-all
Hi Lance,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251212]
[cannot apply to tip/x86/core arnd-asm-generic/master linus/master v6.19-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-allow-architectures-to-skip-redundant-TLB-sync-IPIs/20251213-160431
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251213080038.10917-4-lance.yang%40linux.dev
patch subject: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
config: loongarch-randconfig-001-20251214 (https://download.01.org/0day-ci/archive/20251214/202512142105.NXwq6dfP-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251214/202512142105.NXwq6dfP-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/khugepaged.c:1185:7: error: call to undeclared function 'tlb_table_flush_implies_ipi_broadcast'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1185 | if (!tlb_table_flush_implies_ipi_broadcast())
| ^
1 error generated.
vim +/tlb_table_flush_implies_ipi_broadcast +1185 mm/khugepaged.c
1090
1091 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
1092 int referenced, int unmapped,
1093 struct collapse_control *cc)
1094 {
1095 LIST_HEAD(compound_pagelist);
1096 pmd_t *pmd, _pmd;
1097 pte_t *pte;
1098 pgtable_t pgtable;
1099 struct folio *folio;
1100 spinlock_t *pmd_ptl, *pte_ptl;
1101 int result = SCAN_FAIL;
1102 struct vm_area_struct *vma;
1103 struct mmu_notifier_range range;
1104
1105 VM_BUG_ON(address & ~HPAGE_PMD_MASK);
1106
1107 /*
1108 * Before allocating the hugepage, release the mmap_lock read lock.
1109 * The allocation can take potentially a long time if it involves
1110 * sync compaction, and we do not need to hold the mmap_lock during
1111 * that. We will recheck the vma after taking it again in write mode.
1112 */
1113 mmap_read_unlock(mm);
1114
1115 result = alloc_charge_folio(&folio, mm, cc);
1116 if (result != SCAN_SUCCEED)
1117 goto out_nolock;
1118
1119 mmap_read_lock(mm);
1120 result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
1121 if (result != SCAN_SUCCEED) {
1122 mmap_read_unlock(mm);
1123 goto out_nolock;
1124 }
1125
1126 result = find_pmd_or_thp_or_none(mm, address, &pmd);
1127 if (result != SCAN_SUCCEED) {
1128 mmap_read_unlock(mm);
1129 goto out_nolock;
1130 }
1131
1132 if (unmapped) {
1133 /*
1134 * __collapse_huge_page_swapin will return with mmap_lock
1135 * released when it fails. So we jump out_nolock directly in
1136 * that case. Continuing to collapse causes inconsistency.
1137 */
1138 result = __collapse_huge_page_swapin(mm, vma, address, pmd,
1139 referenced);
1140 if (result != SCAN_SUCCEED)
1141 goto out_nolock;
1142 }
1143
1144 mmap_read_unlock(mm);
1145 /*
1146 * Prevent all access to pagetables with the exception of
1147 * gup_fast later handled by the ptep_clear_flush and the VM
1148 * handled by the anon_vma lock + PG_lock.
1149 *
1150 * UFFDIO_MOVE is prevented to race as well thanks to the
1151 * mmap_lock.
1152 */
1153 mmap_write_lock(mm);
1154 result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
1155 if (result != SCAN_SUCCEED)
1156 goto out_up_write;
1157 /* check if the pmd is still valid */
1158 vma_start_write(vma);
1159 result = check_pmd_still_valid(mm, address, pmd);
1160 if (result != SCAN_SUCCEED)
1161 goto out_up_write;
1162
1163 anon_vma_lock_write(vma->anon_vma);
1164
1165 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
1166 address + HPAGE_PMD_SIZE);
1167 mmu_notifier_invalidate_range_start(&range);
1168
1169 pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
1170 /*
1171 * This removes any huge TLB entry from the CPU so we won't allow
1172 * huge and small TLB entries for the same virtual address to
1173 * avoid the risk of CPU bugs in that area.
1174 *
1175 * Parallel GUP-fast is fine since GUP-fast will back off when
1176 * it detects PMD is changed.
1177 */
1178 _pmd = pmdp_collapse_flush(vma, address, pmd);
1179 spin_unlock(pmd_ptl);
1180 mmu_notifier_invalidate_range_end(&range);
1181 /*
1182 * Skip the second IPI if the TLB flush above already synchronized
1183 * with concurrent GUP-fast via broadcast IPIs.
1184 */
> 1185 if (!tlb_table_flush_implies_ipi_broadcast())
1186 tlb_remove_table_sync_one();
1187
1188 pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
1189 if (pte) {
1190 result = __collapse_huge_page_isolate(vma, address, pte, cc,
1191 &compound_pagelist);
1192 spin_unlock(pte_ptl);
1193 } else {
1194 result = SCAN_NO_PTE_TABLE;
1195 }
1196
1197 if (unlikely(result != SCAN_SUCCEED)) {
1198 if (pte)
1199 pte_unmap(pte);
1200 spin_lock(pmd_ptl);
1201 BUG_ON(!pmd_none(*pmd));
1202 /*
1203 * We can only use set_pmd_at when establishing
1204 * hugepmds and never for establishing regular pmds that
1205 * points to regular pagetables. Use pmd_populate for that
1206 */
1207 pmd_populate(mm, pmd, pmd_pgtable(_pmd));
1208 spin_unlock(pmd_ptl);
1209 anon_vma_unlock_write(vma->anon_vma);
1210 goto out_up_write;
1211 }
1212
1213 /*
1214 * All pages are isolated and locked so anon_vma rmap
1215 * can't run anymore.
1216 */
1217 anon_vma_unlock_write(vma->anon_vma);
1218
1219 result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
1220 vma, address, pte_ptl,
1221 &compound_pagelist);
1222 pte_unmap(pte);
1223 if (unlikely(result != SCAN_SUCCEED))
1224 goto out_up_write;
1225
1226 /*
1227 * The smp_wmb() inside __folio_mark_uptodate() ensures the
1228 * copy_huge_page writes become visible before the set_pmd_at()
1229 * write.
1230 */
1231 __folio_mark_uptodate(folio);
1232 pgtable = pmd_pgtable(_pmd);
1233
1234 spin_lock(pmd_ptl);
1235 BUG_ON(!pmd_none(*pmd));
1236 pgtable_trans_huge_deposit(mm, pmd, pgtable);
1237 map_anon_folio_pmd_nopf(folio, pmd, vma, address);
1238 spin_unlock(pmd_ptl);
1239
1240 folio = NULL;
1241
1242 result = SCAN_SUCCEED;
1243 out_up_write:
1244 mmap_write_unlock(mm);
1245 out_nolock:
1246 if (folio)
1247 folio_put(folio);
1248 trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
1249 return result;
1250 }
1251
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
@ 2025-12-15 5:48 ` Lance Yang
2025-12-18 13:08 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-15 5:48 UTC (permalink / raw)
To: akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel
On 2025/12/13 16:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> When unsharing hugetlb PMD page tables, we currently send two IPIs:
> one for TLB invalidation, and another to synchronize with concurrent
> GUP-fast walkers.
>
> However, if the TLB flush already reaches all CPUs, the second IPI is
> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
> completes, any concurrent GUP-fast must have finished.
>
> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
> their TLB flush provides full synchronization, enabling the redundant IPI
> to be skipped.
>
> The default implementation returns false to maintain current behavior.
>
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 324a21f53b64..3f0add95604f 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
> #define tlb_needs_table_invalidate() (true)
> #endif
>
> +/*
> + * Architectures can override if their TLB flush already broadcasts IPIs to all
> + * CPUs when freeing or unsharing page tables.
> + *
> + * Return true only when the flush guarantees:
> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries
> + * - Synchronization with IRQ-disabled code like GUP-fast
> + */
> +#ifndef tlb_table_flush_implies_ipi_broadcast
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> +{
> + return false;
> +}
> +#endif
As the kernel test robot reported[1][2], the compiler is unhappy with
patch #3:
```
mm/khugepaged.c: In function 'collapse_huge_page':
>> >> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
1185 | if (!tlb_table_flush_implies_ipi_broadcast())
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
```
I'll move tlb_table_flush_implies_ipi_broadcast() outside of
CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier
happy on architectures that don't enable that config ;)
[1]
https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/
[2]
https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
> +
> void tlb_remove_table_sync_one(void);
>
> #else
> @@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
> * We only perform this when we are the last sharer of a page table,
> * as the IPI will reach all CPUs: any GUP-fast.
> *
> + * However, if the TLB flush already synchronized with other CPUs
> + * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip
> + * the additional IPI.
> + *
> * Note that on configs where tlb_remove_table_sync_one() is a NOP,
> * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
> * required IPIs already for us.
> */
> if (tlb->fully_unshared_tables) {
> - tlb_remove_table_sync_one();
> + if (!tlb_table_flush_implies_ipi_broadcast())
> + tlb_remove_table_sync_one();
> tlb->fully_unshared_tables = false;
> }
> }
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing
2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
@ 2025-12-18 13:08 ` David Hildenbrand (Red Hat)
2025-12-22 3:19 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw)
To: Lance Yang, akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
riel, jannh, linux-arch, linux-mm, linux-kernel
On 12/13/25 09:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
>
> flush_tlb_mm_range(..., freed_tables || unshared_tables);
>
> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
> without paravirt or INVLPGB, the TLB flush IPI already provides necessary
> synchronization, allowing the second IPI to be skipped. For paravirt with
> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
>
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 866ea78ba156..96602b7b7210 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -5,10 +5,24 @@
> #define tlb_flush tlb_flush
> static inline void tlb_flush(struct mmu_gather *tlb);
>
> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
> +
> #include <asm-generic/tlb.h>
> #include <linux/kernel.h>
> #include <vdso/bits.h>
> #include <vdso/page.h>
> +#include <asm/paravirt.h>
> +
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> +{
> +#ifdef CONFIG_PARAVIRT
> + /* Paravirt may use hypercalls that don't send real IPIs. */
> + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> + return false;
> +#endif
> + return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
Right, here I was wondering whether we should have a new pv_ops callback
to indicate that instead.
pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
Or a simple boolean property that pv init code properly sets.
Something for x86 folks to give suggestions for. :)
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
2025-12-15 5:48 ` Lance Yang
@ 2025-12-18 13:08 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw)
To: Lance Yang, akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
riel, jannh, linux-arch, linux-mm, linux-kernel
On 12/15/25 06:48, Lance Yang wrote:
>
>
> On 2025/12/13 16:00, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When unsharing hugetlb PMD page tables, we currently send two IPIs:
>> one for TLB invalidation, and another to synchronize with concurrent
>> GUP-fast walkers.
>>
>> However, if the TLB flush already reaches all CPUs, the second IPI is
>> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
>> completes, any concurrent GUP-fast must have finished.
>>
>> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
>> their TLB flush provides full synchronization, enabling the redundant IPI
>> to be skipped.
>>
>> The default implementation returns false to maintain current behavior.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
>> 1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 324a21f53b64..3f0add95604f 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>> #define tlb_needs_table_invalidate() (true)
>> #endif
>>
>> +/*
>> + * Architectures can override if their TLB flush already broadcasts IPIs to all
>> + * CPUs when freeing or unsharing page tables.
>> + *
>> + * Return true only when the flush guarantees:
>> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries
>> + * - Synchronization with IRQ-disabled code like GUP-fast
>> + */
>> +#ifndef tlb_table_flush_implies_ipi_broadcast
>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>> +{
>> + return false;
>> +}
>> +#endif
>
> As the kernel test robot reported[1][2], the compiler is unhappy with
> patch #3:
>
> ```
> mm/khugepaged.c: In function 'collapse_huge_page':
>>>>> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
> 1185 | if (!tlb_table_flush_implies_ipi_broadcast())
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> cc1: some warnings being treated as errors
> ```
>
> I'll move tlb_table_flush_implies_ipi_broadcast() outside of
> CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier
> happy on architectures that don't enable that config ;)
Yeah, that's probably cleanest.
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2025-12-14 13:24 ` kernel test robot
2025-12-14 13:56 ` kernel test robot
@ 2025-12-18 13:13 ` David Hildenbrand (Red Hat)
2025-12-18 14:35 ` Lance Yang
2 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:13 UTC (permalink / raw)
To: Lance Yang, akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
riel, jannh, linux-arch, linux-mm, linux-kernel
On 12/13/25 09:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
> in collapse_huge_page() when the TLB flush already provides necessary
> synchronization.
>
> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>
> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
> INVLPGB, all x86 systems started sending the second IPI. However, on
> native x86 this is redundant:
>
> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
> CPUs to invalidate TLB entries
>
> - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
> any concurrent GUP-fast must have finished
>
> - tlb_remove_table_sync_one() provides no additional synchronization
>
> On x86, skip the second IPI when running native (without paravirt) and
> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
> INVLPGB, conservatively keep both IPIs.
>
> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
> optimization.
>
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> mm/khugepaged.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..06ea793a8190 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> _pmd = pmdp_collapse_flush(vma, address, pmd);
> spin_unlock(pmd_ptl);
> mmu_notifier_invalidate_range_end(&range);
> - tlb_remove_table_sync_one();
> + /*
> + * Skip the second IPI if the TLB flush above already synchronized
> + * with concurrent GUP-fast via broadcast IPIs.
> + */
> + if (!tlb_table_flush_implies_ipi_broadcast())
> + tlb_remove_table_sync_one();
We end up calling
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-> flush_tlb_mm_range(freed_tables = true)
-> flush_tlb_multi(mm_cpumask(mm), info);
So freed_tables=true and we should be doing the right thing.
BTW, I was wondering whether we should embed that
tlb_table_flush_implies_ipi_broadcast() check in
tlb_remove_table_sync_one() instead.
It then relies on the caller to do the right thing (flush with
freed_tables=true or unshared_tables = true).
Thoughts?
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-18 13:13 ` David Hildenbrand (Red Hat)
@ 2025-12-18 14:35 ` Lance Yang
2025-12-19 8:25 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-18 14:35 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
riel, jannh, linux-arch, linux-mm, linux-kernel, akpm
On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
> On 12/13/25 09:00, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>> in collapse_huge_page() when the TLB flush already provides necessary
>> synchronization.
>>
>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>
>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>> INVLPGB, all x86 systems started sending the second IPI. However, on
>> native x86 this is redundant:
>>
>> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>> CPUs to invalidate TLB entries
>>
>> - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>> any concurrent GUP-fast must have finished
>>
>> - tlb_remove_table_sync_one() provides no additional synchronization
>>
>> On x86, skip the second IPI when running native (without paravirt) and
>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>> INVLPGB, conservatively keep both IPIs.
>>
>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
>> optimization.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> mm/khugepaged.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..06ea793a8190 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>> *mm, unsigned long address,
>> _pmd = pmdp_collapse_flush(vma, address, pmd);
>> spin_unlock(pmd_ptl);
>> mmu_notifier_invalidate_range_end(&range);
>> - tlb_remove_table_sync_one();
>> + /*
>> + * Skip the second IPI if the TLB flush above already synchronized
>> + * with concurrent GUP-fast via broadcast IPIs.
>> + */
>> + if (!tlb_table_flush_implies_ipi_broadcast())
>> + tlb_remove_table_sync_one();
>
> We end up calling
>
> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>
> -> flush_tlb_mm_range(freed_tables = true)
>
> -> flush_tlb_multi(mm_cpumask(mm), info);
>
> So freed_tables=true and we should be doing the right thing.
Yep ;)
> BTW, I was wondering whether we should embed that
> tlb_table_flush_implies_ipi_broadcast() check in
> tlb_remove_table_sync_one() instead.
> It then relies on the caller to do the right thing (flush with
> freed_tables=true or unshared_tables = true).
>
> Thoughts?
Good point! Let me check the other callers to ensure they
are all preceded by a flush with freed_tables=true (or unshared_tables).
Will get back to you with what I find :)
Cheers,
Lance
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-18 14:35 ` Lance Yang
@ 2025-12-19 8:25 ` David Hildenbrand (Red Hat)
2025-12-21 10:43 ` Lance Yang
0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19 8:25 UTC (permalink / raw)
To: Lance Yang
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
riel, jannh, linux-arch, linux-mm, linux-kernel, akpm
On 12/18/25 15:35, Lance Yang wrote:
>
>
> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>> On 12/13/25 09:00, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>> in collapse_huge_page() when the TLB flush already provides necessary
>>> synchronization.
>>>
>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>
>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>> native x86 this is redundant:
>>>
>>> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>>> CPUs to invalidate TLB entries
>>>
>>> - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>>> any concurrent GUP-fast must have finished
>>>
>>> - tlb_remove_table_sync_one() provides no additional synchronization
>>>
>>> On x86, skip the second IPI when running native (without paravirt) and
>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>> INVLPGB, conservatively keep both IPIs.
>>>
>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
>>> optimization.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>> mm/khugepaged.c | 7 ++++++-
>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 97d1b2824386..06ea793a8190 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>> *mm, unsigned long address,
>>> _pmd = pmdp_collapse_flush(vma, address, pmd);
>>> spin_unlock(pmd_ptl);
>>> mmu_notifier_invalidate_range_end(&range);
>>> - tlb_remove_table_sync_one();
>>> + /*
>>> + * Skip the second IPI if the TLB flush above already synchronized
>>> + * with concurrent GUP-fast via broadcast IPIs.
>>> + */
>>> + if (!tlb_table_flush_implies_ipi_broadcast())
>>> + tlb_remove_table_sync_one();
>>
>> We end up calling
>>
>> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>
>> -> flush_tlb_mm_range(freed_tables = true)
>>
>> -> flush_tlb_multi(mm_cpumask(mm), info);
>>
>> So freed_tables=true and we should be doing the right thing.
>
> Yep ;)
>
>> BTW, I was wondering whether we should embed that
>> tlb_table_flush_implies_ipi_broadcast() check in
>> tlb_remove_table_sync_one() instead.
>> It then relies on the caller to do the right thing (flush with
>> freed_tables=true or unshared_tables = true).
>>
>> Thoughts?
>
> Good point! Let me check the other callers to ensure they
> are all preceded by a flush with freed_tables=true (or unshared_tables).
>
> Will get back to you with what I find :)
The use case in tlb_table_flush() is a bit confusing. But I would assume
that we have a TLB flush with remove_tables=true beforehand. Otherwise
we cannot possibly free the page table.
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
2025-12-19 8:25 ` David Hildenbrand (Red Hat)
@ 2025-12-21 10:43 ` Lance Yang
0 siblings, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-21 10:43 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
riel, jannh, linux-arch, linux-mm, linux-kernel, akpm
On 2025/12/19 16:25, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 15:35, Lance Yang wrote:
>>
>>
>> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>>> in collapse_huge_page() when the TLB flush already provides necessary
>>>> synchronization.
>>>>
>>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>>> unconditional"), bare metal x86 didn't enable
>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>>
>>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>>> native x86 this is redundant:
>>>>
>>>> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to
>>>> all
>>>> CPUs to invalidate TLB entries
>>>>
>>>> - GUP-fast runs with IRQs disabled, so when the flush IPI
>>>> completes,
>>>> any concurrent GUP-fast must have finished
>>>>
>>>> - tlb_remove_table_sync_one() provides no additional
>>>> synchronization
>>>>
>>>> On x86, skip the second IPI when running native (without paravirt) and
>>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>>> INVLPGB, conservatively keep both IPIs.
>>>>
>>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the
>>>> hugetlb
>>>> optimization.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>> mm/khugepaged.c | 7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 97d1b2824386..06ea793a8190 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>>> *mm, unsigned long address,
>>>> _pmd = pmdp_collapse_flush(vma, address, pmd);
>>>> spin_unlock(pmd_ptl);
>>>> mmu_notifier_invalidate_range_end(&range);
>>>> - tlb_remove_table_sync_one();
>>>> + /*
>>>> + * Skip the second IPI if the TLB flush above already synchronized
>>>> + * with concurrent GUP-fast via broadcast IPIs.
>>>> + */
>>>> + if (!tlb_table_flush_implies_ipi_broadcast())
>>>> + tlb_remove_table_sync_one();
>>>
>>> We end up calling
>>>
>>> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>>
>>> -> flush_tlb_mm_range(freed_tables = true)
>>>
>>> -> flush_tlb_multi(mm_cpumask(mm), info);
>>>
>>> So freed_tables=true and we should be doing the right thing.
>>
>> Yep ;)
>>
>>> BTW, I was wondering whether we should embed that
>>> tlb_table_flush_implies_ipi_broadcast() check in
>>> tlb_remove_table_sync_one() instead.
>>> It then relies on the caller to do the right thing (flush with
>>> freed_tables=true or unshared_tables = true).
>>>
>>> Thoughts?
>>
>> Good point! Let me check the other callers to ensure they
>> are all preceded by a flush with freed_tables=true (or unshared_tables).
>>
>> Will get back to you with what I find :)
>
> The use case in tlb_table_flush() is a bit confusing. But I would assume
> that we have a TLB flush with remove_tables=true beforehand. Otherwise
> we cannot possibly free the page table.
Right! I assume you meant freed_tables=true (not remove_tables) ;)
Verified all callers have proper TLB flushes *beforehand*:
-> 1. mm/khugepaged.c:1188 (collapse_huge_page)
pmdp_collapse_flush(vma, address, pmd)
-> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE)
-> flush_tlb_mm_range(mm, ..., freed_tables = true)
-> flush_tlb_multi(mm_cpumask(mm), info)
So freed_tables=true and we should be doing the right thing :)
-> 2. include/asm-generic/tlb.h:861 (tlb_flush_unshared_tables)
tlb_flush_mmu_tlbonly(tlb)
-> tlb_flush(tlb)
-> flush_tlb_mm_range(mm, ..., unshared_tables = true)
-> flush_tlb_multi(mm_cpumask(mm), info)
unshared_tables=true (equivalent to freed_tables for sending IPIs).
-> 3. mm/mmu_gather.c:341 (__tlb_remove_table_one)
When we can't allocate a batch page in tlb_remove_table(), we do:
tlb_table_invalidate(tlb)
-> tlb_flush_mmu_tlbonly(tlb)
-> flush_tlb_mm_range(mm, ..., freed_tables = true)
-> flush_tlb_multi(mm_cpumask(mm), info)
Then:
tlb_remove_table_one(table)
-> __tlb_remove_table_one(table) // if !CONFIG_PT_RECLAIM
-> tlb_remove_table_sync_one()
freed_tables=true, and this should work too.
Why is tlb->freed_tables guaranteed? Because callers like pte_free_tlb()
(via free_pte_range) set freed_tables=true before calling __pte_free_tlb(),
which then calls tlb_remove_table(). As you mentioned, we cannot free page
tables without freed_tables=true.
Note that tlb_remove_table_sync_one() was a NOP on bare metal x86
(CONFIG_MMU_GATHER_RCU_TABLE_FREE=n) before commit a37259732a7d.
-> 4-5. mm/khugepaged.c:1683,1819 (pmdp_get_lockless_sync macro)
Same as #1.
So all callers satisfy the requirement! Will embed the check in v2.
Hopefully I didn't miss any callers ;)
Cheers,
Lance
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
2025-12-18 13:08 ` David Hildenbrand (Red Hat)
@ 2025-12-22 3:19 ` Lance Yang
2025-12-23 9:44 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-22 3:19 UTC (permalink / raw)
To: david
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
dave.hansen, dev.jain, hpa, ioworker0, jannh, lance.yang,
linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo,
npache, npiggin, peterz, riel, ryan.roberts, shy828301, tglx,
will, x86, ziy
From: Lance Yang <lance.yang@linux.dev>
On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
> On 12/13/25 09:00, Lance Yang wrote:
> > From: Lance Yang <lance.yang@linux.dev>
> >
> > Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
> > ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
> >
> > flush_tlb_mm_range(..., freed_tables || unshared_tables);
> >
> > Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
> > without paravirt or INVLPGB, the TLB flush IPI already provides necessary
> > synchronization, allowing the second IPI to be skipped. For paravirt with
> > non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
> >
> > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
> > ---
> > arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> > index 866ea78ba156..96602b7b7210 100644
> > --- a/arch/x86/include/asm/tlb.h
> > +++ b/arch/x86/include/asm/tlb.h
> > @@ -5,10 +5,24 @@
> > #define tlb_flush tlb_flush
> > static inline void tlb_flush(struct mmu_gather *tlb);
> >
> > +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
> > +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
> > +
> > #include <asm-generic/tlb.h>
> > #include <linux/kernel.h>
> > #include <vdso/bits.h>
> > #include <vdso/page.h>
> > +#include <asm/paravirt.h>
> > +
> > +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> > +{
> > +#ifdef CONFIG_PARAVIRT
> > + /* Paravirt may use hypercalls that don't send real IPIs. */
> > + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> > + return false;
> > +#endif
> > + return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>
> Right, here I was wondering whether we should have a new pv_ops callback
> to indicate that instead.
>
> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>
> Or a simple boolean property that pv init code properly sets.
Cool!
>
> Something for x86 folks to give suggestions for. :)
I prefer to use a boolean property instead of comparing function pointers.
Something like this:
----8<----
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index cfcb60468b01..90e9da33f2c7 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
+ pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
}
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..f9756df6f3f6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,19 @@ struct pv_mmu_ops {
void (*flush_tlb_multi)(const struct cpumask *cpus,
const struct flush_tlb_info *info);
+ /*
+ * Indicates whether TLB flush IPIs provide sufficient synchronization
+ * for GUP-fast when freeing or unsharing page tables.
+ *
+ * Set to true only when the TLB flush guarantees:
+ * - IPIs reach all CPUs with potentially stale paging-structure caches
+ * - Synchronization with IRQ-disabled code like GUP-fast
+ *
+ * Paravirt implementations that use hypercalls (which may not send
+ * real IPIs) should set this to false.
+ */
+ bool tlb_flush_implies_ipi_broadcast;
+
/* Hook for intercepting the destruction of an mm_struct. */
void (*exit_mmap)(struct mm_struct *mm);
void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 96602b7b7210..9d20ad4786cc 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void)
{
#ifdef CONFIG_PARAVIRT
/* Paravirt may use hypercalls that don't send real IPIs. */
- if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+ if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
return false;
#endif
return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index df78ddee0abb..aaea83100105 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -843,6 +843,7 @@ static void __init kvm_guest_init(void)
#ifdef CONFIG_SMP
if (pv_tlb_flush_supported()) {
pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
+ pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
pr_info("KVM setup pv remote TLB flush\n");
}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..625fe93e138a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -173,6 +173,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
.mmu.flush_tlb_multi = native_flush_tlb_multi,
+ .mmu.tlb_flush_implies_ipi_broadcast = true,
.mmu.exit_mmap = paravirt_nop,
.mmu.notify_page_enc_status_changed = paravirt_nop,
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 7a35c3393df4..06eb80cfb4da 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2185,6 +2185,7 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
.flush_tlb_kernel = xen_flush_tlb,
.flush_tlb_one_user = xen_flush_tlb_one_user,
.flush_tlb_multi = xen_flush_tlb_multi,
+ .tlb_flush_implies_ipi_broadcast = false,
.pgd_alloc = xen_pgd_alloc,
.pgd_free = xen_pgd_free,
---
Native x86 sets it to true, paravirt guests (Xen/KVM/Hyper-V) set it to
false. Making the intent explicit :)
Hopefully x86 folks can give me some suggestions!
Thanks,
Lance
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
2025-12-22 3:19 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
@ 2025-12-23 9:44 ` David Hildenbrand (Red Hat)
2025-12-23 11:13 ` Lance Yang
0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23 9:44 UTC (permalink / raw)
To: Lance Yang
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
dave.hansen, dev.jain, hpa, jannh, lance.yang, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
peterz, riel, ryan.roberts, shy828301, tglx, will, x86, ziy
On 12/22/25 04:19, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
>
> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/13/25 09:00, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
>>>
>>> flush_tlb_mm_range(..., freed_tables || unshared_tables);
>>>
>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
>>> without paravirt or INVLPGB, the TLB flush IPI already provides necessary
>>> synchronization, allowing the second IPI to be skipped. For paravirt with
>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>> index 866ea78ba156..96602b7b7210 100644
>>> --- a/arch/x86/include/asm/tlb.h
>>> +++ b/arch/x86/include/asm/tlb.h
>>> @@ -5,10 +5,24 @@
>>> #define tlb_flush tlb_flush
>>> static inline void tlb_flush(struct mmu_gather *tlb);
>>>
>>> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>> +
>>> #include <asm-generic/tlb.h>
>>> #include <linux/kernel.h>
>>> #include <vdso/bits.h>
>>> #include <vdso/page.h>
>>> +#include <asm/paravirt.h>
>>> +
>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>> +{
>>> +#ifdef CONFIG_PARAVIRT
>>> + /* Paravirt may use hypercalls that don't send real IPIs. */
>>> + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>> + return false;
>>> +#endif
>>> + return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>>
>> Right, here I was wondering whether we should have a new pv_ops callback
>> to indicate that instead.
>>
>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>>
>> Or a simple boolean property that pv init code properly sets.
>
> Cool!
>
>>
>> Something for x86 folks to give suggestions for. :)
>
> I prefer to use a boolean property instead of comparing function pointers.
> Something like this:
>
> ----8<----
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index cfcb60468b01..90e9da33f2c7 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
>
> pr_info("Using hypercall for remote TLB flush\n");
> pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
> + pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
> }
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 3502939415ad..f9756df6f3f6 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -133,6 +133,19 @@ struct pv_mmu_ops {
> void (*flush_tlb_multi)(const struct cpumask *cpus,
> const struct flush_tlb_info *info);
>
> + /*
> + * Indicates whether TLB flush IPIs provide sufficient synchronization
> + * for GUP-fast when freeing or unsharing page tables.
> + *
> + * Set to true only when the TLB flush guarantees:
> + * - IPIs reach all CPUs with potentially stale paging-structure caches
> + * - Synchronization with IRQ-disabled code like GUP-fast
> + *
> + * Paravirt implementations that use hypercalls (which may not send
> + * real IPIs) should set this to false.
> + */
> + bool tlb_flush_implies_ipi_broadcast;
> +
> /* Hook for intercepting the destruction of an mm_struct. */
> void (*exit_mmap)(struct mm_struct *mm);
> void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 96602b7b7210..9d20ad4786cc 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> {
> #ifdef CONFIG_PARAVIRT
> /* Paravirt may use hypercalls that don't send real IPIs. */
> - if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> + if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
> return false;
> #endif
> return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
I'd have thought that the X86_FEATURE_INVLPGB heck should then also be
taken care of by whoever sets tlb_flush_implies_ipi_broadcast.
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
2025-12-23 9:44 ` David Hildenbrand (Red Hat)
@ 2025-12-23 11:13 ` Lance Yang
0 siblings, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-23 11:13 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
dave.hansen, dev.jain, hpa, jannh, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, peterz, riel,
ryan.roberts, shy828301, tglx, will, x86, ziy, Lance Yang
On 2025/12/23 17:44, David Hildenbrand (Red Hat) wrote:
> On 12/22/25 04:19, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>>
>> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
>>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure
>>>> caches:
>>>>
>>>> flush_tlb_mm_range(..., freed_tables || unshared_tables);
>>>>
>>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native
>>>> x86
>>>> without paravirt or INVLPGB, the TLB flush IPI already provides
>>>> necessary
>>>> synchronization, allowing the second IPI to be skipped. For paravirt
>>>> with
>>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both
>>>> IPIs.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>>> index 866ea78ba156..96602b7b7210 100644
>>>> --- a/arch/x86/include/asm/tlb.h
>>>> +++ b/arch/x86/include/asm/tlb.h
>>>> @@ -5,10 +5,24 @@
>>>> #define tlb_flush tlb_flush
>>>> static inline void tlb_flush(struct mmu_gather *tlb);
>>>> +#define tlb_table_flush_implies_ipi_broadcast
>>>> tlb_table_flush_implies_ipi_broadcast
>>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>>> +
>>>> #include <asm-generic/tlb.h>
>>>> #include <linux/kernel.h>
>>>> #include <vdso/bits.h>
>>>> #include <vdso/page.h>
>>>> +#include <asm/paravirt.h>
>>>> +
>>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>>> +{
>>>> +#ifdef CONFIG_PARAVIRT
>>>> + /* Paravirt may use hypercalls that don't send real IPIs. */
>>>> + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>>> + return false;
>>>> +#endif
>>>> + return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>>>
>>> Right, here I was wondering whether we should have a new pv_ops callback
>>> to indicate that instead.
>>>
>>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>>>
>>> Or a simple boolean property that pv init code properly sets.
>>
>> Cool!
>>
>>>
>>> Something for x86 folks to give suggestions for. :)
>>
>> I prefer to use a boolean property instead of comparing function
>> pointers.
>> Something like this:
>>
>> ----8<----
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index cfcb60468b01..90e9da33f2c7 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
>>
>> pr_info("Using hypercall for remote TLB flush\n");
>> pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
>> + pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
>> }
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/
>> asm/paravirt_types.h
>> index 3502939415ad..f9756df6f3f6 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -133,6 +133,19 @@ struct pv_mmu_ops {
>> void (*flush_tlb_multi)(const struct cpumask *cpus,
>> const struct flush_tlb_info *info);
>>
>> + /*
>> + * Indicates whether TLB flush IPIs provide sufficient
>> synchronization
>> + * for GUP-fast when freeing or unsharing page tables.
>> + *
>> + * Set to true only when the TLB flush guarantees:
>> + * - IPIs reach all CPUs with potentially stale paging-structure
>> caches
>> + * - Synchronization with IRQ-disabled code like GUP-fast
>> + *
>> + * Paravirt implementations that use hypercalls (which may not send
>> + * real IPIs) should set this to false.
>> + */
>> + bool tlb_flush_implies_ipi_broadcast;
>> +
>> /* Hook for intercepting the destruction of an mm_struct. */
>> void (*exit_mmap)(struct mm_struct *mm);
>> void (*notify_page_enc_status_changed)(unsigned long pfn, int
>> npages, bool enc);
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 96602b7b7210..9d20ad4786cc 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -18,7 +18,7 @@ static inline bool
>> tlb_table_flush_implies_ipi_broadcast(void)
>> {
>> #ifdef CONFIG_PARAVIRT
>> /* Paravirt may use hypercalls that don't send real IPIs. */
>> - if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>> + if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
>> return false;
>> #endif
>> return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>
> I'd have thought that the X86_FEATURE_INVLPGB heck should then also be
> taken care of by whoever sets tlb_flush_implies_ipi_broadcast.
Makes sense!
Let's have the INVLPGB check happen at setup time, not at use time :P
Cheers,
Lance
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-12-23 11:13 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
2025-12-15 5:48 ` Lance Yang
2025-12-18 13:08 ` David Hildenbrand (Red Hat)
2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
2025-12-18 13:08 ` David Hildenbrand (Red Hat)
2025-12-22 3:19 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
2025-12-23 9:44 ` David Hildenbrand (Red Hat)
2025-12-23 11:13 ` Lance Yang
2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2025-12-14 13:24 ` kernel test robot
2025-12-14 13:56 ` kernel test robot
2025-12-18 13:13 ` David Hildenbrand (Red Hat)
2025-12-18 14:35 ` Lance Yang
2025-12-19 8:25 ` David Hildenbrand (Red Hat)
2025-12-21 10:43 ` Lance Yang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.