[RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements

sparclinux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements
@ 2025-05-30 14:04 Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry() Ryan Roberts
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

Hi All,

I recently added support for lazy mmu mode on arm64. The series is now in
Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
found some ugly corners (unexpected nesting). I was able to fix those issues by
making the arm64 implementation more permissive (like the other arches). But
this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
lazy mmu mode never nests, and more importantly, that code never makes pgtable
modifications expecting them to be immediate, not knowing that it's actually in
lazy mmu mode so the changes get deferred.

The first 2 patches are unrelated, very obvious bug fixes. They don't affect
arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
during code review and think they should be fixed.

The next 3 patches are aimed at solving the nesting issue.

And the final patch is reverting the "permissive" fix I did for arm64, which is
no longer needed after the previous 3 patches.

I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
and siince I'm touching various arches and modifying some core mm stuff, I
thought that might take a while so thought I'd beat the rush and get a first
version out early.

I've build-tested all the affected arches. And I've run mm selftests for the
arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).

Applies against Linus's master branch (f66bc387efbe).

Thanks,
Ryan

Ryan Roberts (6):
  fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
    pagemap_scan_pmd_entry()
  mm: Fix pte update and tlb maintenance ordering in
    migrate_vma_collect_pmd()
  mm: Avoid calling page allocator from apply_to_page_range()
  mm: Introduce arch_in_lazy_mmu_mode()
  mm: Avoid calling page allocator while in lazy mmu mode
  Revert "arm64/mm: Permit lazy_mmu_mode to be nested"

 arch/arm64/include/asm/pgtable.h              | 22 ++++----
 .../include/asm/book3s/64/tlbflush-hash.h     | 15 ++++++
 arch/sparc/include/asm/tlbflush_64.h          |  1 +
 arch/sparc/mm/tlb.c                           | 12 +++++
 arch/x86/include/asm/paravirt.h               |  5 ++
 arch/x86/include/asm/paravirt_types.h         |  1 +
 arch/x86/kernel/paravirt.c                    |  6 +++
 arch/x86/xen/mmu_pv.c                         |  6 +++
 fs/proc/task_mmu.c                            |  3 +-
 include/asm-generic/tlb.h                     |  2 +
 include/linux/mm.h                            |  6 +++
 include/linux/pgtable.h                       |  1 +
 kernel/bpf/arena.c                            |  6 +--
 mm/kasan/shadow.c                             |  2 +-
 mm/memory.c                                   | 54 ++++++++++++++-----
 mm/migrate_device.c                           |  3 +-
 mm/mmu_gather.c                               | 15 ++++++
 17 files changed, 128 insertions(+), 32 deletions(-)

--
2.43.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry()
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
@ 2025-05-30 14:04 ` Ryan Roberts
  2025-05-30 16:26   ` Jann Horn
  2025-05-30 14:04 ` [RFC PATCH v1 2/6] mm: Fix pte update and tlb maintenance ordering in migrate_vma_collect_pmd() Ryan Roberts
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

pagemap_scan_pmd_entry() was previously modifying ptes while in lazy mmu
mode, then performing tlb maintenance for the modified ptes, then
leaving lazy mmu mode. But any pte modifications during lazy mmu mode
may be deferred until arch_leave_lazy_mmu_mode(), inverting the required
ordering between pte modificaiton and tlb maintenance.

Let's fix that by leaving mmu mode, forcing all the pte updates to be
actioned, before doing the tlb maintenance.

This is a theorectical bug discovered during code review.

Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 fs/proc/task_mmu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 994cde10e3f4..361f3ffd9a0c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2557,10 +2557,9 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 	}
 
 flush_and_return:
+	arch_leave_lazy_mmu_mode();
 	if (flush_end)
 		flush_tlb_range(vma, start, addr);
-
-	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(start_pte, ptl);
 
 	cond_resched();
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 2/6] mm: Fix pte update and tlb maintenance ordering in migrate_vma_collect_pmd()
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry() Ryan Roberts
@ 2025-05-30 14:04 ` Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range() Ryan Roberts
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

migrate_vma_collect_pmd() was previously modifying ptes while in lazy
mmu mode, then performing tlb maintenance for the modified ptes, then
leaving lazy mmu mode. But any pte modifications during lazy mmu mode
may be deferred until arch_leave_lazy_mmu_mode(), inverting the required
ordering between pte modificaiton and tlb maintenance.

Let's fix that by leaving mmu mode (forcing all the pte updates to be
actioned) before doing the tlb maintenance.

This is a theorectical bug discovered during code review.

Fixes: 60bae7370896 ("mm/migrate_device.c: flush TLB while holding PTL")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/migrate_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 3158afe7eb23..fc73a940c112 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -283,11 +283,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		migrate->src[migrate->npages++] = mpfn;
 	}
 
+	arch_leave_lazy_mmu_mode();
+
 	/* Only flush the TLB if we actually modified any entries */
 	if (unmapped)
 		flush_tlb_range(walk->vma, start, end);
 
-	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry() Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 2/6] mm: Fix pte update and tlb maintenance ordering in migrate_vma_collect_pmd() Ryan Roberts
@ 2025-05-30 14:04 ` Ryan Roberts
  2025-05-30 16:23   ` Liam R. Howlett
  2025-05-30 14:04 ` [RFC PATCH v1 4/6] mm: Introduce arch_in_lazy_mmu_mode() Ryan Roberts
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

Lazy mmu mode applies to the current task and permits pte modifications
to be deferred and updated at a later time in a batch to improve
performance. apply_to_page_range() calls its callback in lazy mmu mode
and some of those callbacks call into the page allocator to either
allocate or free pages.

This is problematic with CONFIG_DEBUG_PAGEALLOC because
debug_pagealloc_[un]map_pages() calls the arch implementation of
__kernel_map_pages() which must modify the ptes for the linear map.

There are two possibilities at this point:

 - If the arch implementation modifies the ptes directly without first
   entering lazy mmu mode, the pte modifications may get deferred until
   the existing lazy mmu mode is exited. This could result in taking
   spurious faults for example.

 - If the arch implementation enters a nested lazy mmu mode before
   modification of the ptes (many arches use apply_to_page_range()),
   then the linear map updates will definitely be applied upon leaving
   the inner lazy mmu mode. But because lazy mmu mode does not support
   nesting, the remainder of the outer user is no longer in lazy mmu
   mode and the optimization opportunity is lost.

So let's just ensure that the page allocator is never called from within
lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and
apply_to_existing_page_range() are introduced which don't enter lazy mmu
mode. Then users which need to call into the page allocator within their
callback are updated to use the _nolazy variants.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/mm.h |  6 ++++++
 kernel/bpf/arena.c |  6 +++---
 mm/kasan/shadow.c  |  2 +-
 mm/memory.c        | 54 +++++++++++++++++++++++++++++++++++-----------
 4 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e51dba8398f7..11cae6ce04ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
 typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
+extern int apply_to_page_range_nolazy(struct mm_struct *mm,
+				      unsigned long address, unsigned long size,
+				      pte_fn_t fn, void *data);
 extern int apply_to_existing_page_range(struct mm_struct *mm,
 				   unsigned long address, unsigned long size,
 				   pte_fn_t fn, void *data);
+extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
+				   unsigned long address, unsigned long size,
+				   pte_fn_t fn, void *data);
 
 #ifdef CONFIG_PAGE_POISONING
 extern void __kernel_poison_pages(struct page *page, int numpages);
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 0d56cea71602..ca833cfeefb7 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map)
 	/*
 	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
 	 * It unmaps everything from vmalloc area and clears pgtables.
-	 * Call apply_to_existing_page_range() first to find populated ptes and
-	 * free those pages.
+	 * Call apply_to_existing_page_range_nolazy() first to find populated
+	 * ptes and free those pages.
 	 */
-	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+	apply_to_existing_page_range_nolazy(&init_mm, bpf_arena_get_kern_vm_start(arena),
 				     KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
 	free_vm_area(arena->kern_vm);
 	range_tree_destroy(&arena->rt);
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index d2c70cd2afb1..2325c5166c3a 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end,
 
 
 		if (flags & KASAN_VMALLOC_PAGE_RANGE)
-			apply_to_existing_page_range(&init_mm,
+			apply_to_existing_page_range_nolazy(&init_mm,
 					     (unsigned long)shadow_start,
 					     size, kasan_depopulate_vmalloc_pte,
 					     NULL);
diff --git a/mm/memory.c b/mm/memory.c
index 49199410805c..24436074ce48 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory);
 static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     unsigned long addr, unsigned long end,
 				     pte_fn_t fn, void *data, bool create,
-				     pgtbl_mod_mask *mask)
+				     pgtbl_mod_mask *mask, bool lazy_mmu)
 {
 	pte_t *pte, *mapped_pte;
 	int err = 0;
@@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			return -EINVAL;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	if (lazy_mmu)
+		arch_enter_lazy_mmu_mode();
 
 	if (fn) {
 		do {
@@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	}
 	*mask |= PGTBL_PTE_MODIFIED;
 
-	arch_leave_lazy_mmu_mode();
+	if (lazy_mmu)
+		arch_leave_lazy_mmu_mode();
 
 	if (mm != &init_mm)
 		pte_unmap_unlock(mapped_pte, ptl);
@@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 				     unsigned long addr, unsigned long end,
 				     pte_fn_t fn, void *data, bool create,
-				     pgtbl_mod_mask *mask)
+				     pgtbl_mod_mask *mask, bool lazy_mmu)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 			pmd_clear_bad(pmd);
 		}
 		err = apply_to_pte_range(mm, pmd, addr, next,
-					 fn, data, create, mask);
+					 fn, data, create, mask, lazy_mmu);
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
@@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 				     unsigned long addr, unsigned long end,
 				     pte_fn_t fn, void *data, bool create,
-				     pgtbl_mod_mask *mask)
+				     pgtbl_mod_mask *mask, bool lazy_mmu)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 			pud_clear_bad(pud);
 		}
 		err = apply_to_pmd_range(mm, pud, addr, next,
-					 fn, data, create, mask);
+					 fn, data, create, mask, lazy_mmu);
 		if (err)
 			break;
 	} while (pud++, addr = next, addr != end);
@@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 				     unsigned long addr, unsigned long end,
 				     pte_fn_t fn, void *data, bool create,
-				     pgtbl_mod_mask *mask)
+				     pgtbl_mod_mask *mask, bool lazy_mmu)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 			p4d_clear_bad(p4d);
 		}
 		err = apply_to_pud_range(mm, p4d, addr, next,
-					 fn, data, create, mask);
+					 fn, data, create, mask, lazy_mmu);
 		if (err)
 			break;
 	} while (p4d++, addr = next, addr != end);
@@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 
 static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 				 unsigned long size, pte_fn_t fn,
-				 void *data, bool create)
+				 void *data, bool create, bool lazy_mmu)
 {
 	pgd_t *pgd;
 	unsigned long start = addr, next;
@@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			pgd_clear_bad(pgd);
 		}
 		err = apply_to_p4d_range(mm, pgd, addr, next,
-					 fn, data, create, &mask);
+					 fn, data, create, &mask, lazy_mmu);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 /*
  * Scan a region of virtual memory, filling in page tables as necessary
  * and calling a provided function on each leaf page table.
+ *
+ * fn() is called in lazy mmu mode. As a result, the callback must be careful
+ * not to perform memory allocation.
  */
 int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			unsigned long size, pte_fn_t fn, void *data)
 {
-	return __apply_to_page_range(mm, addr, size, fn, data, true);
+	return __apply_to_page_range(mm, addr, size, fn, data, true, true);
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
@@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
  * Scan a region of virtual memory, calling a provided function on
  * each leaf page table where it exists.
  *
+ * fn() is called in lazy mmu mode. As a result, the callback must be careful
+ * not to perform memory allocation.
+ *
  * Unlike apply_to_page_range, this does _not_ fill in page tables
  * where they are absent.
  */
 int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
 				 unsigned long size, pte_fn_t fn, void *data)
 {
-	return __apply_to_page_range(mm, addr, size, fn, data, false);
+	return __apply_to_page_range(mm, addr, size, fn, data, false, true);
+}
+
+/*
+ * As per apply_to_page_range() but fn() is not called in lazy mmu mode.
+ */
+int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr,
+			       unsigned long size, pte_fn_t fn, void *data)
+{
+	return __apply_to_page_range(mm, addr, size, fn, data, true, false);
+}
+
+/*
+ * As per apply_to_existing_page_range() but fn() is not called in lazy mmu
+ * mode.
+ */
+int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
+					unsigned long addr, unsigned long size,
+					pte_fn_t fn, void *data)
+{
+	return __apply_to_page_range(mm, addr, size, fn, data, false, false);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 4/6] mm: Introduce arch_in_lazy_mmu_mode()
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
                   ` (2 preceding siblings ...)
  2025-05-30 14:04 ` [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range() Ryan Roberts
@ 2025-05-30 14:04 ` Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 5/6] mm: Avoid calling page allocator while in lazy mmu mode Ryan Roberts
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

Introduce new arch_in_lazy_mmu_mode() API, which returns true if the
calling context is currently in lazy mmu mode or false otherwise. Each
arch that supports lazy mmu mode must provide an implementation of this
API.

The API will shortly be used to prevent accidental lazy mmu mode nesting
when performing an allocation, and will additionally be used to ensure
pte modification vs tlb flushing order does not get inadvertantly
swapped.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h                  |  8 ++++++++
 .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++++++
 arch/sparc/include/asm/tlbflush_64.h              |  1 +
 arch/sparc/mm/tlb.c                               | 12 ++++++++++++
 arch/x86/include/asm/paravirt.h                   |  5 +++++
 arch/x86/include/asm/paravirt_types.h             |  1 +
 arch/x86/kernel/paravirt.c                        |  6 ++++++
 arch/x86/xen/mmu_pv.c                             |  6 ++++++
 include/linux/pgtable.h                           |  1 +
 9 files changed, 55 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5285757ee0c1..add75dee49f5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -119,6 +119,14 @@ static inline void arch_leave_lazy_mmu_mode(void)
 	clear_thread_flag(TIF_LAZY_MMU);
 }
 
+static inline bool arch_in_lazy_mmu_mode(void)
+{
+	if (in_interrupt())
+		return false;
+
+	return test_thread_flag(TIF_LAZY_MMU);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
 
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 146287d9580f..4123a9da32cc 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -57,6 +57,21 @@ static inline void arch_leave_lazy_mmu_mode(void)
 
 #define arch_flush_lazy_mmu_mode()      do {} while (0)
 
+static inline bool arch_in_lazy_mmu_mode(void)
+{
+	struct ppc64_tlb_batch *batch;
+	bool active;
+
+	if (radix_enabled())
+		return false;
+
+	batch = get_cpu_ptr(&ppc64_tlb_batch);
+	active = batch->active;
+	put_cpu_ptr(&ppc64_tlb_batch);
+
+	return active;
+}
+
 extern void hash__tlbiel_all(unsigned int action);
 
 extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 8b8cdaa69272..204bc957df9e 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -45,6 +45,7 @@ void flush_tlb_pending(void);
 void arch_enter_lazy_mmu_mode(void);
 void arch_leave_lazy_mmu_mode(void);
 #define arch_flush_lazy_mmu_mode()      do {} while (0)
+bool arch_in_lazy_mmu_mode(void);
 
 /* Local cpu only.  */
 void __flush_tlb_all(void);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index a35ddcca5e76..83ab4ba4f4fb 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -69,6 +69,18 @@ void arch_leave_lazy_mmu_mode(void)
 	preempt_enable();
 }
 
+bool arch_in_lazy_mmu_mode(void)
+{
+	struct tlb_batch *tb;
+	bool active;
+
+	tb = get_cpu_ptr(&tlb_batch);
+	active = tb->active;
+	put_cpu_ptr(&tlb_batch);
+
+	return active;
+}
+
 static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
 			      bool exec, unsigned int hugepage_shift)
 {
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b5e59a7ba0d0..c7ea3ccb8a41 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -542,6 +542,11 @@ static inline void arch_flush_lazy_mmu_mode(void)
 	PVOP_VCALL0(mmu.lazy_mode.flush);
 }
 
+static inline bool arch_in_lazy_mmu_mode(void)
+{
+	return PVOP_CALL0(bool, mmu.lazy_mode.in);
+}
+
 static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 				phys_addr_t phys, pgprot_t flags)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 37a8627d8277..41001ca9d010 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -46,6 +46,7 @@ struct pv_lazy_ops {
 	void (*enter)(void);
 	void (*leave)(void);
 	void (*flush)(void);
+	bool (*in)(void);
 } __no_randomize_layout;
 #endif
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..9af1a04a47fd 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -106,6 +106,11 @@ static noinstr void pv_native_set_debugreg(int regno, unsigned long val)
 {
 	native_set_debugreg(regno, val);
 }
+
+static noinstr bool paravirt_retfalse(void)
+{
+	return false;
+}
 #endif
 
 struct pv_info pv_info = {
@@ -228,6 +233,7 @@ struct paravirt_patch_template pv_ops = {
 		.enter		= paravirt_nop,
 		.leave		= paravirt_nop,
 		.flush		= paravirt_nop,
+		.in		= paravirt_retfalse,
 	},
 
 	.mmu.set_fixmap		= native_set_fixmap,
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2a4a8deaf612..74f7a8537911 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2147,6 +2147,11 @@ static void xen_flush_lazy_mmu(void)
 	preempt_enable();
 }
 
+static bool xen_in_lazy_mmu(void)
+{
+	return xen_get_lazy_mode() == XEN_LAZY_MMU;
+}
+
 static void __init xen_post_allocator_init(void)
 {
 	pv_ops.mmu.set_pte = xen_set_pte;
@@ -2230,6 +2235,7 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 			.enter = xen_enter_lazy_mmu,
 			.leave = xen_leave_lazy_mmu,
 			.flush = xen_flush_lazy_mmu,
+			.in = xen_in_lazy_mmu,
 		},
 
 		.set_fixmap = xen_set_fixmap,
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..580d9971f435 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -235,6 +235,7 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_enter_lazy_mmu_mode()	do {} while (0)
 #define arch_leave_lazy_mmu_mode()	do {} while (0)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
+#define arch_in_lazy_mmu_mode()		false
 #endif
 
 #ifndef pte_batch_hint
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 5/6] mm: Avoid calling page allocator while in lazy mmu mode
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
                   ` (3 preceding siblings ...)
  2025-05-30 14:04 ` [RFC PATCH v1 4/6] mm: Introduce arch_in_lazy_mmu_mode() Ryan Roberts
@ 2025-05-30 14:04 ` Ryan Roberts
  2025-05-30 14:04 ` [RFC PATCH v1 6/6] Revert "arm64/mm: Permit lazy_mmu_mode to be nested" Ryan Roberts
  2025-05-30 14:47 ` [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Lorenzo Stoakes
  6 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

Lazy mmu mode applies to the current task and permits pte modifications
to be deferred and updated at a later time in a batch to improve
performance. tlb_next_batch() is called in lazy mmu mode as follows:

zap_pte_range
  arch_enter_lazy_mmu_mode
  do_zap_pte_range
    zap_present_ptes
      zap_present_folio_ptes
        __tlb_remove_folio_pages
          __tlb_remove_folio_pages_size
            tlb_next_batch
  arch_leave_lazy_mmu_mode

tlb_next_batch() may call into the page allocator which is problematic
with CONFIG_DEBUG_PAGEALLOC because debug_pagealloc_[un]map_pages()
calls the arch implementation of __kernel_map_pages() which must modify
the ptes for the linear map.

There are two possibilities at this point:

- If the arch implementation modifies the ptes directly without first
  entering lazy mmu mode, the pte modifications may get deferred until
  the existing lazy mmu mode is exited. This could result in taking
  spurious faults for example.

- If the arch implementation enters a nested lazy mmu mode before
  modification of the ptes (many arches use apply_to_page_range()),
  then the linear map updates will definitely be applied upon leaving
  the inner lazy mmu mode. But because lazy mmu mode does not support
  nesting, the remainder of the outer user is no longer in lazy mmu
  mode and the optimization opportunity is lost.

So let's just ensure that the page allocator is never called from within
lazy mmu mode. Use the new arch_in_lazy_mmu_mode() API to check if we
are in lazy mmu mode, and if so, when calling into the page allocator,
temporarily leave lazy mmu mode.

Given this new API we can also add VM_WARNings to check that we exit
lazy mmu mode when required to ensure the PTEs are actually updated
prior to tlb flushing.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/asm-generic/tlb.h |  2 ++
 mm/mmu_gather.c           | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 88a42973fa47..84fb269b78a5 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -469,6 +469,8 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma)
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
+	VM_WARN_ON(arch_in_lazy_mmu_mode());
+
 	/*
 	 * Anything calling __tlb_adjust_range() also sets at least one of
 	 * these bits.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index db7ba4a725d6..0bd1e69b048b 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -18,6 +18,7 @@
 static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
+	bool lazy_mmu;
 
 	/* Limit batching if we have delayed rmaps pending */
 	if (tlb->delayed_rmap && tlb->active != &tlb->local)
@@ -32,7 +33,15 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
 		return false;
 
+	lazy_mmu = arch_in_lazy_mmu_mode();
+	if (lazy_mmu)
+		arch_leave_lazy_mmu_mode();
+
 	batch = (void *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+
+	if (lazy_mmu)
+		arch_enter_lazy_mmu_mode();
+
 	if (!batch)
 		return false;
 
@@ -145,6 +154,8 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
+	VM_WARN_ON(arch_in_lazy_mmu_mode());
+
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
 		__tlb_batch_free_encoded_pages(batch);
 	tlb->active = &tlb->local;
@@ -154,6 +165,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch, *next;
 
+	VM_WARN_ON(arch_in_lazy_mmu_mode());
+
 	for (batch = tlb->local.next; batch; batch = next) {
 		next = batch->next;
 		free_pages((unsigned long)batch, 0);
@@ -363,6 +376,8 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
+	VM_WARN_ON(arch_in_lazy_mmu_mode());
+
 	if (*batch == NULL) {
 		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
 		if (*batch == NULL) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 6/6] Revert "arm64/mm: Permit lazy_mmu_mode to be nested"
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
                   ` (4 preceding siblings ...)
  2025-05-30 14:04 ` [RFC PATCH v1 5/6] mm: Avoid calling page allocator while in lazy mmu mode Ryan Roberts
@ 2025-05-30 14:04 ` Ryan Roberts
  2025-05-30 14:47 ` [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Lorenzo Stoakes
  6 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

Commit 491344301b25 ("arm64/mm: Permit lazy_mmu_mode to be nested") made
the arm64 implementation of lazy_mmu_mode tolerant to nesting. But
subsequent commits have fixed the core code to ensure that lazy_mmu_mode
never gets nested (as originally intended). Therefore we can revert this
commit and reinstate the VM_WARN() if nesting is detected in future.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index add75dee49f5..dcf0adbeb803 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -83,21 +83,11 @@ static inline void queue_pte_barriers(void)
 #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	/*
-	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
-	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
-	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
-	 * permissions on the linear map with apply_to_page_range(), which
-	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
-	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
-	 * flush and clear the flag such that the remainder of the work in the
-	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
-	 * keeps tracking simple.
-	 */
-
 	if (in_interrupt())
 		return;
 
+	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
+
 	set_thread_flag(TIF_LAZY_MMU);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements
  2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
                   ` (5 preceding siblings ...)
  2025-05-30 14:04 ` [RFC PATCH v1 6/6] Revert "arm64/mm: Permit lazy_mmu_mode to be nested" Ryan Roberts
@ 2025-05-30 14:47 ` Lorenzo Stoakes
  2025-05-30 15:55   ` Ryan Roberts
  6 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 14:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Jann Horn

+cc Jann who is a specialist in all things page table-y and especially scary
edge cases :)

On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
> Hi All,
>
> I recently added support for lazy mmu mode on arm64. The series is now in
> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
> found some ugly corners (unexpected nesting). I was able to fix those issues by
> making the arm64 implementation more permissive (like the other arches). But
> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
> lazy mmu mode never nests, and more importantly, that code never makes pgtable
> modifications expecting them to be immediate, not knowing that it's actually in
> lazy mmu mode so the changes get deferred.

When you say fragile, are you confident it _works_ but perhaps not quite as well
as you want? Or are you concerned this might be broken upstream in any way?

I am thinking specifically about the proposed use in Dev's new series [0] and
obviously hoping (and assuming in fact) that it's the former :)

[0]: https://lore.kernel.org/linux-mm/20250530090407.19237-1-dev.jain@arm.com/

>
> The first 2 patches are unrelated, very obvious bug fixes. They don't affect
> arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
> during code review and think they should be fixed.
>
> The next 3 patches are aimed at solving the nesting issue.
>
> And the final patch is reverting the "permissive" fix I did for arm64, which is
> no longer needed after the previous 3 patches.
>
> I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
> in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
> and siince I'm touching various arches and modifying some core mm stuff, I
> thought that might take a while so thought I'd beat the rush and get a first
> version out early.
>
> I've build-tested all the affected arches. And I've run mm selftests for the
> arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).
>
> Applies against Linus's master branch (f66bc387efbe).
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (6):
>   fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
>     pagemap_scan_pmd_entry()
>   mm: Fix pte update and tlb maintenance ordering in
>     migrate_vma_collect_pmd()
>   mm: Avoid calling page allocator from apply_to_page_range()
>   mm: Introduce arch_in_lazy_mmu_mode()
>   mm: Avoid calling page allocator while in lazy mmu mode
>   Revert "arm64/mm: Permit lazy_mmu_mode to be nested"
>
>  arch/arm64/include/asm/pgtable.h              | 22 ++++----
>  .../include/asm/book3s/64/tlbflush-hash.h     | 15 ++++++
>  arch/sparc/include/asm/tlbflush_64.h          |  1 +
>  arch/sparc/mm/tlb.c                           | 12 +++++
>  arch/x86/include/asm/paravirt.h               |  5 ++
>  arch/x86/include/asm/paravirt_types.h         |  1 +
>  arch/x86/kernel/paravirt.c                    |  6 +++
>  arch/x86/xen/mmu_pv.c                         |  6 +++
>  fs/proc/task_mmu.c                            |  3 +-
>  include/asm-generic/tlb.h                     |  2 +
>  include/linux/mm.h                            |  6 +++
>  include/linux/pgtable.h                       |  1 +
>  kernel/bpf/arena.c                            |  6 +--
>  mm/kasan/shadow.c                             |  2 +-
>  mm/memory.c                                   | 54 ++++++++++++++-----
>  mm/migrate_device.c                           |  3 +-
>  mm/mmu_gather.c                               | 15 ++++++
>  17 files changed, 128 insertions(+), 32 deletions(-)
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements
  2025-05-30 14:47 ` [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Lorenzo Stoakes
@ 2025-05-30 15:55   ` Ryan Roberts
  2025-05-31  7:46     ` Mike Rapoport
  0 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 15:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Jann Horn

On 30/05/2025 15:47, Lorenzo Stoakes wrote:
> +cc Jann who is a specialist in all things page table-y and especially scary
> edge cases :)
> 
> On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
>> Hi All,
>>
>> I recently added support for lazy mmu mode on arm64. The series is now in
>> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
>> found some ugly corners (unexpected nesting). I was able to fix those issues by
>> making the arm64 implementation more permissive (like the other arches). But
>> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
>> lazy mmu mode never nests, and more importantly, that code never makes pgtable
>> modifications expecting them to be immediate, not knowing that it's actually in
>> lazy mmu mode so the changes get deferred.
> 
> When you say fragile, are you confident it _works_ but perhaps not quite as well
> as you want? Or are you concerned this might be broken upstream in any way?

I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
were to go in _without_ the lazy_mmu bracketting in some manner, then it would
be broken if the config includes CONFIG_DEBUG_PAGEALLOC.

There's a lot more explanation in the later patches as to how it can be broken,
but for arm64, the situation is currently like this, because our implementation
of __change_memory_common() uses apply_to_page_range() which implicitly starts
an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
to exit. Everything works correctly but it's not optimal because C is no longer
deferred:

arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
  <do some pte changes (A)>
  alloc_pages()
    debug_pagealloc_map_pages()
      __kernel_map_pages()
        __change_memory_common()
          arch_enter_lazy_mmu_mode()              << inner lazy mmu region
            <change kernel pte to make valid (B)>
          arch_leave_lazy_mmu_mode()              << exit; complete A + B
    clear_page()
  <do some more pte changes (C)>                  << no longer in lazy mode
arch_leave_lazy_mmu_mode()                        << nop

An alternative implementation would not add the nested lazy mmu mode, so we end
up with this:

arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
  <do some pte changes (A)>
  alloc_pages()
    debug_pagealloc_map_pages()
      __kernel_map_pages()
        __change_memory_common()
            <change kernel pte to make valid (B)> << deferred due to lazy mmu
    clear_page()                                  << BANG! B has not be actioned
  <do some more pte changes (C)>
arch_leave_lazy_mmu_mode()

This is clearly a much worse outcome. It's not happening today but it could in
future. That's why I'm claiming it's fragile. It's much better (IMHO) to
disallow calling the page allocator when in lazy mmu mode.

I won't speak for other arches; there may be more or less potential impact for them.

> 
> I am thinking specifically about the proposed use in Dev's new series [0] and
> obviously hoping (and assuming in fact) that it's the former :)

Dev's changes aren't directly related to this, but if a version was accepted
that didn't include the lazy mmu mode, that would cause non-obvious issues.

Hope that helps?

Thanks,
Ryan

> 
> [0]: https://lore.kernel.org/linux-mm/20250530090407.19237-1-dev.jain@arm.com/
> 
>>
>> The first 2 patches are unrelated, very obvious bug fixes. They don't affect
>> arm64 because arm64 only uses lazy mmu for kernel mappings. But I noticed them
>> during code review and think they should be fixed.
>>
>> The next 3 patches are aimed at solving the nesting issue.
>>
>> And the final patch is reverting the "permissive" fix I did for arm64, which is
>> no longer needed after the previous 3 patches.
>>
>> I've labelled this RFC for now because it depends on the arm64 lazy mmu patches
>> in Linus's master, so it won't apply to mm-unstable. But I'm keen to get review
>> and siince I'm touching various arches and modifying some core mm stuff, I
>> thought that might take a while so thought I'd beat the rush and get a first
>> version out early.
>>
>> I've build-tested all the affected arches. And I've run mm selftests for the
>> arm64 build, with no issues (with DEBUG_PAGEALLOC and KFENCE enabled).
>>
>> Applies against Linus's master branch (f66bc387efbe).
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (6):
>>   fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in
>>     pagemap_scan_pmd_entry()
>>   mm: Fix pte update and tlb maintenance ordering in
>>     migrate_vma_collect_pmd()
>>   mm: Avoid calling page allocator from apply_to_page_range()
>>   mm: Introduce arch_in_lazy_mmu_mode()
>>   mm: Avoid calling page allocator while in lazy mmu mode
>>   Revert "arm64/mm: Permit lazy_mmu_mode to be nested"
>>
>>  arch/arm64/include/asm/pgtable.h              | 22 ++++----
>>  .../include/asm/book3s/64/tlbflush-hash.h     | 15 ++++++
>>  arch/sparc/include/asm/tlbflush_64.h          |  1 +
>>  arch/sparc/mm/tlb.c                           | 12 +++++
>>  arch/x86/include/asm/paravirt.h               |  5 ++
>>  arch/x86/include/asm/paravirt_types.h         |  1 +
>>  arch/x86/kernel/paravirt.c                    |  6 +++
>>  arch/x86/xen/mmu_pv.c                         |  6 +++
>>  fs/proc/task_mmu.c                            |  3 +-
>>  include/asm-generic/tlb.h                     |  2 +
>>  include/linux/mm.h                            |  6 +++
>>  include/linux/pgtable.h                       |  1 +
>>  kernel/bpf/arena.c                            |  6 +--
>>  mm/kasan/shadow.c                             |  2 +-
>>  mm/memory.c                                   | 54 ++++++++++++++-----
>>  mm/migrate_device.c                           |  3 +-
>>  mm/mmu_gather.c                               | 15 ++++++
>>  17 files changed, 128 insertions(+), 32 deletions(-)
>>
>> --
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
  2025-05-30 14:04 ` [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range() Ryan Roberts
@ 2025-05-30 16:23   ` Liam R. Howlett
  2025-05-30 16:50     ` Ryan Roberts
  0 siblings, 1 reply; 17+ messages in thread
From: Liam R. Howlett @ 2025-05-30 16:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

* Ryan Roberts <ryan.roberts@arm.com> [250530 10:05]:
> Lazy mmu mode applies to the current task and permits pte modifications
> to be deferred and updated at a later time in a batch to improve
> performance. apply_to_page_range() calls its callback in lazy mmu mode
> and some of those callbacks call into the page allocator to either
> allocate or free pages.
> 
> This is problematic with CONFIG_DEBUG_PAGEALLOC because
> debug_pagealloc_[un]map_pages() calls the arch implementation of
> __kernel_map_pages() which must modify the ptes for the linear map.
> 
> There are two possibilities at this point:
> 
>  - If the arch implementation modifies the ptes directly without first
>    entering lazy mmu mode, the pte modifications may get deferred until
>    the existing lazy mmu mode is exited. This could result in taking
>    spurious faults for example.
> 
>  - If the arch implementation enters a nested lazy mmu mode before
>    modification of the ptes (many arches use apply_to_page_range()),
>    then the linear map updates will definitely be applied upon leaving
>    the inner lazy mmu mode. But because lazy mmu mode does not support
>    nesting, the remainder of the outer user is no longer in lazy mmu
>    mode and the optimization opportunity is lost.
> 
> So let's just ensure that the page allocator is never called from within
> lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and
> apply_to_existing_page_range() are introduced which don't enter lazy mmu
> mode. Then users which need to call into the page allocator within their
> callback are updated to use the _nolazy variants.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  kernel/bpf/arena.c |  6 +++---
>  mm/kasan/shadow.c  |  2 +-
>  mm/memory.c        | 54 +++++++++++++++++++++++++++++++++++-----------
>  4 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e51dba8398f7..11cae6ce04ff 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
>  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  			       unsigned long size, pte_fn_t fn, void *data);
> +extern int apply_to_page_range_nolazy(struct mm_struct *mm,
> +				      unsigned long address, unsigned long size,
> +				      pte_fn_t fn, void *data);

We are removing externs as things are edited, so probably drop them
here.

>  extern int apply_to_existing_page_range(struct mm_struct *mm,
>  				   unsigned long address, unsigned long size,
>  				   pte_fn_t fn, void *data);
> +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> +				   unsigned long address, unsigned long size,
> +				   pte_fn_t fn, void *data);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern void __kernel_poison_pages(struct page *page, int numpages);
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 0d56cea71602..ca833cfeefb7 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map)
>  	/*
>  	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
>  	 * It unmaps everything from vmalloc area and clears pgtables.
> -	 * Call apply_to_existing_page_range() first to find populated ptes and
> -	 * free those pages.
> +	 * Call apply_to_existing_page_range_nolazy() first to find populated
> +	 * ptes and free those pages.
>  	 */
> -	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +	apply_to_existing_page_range_nolazy(&init_mm, bpf_arena_get_kern_vm_start(arena),
>  				     KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
>  	free_vm_area(arena->kern_vm);
>  	range_tree_destroy(&arena->rt);
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..2325c5166c3a 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end,
>  
>  
>  		if (flags & KASAN_VMALLOC_PAGE_RANGE)
> -			apply_to_existing_page_range(&init_mm,
> +			apply_to_existing_page_range_nolazy(&init_mm,
>  					     (unsigned long)shadow_start,
>  					     size, kasan_depopulate_vmalloc_pte,
>  					     NULL);
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..24436074ce48 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory);
>  static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	pte_t *pte, *mapped_pte;
>  	int err = 0;
> @@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			return -EINVAL;
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_enter_lazy_mmu_mode();
>  
>  	if (fn) {
>  		do {
> @@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  	}
>  	*mask |= PGTBL_PTE_MODIFIED;
>  
> -	arch_leave_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_leave_lazy_mmu_mode();
>  
>  	if (mm != &init_mm)
>  		pte_unmap_unlock(mapped_pte, ptl);
> @@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)

I am having a hard time understanding why other lazy mmus were more
self-contained, but arm has added arguments to generic code?

>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  			pmd_clear_bad(pmd);
>  		}
>  		err = apply_to_pte_range(mm, pmd, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (pmd++, addr = next, addr != end);
> @@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  			pud_clear_bad(pud);
>  		}
>  		err = apply_to_pmd_range(mm, pud, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (pud++, addr = next, addr != end);
> @@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	p4d_t *p4d;
>  	unsigned long next;
> @@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  			p4d_clear_bad(p4d);
>  		}
>  		err = apply_to_pud_range(mm, p4d, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (p4d++, addr = next, addr != end);
> @@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  
>  static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  				 unsigned long size, pte_fn_t fn,
> -				 void *data, bool create)
> +				 void *data, bool create, bool lazy_mmu)
>  {
>  	pgd_t *pgd;
>  	unsigned long start = addr, next;
> @@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			pgd_clear_bad(pgd);
>  		}
>  		err = apply_to_p4d_range(mm, pgd, addr, next,
> -					 fn, data, create, &mask);
> +					 fn, data, create, &mask, lazy_mmu);

This is annoying.  We now have a bool, bool passed through with mask
inserted in the middle.

>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> @@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  /*
>   * Scan a region of virtual memory, filling in page tables as necessary
>   * and calling a provided function on each leaf page table.
> + *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
>   */
>  int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			unsigned long size, pte_fn_t fn, void *data)
>  {
> -	return __apply_to_page_range(mm, addr, size, fn, data, true);
> +	return __apply_to_page_range(mm, addr, size, fn, data, true, true);

Please add something here to tell me what false, true is.

>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> @@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
>   * Scan a region of virtual memory, calling a provided function on
>   * each leaf page table where it exists.
>   *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
> + *
>   * Unlike apply_to_page_range, this does _not_ fill in page tables
>   * where they are absent.
>   */
>  int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
>  				 unsigned long size, pte_fn_t fn, void *data)
>  {
> -	return __apply_to_page_range(mm, addr, size, fn, data, false);
> +	return __apply_to_page_range(mm, addr, size, fn, data, false, true);

every..

> +}
> +
> +/*
> + * As per apply_to_page_range() but fn() is not called in lazy mmu mode.
> + */
> +int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr,
> +			       unsigned long size, pte_fn_t fn, void *data)
> +{
> +	return __apply_to_page_range(mm, addr, size, fn, data, true, false);

one...

> +}
> +
> +/*
> + * As per apply_to_existing_page_range() but fn() is not called in lazy mmu
> + * mode.
> + */
> +int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> +					unsigned long addr, unsigned long size,
> +					pte_fn_t fn, void *data)
> +{
> +	return __apply_to_page_range(mm, addr, size, fn, data, false, false);

adds confusion :)


These wrappers are terrible for readability and annoying for argument
lists too.

Could we do something like the pgtbl_mod_mask or zap_details and pass
through a struct or one unsigned int for create and lazy_mmu?

At least we'd have better self-documenting code in the wrappers.. and if
we ever need a third boolean, we could avoid multiplying the wrappers
again.

WDYT?

Cheers,
Liam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry()
  2025-05-30 14:04 ` [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry() Ryan Roberts
@ 2025-05-30 16:26   ` Jann Horn
  2025-05-30 16:45     ` Ryan Roberts
  0 siblings, 1 reply; 17+ messages in thread
From: Jann Horn @ 2025-05-30 16:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Andy Lutomirski

On Fri, May 30, 2025 at 4:04 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> pagemap_scan_pmd_entry() was previously modifying ptes while in lazy mmu
> mode, then performing tlb maintenance for the modified ptes, then
> leaving lazy mmu mode. But any pte modifications during lazy mmu mode
> may be deferred until arch_leave_lazy_mmu_mode(), inverting the required
> ordering between pte modificaiton and tlb maintenance.
>
> Let's fix that by leaving mmu mode, forcing all the pte updates to be
> actioned, before doing the tlb maintenance.
>
> This is a theorectical bug discovered during code review.
>
> Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")

Hmm... isn't lazy mmu mode supposed to also delay TLB flushes, and
preserve the ordering of PTE modifications and TLB flushes?

Looking at the existing implementations of lazy MMU:

 - In Xen PV implementation of lazy MMU, I see that TLB flush
hypercalls are delayed as well (xen_flush_tlb(),
xen_flush_tlb_one_user() and xen_flush_tlb_multi() all use
xen_mc_issue(XEN_LAZY_MMU) which delays issuing if lazymmu is active).
 - The sparc version also seems to delay TLB flushes, and sparc's
arch_leave_lazy_mmu_mode() seems to do TLB flushes via
flush_tlb_pending() if necessary.
 - powerpc's arch_leave_lazy_mmu_mode() also seems to do TLB flushes.

Am I missing something?

If arm64 requires different semantics compared to all existing
implementations and doesn't delay TLB flushes for lazy mmu mode, I
think the "Fixes" tag should point to your addition of lazy mmu
support for arm64.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  fs/proc/task_mmu.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 994cde10e3f4..361f3ffd9a0c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -2557,10 +2557,9 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>         }
>
>  flush_and_return:
> +       arch_leave_lazy_mmu_mode();
>         if (flush_end)
>                 flush_tlb_range(vma, start, addr);
> -
> -       arch_leave_lazy_mmu_mode();

I think this ordering was probably intentional, because doing it this
way around allows Xen PV to avoid one more hypercall, because the TLB
flush can be batched together with the page table changes?


>         pte_unmap_unlock(start_pte, ptl);
>
>         cond_resched();
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry()
  2025-05-30 16:26   ` Jann Horn
@ 2025-05-30 16:45     ` Ryan Roberts
  2025-05-30 16:48       ` Jann Horn
  0 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 16:45 UTC (permalink / raw)
  To: Jann Horn
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Andy Lutomirski

On 30/05/2025 17:26, Jann Horn wrote:
> On Fri, May 30, 2025 at 4:04 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> pagemap_scan_pmd_entry() was previously modifying ptes while in lazy mmu
>> mode, then performing tlb maintenance for the modified ptes, then
>> leaving lazy mmu mode. But any pte modifications during lazy mmu mode
>> may be deferred until arch_leave_lazy_mmu_mode(), inverting the required
>> ordering between pte modificaiton and tlb maintenance.
>>
>> Let's fix that by leaving mmu mode, forcing all the pte updates to be
>> actioned, before doing the tlb maintenance.
>>
>> This is a theorectical bug discovered during code review.
>>
>> Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
> 
> Hmm... isn't lazy mmu mode supposed to also delay TLB flushes, and
> preserve the ordering of PTE modifications and TLB flushes?
> 
> Looking at the existing implementations of lazy MMU:
> 
>  - In Xen PV implementation of lazy MMU, I see that TLB flush
> hypercalls are delayed as well (xen_flush_tlb(),
> xen_flush_tlb_one_user() and xen_flush_tlb_multi() all use
> xen_mc_issue(XEN_LAZY_MMU) which delays issuing if lazymmu is active).
>  - The sparc version also seems to delay TLB flushes, and sparc's
> arch_leave_lazy_mmu_mode() seems to do TLB flushes via
> flush_tlb_pending() if necessary.
>  - powerpc's arch_leave_lazy_mmu_mode() also seems to do TLB flushes.
> 
> Am I missing something?

I doubt it. I suspect this was just my misunderstanding then. I hadn't
appreciated that lazy mmu is also guarranteed to maintain flush ordering; it's
chronically under-documented. Sorry for the noise here. On that basis, I expect
the first 2 patches can definitely be dropped.

> 
> If arm64 requires different semantics compared to all existing
> implementations and doesn't delay TLB flushes for lazy mmu mode, I
> think the "Fixes" tag should point to your addition of lazy mmu
> support for arm64.

arm64 doesn't require different semantics. arm64 is using lazy mmu in a very
limited manner and it can already tolerate the current code.

I just spotted this during code review and was trying to be a good citizen.
Thanks for setting me straight!

Thanks,
Ryan

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  fs/proc/task_mmu.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 994cde10e3f4..361f3ffd9a0c 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -2557,10 +2557,9 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>         }
>>
>>  flush_and_return:
>> +       arch_leave_lazy_mmu_mode();
>>         if (flush_end)
>>                 flush_tlb_range(vma, start, addr);
>> -
>> -       arch_leave_lazy_mmu_mode();
> 
> I think this ordering was probably intentional, because doing it this
> way around allows Xen PV to avoid one more hypercall, because the TLB
> flush can be batched together with the page table changes?
> 
> 
>>         pte_unmap_unlock(start_pte, ptl);
>>
>>         cond_resched();
>> --
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry()
  2025-05-30 16:45     ` Ryan Roberts
@ 2025-05-30 16:48       ` Jann Horn
  0 siblings, 0 replies; 17+ messages in thread
From: Jann Horn @ 2025-05-30 16:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Andy Lutomirski

On Fri, May 30, 2025 at 6:45 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> On 30/05/2025 17:26, Jann Horn wrote:
> > On Fri, May 30, 2025 at 4:04 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> pagemap_scan_pmd_entry() was previously modifying ptes while in lazy mmu
> >> mode, then performing tlb maintenance for the modified ptes, then
> >> leaving lazy mmu mode. But any pte modifications during lazy mmu mode
> >> may be deferred until arch_leave_lazy_mmu_mode(), inverting the required
> >> ordering between pte modificaiton and tlb maintenance.
> >>
> >> Let's fix that by leaving mmu mode, forcing all the pte updates to be
> >> actioned, before doing the tlb maintenance.
> >>
> >> This is a theorectical bug discovered during code review.
> >>
> >> Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
> >
> > Hmm... isn't lazy mmu mode supposed to also delay TLB flushes, and
> > preserve the ordering of PTE modifications and TLB flushes?
> >
> > Looking at the existing implementations of lazy MMU:
> >
> >  - In Xen PV implementation of lazy MMU, I see that TLB flush
> > hypercalls are delayed as well (xen_flush_tlb(),
> > xen_flush_tlb_one_user() and xen_flush_tlb_multi() all use
> > xen_mc_issue(XEN_LAZY_MMU) which delays issuing if lazymmu is active).
> >  - The sparc version also seems to delay TLB flushes, and sparc's
> > arch_leave_lazy_mmu_mode() seems to do TLB flushes via
> > flush_tlb_pending() if necessary.
> >  - powerpc's arch_leave_lazy_mmu_mode() also seems to do TLB flushes.
> >
> > Am I missing something?
>
> I doubt it. I suspect this was just my misunderstanding then. I hadn't
> appreciated that lazy mmu is also guarranteed to maintain flush ordering; it's
> chronically under-documented. Sorry for the noise here. On that basis, I expect
> the first 2 patches can definitely be dropped.

Yeah looking at this code I agree that it could use significantly more
verbose comments on the API contract.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
  2025-05-30 16:23   ` Liam R. Howlett
@ 2025-05-30 16:50     ` Ryan Roberts
  2025-05-30 19:08       ` Liam R. Howlett
  0 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2025-05-30 16:50 UTC (permalink / raw)
  To: Liam R. Howlett, Catalin Marinas, Will Deacon,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, David S. Miller, Andreas Larsson, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Boris Ostrovsky,
	Aneesh Kumar K.V, Andrew Morton, Peter Zijlstra, Arnd Bergmann,
	David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Alexei Starovoitov, Andrey Ryabinin, linux-arm-kernel,
	linux-kernel, linuxppc-dev, sparclinux, virtualization, xen-devel,
	linux-mm

On 30/05/2025 17:23, Liam R. Howlett wrote:
> * Ryan Roberts <ryan.roberts@arm.com> [250530 10:05]:
>> Lazy mmu mode applies to the current task and permits pte modifications
>> to be deferred and updated at a later time in a batch to improve
>> performance. apply_to_page_range() calls its callback in lazy mmu mode
>> and some of those callbacks call into the page allocator to either
>> allocate or free pages.
>>
>> This is problematic with CONFIG_DEBUG_PAGEALLOC because
>> debug_pagealloc_[un]map_pages() calls the arch implementation of
>> __kernel_map_pages() which must modify the ptes for the linear map.
>>
>> There are two possibilities at this point:
>>
>>  - If the arch implementation modifies the ptes directly without first
>>    entering lazy mmu mode, the pte modifications may get deferred until
>>    the existing lazy mmu mode is exited. This could result in taking
>>    spurious faults for example.
>>
>>  - If the arch implementation enters a nested lazy mmu mode before
>>    modification of the ptes (many arches use apply_to_page_range()),
>>    then the linear map updates will definitely be applied upon leaving
>>    the inner lazy mmu mode. But because lazy mmu mode does not support
>>    nesting, the remainder of the outer user is no longer in lazy mmu
>>    mode and the optimization opportunity is lost.
>>
>> So let's just ensure that the page allocator is never called from within
>> lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and
>> apply_to_existing_page_range() are introduced which don't enter lazy mmu
>> mode. Then users which need to call into the page allocator within their
>> callback are updated to use the _nolazy variants.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/mm.h |  6 ++++++
>>  kernel/bpf/arena.c |  6 +++---
>>  mm/kasan/shadow.c  |  2 +-
>>  mm/memory.c        | 54 +++++++++++++++++++++++++++++++++++-----------
>>  4 files changed, 51 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index e51dba8398f7..11cae6ce04ff 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
>>  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
>>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>>  			       unsigned long size, pte_fn_t fn, void *data);
>> +extern int apply_to_page_range_nolazy(struct mm_struct *mm,
>> +				      unsigned long address, unsigned long size,
>> +				      pte_fn_t fn, void *data);
> 
> We are removing externs as things are edited, so probably drop them
> here.

ACK

> 
>>  extern int apply_to_existing_page_range(struct mm_struct *mm,
>>  				   unsigned long address, unsigned long size,
>>  				   pte_fn_t fn, void *data);
>> +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
>> +				   unsigned long address, unsigned long size,
>> +				   pte_fn_t fn, void *data);
>>  
>>  #ifdef CONFIG_PAGE_POISONING
>>  extern void __kernel_poison_pages(struct page *page, int numpages);
>> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
>> index 0d56cea71602..ca833cfeefb7 100644
>> --- a/kernel/bpf/arena.c
>> +++ b/kernel/bpf/arena.c
>> @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map)
>>  	/*
>>  	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
>>  	 * It unmaps everything from vmalloc area and clears pgtables.
>> -	 * Call apply_to_existing_page_range() first to find populated ptes and
>> -	 * free those pages.
>> +	 * Call apply_to_existing_page_range_nolazy() first to find populated
>> +	 * ptes and free those pages.
>>  	 */
>> -	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
>> +	apply_to_existing_page_range_nolazy(&init_mm, bpf_arena_get_kern_vm_start(arena),
>>  				     KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
>>  	free_vm_area(arena->kern_vm);
>>  	range_tree_destroy(&arena->rt);
>> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
>> index d2c70cd2afb1..2325c5166c3a 100644
>> --- a/mm/kasan/shadow.c
>> +++ b/mm/kasan/shadow.c
>> @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end,
>>  
>>  
>>  		if (flags & KASAN_VMALLOC_PAGE_RANGE)
>> -			apply_to_existing_page_range(&init_mm,
>> +			apply_to_existing_page_range_nolazy(&init_mm,
>>  					     (unsigned long)shadow_start,
>>  					     size, kasan_depopulate_vmalloc_pte,
>>  					     NULL);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 49199410805c..24436074ce48 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory);
>>  static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>>  				     unsigned long addr, unsigned long end,
>>  				     pte_fn_t fn, void *data, bool create,
>> -				     pgtbl_mod_mask *mask)
>> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>>  {
>>  	pte_t *pte, *mapped_pte;
>>  	int err = 0;
>> @@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>>  			return -EINVAL;
>>  	}
>>  
>> -	arch_enter_lazy_mmu_mode();
>> +	if (lazy_mmu)
>> +		arch_enter_lazy_mmu_mode();
>>  
>>  	if (fn) {
>>  		do {
>> @@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>>  	}
>>  	*mask |= PGTBL_PTE_MODIFIED;
>>  
>> -	arch_leave_lazy_mmu_mode();
>> +	if (lazy_mmu)
>> +		arch_leave_lazy_mmu_mode();
>>  
>>  	if (mm != &init_mm)
>>  		pte_unmap_unlock(mapped_pte, ptl);
>> @@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>>  static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>>  				     unsigned long addr, unsigned long end,
>>  				     pte_fn_t fn, void *data, bool create,
>> -				     pgtbl_mod_mask *mask)
>> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
> 
> I am having a hard time understanding why other lazy mmus were more
> self-contained, but arm has added arguments to generic code?

They are not; as I explain in the cover letter, arm64 can work with the code as
it is today, but IMHO opinion it is very fragile and this is an attempt to
reduce the fragility (for all).

> 
>>  {
>>  	pmd_t *pmd;
>>  	unsigned long next;
>> @@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>>  			pmd_clear_bad(pmd);
>>  		}
>>  		err = apply_to_pte_range(mm, pmd, addr, next,
>> -					 fn, data, create, mask);
>> +					 fn, data, create, mask, lazy_mmu);
>>  		if (err)
>>  			break;
>>  	} while (pmd++, addr = next, addr != end);
>> @@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>>  static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>>  				     unsigned long addr, unsigned long end,
>>  				     pte_fn_t fn, void *data, bool create,
>> -				     pgtbl_mod_mask *mask)
>> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>>  {
>>  	pud_t *pud;
>>  	unsigned long next;
>> @@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>>  			pud_clear_bad(pud);
>>  		}
>>  		err = apply_to_pmd_range(mm, pud, addr, next,
>> -					 fn, data, create, mask);
>> +					 fn, data, create, mask, lazy_mmu);
>>  		if (err)
>>  			break;
>>  	} while (pud++, addr = next, addr != end);
>> @@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>>  static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>>  				     unsigned long addr, unsigned long end,
>>  				     pte_fn_t fn, void *data, bool create,
>> -				     pgtbl_mod_mask *mask)
>> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>>  {
>>  	p4d_t *p4d;
>>  	unsigned long next;
>> @@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>>  			p4d_clear_bad(p4d);
>>  		}
>>  		err = apply_to_pud_range(mm, p4d, addr, next,
>> -					 fn, data, create, mask);
>> +					 fn, data, create, mask, lazy_mmu);
>>  		if (err)
>>  			break;
>>  	} while (p4d++, addr = next, addr != end);
>> @@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>>  
>>  static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>  				 unsigned long size, pte_fn_t fn,
>> -				 void *data, bool create)
>> +				 void *data, bool create, bool lazy_mmu)
>>  {
>>  	pgd_t *pgd;
>>  	unsigned long start = addr, next;
>> @@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>  			pgd_clear_bad(pgd);
>>  		}
>>  		err = apply_to_p4d_range(mm, pgd, addr, next,
>> -					 fn, data, create, &mask);
>> +					 fn, data, create, &mask, lazy_mmu);
> 
> This is annoying.  We now have a bool, bool passed through with mask
> inserted in the middle.
> 
>>  		if (err)
>>  			break;
>>  	} while (pgd++, addr = next, addr != end);
>> @@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>  /*
>>   * Scan a region of virtual memory, filling in page tables as necessary
>>   * and calling a provided function on each leaf page table.
>> + *
>> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
>> + * not to perform memory allocation.
>>   */
>>  int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>  			unsigned long size, pte_fn_t fn, void *data)
>>  {
>> -	return __apply_to_page_range(mm, addr, size, fn, data, true);
>> +	return __apply_to_page_range(mm, addr, size, fn, data, true, true);
> 
> Please add something here to tell me what false, true is.
> 
>>  }
>>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>>  
>> @@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
>>   * Scan a region of virtual memory, calling a provided function on
>>   * each leaf page table where it exists.
>>   *
>> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
>> + * not to perform memory allocation.
>> + *
>>   * Unlike apply_to_page_range, this does _not_ fill in page tables
>>   * where they are absent.
>>   */
>>  int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
>>  				 unsigned long size, pte_fn_t fn, void *data)
>>  {
>> -	return __apply_to_page_range(mm, addr, size, fn, data, false);
>> +	return __apply_to_page_range(mm, addr, size, fn, data, false, true);
> 
> every..
> 
>> +}
>> +
>> +/*
>> + * As per apply_to_page_range() but fn() is not called in lazy mmu mode.
>> + */
>> +int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr,
>> +			       unsigned long size, pte_fn_t fn, void *data)
>> +{
>> +	return __apply_to_page_range(mm, addr, size, fn, data, true, false);
> 
> one...
> 
>> +}
>> +
>> +/*
>> + * As per apply_to_existing_page_range() but fn() is not called in lazy mmu
>> + * mode.
>> + */
>> +int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
>> +					unsigned long addr, unsigned long size,
>> +					pte_fn_t fn, void *data)
>> +{
>> +	return __apply_to_page_range(mm, addr, size, fn, data, false, false);
> 
> adds confusion :)
> 
> 
> These wrappers are terrible for readability and annoying for argument
> lists too.

Agreed.

> 
> Could we do something like the pgtbl_mod_mask or zap_details and pass
> through a struct or one unsigned int for create and lazy_mmu?

Or just create some enum flags?

> 
> At least we'd have better self-documenting code in the wrappers.. and if
> we ever need a third boolean, we could avoid multiplying the wrappers
> again.
> 
> WDYT?

I'm happy with either approach. I was expecting more constination about the idea
of being able to disable lazy mode though, so perhaps I'll wait and see if any
arrives. If it doesn't... flags?

> 
> Cheers,
> Liam


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range()
  2025-05-30 16:50     ` Ryan Roberts
@ 2025-05-30 19:08       ` Liam R. Howlett
  0 siblings, 0 replies; 17+ messages in thread
From: Liam R. Howlett @ 2025-05-30 19:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	David S. Miller, Andreas Larsson, Juergen Gross, Ajay Kaher,
	Alexey Makhalov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Boris Ostrovsky, Aneesh Kumar K.V,
	Andrew Morton, Peter Zijlstra, Arnd Bergmann, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm

* Ryan Roberts <ryan.roberts@arm.com> [250530 12:50]:
...

> > 
> > 
> > These wrappers are terrible for readability and annoying for argument
> > lists too.
> 
> Agreed.
> 
> > 
> > Could we do something like the pgtbl_mod_mask or zap_details and pass
> > through a struct or one unsigned int for create and lazy_mmu?
> 
> Or just create some enum flags?
> 
> > 
> > At least we'd have better self-documenting code in the wrappers.. and if
> > we ever need a third boolean, we could avoid multiplying the wrappers
> > again.
> > 
> > WDYT?
> 
> I'm happy with either approach. I was expecting more constination about the idea
> of being able to disable lazy mode though, so perhaps I'll wait and see if any
> arrives. If it doesn't... flags?

Yes, that works as well.  Please use pmd_flags or anything more
descriptive than just 'flags' :)

I wonder which approach is best in asm instructions and self-documenting
code.

Regards,
Liam


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements
  2025-05-30 15:55   ` Ryan Roberts
@ 2025-05-31  7:46     ` Mike Rapoport
  2025-06-02 10:31       ` Ryan Roberts
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Rapoport @ 2025-05-31  7:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Lorenzo Stoakes, Catalin Marinas, Will Deacon,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, David S. Miller, Andreas Larsson, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Boris Ostrovsky,
	Aneesh Kumar K.V, Andrew Morton, Peter Zijlstra, Arnd Bergmann,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Jann Horn

Hi Ryan,

On Fri, May 30, 2025 at 04:55:36PM +0100, Ryan Roberts wrote:
> On 30/05/2025 15:47, Lorenzo Stoakes wrote:
> > +cc Jann who is a specialist in all things page table-y and especially scary
> > edge cases :)
> > 
> > On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
> >> Hi All,
> >>
> >> I recently added support for lazy mmu mode on arm64. The series is now in
> >> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
> >> found some ugly corners (unexpected nesting). I was able to fix those issues by
> >> making the arm64 implementation more permissive (like the other arches). But
> >> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
> >> lazy mmu mode never nests, and more importantly, that code never makes pgtable
> >> modifications expecting them to be immediate, not knowing that it's actually in
> >> lazy mmu mode so the changes get deferred.
> > 
> > When you say fragile, are you confident it _works_ but perhaps not quite as well
> > as you want? Or are you concerned this might be broken upstream in any way?
> 
> I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
> were to go in _without_ the lazy_mmu bracketting in some manner, then it would
> be broken if the config includes CONFIG_DEBUG_PAGEALLOC.
> 
> There's a lot more explanation in the later patches as to how it can be broken,
> but for arm64, the situation is currently like this, because our implementation
> of __change_memory_common() uses apply_to_page_range() which implicitly starts
> an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
> to exit. Everything works correctly but it's not optimal because C is no longer
> deferred:
> 
> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>   <do some pte changes (A)>
>   alloc_pages()
>     debug_pagealloc_map_pages()
>       __kernel_map_pages()
>         __change_memory_common()
>           arch_enter_lazy_mmu_mode()              << inner lazy mmu region
>             <change kernel pte to make valid (B)>
>           arch_leave_lazy_mmu_mode()              << exit; complete A + B
>     clear_page()
>   <do some more pte changes (C)>                  << no longer in lazy mode
> arch_leave_lazy_mmu_mode()                        << nop
> 
> An alternative implementation would not add the nested lazy mmu mode, so we end
> up with this:
> 
> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>   <do some pte changes (A)>
>   alloc_pages()
>     debug_pagealloc_map_pages()
>       __kernel_map_pages()
>         __change_memory_common()
>             <change kernel pte to make valid (B)> << deferred due to lazy mmu
>     clear_page()                                  << BANG! B has not be actioned
>   <do some more pte changes (C)>
> arch_leave_lazy_mmu_mode()
> 
> This is clearly a much worse outcome. It's not happening today but it could in
> future. That's why I'm claiming it's fragile. It's much better (IMHO) to
> disallow calling the page allocator when in lazy mmu mode.

First, I think it should be handled completely inside arch/arm64. Page
allocation worked on lazy mmu mode on other architectures, no reason it
should be changed because of the way arm64 implements lazy mmu.

Second, DEBUG_PAGEALLOC already implies that performance is bad, for it to
be useful the kernel should be mapped with base pages and there's map/unmap
for every page allocation so optimizing a few pte changes (C in your
example) won't matter much.

If there's a potential correctness issue with Dev's patches, it should be
dealt with as a part of those patches with the necessary updates of how
lazy mmu is implemented on arm64 and used in pageattr.c.

And it seems to me that adding something along the lines below to
__kernel_map_pages() would solve DEBUG_PAGEALLOC issue:

void __kernel_map_pages(struct page *page, int numpages, int enable)
{
	unsigned long flags;
	bool lazy_mmu = false;

	if (!can_set_direct_map())
		return;

	flags = read_thread_flags();
	if (flags & BIT(TIF_LAZY_MMU))
		lazy_mmu = true;

	set_memory_valid((unsigned long)page_address(page), numpages, enable);

	if (lazy_mmu)
		set_thread_flag(TIF_LAZY_MMU);
}

> Thanks,
> Ryan

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements
  2025-05-31  7:46     ` Mike Rapoport
@ 2025-06-02 10:31       ` Ryan Roberts
  0 siblings, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2025-06-02 10:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Lorenzo Stoakes, Catalin Marinas, Will Deacon,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, David S. Miller, Andreas Larsson, Juergen Gross,
	Ajay Kaher, Alexey Makhalov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Boris Ostrovsky,
	Aneesh Kumar K.V, Andrew Morton, Peter Zijlstra, Arnd Bergmann,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Alexei Starovoitov,
	Andrey Ryabinin, linux-arm-kernel, linux-kernel, linuxppc-dev,
	sparclinux, virtualization, xen-devel, linux-mm, Jann Horn

On 31/05/2025 08:46, Mike Rapoport wrote:
> Hi Ryan,
> 
> On Fri, May 30, 2025 at 04:55:36PM +0100, Ryan Roberts wrote:
>> On 30/05/2025 15:47, Lorenzo Stoakes wrote:
>>> +cc Jann who is a specialist in all things page table-y and especially scary
>>> edge cases :)
>>>
>>> On Fri, May 30, 2025 at 03:04:38PM +0100, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> I recently added support for lazy mmu mode on arm64. The series is now in
>>>> Linus's tree so should be in v6.16-rc1. But during testing in linux-next we
>>>> found some ugly corners (unexpected nesting). I was able to fix those issues by
>>>> making the arm64 implementation more permissive (like the other arches). But
>>>> this is quite fragile IMHO. So I'd rather fix the root cause and ensure that
>>>> lazy mmu mode never nests, and more importantly, that code never makes pgtable
>>>> modifications expecting them to be immediate, not knowing that it's actually in
>>>> lazy mmu mode so the changes get deferred.
>>>
>>> When you say fragile, are you confident it _works_ but perhaps not quite as well
>>> as you want? Or are you concerned this might be broken upstream in any way?
>>
>> I'm confident that it _works_ for arm64 as it is, upstream. But if Dev's series
>> were to go in _without_ the lazy_mmu bracketting in some manner, then it would
>> be broken if the config includes CONFIG_DEBUG_PAGEALLOC.
>>
>> There's a lot more explanation in the later patches as to how it can be broken,
>> but for arm64, the situation is currently like this, because our implementation
>> of __change_memory_common() uses apply_to_page_range() which implicitly starts
>> an inner lazy_mmu_mode. We enter multiple times, but we exit one the first call
>> to exit. Everything works correctly but it's not optimal because C is no longer
>> deferred:
>>
>> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>>   <do some pte changes (A)>
>>   alloc_pages()
>>     debug_pagealloc_map_pages()
>>       __kernel_map_pages()
>>         __change_memory_common()
>>           arch_enter_lazy_mmu_mode()              << inner lazy mmu region
>>             <change kernel pte to make valid (B)>
>>           arch_leave_lazy_mmu_mode()              << exit; complete A + B
>>     clear_page()
>>   <do some more pte changes (C)>                  << no longer in lazy mode
>> arch_leave_lazy_mmu_mode()                        << nop
>>
>> An alternative implementation would not add the nested lazy mmu mode, so we end
>> up with this:
>>
>> arch_enter_lazy_mmu_mode()                        << outer lazy mmu region
>>   <do some pte changes (A)>
>>   alloc_pages()
>>     debug_pagealloc_map_pages()
>>       __kernel_map_pages()
>>         __change_memory_common()
>>             <change kernel pte to make valid (B)> << deferred due to lazy mmu
>>     clear_page()                                  << BANG! B has not be actioned
>>   <do some more pte changes (C)>
>> arch_leave_lazy_mmu_mode()
>>
>> This is clearly a much worse outcome. It's not happening today but it could in
>> future. That's why I'm claiming it's fragile. It's much better (IMHO) to
>> disallow calling the page allocator when in lazy mmu mode.
> 
> First, I think it should be handled completely inside arch/arm64. Page
> allocation worked on lazy mmu mode on other architectures, no reason it
> should be changed because of the way arm64 implements lazy mmu.
> 
> Second, DEBUG_PAGEALLOC already implies that performance is bad, for it to
> be useful the kernel should be mapped with base pages and there's map/unmap
> for every page allocation so optimizing a few pte changes (C in your
> example) won't matter much.
> 
> If there's a potential correctness issue with Dev's patches, it should be
> dealt with as a part of those patches with the necessary updates of how
> lazy mmu is implemented on arm64 and used in pageattr.c.
> 
> And it seems to me that adding something along the lines below to
> __kernel_map_pages() would solve DEBUG_PAGEALLOC issue:
> 
> void __kernel_map_pages(struct page *page, int numpages, int enable)
> {
> 	unsigned long flags;
> 	bool lazy_mmu = false;
> 
> 	if (!can_set_direct_map())
> 		return;
> 
> 	flags = read_thread_flags();
> 	if (flags & BIT(TIF_LAZY_MMU))
> 		lazy_mmu = true;
> 
> 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
> 
> 	if (lazy_mmu)
> 		set_thread_flag(TIF_LAZY_MMU);
> }

Hi Mike,

I've thought about this for a bit, and concluded that you are totally right.
This is a much smaller, arm64-contained patch. Sorry for the noise here, and
thanks for the suggestion.

Thanks,
Ryan


> 
>> Thanks,
>> Ryan
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-06-02 10:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 14:04 [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Ryan Roberts
2025-05-30 14:04 ` [RFC PATCH v1 1/6] fs/proc/task_mmu: Fix pte update and tlb maintenance ordering in pagemap_scan_pmd_entry() Ryan Roberts
2025-05-30 16:26   ` Jann Horn
2025-05-30 16:45     ` Ryan Roberts
2025-05-30 16:48       ` Jann Horn
2025-05-30 14:04 ` [RFC PATCH v1 2/6] mm: Fix pte update and tlb maintenance ordering in migrate_vma_collect_pmd() Ryan Roberts
2025-05-30 14:04 ` [RFC PATCH v1 3/6] mm: Avoid calling page allocator from apply_to_page_range() Ryan Roberts
2025-05-30 16:23   ` Liam R. Howlett
2025-05-30 16:50     ` Ryan Roberts
2025-05-30 19:08       ` Liam R. Howlett
2025-05-30 14:04 ` [RFC PATCH v1 4/6] mm: Introduce arch_in_lazy_mmu_mode() Ryan Roberts
2025-05-30 14:04 ` [RFC PATCH v1 5/6] mm: Avoid calling page allocator while in lazy mmu mode Ryan Roberts
2025-05-30 14:04 ` [RFC PATCH v1 6/6] Revert "arm64/mm: Permit lazy_mmu_mode to be nested" Ryan Roberts
2025-05-30 14:47 ` [RFC PATCH v1 0/6] Lazy mmu mode fixes and improvements Lorenzo Stoakes
2025-05-30 15:55   ` Ryan Roberts
2025-05-31  7:46     ` Mike Rapoport
2025-06-02 10:31       ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).