linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] Nesting support for lazy MMU mode
@ 2025-10-15  8:27 Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 01/13] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
                   ` (12 more replies)
  0 siblings, 13 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

When the lazy MMU mode was introduced eons ago, it wasn't made clear
whether such a sequence was legal:

	arch_enter_lazy_mmu_mode()
	...
		arch_enter_lazy_mmu_mode()
		...
		arch_leave_lazy_mmu_mode()
	...
	arch_leave_lazy_mmu_mode()

It seems fair to say that nested calls to
arch_{enter,leave}_lazy_mmu_mode() were not expected, and most
architectures never explicitly supported it.

Ryan Roberts' series from March [1] attempted to prevent nesting from
ever occurring, and mostly succeeded. Unfortunately, a corner case
(DEBUG_PAGEALLOC) may still cause nesting to occur on arm64. Ryan
proposed [2] to address that corner case at the generic level but this
approach received pushback; [3] then attempted to solve the issue on
arm64 only, but it was deemed too fragile.

It feels generally difficult to guarantee that lazy_mmu sections don't
nest, because callers of various standard mm functions do not know if
the function uses lazy_mmu itself. This series therefore performs a
U-turn and adds support for nested lazy_mmu sections, on all
architectures.

v3 is a full rewrite of the series based on the feedback from David
Hildenbrand on v2. Nesting is now handled using a counter in task_struct
(patch 7), like other APIs such as pagefault_{disable,enable}().
This is fully handled in a new generic layer in <linux/pgtable.h>; the
existing arch_* API remains unchanged. A new pair of calls,
lazy_mmu_mode_{pause,resume}(), is also introduced to allow functions
that are called with the lazy MMU mode enabled to temporarily pause it,
regardless of nesting.

An arch now opts in to using the lazy MMU mode by selecting
CONFIG_ARCH_LAZY_MMU; this is more appropriate now that we have a
generic API, especially with state conditionally added to task_struct.
The overall approach is very close to what David proposed on v2 [4].

Unlike in v1/v2, no special provision is made for architectures to
save/restore extra state when entering/leaving the mode. Based on the
discussions so far, this does not seem to be required - an arch can
store any relevant state in thread_struct during arch_enter() and
restore it in arch_leave(). Nesting is not a concern as these functions
are only called at the top level, not in nested sections.

The introduction of a generic layer, and tracking of the lazy MMU state
in task_struct, also allows to streamline the arch callbacks - this
series removes 72 lines from arch/.

Patch overview:

* Patch 1: cleanup - avoids having to deal with the powerpc
  context-switching code

* Patch 2-4: prepare arch_flush_lazy_mmu_mode() to be called from the
  generic layer (patch 7)

* Patch 5-6: new API + CONFIG_ARCH_LAZY_MMU

* Patch 7: nesting support

* Patch 8-13: move as much handling as possible to the generic layer

This series has been tested by running the mm kselfetsts on arm64 with
DEBUG_VM, DEBUG_PAGEALLOC and KFENCE. It was also build-tested on other
architectures (with and without XEN_PV on x86).

- Kevin

[1] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250530140446.2387131-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/ef343405-c394-4763-a79f-21381f217b6c@redhat.com/
---
Changelog

v2..v3:

- Full rewrite; dropped all Acked-by/Reviewed-by.
- Rebased on v6.18-rc1.

v2: https://lore.kernel.org/all/20250908073931.4159362-1-kevin.brodsky@arm.com/

v1..v2:
- Rebased on mm-unstable.
- Patch 2: handled new calls to enter()/leave(), clarified how the "flush"
  pattern (leave() followed by enter()) is handled.
- Patch 5,6: removed unnecessary local variable [Alexander Gordeev's
  suggestion].
- Added Mike Rapoport's Acked-by.

v1: https://lore.kernel.org/all/20250904125736.3918646-1-kevin.brodsky@arm.com/
---
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
---
Alexander Gordeev (1):
  powerpc/64s: Do not re-activate batched TLB flush

Kevin Brodsky (12):
  x86/xen: simplify flush_lazy_mmu()
  powerpc/mm: implement arch_flush_lazy_mmu_mode()
  sparc/mm: implement arch_flush_lazy_mmu_mode()
  mm: introduce CONFIG_ARCH_LAZY_MMU
  mm: introduce generic lazy_mmu helpers
  mm: enable lazy_mmu sections to nest
  arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  sparc/mm: replace batch->active with in_lazy_mmu_mode()
  x86/xen: use lazy_mmu_state when context-switching
  mm: bail out of lazy_mmu_mode_* in interrupt context
  mm: introduce arch_wants_lazy_mmu_mode()

 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/pgtable.h              |  46 +------
 arch/arm64/include/asm/thread_info.h          |   3 +-
 arch/arm64/mm/mmu.c                           |   4 +-
 arch/arm64/mm/pageattr.c                      |   4 +-
 .../include/asm/book3s/64/tlbflush-hash.h     |  25 ++--
 arch/powerpc/include/asm/thread_info.h        |   2 -
 arch/powerpc/kernel/process.c                 |  25 ----
 arch/powerpc/mm/book3s64/hash_tlb.c           |  10 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |   4 +-
 arch/powerpc/platforms/Kconfig.cputype        |   1 +
 arch/sparc/Kconfig                            |   1 +
 arch/sparc/include/asm/tlbflush_64.h          |   5 +-
 arch/sparc/mm/tlb.c                           |  14 +--
 arch/x86/Kconfig                              |   1 +
 arch/x86/boot/compressed/misc.h               |   1 +
 arch/x86/boot/startup/sme.c                   |   1 +
 arch/x86/include/asm/paravirt.h               |   1 -
 arch/x86/include/asm/pgtable.h                |   3 +-
 arch/x86/include/asm/thread_info.h            |   4 +-
 arch/x86/xen/enlighten_pv.c                   |   3 +-
 arch/x86/xen/mmu_pv.c                         |   9 +-
 fs/proc/task_mmu.c                            |   4 +-
 include/linux/mm_types_task.h                 |   5 +
 include/linux/pgtable.h                       | 114 +++++++++++++++++-
 include/linux/sched.h                         |  19 +++
 mm/Kconfig                                    |   3 +
 mm/kasan/shadow.c                             |   8 +-
 mm/madvise.c                                  |  18 +--
 mm/memory.c                                   |  16 +--
 mm/migrate_device.c                           |   4 +-
 mm/mprotect.c                                 |   4 +-
 mm/mremap.c                                   |   4 +-
 mm/userfaultfd.c                              |   4 +-
 mm/vmalloc.c                                  |  12 +-
 mm/vmscan.c                                   |  12 +-
 36 files changed, 226 insertions(+), 169 deletions(-)


base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
-- 
2.47.0



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v3 01/13] powerpc/64s: Do not re-activate batched TLB flush
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
lazy mmu mode") a task can not be preempted while in lazy MMU mode.
Therefore, the batch re-activation code is never called, so remove it.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
This patch was originally posted as part of [1]; the series was not
taken but this patch remains relevant.

[1] https://lore.kernel.org/all/cover.1749747752.git.agordeev@linux.ibm.com/
---
 arch/powerpc/include/asm/thread_info.h |  2 --
 arch/powerpc/kernel/process.c          | 25 -------------------------
 2 files changed, 27 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index b0f200aba2b3..97f35f9b1a96 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
 /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
-#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
-#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
 
 #ifndef __ASSEMBLER__
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index eb23966ac0a9..9237dcbeee4a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 {
 	struct thread_struct *new_thread, *old_thread;
 	struct task_struct *last;
-#ifdef CONFIG_PPC_64S_HASH_MMU
-	struct ppc64_tlb_batch *batch;
-#endif
 
 	new_thread = &new->thread;
 	old_thread = &current->thread;
@@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	WARN_ON(!irqs_disabled());
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
-	if (batch->active) {
-		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
-		if (batch->index)
-			__flush_tlb_pending(batch);
-		batch->active = 0;
-	}
-
 	/*
 	 * On POWER9 the copy-paste buffer can only paste into
 	 * foreign real addresses, so unprivileged processes can not
@@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	 */
 
 #ifdef CONFIG_PPC_BOOK3S_64
-#ifdef CONFIG_PPC_64S_HASH_MMU
-	/*
-	 * This applies to a process that was context switched while inside
-	 * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
-	 * deactivated above, before _switch(). This will never be the case
-	 * for new tasks.
-	 */
-	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
-		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
-		batch = this_cpu_ptr(&ppc64_tlb_batch);
-		batch->active = 1;
-	}
-#endif
-
 	/*
 	 * Math facilities are masked out of the child MSR in copy_thread.
 	 * A new task does not need to restore_math because it will
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 01/13] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-15 16:52   ` Dave Hansen
  2025-10-15  8:27 ` [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

arch_flush_lazy_mmu_mode() is called when outstanding batched
pgtable operations must be completed immediately. There should
however be no need to leave and re-enter lazy MMU completely. The
only part of that sequence that we really need is xen_mc_flush();
call it directly.

While at it, we can also avoid preempt_disable() if we are not
in lazy MMU mode - xen_get_lazy_mode() should tolerate preemption.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/x86/xen/mmu_pv.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2a4a8deaf612..dcb7b0989c32 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2137,14 +2137,11 @@ static void xen_enter_lazy_mmu(void)
 
 static void xen_flush_lazy_mmu(void)
 {
-	preempt_disable();
-
 	if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
-		arch_leave_lazy_mmu_mode();
-		arch_enter_lazy_mmu_mode();
+		preempt_disable();
+		xen_mc_flush();
+		preempt_enable();
 	}
-
-	preempt_enable();
 }
 
 static void __init xen_post_allocator_init(void)
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 01/13] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 19:36   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 04/13] sparc/mm: " Kevin Brodsky
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

Upcoming changes to the lazy_mmu API will cause
arch_flush_lazy_mmu_mode() to be called when leaving a nested
lazy_mmu section.

Move the relevant logic from arch_leave_lazy_mmu_mode() to
arch_flush_lazy_mmu_mode() and have the former call the latter.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 146287d9580f..7704dbe8e88d 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
 	batch->active = 1;
 }
 
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	struct ppc64_tlb_batch *batch;
+
+	batch = this_cpu_ptr(&ppc64_tlb_batch);
+
+	if (batch->index)
+		__flush_tlb_pending(batch);
+}
+
 static inline void arch_leave_lazy_mmu_mode(void)
 {
 	struct ppc64_tlb_batch *batch;
@@ -49,14 +59,11 @@ static inline void arch_leave_lazy_mmu_mode(void)
 		return;
 	batch = this_cpu_ptr(&ppc64_tlb_batch);
 
-	if (batch->index)
-		__flush_tlb_pending(batch);
+	arch_flush_lazy_mmu_mode();
 	batch->active = 0;
 	preempt_enable();
 }
 
-#define arch_flush_lazy_mmu_mode()      do {} while (0)
-
 extern void hash__tlbiel_all(unsigned int action);
 
 extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 04/13] sparc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (2 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 19:37   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU Kevin Brodsky
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

Upcoming changes to the lazy_mmu API will cause
arch_flush_lazy_mmu_mode() to be called when leaving a nested
lazy_mmu section.

Move the relevant logic from arch_leave_lazy_mmu_mode() to
arch_flush_lazy_mmu_mode() and have the former call the latter.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/sparc/include/asm/tlbflush_64.h | 2 +-
 arch/sparc/mm/tlb.c                  | 9 ++++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 8b8cdaa69272..925bb5d7a4e1 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -43,8 +43,8 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 void flush_tlb_pending(void);
 void arch_enter_lazy_mmu_mode(void);
+void arch_flush_lazy_mmu_mode(void);
 void arch_leave_lazy_mmu_mode(void);
-#define arch_flush_lazy_mmu_mode()      do {} while (0)
 
 /* Local cpu only.  */
 void __flush_tlb_all(void);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index a35ddcca5e76..7b5dfcdb1243 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -59,12 +59,19 @@ void arch_enter_lazy_mmu_mode(void)
 	tb->active = 1;
 }
 
-void arch_leave_lazy_mmu_mode(void)
+void arch_flush_lazy_mmu_mode(void)
 {
 	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
 
 	if (tb->tlb_nr)
 		flush_tlb_pending();
+}
+
+void arch_leave_lazy_mmu_mode(void)
+{
+	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
+
+	arch_flush_lazy_mmu_mode();
 	tb->active = 0;
 	preempt_enable();
 }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (3 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 04/13] sparc/mm: " Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-18  9:52   ` Mike Rapoport
  2025-10-15  8:27 ` [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers Kevin Brodsky
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

Architectures currently opt in for implementing lazy_mmu helpers by
defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.

In preparation for introducing a generic lazy_mmu layer that will
require storage in task_struct, let's switch to a cleaner approach:
instead of defining a macro, select a CONFIG option.

This patch introduces CONFIG_ARCH_LAZY_MMU and has each arch select
it when it implements lazy_mmu helpers.
__HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
relies on the new CONFIG instead.

On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
selected. This creates some complications in arch/x86/boot/, because
a few files manually undefine PARAVIRT* options. As a result
<asm/paravirt.h> does not define the lazy_mmu helpers, but this
breaks the build as <linux/pgtable.h> only defines them if
!CONFIG_ARCH_LAZY_MMU. There does not seem to be a clean way out of
this - let's just undefine that new CONFIG too.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/Kconfig                                 | 1 +
 arch/arm64/include/asm/pgtable.h                   | 1 -
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
 arch/powerpc/platforms/Kconfig.cputype             | 1 +
 arch/sparc/Kconfig                                 | 1 +
 arch/sparc/include/asm/tlbflush_64.h               | 2 --
 arch/x86/Kconfig                                   | 1 +
 arch/x86/boot/compressed/misc.h                    | 1 +
 arch/x86/boot/startup/sme.c                        | 1 +
 arch/x86/include/asm/paravirt.h                    | 1 -
 include/linux/pgtable.h                            | 2 +-
 mm/Kconfig                                         | 3 +++
 12 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6663ffd23f25..12d47a5f5e56 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -122,6 +122,7 @@ config ARM64
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
 	select ARCH_HAS_UBSAN
+	select ARCH_LAZY_MMU
 	select ARM_AMBA
 	select ARM_ARCH_TIMER
 	select ARM_GIC
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index aa89c2e67ebc..e3cbb10288c4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -80,7 +80,6 @@ static inline void queue_pte_barriers(void)
 	}
 }
 
-#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	/*
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 7704dbe8e88d..623a8a8b2d0e 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -24,8 +24,6 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
 
 extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
 
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	struct ppc64_tlb_batch *batch;
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 7b527d18aa5e..a5e06aaf19cd 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -93,6 +93,7 @@ config PPC_BOOK3S_64
 	select IRQ_WORK
 	select PPC_64S_HASH_MMU if !PPC_RADIX_MMU
 	select KASAN_VMALLOC if KASAN
+	select ARCH_LAZY_MMU
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index a630d373e645..59f17996a353 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -112,6 +112,7 @@ config SPARC64
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
 	select ARCH_SUPPORTS_SCHED_SMT if SMP
 	select ARCH_SUPPORTS_SCHED_MC  if SMP
+	select ARCH_LAZY_MMU
 
 config ARCH_PROC_KCORE_TEXT
 	def_bool y
diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 925bb5d7a4e1..4e1036728e2f 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -39,8 +39,6 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-
 void flush_tlb_pending(void);
 void arch_enter_lazy_mmu_mode(void);
 void arch_flush_lazy_mmu_mode(void);
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..85de037cad8c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -804,6 +804,7 @@ config PARAVIRT
 config PARAVIRT_XXL
 	bool
 	depends on X86_64
+	select ARCH_LAZY_MMU
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index db1048621ea2..80b3b79a1001 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -11,6 +11,7 @@
 #undef CONFIG_PARAVIRT
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
+#undef CONFIG_ARCH_LAZY_MMU
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
 
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index e7ea65f3f1d6..af74d09b68bc 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -24,6 +24,7 @@
 #undef CONFIG_PARAVIRT
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
+#undef CONFIG_ARCH_LAZY_MMU
 
 /*
  * This code runs before CPU feature bits are set. By default, the
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b5e59a7ba0d0..13f9cd31c8f8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -526,7 +526,6 @@ static inline void arch_end_context_switch(struct task_struct *next)
 	PVOP_VCALL1(cpu.end_context_switch, next);
 }
 
-#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	PVOP_VCALL0(mmu.lazy_mode.enter);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 32e8457ad535..124d5fa2975f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
  * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
  * and the mode cannot be used in interrupt context.
  */
-#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#ifndef CONFIG_ARCH_LAZY_MMU
 static inline void arch_enter_lazy_mmu_mode(void) {}
 static inline void arch_leave_lazy_mmu_mode(void) {}
 static inline void arch_flush_lazy_mmu_mode(void) {}
diff --git a/mm/Kconfig b/mm/Kconfig
index 0e26f4fc8717..2fdcb42ca1a1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1372,6 +1372,9 @@ config PT_RECLAIM
 config FIND_NORMAL_PAGE
 	def_bool n
 
+config ARCH_LAZY_MMU
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (4 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-17 15:54   ` Alexander Gordeev
  2025-10-23 19:52   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 07/13] mm: enable lazy_mmu sections to nest Kevin Brodsky
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

The implementation of the lazy MMU mode is currently entirely
arch-specific; core code directly calls arch helpers:
arch_{enter,leave}_lazy_mmu_mode().

We are about to introduce support for nested lazy MMU sections.
As things stand we'd have to duplicate that logic in every arch
implementing lazy_mmu - adding to a fair amount of logic
already duplicated across lazy_mmu implementations.

This patch therefore introduces a new generic layer that calls the
existing arch_* helpers. Two pair of calls are introduced:

* lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
    This is the standard case where the mode is enabled for a given
    block of code by surrounding it with enable() and disable()
    calls.

* lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
    This is for situations where the mode is temporarily disabled
    by first calling pause() and then resume() (e.g. to prevent any
    batching from occurring in a critical section).

The documentation in <linux/pgtable.h> will be updated in a
subsequent patch.

No functional change should be introduced at this stage.
The implementation of enable()/resume() and disable()/pause() is
currently identical, but nesting support will change that.

Most of the call sites have been updated using the following
Coccinelle script:

@@
@@
{
...
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
...
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
...
}

@@
@@
{
...
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_pause();
...
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_resume();
...
}

A couple of cases are noteworthy:

* madvise_*_pte_range() call arch_leave() in multiple paths, some
  followed by an immediate exit/rescheduling and some followed by a
  conditional exit. These functions assume that they are called
  with lazy MMU disabled and we cannot simply use pause()/resume()
  to address that. This patch leaves the situation unchanged by
  calling enable()/disable() in all cases.

* x86/Xen is currently the only case where explicit handling is
  required for lazy MMU when context-switching. This is purely an
  implementation detail and using the generic lazy_mmu_mode_*
  functions would cause trouble when nesting support is introduced,
  because the generic functions must be called from the current task.
  For that reason we still use arch_leave() and arch_enter() there.

Note: x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
places, but only defines it if PARAVIRT_XXL is selected, and we are
removing the fallback in <linux/pgtable.h>. Add a new fallback
definition to <asm/pgtable.h> to keep things building.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/mm/mmu.c                     |  4 ++--
 arch/arm64/mm/pageattr.c                |  4 ++--
 arch/powerpc/mm/book3s64/hash_tlb.c     |  8 +++----
 arch/powerpc/mm/book3s64/subpage_prot.c |  4 ++--
 arch/x86/include/asm/pgtable.h          |  3 ++-
 fs/proc/task_mmu.c                      |  4 ++--
 include/linux/pgtable.h                 | 29 +++++++++++++++++++++----
 mm/kasan/shadow.c                       |  8 +++----
 mm/madvise.c                            | 18 +++++++--------
 mm/memory.c                             | 16 +++++++-------
 mm/migrate_device.c                     |  4 ++--
 mm/mprotect.c                           |  4 ++--
 mm/mremap.c                             |  4 ++--
 mm/userfaultfd.c                        |  4 ++--
 mm/vmalloc.c                            | 12 +++++-----
 mm/vmscan.c                             | 12 +++++-----
 16 files changed, 80 insertions(+), 58 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b8d37eb037fc..d9c8e94f140f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -731,7 +731,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 		return -EINVAL;
 
 	mutex_lock(&pgtable_split_lock);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	/*
 	 * The split_kernel_leaf_mapping_locked() may sleep, it is not a
@@ -753,7 +753,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 			ret = split_kernel_leaf_mapping_locked(end);
 	}
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	mutex_unlock(&pgtable_split_lock);
 	return ret;
 }
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 5135f2d66958..e4059f13c4ed 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -110,7 +110,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
 	if (WARN_ON_ONCE(ret))
 		return ret;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	/*
 	 * The caller must ensure that the range we are operating on does not
@@ -119,7 +119,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
 	 */
 	ret = walk_kernel_page_table_range_lockless(start, start + size,
 						    &pageattr_ops, NULL, &data);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	return ret;
 }
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 21fcad97ae80..787f7a0e27f0 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -205,7 +205,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
 	 * way to do things but is fine for our needs here.
 	 */
 	local_irq_save(flags);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; start < end; start += PAGE_SIZE) {
 		pte_t *ptep = find_init_mm_pte(start, &hugepage_shift);
 		unsigned long pte;
@@ -217,7 +217,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
 			continue;
 		hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift);
 	}
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	local_irq_restore(flags);
 }
 
@@ -237,7 +237,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
 	 * way to do things but is fine for our needs here.
 	 */
 	local_irq_save(flags);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	start_pte = pte_offset_map(pmd, addr);
 	if (!start_pte)
 		goto out;
@@ -249,6 +249,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
 	}
 	pte_unmap(start_pte);
 out:
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	local_irq_restore(flags);
 }
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index ec98e526167e..07c47673bba2 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -73,13 +73,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; npages > 0; --npages) {
 		pte_update(mm, addr, pte, 0, 0, 0);
 		addr += PAGE_SIZE;
 		++pte;
 	}
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e33df3da6980..14fd672bc9b2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -117,7 +117,8 @@ extern pmdval_t early_pmd_flags;
 #define pte_val(x)	native_pte_val(x)
 #define __pte(x)	native_make_pte(x)
 
-#define arch_end_context_switch(prev)	do {} while(0)
+#define arch_end_context_switch(prev)	do {} while (0)
+#define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif	/* CONFIG_PARAVIRT_XXL */
 
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..d16ba1d32169 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2703,7 +2703,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 		return 0;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
 		/* Fast path for performing exclusive WP */
@@ -2773,7 +2773,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (flush_end)
 		flush_tlb_range(vma, start, addr);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(start_pte, ptl);
 
 	cond_resched();
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 124d5fa2975f..194b2c3e7576 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,10 +231,31 @@ static inline int pmd_dirty(pmd_t pmd)
  * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
  * and the mode cannot be used in interrupt context.
  */
-#ifndef CONFIG_ARCH_LAZY_MMU
-static inline void arch_enter_lazy_mmu_mode(void) {}
-static inline void arch_leave_lazy_mmu_mode(void) {}
-static inline void arch_flush_lazy_mmu_mode(void) {}
+#ifdef CONFIG_ARCH_LAZY_MMU
+static inline void lazy_mmu_mode_enable(void)
+{
+	arch_enter_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_disable(void)
+{
+	arch_leave_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_pause(void)
+{
+	arch_leave_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_resume(void)
+{
+	arch_enter_lazy_mmu_mode();
+}
+#else
+static inline void lazy_mmu_mode_enable(void) {}
+static inline void lazy_mmu_mode_disable(void) {}
+static inline void lazy_mmu_mode_pause(void) {}
+static inline void lazy_mmu_mode_resume(void) {}
 #endif
 
 #ifndef pte_batch_hint
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 5d2a876035d6..c49b029d3593 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	pte_t pte;
 	int index;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_pause();
 
 	index = PFN_DOWN(addr - data->start);
 	page = data->pages[index];
@@ -319,7 +319,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	}
 	spin_unlock(&init_mm.page_table_lock);
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_resume();
 
 	return 0;
 }
@@ -482,7 +482,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	pte_t pte;
 	int none;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_pause();
 
 	spin_lock(&init_mm.page_table_lock);
 	pte = ptep_get(ptep);
@@ -494,7 +494,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	if (likely(!none))
 		__free_page(pfn_to_page(pte_pfn(pte)));
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_resume();
 
 	return 0;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..536026772160 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -455,7 +455,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -463,7 +463,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		if (++batch_count == SWAP_CLUSTER_MAX) {
 			batch_count = 0;
 			if (need_resched()) {
-				arch_leave_lazy_mmu_mode();
+				lazy_mmu_mode_disable();
 				pte_unmap_unlock(start_pte, ptl);
 				cond_resched();
 				goto restart;
@@ -499,7 +499,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				if (!folio_trylock(folio))
 					continue;
 				folio_get(folio);
-				arch_leave_lazy_mmu_mode();
+				lazy_mmu_mode_disable();
 				pte_unmap_unlock(start_pte, ptl);
 				start_pte = NULL;
 				err = split_folio(folio);
@@ -510,7 +510,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				arch_enter_lazy_mmu_mode();
+				lazy_mmu_mode_enable();
 				if (!err)
 					nr = 0;
 				continue;
@@ -558,7 +558,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 	if (start_pte) {
-		arch_leave_lazy_mmu_mode();
+		lazy_mmu_mode_disable();
 		pte_unmap_unlock(start_pte, ptl);
 	}
 	if (pageout)
@@ -677,7 +677,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -727,7 +727,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				if (!folio_trylock(folio))
 					continue;
 				folio_get(folio);
-				arch_leave_lazy_mmu_mode();
+				lazy_mmu_mode_disable();
 				pte_unmap_unlock(start_pte, ptl);
 				start_pte = NULL;
 				err = split_folio(folio);
@@ -738,7 +738,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				arch_enter_lazy_mmu_mode();
+				lazy_mmu_mode_enable();
 				if (!err)
 					nr = 0;
 				continue;
@@ -778,7 +778,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (nr_swap)
 		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
 	if (start_pte) {
-		arch_leave_lazy_mmu_mode();
+		lazy_mmu_mode_disable();
 		pte_unmap_unlock(start_pte, ptl);
 	}
 	cond_resched();
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..2d662dee5ae7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1254,7 +1254,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		nr = 1;
@@ -1323,7 +1323,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
 		 addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(orig_src_pte, src_ptl);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
@@ -1842,7 +1842,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		return addr;
 
 	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	do {
 		bool any_skipped = false;
 
@@ -1874,7 +1874,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
 
 	add_mm_rss_vec(mm, rss);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	/* Do the actual TLB flush before dropping ptl */
 	if (force_flush) {
@@ -2817,7 +2817,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	do {
 		BUG_ON(!pte_none(ptep_get(pte)));
 		if (!pfn_modify_allowed(pfn, prot)) {
@@ -2827,7 +2827,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(mapped_pte, ptl);
 	return err;
 }
@@ -3134,7 +3134,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			return -EINVAL;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	if (fn) {
 		do {
@@ -3147,7 +3147,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	}
 	*mask |= PGTBL_PTE_MODIFIED;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	if (mm != &init_mm)
 		pte_unmap_unlock(mapped_pte, ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..dcdc46b96cc7 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -110,7 +110,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	if (!ptep)
 		goto again;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		struct dev_pagemap *pgmap;
@@ -287,7 +287,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	if (unmapped)
 		flush_tlb_range(walk->vma, start, end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(ptep - 1, ptl);
 
 	return 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 113b48985834..bcb183a6fd2f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -293,7 +293,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 		target_node = numa_node_id();
 
 	flush_tlb_batched_pending(vma->vm_mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	do {
 		nr_ptes = 1;
 		oldpte = ptep_get(pte);
@@ -439,7 +439,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			}
 		}
 	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return pages;
diff --git a/mm/mremap.c b/mm/mremap.c
index 35de0a7b910e..1e216007160d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -256,7 +256,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 	flush_tlb_batched_pending(vma->vm_mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
 		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
@@ -301,7 +301,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		}
 	}
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	if (force_flush)
 		flush_tlb_range(vma, old_end - len, old_end);
 	if (new_ptl != old_ptl)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..e01f7813e15c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1100,7 +1100,7 @@ static long move_present_ptes(struct mm_struct *mm,
 	/* It's safe to drop the reference now as the page-table is holding one. */
 	folio_put(*first_src_folio);
 	*first_src_folio = NULL;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	while (true) {
 		orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1138,7 +1138,7 @@ static long move_present_ptes(struct mm_struct *mm,
 			break;
 	}
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	if (src_addr > src_start)
 		flush_tlb_range(src_vma, src_start, src_addr);
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 798b2ed21e46..b9940590a40d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -105,7 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	if (!pte)
 		return -ENOMEM;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		if (unlikely(!pte_none(ptep_get(pte)))) {
@@ -131,7 +131,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		pfn++;
 	} while (pte += PFN_DOWN(size), addr += size, addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
@@ -359,7 +359,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 #ifdef CONFIG_HUGETLB_PAGE
@@ -378,7 +378,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
 	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
 }
 
@@ -526,7 +526,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!pte)
 		return -ENOMEM;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		struct page *page = pages[*nr];
@@ -548,7 +548,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		(*nr)++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
 
 	return err;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2fc8b626d3d..7d2d87069530 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3551,7 +3551,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		return false;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 restart:
 	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
 		unsigned long pfn;
@@ -3592,7 +3592,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
 		goto restart;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(pte, ptl);
 
 	return suitable_to_scan(total, young);
@@ -3633,7 +3633,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	if (!spin_trylock(ptl))
 		goto done;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		unsigned long pfn;
@@ -3680,7 +3680,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 
 	walk_update_folio(walk, last, gen, dirty);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	spin_unlock(ptl);
 done:
 	*first = -1;
@@ -4279,7 +4279,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		}
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	pte -= (addr - start) / PAGE_SIZE;
 
@@ -4313,7 +4313,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
 	walk_update_folio(walk, last, gen, dirty);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	/* feedback from rmap walkers to page table walkers */
 	if (mm_state && suitable_to_scan(i, young))
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 07/13] mm: enable lazy_mmu sections to nest
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (5 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 20:00   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 08/13] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

Despite recent efforts to prevent lazy_mmu sections from nesting, it
remains difficult to ensure that it never occurs - and in fact it
does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
made nesting tolerable on arm64, but without truly supporting it:
the inner call to leave() disables the batching optimisation before
the outer section ends.

This patch actually enables lazy_mmu sections to nest by tracking
the nesting level in task_struct, in a similar fashion to e.g.
pagefault_{enable,disable}(). This is fully handled by the generic
lazy_mmu helpers that were recently introduced.

lazy_mmu sections were not initially intended to nest, so we need to
clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
This patch takes the following approach:

* The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
  calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.

* Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
  to the arch via arch_{enter,leave} - lazy MMU remains enabled so
  the assumption is that these callbacks are not relevant. However,
  existing code may rely on a call to disable() to flush any batched
  state, regardless of nesting. arch_flush_lazy_mmu_mode() is
  therefore called in that situation.

A separate interface was recently introduced to temporarily pause
the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
exits the mode *regardless of the nesting level*, and resume()
restores the mode at the same nesting level.

Whether the mode is actually enabled or not at any point is tracked
by a separate "enabled" field in task_struct; this makes it possible
to check invariants in the generic API, and to expose a new
in_lazy_mmu_mode() helper to replace the various ways arch's
currently track whether the mode is enabled (this will be done in
later patches).

In summary (count/enabled represent the values *after* the call):

lazy_mmu_mode_enable()		-> arch_enter()	    count=1 enabled=1
    lazy_mmu_mode_enable()	-> ø		    count=2 enabled=1
	lazy_mmu_mode_pause()	-> arch_leave()     count=2 enabled=0
	lazy_mmu_mode_resume()	-> arch_enter()     count=2 enabled=1
    lazy_mmu_mode_disable()	-> arch_flush()     count=1 enabled=1
lazy_mmu_mode_disable()		-> arch_leave()     count=0 enabled=0

Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
headers included by <linux/pgtable.h> to use it.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
Alexander Gordeev suggested that a future optimisation may need
lazy_mmu_mode_{pause,resume}() to call distinct arch callbacks [1]. For
now arch_{leave,enter}() are called directly, but introducing new arch
callbacks should be straightforward.

[1] https://lore.kernel.org/all/5a0818bb-75d4-47df-925c-0102f7d598f4-agordeev@linux.ibm.com/
---
 arch/arm64/include/asm/pgtable.h | 12 ------
 include/linux/mm_types_task.h    |  5 +++
 include/linux/pgtable.h          | 69 ++++++++++++++++++++++++++++++--
 include/linux/sched.h            | 16 ++++++++
 4 files changed, 86 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e3cbb10288c4..f15ca4d62f09 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	/*
-	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
-	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
-	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
-	 * permissions on the linear map with apply_to_page_range(), which
-	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
-	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
-	 * flush and clear the flag such that the remainder of the work in the
-	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
-	 * keeps tracking simple.
-	 */
-
 	if (in_interrupt())
 		return;
 
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a82aa80c0ba4..2ff83b85fef0 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
 #endif
 };
 
+struct lazy_mmu_state {
+	u8 count;
+	bool enabled;
+};
+
 #endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 194b2c3e7576..269225a733de 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,28 +228,89 @@ static inline int pmd_dirty(pmd_t pmd)
  * of the lazy mode. So the implementation must assume preemption may be enabled
  * and cpu migration is possible; it must take steps to be robust against this.
  * (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
- * and the mode cannot be used in interrupt context.
+ * held, but for kernel PTE updates, no lock is held). The mode cannot be used
+ * in interrupt context.
+ *
+ * The lazy MMU mode is enabled for a given block of code using:
+ *
+ *   lazy_mmu_mode_enable();
+ *   <code>
+ *   lazy_mmu_mode_disable();
+ *
+ * Nesting is permitted: <code> may itself use an enable()/disable() pair.
+ * A nested call to enable() has no functional effect; however disable() causes
+ * any batched architectural state to be flushed regardless of nesting. After a
+ * call to disable(), the caller can therefore rely on all previous page table
+ * modifications to have taken effect, but the lazy MMU mode may still be
+ * enabled.
+ *
+ * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
+ * This can be done using:
+ *
+ *   lazy_mmu_mode_pause();
+ *   <code>
+ *   lazy_mmu_mode_resume();
+ *
+ * This sequence must only be used if the lazy MMU mode is already enabled.
+ * pause() ensures that the mode is exited regardless of the nesting level;
+ * resume() re-enters the mode at the same nesting level. <code> must not modify
+ * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
+ * helpers).
+ *
+ * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
+ * currently enabled.
  */
 #ifdef CONFIG_ARCH_LAZY_MMU
 static inline void lazy_mmu_mode_enable(void)
 {
-	arch_enter_lazy_mmu_mode();
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_BUG_ON(state->count == U8_MAX);
+	/* enable() must not be called while paused */
+	VM_WARN_ON(state->count > 0 && !state->enabled);
+
+	if (state->count == 0) {
+		arch_enter_lazy_mmu_mode();
+		state->enabled = true;
+	}
+	++state->count;
 }
 
 static inline void lazy_mmu_mode_disable(void)
 {
-	arch_leave_lazy_mmu_mode();
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_BUG_ON(state->count == 0);
+	VM_WARN_ON(!state->enabled);
+
+	--state->count;
+	if (state->count == 0) {
+		state->enabled = false;
+		arch_leave_lazy_mmu_mode();
+	} else {
+		/* Exiting a nested section */
+		arch_flush_lazy_mmu_mode();
+	}
 }
 
 static inline void lazy_mmu_mode_pause(void)
 {
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_WARN_ON(state->count == 0 || !state->enabled);
+
+	state->enabled = false;
 	arch_leave_lazy_mmu_mode();
 }
 
 static inline void lazy_mmu_mode_resume(void)
 {
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_WARN_ON(state->count == 0 || state->enabled);
+
 	arch_enter_lazy_mmu_mode();
+	state->enabled = true;
 }
 #else
 static inline void lazy_mmu_mode_enable(void) {}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbb7340c5866..2862d8bf2160 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1441,6 +1441,10 @@ struct task_struct {
 
 	struct page_frag		task_frag;
 
+#ifdef CONFIG_ARCH_LAZY_MMU
+	struct lazy_mmu_state		lazy_mmu_state;
+#endif
+
 #ifdef CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info		*delays;
 #endif
@@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
 	return task_index_to_char(task_state_index(tsk));
 }
 
+#ifdef CONFIG_ARCH_LAZY_MMU
+static inline bool in_lazy_mmu_mode(void)
+{
+	return current->lazy_mmu_state.enabled;
+}
+#else
+static inline bool in_lazy_mmu_mode(void)
+{
+	return false;
+}
+#endif
+
 extern struct pid *cad_pid;
 
 /*
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 08/13] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (6 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 07/13] mm: enable lazy_mmu sections to nest Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 09/13] powerpc/mm: replace batch->active " Kevin Brodsky
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. As a result we no longer need a TIF flag for that purpose -
let's use the new in_lazy_mmu_mode() helper instead.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h     | 16 +++-------------
 arch/arm64/include/asm/thread_info.h |  3 +--
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f15ca4d62f09..944e512767db 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -62,30 +62,21 @@ static inline void emit_pte_barriers(void)
 
 static inline void queue_pte_barriers(void)
 {
-	unsigned long flags;
-
 	if (in_interrupt()) {
 		emit_pte_barriers();
 		return;
 	}
 
-	flags = read_thread_flags();
-
-	if (flags & BIT(TIF_LAZY_MMU)) {
-		/* Avoid the atomic op if already set. */
-		if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
-			set_thread_flag(TIF_LAZY_MMU_PENDING);
-	} else {
+	if (in_lazy_mmu_mode())
+		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);
+	else
 		emit_pte_barriers();
-	}
 }
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	if (in_interrupt())
 		return;
-
-	set_thread_flag(TIF_LAZY_MMU);
 }
 
 static inline void arch_flush_lazy_mmu_mode(void)
@@ -103,7 +94,6 @@ static inline void arch_leave_lazy_mmu_mode(void)
 		return;
 
 	arch_flush_lazy_mmu_mode();
-	clear_thread_flag(TIF_LAZY_MMU);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index f241b8601ebd..4ff8da0767d9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -84,8 +84,7 @@ void arch_setup_new_exec(void);
 #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
 #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
 #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
-#define TIF_LAZY_MMU		31	/* Task in lazy mmu mode */
-#define TIF_LAZY_MMU_PENDING	32	/* Ops pending for lazy mmu mode exit */
+#define TIF_LAZY_MMU_PENDING	31	/* Ops pending for lazy mmu mode exit */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 09/13] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (7 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 08/13] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 20:02   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 10/13] sparc/mm: " Kevin Brodsky
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. As a result we no longer need to track whether the per-CPU TLB
batch struct is active - we know it is if in_lazy_mmu_mode() returns
true.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 9 ---------
 arch/powerpc/mm/book3s64/hash_tlb.c                | 2 +-
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 623a8a8b2d0e..bbc54690d374 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -12,7 +12,6 @@
 #define PPC64_TLB_BATCH_NR 192
 
 struct ppc64_tlb_batch {
-	int			active;
 	unsigned long		index;
 	struct mm_struct	*mm;
 	real_pte_t		pte[PPC64_TLB_BATCH_NR];
@@ -26,8 +25,6 @@ extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch;
-
 	if (radix_enabled())
 		return;
 	/*
@@ -35,8 +32,6 @@ static inline void arch_enter_lazy_mmu_mode(void)
 	 * operating on kernel page tables.
 	 */
 	preempt_disable();
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
-	batch->active = 1;
 }
 
 static inline void arch_flush_lazy_mmu_mode(void)
@@ -51,14 +46,10 @@ static inline void arch_flush_lazy_mmu_mode(void)
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch;
-
 	if (radix_enabled())
 		return;
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
 
 	arch_flush_lazy_mmu_mode();
-	batch->active = 0;
 	preempt_enable();
 }
 
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 787f7a0e27f0..72b83f582b6d 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -100,7 +100,7 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 	 * Check if we have an active batch on this CPU. If not, just
 	 * flush now and return.
 	 */
-	if (!batch->active) {
+	if (!in_lazy_mmu_mode()) {
 		flush_hash_page(vpn, rpte, psize, ssize, mm_is_thread_local(mm));
 		put_cpu_var(ppc64_tlb_batch);
 		return;
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 10/13] sparc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (8 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 09/13] powerpc/mm: replace batch->active " Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 20:03   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. As a result we no longer need to track whether the per-CPU TLB
batch struct is active - we know it is if in_lazy_mmu_mode() returns
true.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/sparc/include/asm/tlbflush_64.h | 1 -
 arch/sparc/mm/tlb.c                  | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 4e1036728e2f..6133306ba59a 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -12,7 +12,6 @@ struct tlb_batch {
 	unsigned int hugepage_shift;
 	struct mm_struct *mm;
 	unsigned long tlb_nr;
-	unsigned long active;
 	unsigned long vaddrs[TLB_BATCH_NR];
 };
 
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 7b5dfcdb1243..879e22c86e5c 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -52,11 +52,7 @@ void flush_tlb_pending(void)
 
 void arch_enter_lazy_mmu_mode(void)
 {
-	struct tlb_batch *tb;
-
 	preempt_disable();
-	tb = this_cpu_ptr(&tlb_batch);
-	tb->active = 1;
 }
 
 void arch_flush_lazy_mmu_mode(void)
@@ -69,10 +65,7 @@ void arch_flush_lazy_mmu_mode(void)
 
 void arch_leave_lazy_mmu_mode(void)
 {
-	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
-
 	arch_flush_lazy_mmu_mode();
-	tb->active = 0;
 	preempt_enable();
 }
 
@@ -93,7 +86,7 @@ static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
 		nr = 0;
 	}
 
-	if (!tb->active) {
+	if (!in_lazy_mmu_mode()) {
 		flush_tsb_user_page(mm, vaddr, hugepage_shift);
 		global_flush_tlb_page(mm, vaddr);
 		goto out;
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (9 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 10/13] sparc/mm: " Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 20:06   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
  2025-10-15  8:27 ` [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode() Kevin Brodsky
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

We currently set a TIF flag when scheduling out a task that is in
lazy MMU mode, in order to restore it when the task is scheduled
again.

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode in task_struct::lazy_mmu_state. We can therefore check that
state when switching to the new task, instead of using a separate
TIF flag.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/x86/include/asm/thread_info.h | 4 +---
 arch/x86/xen/enlighten_pv.c        | 3 +--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e71e0e8362ed..0067684afb5b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,8 +100,7 @@ struct thread_info {
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_SINGLESTEP		25	/* reenable singlestep on user return*/
 #define TIF_BLOCKSTEP		26	/* set when we want DEBUGCTLMSR_BTF */
-#define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
-#define TIF_ADDR32		28	/* 32-bit address space on 64 bits */
+#define TIF_ADDR32		27	/* 32-bit address space on 64 bits */
 
 #define _TIF_SSBD		BIT(TIF_SSBD)
 #define _TIF_SPEC_IB		BIT(TIF_SPEC_IB)
@@ -114,7 +113,6 @@ struct thread_info {
 #define _TIF_FORCED_TF		BIT(TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		BIT(TIF_BLOCKSTEP)
 #define _TIF_SINGLESTEP		BIT(TIF_SINGLESTEP)
-#define _TIF_LAZY_MMU_UPDATES	BIT(TIF_LAZY_MMU_UPDATES)
 #define _TIF_ADDR32		BIT(TIF_ADDR32)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 4806cc28d7ca..9fabe83e7546 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -426,7 +426,6 @@ static void xen_start_context_switch(struct task_struct *prev)
 
 	if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) {
 		arch_leave_lazy_mmu_mode();
-		set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU_UPDATES);
 	}
 	enter_lazy(XEN_LAZY_CPU);
 }
@@ -437,7 +436,7 @@ static void xen_end_context_switch(struct task_struct *next)
 
 	xen_mc_flush();
 	leave_lazy(XEN_LAZY_CPU);
-	if (test_and_clear_ti_thread_flag(task_thread_info(next), TIF_LAZY_MMU_UPDATES))
+	if (next->lazy_mmu_state.enabled)
 		arch_enter_lazy_mmu_mode();
 }
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (10 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 20:08   ` David Hildenbrand
  2025-10-15  8:27 ` [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode() Kevin Brodsky
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

The lazy MMU mode cannot be used in interrupt context. This is
documented in <linux/pgtable.h>, but isn't consistently handled
across architectures.

arm64 ensures that calls to lazy_mmu_mode_* have no effect in
interrupt context, because such calls do occur in certain
configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
batching in interrupt contexts"). Other architectures do not check
this situation, most likely because it hasn't occurred so far.

Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
disabled while in interrupt mode (see queue_pte_barriers() and
xen_get_lazy_mode() respectively).

Let's handle this in the new generic lazy_mmu layer, in the same
fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt(), and
have in_lazy_mmu_mode() return false to disable any optimisation.
Also remove the arm64 handling that is now redundant; x86/Xen has
its own internal tracking so it is left unchanged.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 17 +----------------
 include/linux/pgtable.h          | 16 ++++++++++++++--
 include/linux/sched.h            |  3 +++
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 944e512767db..a37f417c30be 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -62,37 +62,22 @@ static inline void emit_pte_barriers(void)
 
 static inline void queue_pte_barriers(void)
 {
-	if (in_interrupt()) {
-		emit_pte_barriers();
-		return;
-	}
-
 	if (in_lazy_mmu_mode())
 		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);
 	else
 		emit_pte_barriers();
 }
 
-static inline void arch_enter_lazy_mmu_mode(void)
-{
-	if (in_interrupt())
-		return;
-}
+static inline void arch_enter_lazy_mmu_mode(void) {}
 
 static inline void arch_flush_lazy_mmu_mode(void)
 {
-	if (in_interrupt())
-		return;
-
 	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
 		emit_pte_barriers();
 }
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	if (in_interrupt())
-		return;
-
 	arch_flush_lazy_mmu_mode();
 }
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 269225a733de..718c9c788114 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,8 +228,8 @@ static inline int pmd_dirty(pmd_t pmd)
  * of the lazy mode. So the implementation must assume preemption may be enabled
  * and cpu migration is possible; it must take steps to be robust against this.
  * (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). The mode cannot be used
- * in interrupt context.
+ * held, but for kernel PTE updates, no lock is held). The mode is disabled
+ * in interrupt context and calls to the lazy_mmu API have no effect.
  *
  * The lazy MMU mode is enabled for a given block of code using:
  *
@@ -265,6 +265,9 @@ static inline void lazy_mmu_mode_enable(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_BUG_ON(state->count == U8_MAX);
 	/* enable() must not be called while paused */
 	VM_WARN_ON(state->count > 0 && !state->enabled);
@@ -280,6 +283,9 @@ static inline void lazy_mmu_mode_disable(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_BUG_ON(state->count == 0);
 	VM_WARN_ON(!state->enabled);
 
@@ -297,6 +303,9 @@ static inline void lazy_mmu_mode_pause(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_WARN_ON(state->count == 0 || !state->enabled);
 
 	state->enabled = false;
@@ -307,6 +316,9 @@ static inline void lazy_mmu_mode_resume(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_WARN_ON(state->count == 0 || state->enabled);
 
 	arch_enter_lazy_mmu_mode();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2862d8bf2160..beb3e6cfddd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1731,6 +1731,9 @@ static inline char task_state_to_char(struct task_struct *tsk)
 #ifdef CONFIG_ARCH_LAZY_MMU
 static inline bool in_lazy_mmu_mode(void)
 {
+	if (in_interrupt())
+		return false;
+
 	return current->lazy_mmu_state.enabled;
 }
 #else
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode()
  2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (11 preceding siblings ...)
  2025-10-15  8:27 ` [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
@ 2025-10-15  8:27 ` Kevin Brodsky
  2025-10-23 20:10   ` David Hildenbrand
  12 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-15  8:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

powerpc decides at runtime whether the lazy MMU mode should be used.

To avoid the overhead associated with managing
task_struct::lazy_mmu_state if the mode isn't used, introduce
arch_wants_lazy_mmu_mode() and bail out of lazy_mmu_mode_* if it
returns false. Add a default definition returning true, and an
appropriate implementation for powerpc.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
This patch seemed like a good idea to start with, but now I'm not so
sure that the churn added to the generic layer is worth it.

It provides a minor optimisation for just powerpc. x86 with XEN_PV also
chooses at runtime whether to implement lazy_mmu helpers or not, but
it doesn't fit this API so neatly and isn't handled here.
---
 .../include/asm/book3s/64/tlbflush-hash.h        | 11 ++++++-----
 include/linux/pgtable.h                          | 16 ++++++++++++----
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index bbc54690d374..a91b354cf87c 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -23,10 +23,14 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
 
 extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
 
+#define arch_wants_lazy_mmu_mode arch_wants_lazy_mmu_mode
+static inline bool arch_wants_lazy_mmu_mode(void)
+{
+	return !radix_enabled();
+}
+
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	if (radix_enabled())
-		return;
 	/*
 	 * apply_to_page_range can call us this preempt enabled when
 	 * operating on kernel page tables.
@@ -46,9 +50,6 @@ static inline void arch_flush_lazy_mmu_mode(void)
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	if (radix_enabled())
-		return;
-
 	arch_flush_lazy_mmu_mode();
 	preempt_enable();
 }
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 718c9c788114..db4f388d2a16 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -261,11 +261,19 @@ static inline int pmd_dirty(pmd_t pmd)
  * currently enabled.
  */
 #ifdef CONFIG_ARCH_LAZY_MMU
+
+#ifndef arch_wants_lazy_mmu_mode
+static inline bool arch_wants_lazy_mmu_mode(void)
+{
+	return true;
+}
+#endif
+
 static inline void lazy_mmu_mode_enable(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
-	if (in_interrupt())
+	if (!arch_wants_lazy_mmu_mode() || in_interrupt())
 		return;
 
 	VM_BUG_ON(state->count == U8_MAX);
@@ -283,7 +291,7 @@ static inline void lazy_mmu_mode_disable(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
-	if (in_interrupt())
+	if (!arch_wants_lazy_mmu_mode() || in_interrupt())
 		return;
 
 	VM_BUG_ON(state->count == 0);
@@ -303,7 +311,7 @@ static inline void lazy_mmu_mode_pause(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
-	if (in_interrupt())
+	if (!arch_wants_lazy_mmu_mode() || in_interrupt())
 		return;
 
 	VM_WARN_ON(state->count == 0 || !state->enabled);
@@ -316,7 +324,7 @@ static inline void lazy_mmu_mode_resume(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
-	if (in_interrupt())
+	if (!arch_wants_lazy_mmu_mode() || in_interrupt())
 		return;
 
 	VM_WARN_ON(state->count == 0 || state->enabled);
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu()
  2025-10-15  8:27 ` [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
@ 2025-10-15 16:52   ` Dave Hansen
  2025-10-16  7:32     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Hansen @ 2025-10-15 16:52 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 10/15/25 01:27, Kevin Brodsky wrote:
> While at it, we can also avoid preempt_disable() if we are not
> in lazy MMU mode - xen_get_lazy_mode() should tolerate preemption.
...
>  static void xen_flush_lazy_mmu(void)
>  {
> -	preempt_disable();
> -
>  	if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
> -		arch_leave_lazy_mmu_mode();
> -		arch_enter_lazy_mmu_mode();
> +		preempt_disable();
> +		xen_mc_flush();
> +		preempt_enable();
>  	}

But xen_get_lazy_mode() does:

	this_cpu_read(xen_lazy_mode);

Couldn't preemption end up doing the 'xen_lazy_mode' read and the
xen_mc_flush() on different CPUs?

That seems like a problem. Is there a reason it's safe?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu()
  2025-10-15 16:52   ` Dave Hansen
@ 2025-10-16  7:32     ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-16  7:32 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15/10/2025 18:52, Dave Hansen wrote:
> On 10/15/25 01:27, Kevin Brodsky wrote:
>> While at it, we can also avoid preempt_disable() if we are not
>> in lazy MMU mode - xen_get_lazy_mode() should tolerate preemption.
> ...
>>  static void xen_flush_lazy_mmu(void)
>>  {
>> -	preempt_disable();
>> -
>>  	if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
>> -		arch_leave_lazy_mmu_mode();
>> -		arch_enter_lazy_mmu_mode();
>> +		preempt_disable();
>> +		xen_mc_flush();
>> +		preempt_enable();
>>  	}
> But xen_get_lazy_mode() does:
>
> 	this_cpu_read(xen_lazy_mode);
>
> Couldn't preemption end up doing the 'xen_lazy_mode' read and the
> xen_mc_flush() on different CPUs?
>
> That seems like a problem. Is there a reason it's safe?

You're right, I was thinking in terms of task, but xen_mc_flush() does
operate on the current CPU (and it's called when context-switching).
Will restore the original order in v4.

- Kevin



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-15  8:27 ` [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers Kevin Brodsky
@ 2025-10-17 15:54   ` Alexander Gordeev
  2025-10-20 10:32     ` Kevin Brodsky
  2025-10-23 19:52   ` David Hildenbrand
  1 sibling, 1 reply; 58+ messages in thread
From: Alexander Gordeev @ 2025-10-17 15:54 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On Wed, Oct 15, 2025 at 09:27:20AM +0100, Kevin Brodsky wrote:

Hi Kevin,

...
> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
>     This is for situations where the mode is temporarily disabled
>     by first calling pause() and then resume() (e.g. to prevent any
>     batching from occurring in a critical section).
...
> +static inline void lazy_mmu_mode_pause(void)
> +{
> +	arch_leave_lazy_mmu_mode();

I think it should have been arch_pause_lazy_mmu_mode(), wich defaults
to  arch_leave_lazy_mmu_mode(), as we discussed in v2:

https://lore.kernel.org/linux-mm/d407a381-099b-4ec6-a20e-aeff4f3d750f@arm.com/#t

> +}
> +
> +static inline void lazy_mmu_mode_resume(void)
> +{
> +	arch_enter_lazy_mmu_mode();
> +}

Thanks!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU
  2025-10-15  8:27 ` [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU Kevin Brodsky
@ 2025-10-18  9:52   ` Mike Rapoport
  2025-10-20 10:37     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Rapoport @ 2025-10-18  9:52 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On Wed, Oct 15, 2025 at 09:27:19AM +0100, Kevin Brodsky wrote:
> Architectures currently opt in for implementing lazy_mmu helpers by
> defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
> 
> In preparation for introducing a generic lazy_mmu layer that will
> require storage in task_struct, let's switch to a cleaner approach:
> instead of defining a macro, select a CONFIG option.
> 
> This patch introduces CONFIG_ARCH_LAZY_MMU and has each arch select
> it when it implements lazy_mmu helpers.
> __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
> relies on the new CONFIG instead.
> 
> On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
> selected. This creates some complications in arch/x86/boot/, because
> a few files manually undefine PARAVIRT* options. As a result
> <asm/paravirt.h> does not define the lazy_mmu helpers, but this
> breaks the build as <linux/pgtable.h> only defines them if
> !CONFIG_ARCH_LAZY_MMU. There does not seem to be a clean way out of
> this - let's just undefine that new CONFIG too.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

...

> @@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
>   * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>   * and the mode cannot be used in interrupt context.
>   */
> -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +#ifndef CONFIG_ARCH_LAZY_MMU
>  static inline void arch_enter_lazy_mmu_mode(void) {}
>  static inline void arch_leave_lazy_mmu_mode(void) {}
>  static inline void arch_flush_lazy_mmu_mode(void) {}
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0e26f4fc8717..2fdcb42ca1a1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1372,6 +1372,9 @@ config PT_RECLAIM
>  config FIND_NORMAL_PAGE
>  	def_bool n
>  
> +config ARCH_LAZY_MMU
> +	bool
> +

I think a better name would be ARCH_HAS_LAZY_MMU and the config option fits
better to arch/Kconfig.

>  source "mm/damon/Kconfig"
>  
>  endmenu
> -- 
> 2.47.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-17 15:54   ` Alexander Gordeev
@ 2025-10-20 10:32     ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-20 10:32 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 17/10/2025 17:54, Alexander Gordeev wrote:
> On Wed, Oct 15, 2025 at 09:27:20AM +0100, Kevin Brodsky wrote:
>
> Hi Kevin,
>
> ...
>> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
>>     This is for situations where the mode is temporarily disabled
>>     by first calling pause() and then resume() (e.g. to prevent any
>>     batching from occurring in a critical section).
> ...
>> +static inline void lazy_mmu_mode_pause(void)
>> +{
>> +	arch_leave_lazy_mmu_mode();
> I think it should have been arch_pause_lazy_mmu_mode(), wich defaults
> to  arch_leave_lazy_mmu_mode(), as we discussed in v2:
>
> https://lore.kernel.org/linux-mm/d407a381-099b-4ec6-a20e-aeff4f3d750f@arm.com/#t

See my comment on patch 7 - these new arch callbacks can easily be
introduced later, I don't see much point in introducing them now if they
default to leave/enter on every architecture.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU
  2025-10-18  9:52   ` Mike Rapoport
@ 2025-10-20 10:37     ` Kevin Brodsky
  2025-10-23 19:38       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-20 10:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 18/10/2025 11:52, Mike Rapoport wrote:
>> @@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
>>   * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>>   * and the mode cannot be used in interrupt context.
>>   */
>> -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>> +#ifndef CONFIG_ARCH_LAZY_MMU
>>  static inline void arch_enter_lazy_mmu_mode(void) {}
>>  static inline void arch_leave_lazy_mmu_mode(void) {}
>>  static inline void arch_flush_lazy_mmu_mode(void) {}
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 0e26f4fc8717..2fdcb42ca1a1 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1372,6 +1372,9 @@ config PT_RECLAIM
>>  config FIND_NORMAL_PAGE
>>  	def_bool n
>>  
>> +config ARCH_LAZY_MMU
>> +	bool
>> +
> I think a better name would be ARCH_HAS_LAZY_MMU and the config option fits
> better to arch/Kconfig.

Sounds fine by me - I'm inclined to make it slightly longer still,
ARCH_HAS_LAZY_MMU_MODE, to avoid making "LAZY_MMU" sound like some HW
feature.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-15  8:27 ` [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
@ 2025-10-23 19:36   ` David Hildenbrand
  2025-10-24 12:09     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 19:36 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> Upcoming changes to the lazy_mmu API will cause
> arch_flush_lazy_mmu_mode() to be called when leaving a nested
> lazy_mmu section.
> 
> Move the relevant logic from arch_leave_lazy_mmu_mode() to
> arch_flush_lazy_mmu_mode() and have the former call the latter.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++----
>   1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 146287d9580f..7704dbe8e88d 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>   	batch->active = 1;
>   }
>   
> +static inline void arch_flush_lazy_mmu_mode(void)
> +{
> +	struct ppc64_tlb_batch *batch;
> +
> +	batch = this_cpu_ptr(&ppc64_tlb_batch);

The downside is the double this_cpu_ptr() now on the 
arch_leave_lazy_mmu_mode() path.

You could just have a helper function that is called by either or just 
... leave arch_leave_lazy_mmu_mode() alone and just replicate the two 
statements here in arch_flush_lazy_mmu_mode().

I would do just that :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 04/13] sparc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-15  8:27 ` [PATCH v3 04/13] sparc/mm: " Kevin Brodsky
@ 2025-10-23 19:37   ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 19:37 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> Upcoming changes to the lazy_mmu API will cause
> arch_flush_lazy_mmu_mode() to be called when leaving a nested
> lazy_mmu section.
> 
> Move the relevant logic from arch_leave_lazy_mmu_mode() to
> arch_flush_lazy_mmu_mode() and have the former call the latter.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   arch/sparc/include/asm/tlbflush_64.h | 2 +-
>   arch/sparc/mm/tlb.c                  | 9 ++++++++-
>   2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
> index 8b8cdaa69272..925bb5d7a4e1 100644
> --- a/arch/sparc/include/asm/tlbflush_64.h
> +++ b/arch/sparc/include/asm/tlbflush_64.h
> @@ -43,8 +43,8 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>   
>   void flush_tlb_pending(void);
>   void arch_enter_lazy_mmu_mode(void);
> +void arch_flush_lazy_mmu_mode(void);
>   void arch_leave_lazy_mmu_mode(void);
> -#define arch_flush_lazy_mmu_mode()      do {} while (0)
>   
>   /* Local cpu only.  */
>   void __flush_tlb_all(void);
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index a35ddcca5e76..7b5dfcdb1243 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -59,12 +59,19 @@ void arch_enter_lazy_mmu_mode(void)
>   	tb->active = 1;
>   }
>   
> -void arch_leave_lazy_mmu_mode(void)
> +void arch_flush_lazy_mmu_mode(void)
>   {
>   	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
>   
>   	if (tb->tlb_nr)
>   		flush_tlb_pending();
> +}
> +
> +void arch_leave_lazy_mmu_mode(void)
> +{
> +	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
> +

Just like for ppc now a double this_cpu_ptr(). I'd similarly just 
replicate the two statements.

> +	arch_flush_lazy_mmu_mode();
>   	tb->active = 0;
>   	preempt_enable();
>   }


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU
  2025-10-20 10:37     ` Kevin Brodsky
@ 2025-10-23 19:38       ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 19:38 UTC (permalink / raw)
  To: Kevin Brodsky, Mike Rapoport
  Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 20.10.25 12:37, Kevin Brodsky wrote:
> On 18/10/2025 11:52, Mike Rapoport wrote:
>>> @@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
>>>    * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>>>    * and the mode cannot be used in interrupt context.
>>>    */
>>> -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>>> +#ifndef CONFIG_ARCH_LAZY_MMU
>>>   static inline void arch_enter_lazy_mmu_mode(void) {}
>>>   static inline void arch_leave_lazy_mmu_mode(void) {}
>>>   static inline void arch_flush_lazy_mmu_mode(void) {}
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 0e26f4fc8717..2fdcb42ca1a1 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1372,6 +1372,9 @@ config PT_RECLAIM
>>>   config FIND_NORMAL_PAGE
>>>   	def_bool n
>>>   
>>> +config ARCH_LAZY_MMU
>>> +	bool
>>> +
>> I think a better name would be ARCH_HAS_LAZY_MMU and the config option fits
>> better to arch/Kconfig.
> 
> Sounds fine by me - I'm inclined to make it slightly longer still,
> ARCH_HAS_LAZY_MMU_MODE, to avoid making "LAZY_MMU" sound like some HW
> feature.

LGTM.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-15  8:27 ` [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers Kevin Brodsky
  2025-10-17 15:54   ` Alexander Gordeev
@ 2025-10-23 19:52   ` David Hildenbrand
  2025-10-24 12:13     ` Kevin Brodsky
  1 sibling, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 19:52 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> The implementation of the lazy MMU mode is currently entirely
> arch-specific; core code directly calls arch helpers:
> arch_{enter,leave}_lazy_mmu_mode().
> 
> We are about to introduce support for nested lazy MMU sections.
> As things stand we'd have to duplicate that logic in every arch
> implementing lazy_mmu - adding to a fair amount of logic
> already duplicated across lazy_mmu implementations.
> 
> This patch therefore introduces a new generic layer that calls the
> existing arch_* helpers. Two pair of calls are introduced:
> 
> * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
>      This is the standard case where the mode is enabled for a given
>      block of code by surrounding it with enable() and disable()
>      calls.
> 
> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
>      This is for situations where the mode is temporarily disabled
>      by first calling pause() and then resume() (e.g. to prevent any
>      batching from occurring in a critical section).
> 
> The documentation in <linux/pgtable.h> will be updated in a
> subsequent patch.
> 
> No functional change should be introduced at this stage.
> The implementation of enable()/resume() and disable()/pause() is
> currently identical, but nesting support will change that.
> 
> Most of the call sites have been updated using the following
> Coccinelle script:
> 
> @@
> @@
> {
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> ...
> }
> 
> @@
> @@
> {
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
> ...
> }
> 
> A couple of cases are noteworthy:
> 
> * madvise_*_pte_range() call arch_leave() in multiple paths, some
>    followed by an immediate exit/rescheduling and some followed by a
>    conditional exit. These functions assume that they are called
>    with lazy MMU disabled and we cannot simply use pause()/resume()
>    to address that. This patch leaves the situation unchanged by
>    calling enable()/disable() in all cases.

I'm confused, the function simply does

(a) enables lazy mmu
(b) does something on the page table
(c) disables lazy mmu
(d) does something expensive (split folio -> take sleepable locks,
     flushes tlb)
(e) go to (a)

Why would we use enable/disable instead?

> 
> * x86/Xen is currently the only case where explicit handling is
>    required for lazy MMU when context-switching. This is purely an
>    implementation detail and using the generic lazy_mmu_mode_*
>    functions would cause trouble when nesting support is introduced,
>    because the generic functions must be called from the current task.
>    For that reason we still use arch_leave() and arch_enter() there.

How does this interact with patch #11?

> 
> Note: x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
> places, but only defines it if PARAVIRT_XXL is selected, and we are
> removing the fallback in <linux/pgtable.h>. Add a new fallback
> definition to <asm/pgtable.h> to keep things building.

I can see a call in __kernel_map_pages() and 
arch_kmap_local_post_map()/arch_kmap_local_post_unmap().

I guess that is ... harmless/irrelevant in the context of this series?

[...]


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 07/13] mm: enable lazy_mmu sections to nest
  2025-10-15  8:27 ` [PATCH v3 07/13] mm: enable lazy_mmu sections to nest Kevin Brodsky
@ 2025-10-23 20:00   ` David Hildenbrand
  2025-10-24 12:16     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:00 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

[...]


> 
> In summary (count/enabled represent the values *after* the call):
> 
> lazy_mmu_mode_enable()		-> arch_enter()	    count=1 enabled=1
>      lazy_mmu_mode_enable()	-> ø		    count=2 enabled=1
> 	lazy_mmu_mode_pause()	-> arch_leave()     count=2 enabled=0

The arch_leave..() is expected to do a flush itself, correct?

> 	lazy_mmu_mode_resume()	-> arch_enter()     count=2 enabled=1
>      lazy_mmu_mode_disable()	-> arch_flush()     count=1 enabled=1
> lazy_mmu_mode_disable()		-> arch_leave()     count=0 enabled=0
> 
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> Alexander Gordeev suggested that a future optimisation may need
> lazy_mmu_mode_{pause,resume}() to call distinct arch callbacks [1]. For
> now arch_{leave,enter}() are called directly, but introducing new arch
> callbacks should be straightforward.
> 
> [1] https://lore.kernel.org/all/5a0818bb-75d4-47df-925c-0102f7d598f4-agordeev@linux.ibm.com/
> ---

[...]

>   
> +struct lazy_mmu_state {
> +	u8 count;

I would have called this "enabled_count" or "nesting_level".

> +	bool enabled;

"enabled" is a bit confusing when we have lazy_mmu_mode_enable().

I'd have called this "active".

> +};
> +
>   #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 194b2c3e7576..269225a733de 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,28 +228,89 @@ static inline int pmd_dirty(pmd_t pmd)
>    * of the lazy mode. So the implementation must assume preemption may be enabled
>    * and cpu migration is possible; it must take steps to be robust against this.
>    * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> + * in interrupt context.
> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + *   lazy_mmu_mode_enable();
> + *   <code>
> + *   lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + *   lazy_mmu_mode_pause();
> + *   <code>
> + *   lazy_mmu_mode_resume();
> + *
> + * This sequence must only be used if the lazy MMU mode is already enabled.
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. <code> must not modify
> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
> + * helpers).
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
>    */
>   #ifdef CONFIG_ARCH_LAZY_MMU
>   static inline void lazy_mmu_mode_enable(void)
>   {
> -	arch_enter_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_BUG_ON(state->count == U8_MAX);

No VM_BUG_ON() please.

> +	/* enable() must not be called while paused */
> +	VM_WARN_ON(state->count > 0 && !state->enabled);
> +
> +	if (state->count == 0) {
> +		arch_enter_lazy_mmu_mode();
> +		state->enabled = true;
> +	}
> +	++state->count;

Can do

if (state->count++ == 0) {

>   }
>   
>   static inline void lazy_mmu_mode_disable(void)
>   {
> -	arch_leave_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_BUG_ON(state->count == 0);

Dito.

> +	VM_WARN_ON(!state->enabled);
> +
> +	--state->count;
> +	if (state->count == 0) {

Can do

if (--state->count == 0) {

> +		state->enabled = false;
> +		arch_leave_lazy_mmu_mode();
> +	} else {
> +		/* Exiting a nested section */
> +		arch_flush_lazy_mmu_mode();
> +	}
>   }
>   
>   static inline void lazy_mmu_mode_pause(void)
>   {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->count == 0 || !state->enabled);
> +
> +	state->enabled = false;
>   	arch_leave_lazy_mmu_mode();
>   }
>   
>   static inline void lazy_mmu_mode_resume(void)
>   {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->count == 0 || state->enabled);
> +
>   	arch_enter_lazy_mmu_mode();
> +	state->enabled = true;
>   }
>   #else
>   static inline void lazy_mmu_mode_enable(void) {}
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cbb7340c5866..2862d8bf2160 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>   
>   	struct page_frag		task_frag;
>   
> +#ifdef CONFIG_ARCH_LAZY_MMU
> +	struct lazy_mmu_state		lazy_mmu_state;
> +#endif
> +
>   #ifdef CONFIG_TASK_DELAY_ACCT
>   	struct task_delay_info		*delays;
>   #endif
> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>   	return task_index_to_char(task_state_index(tsk));
>   }
>   
> +#ifdef CONFIG_ARCH_LAZY_MMU
> +static inline bool in_lazy_mmu_mode(void)

So these functions will reveal the actual arch state, not whether
_enabled() was called.

As I can see in later patches, in interrupt context they are also
return "not in lazy mmu mode".

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 09/13] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-15  8:27 ` [PATCH v3 09/13] powerpc/mm: replace batch->active " Kevin Brodsky
@ 2025-10-23 20:02   ` David Hildenbrand
  2025-10-24 12:16     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:02 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. As a result we no longer need to track whether the per-CPU TLB
> batch struct is active - we know it is if in_lazy_mmu_mode() returns
> true.

It's worth adding that disabling preemption while enabled makes sure 
that we cannot reschedule while in lazy MMU mode, so when the per-CPU 
TLB batch structure is active.


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 10/13] sparc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-15  8:27 ` [PATCH v3 10/13] sparc/mm: " Kevin Brodsky
@ 2025-10-23 20:03   ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:03 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. As a result we no longer need to track whether the per-CPU TLB
> batch struct is active - we know it is if in_lazy_mmu_mode() returns
> true.

Same here, document the dependency on disabled preemption.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-15  8:27 ` [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
@ 2025-10-23 20:06   ` David Hildenbrand
  2025-10-24 14:47     ` David Woodhouse
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:06 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> We currently set a TIF flag when scheduling out a task that is in
> lazy MMU mode, in order to restore it when the task is scheduled
> again.
> 
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode in task_struct::lazy_mmu_state. We can therefore check that
> state when switching to the new task, instead of using a separate
> TIF flag.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---


Looks ok to me, but I hope we get some confirmation from x86 / xen folks.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context
  2025-10-15  8:27 ` [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
@ 2025-10-23 20:08   ` David Hildenbrand
  2025-10-24 12:17     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:08 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> The lazy MMU mode cannot be used in interrupt context. This is
> documented in <linux/pgtable.h>, but isn't consistently handled
> across architectures.
> 
> arm64 ensures that calls to lazy_mmu_mode_* have no effect in
> interrupt context, because such calls do occur in certain
> configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
> batching in interrupt contexts"). Other architectures do not check
> this situation, most likely because it hasn't occurred so far.
> 
> Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
> disabled while in interrupt mode (see queue_pte_barriers() and
> xen_get_lazy_mode() respectively).
> 
> Let's handle this in the new generic lazy_mmu layer, in the same
> fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt(), and
> have in_lazy_mmu_mode() return false to disable any optimisation.
> Also remove the arm64 handling that is now redundant; x86/Xen has
> its own internal tracking so it is left unchanged.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 17 +----------------
>   include/linux/pgtable.h          | 16 ++++++++++++++--
>   include/linux/sched.h            |  3 +++
>   3 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 944e512767db..a37f417c30be 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -62,37 +62,22 @@ static inline void emit_pte_barriers(void)
>   
>   static inline void queue_pte_barriers(void)
>   {
> -	if (in_interrupt()) {
> -		emit_pte_barriers();
> -		return;
> -	}
> -

That took me a while. I guess this works because in_lazy_mmu_mode() == 0 
in interrupt context, so we keep calling emit_pte_barriers?


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode()
  2025-10-15  8:27 ` [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode() Kevin Brodsky
@ 2025-10-23 20:10   ` David Hildenbrand
  2025-10-24 12:17     ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:10 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 15.10.25 10:27, Kevin Brodsky wrote:
> powerpc decides at runtime whether the lazy MMU mode should be used.
> 
> To avoid the overhead associated with managing
> task_struct::lazy_mmu_state if the mode isn't used, introduce
> arch_wants_lazy_mmu_mode() and bail out of lazy_mmu_mode_* if it
> returns false. Add a default definition returning true, and an
> appropriate implementation for powerpc.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> This patch seemed like a good idea to start with, but now I'm not so
> sure that the churn added to the generic layer is worth it.

Exactly my thoughts :)

I think we need evidence that this is really worth it for optimizing out 
basically a counter update on powerpc.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-23 19:36   ` David Hildenbrand
@ 2025-10-24 12:09     ` Kevin Brodsky
  2025-10-24 14:42       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 12:09 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 23/10/2025 21:36, David Hildenbrand wrote:
> On 15.10.25 10:27, Kevin Brodsky wrote:
>> [...]
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> index 146287d9580f..7704dbe8e88d 100644
>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>>       batch->active = 1;
>>   }
>>   +static inline void arch_flush_lazy_mmu_mode(void)
>> +{
>> +    struct ppc64_tlb_batch *batch;
>> +
>> +    batch = this_cpu_ptr(&ppc64_tlb_batch);
>
> The downside is the double this_cpu_ptr() now on the
> arch_leave_lazy_mmu_mode() path.

This is only temporary, patch 9 removes it from arch_enter(). I don't
think having a redundant this_cpu_ptr() for a few commits is really a
concern?

Same idea for patch 4/10.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-23 19:52   ` David Hildenbrand
@ 2025-10-24 12:13     ` Kevin Brodsky
  2025-10-24 13:27       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 12:13 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 23/10/2025 21:52, David Hildenbrand wrote:
> On 15.10.25 10:27, Kevin Brodsky wrote:
>> [...]
>>
>> * madvise_*_pte_range() call arch_leave() in multiple paths, some
>>    followed by an immediate exit/rescheduling and some followed by a
>>    conditional exit. These functions assume that they are called
>>    with lazy MMU disabled and we cannot simply use pause()/resume()
>>    to address that. This patch leaves the situation unchanged by
>>    calling enable()/disable() in all cases.
>
> I'm confused, the function simply does
>
> (a) enables lazy mmu
> (b) does something on the page table
> (c) disables lazy mmu
> (d) does something expensive (split folio -> take sleepable locks,
>     flushes tlb)
> (e) go to (a)

That step is conditional: we exit right away if pte_offset_map_lock()
fails. The fundamental issue is that pause() must always be matched with
resume(), but as those functions look today there is no situation where
a pause() would always be matched with a resume().

Alternatively it should be possible to pause(), unconditionally resume()
after the expensive operations are done and then leave() right away in
case of failure. It requires restructuring and might look a bit strange,
but can be done if you think it's justified.

>
> Why would we use enable/disable instead?
>
>>
>> * x86/Xen is currently the only case where explicit handling is
>>    required for lazy MMU when context-switching. This is purely an
>>    implementation detail and using the generic lazy_mmu_mode_*
>>    functions would cause trouble when nesting support is introduced,
>>    because the generic functions must be called from the current task.
>>    For that reason we still use arch_leave() and arch_enter() there.
>
> How does this interact with patch #11? 

It is a requirement for patch 11, in fact. If we called disable() when
switching out a task, then lazy_mmu_state.enabled would (most likely) be
false when scheduling it again.

By calling the arch_* helpers when context-switching, we ensure
lazy_mmu_state remains unchanged. This is consistent with what happens
on all other architectures (which don't do anything about lazy_mmu when
context-switching). lazy_mmu_state is the lazy MMU status *when the task
is scheduled*, and should be preserved on a context-switch.

>
>>
>> Note: x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
>> places, but only defines it if PARAVIRT_XXL is selected, and we are
>> removing the fallback in <linux/pgtable.h>. Add a new fallback
>> definition to <asm/pgtable.h> to keep things building.
>
> I can see a call in __kernel_map_pages() and
> arch_kmap_local_post_map()/arch_kmap_local_post_unmap().
>
> I guess that is ... harmless/irrelevant in the context of this series?

It should be. arch_flush_lazy_mmu_mode() was only used by x86 before
this series; we're adding new calls to it from the generic layer, but
existing x86 calls shouldn't be affected.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 07/13] mm: enable lazy_mmu sections to nest
  2025-10-23 20:00   ` David Hildenbrand
@ 2025-10-24 12:16     ` Kevin Brodsky
  2025-10-24 13:23       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 12:16 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 23/10/2025 22:00, David Hildenbrand wrote:
> [...]
>
>
>>
>> In summary (count/enabled represent the values *after* the call):
>>
>> lazy_mmu_mode_enable()        -> arch_enter()        count=1 enabled=1
>>      lazy_mmu_mode_enable()    -> ø            count=2 enabled=1
>>     lazy_mmu_mode_pause()    -> arch_leave()     count=2 enabled=0
>
> The arch_leave..() is expected to do a flush itself, correct?

Correct, that's unchanged.

>
>>     lazy_mmu_mode_resume()    -> arch_enter()     count=2 enabled=1
>>      lazy_mmu_mode_disable()    -> arch_flush()     count=1 enabled=1
>> lazy_mmu_mode_disable()        -> arch_leave()     count=0 enabled=0
>>
>> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
>> headers included by <linux/pgtable.h> to use it.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>> Alexander Gordeev suggested that a future optimisation may need
>> lazy_mmu_mode_{pause,resume}() to call distinct arch callbacks [1]. For
>> now arch_{leave,enter}() are called directly, but introducing new arch
>> callbacks should be straightforward.
>>
>> [1]
>> https://lore.kernel.org/all/5a0818bb-75d4-47df-925c-0102f7d598f4-agordeev@linux.ibm.com/
>> ---
>
> [...]
>
>>   +struct lazy_mmu_state {
>> +    u8 count;
>
> I would have called this "enabled_count" or "nesting_level".

Might as well be explicit and say nesting_level, yes :)

>
>> +    bool enabled;
>
> "enabled" is a bit confusing when we have lazy_mmu_mode_enable().

Agreed, hadn't realised that.

> I'd have called this "active".

Sounds good, that also matches batch->active on powerpc/sparc.

>
>> +};
>> +
>>   #endif /* _LINUX_MM_TYPES_TASK_H */
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 194b2c3e7576..269225a733de 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -228,28 +228,89 @@ static inline int pmd_dirty(pmd_t pmd)
>>    * of the lazy mode. So the implementation must assume preemption
>> may be enabled
>>    * and cpu migration is possible; it must take steps to be robust
>> against this.
>>    * (In practice, for user PTE updates, the appropriate page table
>> lock(s) are
>> - * held, but for kernel PTE updates, no lock is held). Nesting is
>> not permitted
>> - * and the mode cannot be used in interrupt context.
>> + * held, but for kernel PTE updates, no lock is held). The mode
>> cannot be used
>> + * in interrupt context.
>> + *
>> + * The lazy MMU mode is enabled for a given block of code using:
>> + *
>> + *   lazy_mmu_mode_enable();
>> + *   <code>
>> + *   lazy_mmu_mode_disable();
>> + *
>> + * Nesting is permitted: <code> may itself use an enable()/disable()
>> pair.
>> + * A nested call to enable() has no functional effect; however
>> disable() causes
>> + * any batched architectural state to be flushed regardless of
>> nesting. After a
>> + * call to disable(), the caller can therefore rely on all previous
>> page table
>> + * modifications to have taken effect, but the lazy MMU mode may
>> still be
>> + * enabled.
>> + *
>> + * In certain cases, it may be desirable to temporarily pause the
>> lazy MMU mode.
>> + * This can be done using:
>> + *
>> + *   lazy_mmu_mode_pause();
>> + *   <code>
>> + *   lazy_mmu_mode_resume();
>> + *
>> + * This sequence must only be used if the lazy MMU mode is already
>> enabled.
>> + * pause() ensures that the mode is exited regardless of the nesting
>> level;
>> + * resume() re-enters the mode at the same nesting level. <code>
>> must not modify
>> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
>> + * helpers).
>> + *
>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> + * currently enabled.
>>    */
>>   #ifdef CONFIG_ARCH_LAZY_MMU
>>   static inline void lazy_mmu_mode_enable(void)
>>   {
>> -    arch_enter_lazy_mmu_mode();
>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +    VM_BUG_ON(state->count == U8_MAX);
>
> No VM_BUG_ON() please.

I did wonder if this would be acceptable!

What should we do in case of underflow/overflow then? Saturate or just
let it wrap around? If an overflow occurs we're probably in some
infinite recursion and we'll crash anyway, but an underflow is likely
due to a double disable() and saturating would probably allow to recover.

>
>> +    /* enable() must not be called while paused */
>> +    VM_WARN_ON(state->count > 0 && !state->enabled);
>> +
>> +    if (state->count == 0) {
>> +        arch_enter_lazy_mmu_mode();
>> +        state->enabled = true;
>> +    }
>> +    ++state->count;
>
> Can do
>
> if (state->count++ == 0) {

My idea here was to have exactly the reverse order between enable() and
disable(), so that arch_enter() is called before lazy_mmu_state is
updated, and arch_leave() afterwards. arch_* probably shouldn't rely on
this (or care), but I liked the symmetry.

>
>>   }
>>     static inline void lazy_mmu_mode_disable(void)
>>   {
>> -    arch_leave_lazy_mmu_mode();
>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +    VM_BUG_ON(state->count == 0);
>
> Dito.
>
>> +    VM_WARN_ON(!state->enabled);
>> +
>> +    --state->count;
>> +    if (state->count == 0) {
>
> Can do
>
> if (--state->count == 0) {
>
>> +        state->enabled = false;
>> +        arch_leave_lazy_mmu_mode();
>> +    } else {
>> +        /* Exiting a nested section */
>> +        arch_flush_lazy_mmu_mode();
>> +    }
>>   }
>>     static inline void lazy_mmu_mode_pause(void)
>>   {
>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +    VM_WARN_ON(state->count == 0 || !state->enabled);
>> +
>> +    state->enabled = false;
>>       arch_leave_lazy_mmu_mode();
>>   }
>>     static inline void lazy_mmu_mode_resume(void)
>>   {
>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +    VM_WARN_ON(state->count == 0 || state->enabled);
>> +
>>       arch_enter_lazy_mmu_mode();
>> +    state->enabled = true;
>>   }
>>   #else
>>   static inline void lazy_mmu_mode_enable(void) {}
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index cbb7340c5866..2862d8bf2160 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>         struct page_frag        task_frag;
>>   +#ifdef CONFIG_ARCH_LAZY_MMU
>> +    struct lazy_mmu_state        lazy_mmu_state;
>> +#endif
>> +
>>   #ifdef CONFIG_TASK_DELAY_ACCT
>>       struct task_delay_info        *delays;
>>   #endif
>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct
>> task_struct *tsk)
>>       return task_index_to_char(task_state_index(tsk));
>>   }
>>   +#ifdef CONFIG_ARCH_LAZY_MMU
>> +static inline bool in_lazy_mmu_mode(void)
>
> So these functions will reveal the actual arch state, not whether
> _enabled() was called.
>
> As I can see in later patches, in interrupt context they are also
> return "not in lazy mmu mode". 

Yes - the idea is that a task is in lazy MMU mode if it enabled it and
is in process context. The mode is never enabled in interrupt context.
This has always been the intention, but it wasn't formalised until patch
12 (except on arm64).

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 09/13] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-23 20:02   ` David Hildenbrand
@ 2025-10-24 12:16     ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 12:16 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 23/10/2025 22:02, David Hildenbrand wrote:
> On 15.10.25 10:27, Kevin Brodsky wrote:
>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>> mode. As a result we no longer need to track whether the per-CPU TLB
>> batch struct is active - we know it is if in_lazy_mmu_mode() returns
>> true.
>
> It's worth adding that disabling preemption while enabled makes sure
> that we cannot reschedule while in lazy MMU mode, so when the per-CPU
> TLB batch structure is active.

Yes good point, otherwise this change doesn't make sense. I'll add that.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context
  2025-10-23 20:08   ` David Hildenbrand
@ 2025-10-24 12:17     ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 12:17 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 23/10/2025 22:08, David Hildenbrand wrote:
> On 15.10.25 10:27, Kevin Brodsky wrote:
>> The lazy MMU mode cannot be used in interrupt context. This is
>> documented in <linux/pgtable.h>, but isn't consistently handled
>> across architectures.
>>
>> arm64 ensures that calls to lazy_mmu_mode_* have no effect in
>> interrupt context, because such calls do occur in certain
>> configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
>> batching in interrupt contexts"). Other architectures do not check
>> this situation, most likely because it hasn't occurred so far.
>>
>> Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
>> disabled while in interrupt mode (see queue_pte_barriers() and
>> xen_get_lazy_mode() respectively).
>>
>> Let's handle this in the new generic lazy_mmu layer, in the same
>> fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt(), and
>> have in_lazy_mmu_mode() return false to disable any optimisation.
>> Also remove the arm64 handling that is now redundant; x86/Xen has
>> its own internal tracking so it is left unchanged.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 17 +----------------
>>   include/linux/pgtable.h          | 16 ++++++++++++++--
>>   include/linux/sched.h            |  3 +++
>>   3 files changed, 18 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h
>> b/arch/arm64/include/asm/pgtable.h
>> index 944e512767db..a37f417c30be 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -62,37 +62,22 @@ static inline void emit_pte_barriers(void)
>>     static inline void queue_pte_barriers(void)
>>   {
>> -    if (in_interrupt()) {
>> -        emit_pte_barriers();
>> -        return;
>> -    }
>> -
>
> That took me a while. I guess this works because in_lazy_mmu_mode() ==
> 0 in interrupt context, so we keep calling emit_pte_barriers?

Yes exactly.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode()
  2025-10-23 20:10   ` David Hildenbrand
@ 2025-10-24 12:17     ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 12:17 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 23/10/2025 22:10, David Hildenbrand wrote:
> On 15.10.25 10:27, Kevin Brodsky wrote:
>> powerpc decides at runtime whether the lazy MMU mode should be used.
>>
>> To avoid the overhead associated with managing
>> task_struct::lazy_mmu_state if the mode isn't used, introduce
>> arch_wants_lazy_mmu_mode() and bail out of lazy_mmu_mode_* if it
>> returns false. Add a default definition returning true, and an
>> appropriate implementation for powerpc.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>> This patch seemed like a good idea to start with, but now I'm not so
>> sure that the churn added to the generic layer is worth it.
>
> Exactly my thoughts :)
>
> I think we need evidence that this is really worth it for optimizing
> out basically a counter update on powerpc.

Ack, I'll drop that patch in v4 unless someone sees a better
justification for it.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 07/13] mm: enable lazy_mmu sections to nest
  2025-10-24 12:16     ` Kevin Brodsky
@ 2025-10-24 13:23       ` David Hildenbrand
  2025-10-24 14:33         ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-24 13:23 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

>>> + * currently enabled.
>>>     */
>>>    #ifdef CONFIG_ARCH_LAZY_MMU
>>>    static inline void lazy_mmu_mode_enable(void)
>>>    {
>>> -    arch_enter_lazy_mmu_mode();
>>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>> +
>>> +    VM_BUG_ON(state->count == U8_MAX);
>>
>> No VM_BUG_ON() please.
> 
> I did wonder if this would be acceptable!

Use VM_WARN_ON_ONCE() and let early testing find any such issues.

VM_* is active in debug kernels only either way! :)

If you'd want to handle this in production kernels you'd need

if (WARN_ON_ONCE()) {
	/* Try to recover */
}

And that seems unnecessary/overly-complicated for something that should 
never happen, and if it happens, can be found early during testing.

> 
> What should we do in case of underflow/overflow then? Saturate or just
> let it wrap around? If an overflow occurs we're probably in some
> infinite recursion and we'll crash anyway, but an underflow is likely
> due to a double disable() and saturating would probably allow to recover.
> 
>>
>>> +    /* enable() must not be called while paused */
>>> +    VM_WARN_ON(state->count > 0 && !state->enabled);
>>> +
>>> +    if (state->count == 0) {
>>> +        arch_enter_lazy_mmu_mode();
>>> +        state->enabled = true;
>>> +    }
>>> +    ++state->count;
>>
>> Can do
>>
>> if (state->count++ == 0) {
> 
> My idea here was to have exactly the reverse order between enable() and
> disable(), so that arch_enter() is called before lazy_mmu_state is
> updated, and arch_leave() afterwards. arch_* probably shouldn't rely on
> this (or care), but I liked the symmetry.

I see, but really the arch callback should never have to care about that
value -- unless something is messed up :)

[...]

>>> +static inline bool in_lazy_mmu_mode(void)
>>
>> So these functions will reveal the actual arch state, not whether
>> _enabled() was called.
>>
>> As I can see in later patches, in interrupt context they are also
>> return "not in lazy mmu mode".
> 
> Yes - the idea is that a task is in lazy MMU mode if it enabled it and
> is in process context. The mode is never enabled in interrupt context.
> This has always been the intention, but it wasn't formalised until patch
> 12 (except on arm64).

Okay, thanks for clarifying.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-24 12:13     ` Kevin Brodsky
@ 2025-10-24 13:27       ` David Hildenbrand
  2025-10-24 14:32         ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-24 13:27 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 14:13, Kevin Brodsky wrote:
> On 23/10/2025 21:52, David Hildenbrand wrote:
>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>> [...]
>>>
>>> * madvise_*_pte_range() call arch_leave() in multiple paths, some
>>>     followed by an immediate exit/rescheduling and some followed by a
>>>     conditional exit. These functions assume that they are called
>>>     with lazy MMU disabled and we cannot simply use pause()/resume()
>>>     to address that. This patch leaves the situation unchanged by
>>>     calling enable()/disable() in all cases.
>>
>> I'm confused, the function simply does
>>
>> (a) enables lazy mmu
>> (b) does something on the page table
>> (c) disables lazy mmu
>> (d) does something expensive (split folio -> take sleepable locks,
>>      flushes tlb)
>> (e) go to (a)
> 
> That step is conditional: we exit right away if pte_offset_map_lock()
> fails. The fundamental issue is that pause() must always be matched with
> resume(), but as those functions look today there is no situation where
> a pause() would always be matched with a resume().

We have matches enable/disable, so my question is rather "why" you are 
even thinking about using pause/resume?

What would be the benefit of that? If there is no benefit then just drop 
this from the patch description as it's more confusing than just ... 
doing what the existing code does :)

>>
>> Why would we use enable/disable instead?
>>
>>>
>>> * x86/Xen is currently the only case where explicit handling is
>>>     required for lazy MMU when context-switching. This is purely an
>>>     implementation detail and using the generic lazy_mmu_mode_*
>>>     functions would cause trouble when nesting support is introduced,
>>>     because the generic functions must be called from the current task.
>>>     For that reason we still use arch_leave() and arch_enter() there.
>>
>> How does this interact with patch #11?
> 
> It is a requirement for patch 11, in fact. If we called disable() when
> switching out a task, then lazy_mmu_state.enabled would (most likely) be
> false when scheduling it again.
> 
> By calling the arch_* helpers when context-switching, we ensure
> lazy_mmu_state remains unchanged. This is consistent with what happens
> on all other architectures (which don't do anything about lazy_mmu when
> context-switching). lazy_mmu_state is the lazy MMU status *when the task
> is scheduled*, and should be preserved on a context-switch.

Okay, thanks for clarifying. That whole XEN stuff here is rather horrible.

> 
>>
>>>
>>> Note: x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
>>> places, but only defines it if PARAVIRT_XXL is selected, and we are
>>> removing the fallback in <linux/pgtable.h>. Add a new fallback
>>> definition to <asm/pgtable.h> to keep things building.
>>
>> I can see a call in __kernel_map_pages() and
>> arch_kmap_local_post_map()/arch_kmap_local_post_unmap().
>>
>> I guess that is ... harmless/irrelevant in the context of this series?
> 
> It should be. arch_flush_lazy_mmu_mode() was only used by x86 before
> this series; we're adding new calls to it from the generic layer, but
> existing x86 calls shouldn't be affected.

Okay, I'd like to understand the rules when arch_flush_lazy_mmu_mode() 
would actually be required in such arch code, but that's outside of the 
scope of your patch series.


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-24 13:27       ` David Hildenbrand
@ 2025-10-24 14:32         ` Kevin Brodsky
  2025-10-27 16:24           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 14:32 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24/10/2025 15:27, David Hildenbrand wrote:
> On 24.10.25 14:13, Kevin Brodsky wrote:
>> On 23/10/2025 21:52, David Hildenbrand wrote:
>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>> [...]
>>>>
>>>> * madvise_*_pte_range() call arch_leave() in multiple paths, some
>>>>     followed by an immediate exit/rescheduling and some followed by a
>>>>     conditional exit. These functions assume that they are called
>>>>     with lazy MMU disabled and we cannot simply use pause()/resume()
>>>>     to address that. This patch leaves the situation unchanged by
>>>>     calling enable()/disable() in all cases.
>>>
>>> I'm confused, the function simply does
>>>
>>> (a) enables lazy mmu
>>> (b) does something on the page table
>>> (c) disables lazy mmu
>>> (d) does something expensive (split folio -> take sleepable locks,
>>>      flushes tlb)
>>> (e) go to (a)
>>
>> That step is conditional: we exit right away if pte_offset_map_lock()
>> fails. The fundamental issue is that pause() must always be matched with
>> resume(), but as those functions look today there is no situation where
>> a pause() would always be matched with a resume().
>
> We have matches enable/disable, so my question is rather "why" you are
> even thinking about using pause/resume?
>
> What would be the benefit of that? If there is no benefit then just
> drop this from the patch description as it's more confusing than just
> ... doing what the existing code does :)

Ah sorry I misunderstood, I guess you originally meant: why would we use
pause()/resume()?

The issue is the one I mentioned in the commit message: using
enable()/disable(), we assume that the functions are called with lazy
MMU mode is disabled. Consider:

  lazy_mmu_mode_enable()
  madvise_cold_or_pageout_pte_range():
    lazy_mmu_mode_enable()
    ...
    if (need_resched()) {
      lazy_mmu_mode_disable()
      cond_resched() // lazy MMU still enabled
    }

This will explode on architectures that do not allow sleeping while in
lazy MMU mode. I'm not saying this is an actual problem - I don't see
why those functions would be called with lazy MMU mode enabled. But it
does go against the notion that nesting works everywhere.

>
>>>
>>> Why would we use enable/disable instead?
>>>
>>>>
>>>> * x86/Xen is currently the only case where explicit handling is
>>>>     required for lazy MMU when context-switching. This is purely an
>>>>     implementation detail and using the generic lazy_mmu_mode_*
>>>>     functions would cause trouble when nesting support is introduced,
>>>>     because the generic functions must be called from the current
>>>> task.
>>>>     For that reason we still use arch_leave() and arch_enter() there.
>>>
>>> How does this interact with patch #11?
>>
>> It is a requirement for patch 11, in fact. If we called disable() when
>> switching out a task, then lazy_mmu_state.enabled would (most likely) be
>> false when scheduling it again.
>>
>> By calling the arch_* helpers when context-switching, we ensure
>> lazy_mmu_state remains unchanged. This is consistent with what happens
>> on all other architectures (which don't do anything about lazy_mmu when
>> context-switching). lazy_mmu_state is the lazy MMU status *when the task
>> is scheduled*, and should be preserved on a context-switch.
>
> Okay, thanks for clarifying. That whole XEN stuff here is rather horrible.

Can't say I disagree... I tried to simplify it further, but the
Xen-specific "LAZY_CPU" mode makes it just too difficult.

>
>>
>>>
>>>>
>>>> Note: x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
>>>> places, but only defines it if PARAVIRT_XXL is selected, and we are
>>>> removing the fallback in <linux/pgtable.h>. Add a new fallback
>>>> definition to <asm/pgtable.h> to keep things building.
>>>
>>> I can see a call in __kernel_map_pages() and
>>> arch_kmap_local_post_map()/arch_kmap_local_post_unmap().
>>>
>>> I guess that is ... harmless/irrelevant in the context of this series?
>>
>> It should be. arch_flush_lazy_mmu_mode() was only used by x86 before
>> this series; we're adding new calls to it from the generic layer, but
>> existing x86 calls shouldn't be affected.
>
> Okay, I'd like to understand the rules when arch_flush_lazy_mmu_mode()
> would actually be required in such arch code, but that's outside of
> the scope of your patch series. 

Not too sure either. A little archaeology shows that the calls were
added by [1][2]. Chances are [1] is no longer relevant since lazy_mmu
isn't directly used in copy_page_range(). 

I think [2] is still required - __kernel_map_pages() can be called while
lazy MMU is already enabled, and AIUI the mapping changes should take
effect by the time __kernel_map_pages() returns. On arm64 we shouldn't
have this problem by virtue of __kernel_map_pages() using lazy_mmu
itself, meaning that the nested call to disable() will trigger a flush.
(This case is in fact the original motivation for supporting nesting.)

- Kevin

[1]
https://lore.kernel.org/all/1319573279-13867-2-git-send-email-konrad.wilk@oracle.com/
[2]
https://lore.kernel.org/all/1365703192-2089-1-git-send-email-boris.ostrovsky@oracle.com/



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 07/13] mm: enable lazy_mmu sections to nest
  2025-10-24 13:23       ` David Hildenbrand
@ 2025-10-24 14:33         ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 14:33 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24/10/2025 15:23, David Hildenbrand wrote:
>>>> + * currently enabled.
>>>>     */
>>>>    #ifdef CONFIG_ARCH_LAZY_MMU
>>>>    static inline void lazy_mmu_mode_enable(void)
>>>>    {
>>>> -    arch_enter_lazy_mmu_mode();
>>>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>>> +
>>>> +    VM_BUG_ON(state->count == U8_MAX);
>>>
>>> No VM_BUG_ON() please.
>>
>> I did wonder if this would be acceptable!
>
> Use VM_WARN_ON_ONCE() and let early testing find any such issues.
>
> VM_* is active in debug kernels only either way! :)

That was my intention - I don't think the checking overhead is justified
in production.

>
> If you'd want to handle this in production kernels you'd need
>
> if (WARN_ON_ONCE()) {
>     /* Try to recover */
> }
>
> And that seems unnecessary/overly-complicated for something that
> should never happen, and if it happens, can be found early during testing.

Got it. Then I guess I'll go for a VM_WARN_ON_ONCE() (because indeed
once the overflow/underflow occurs it'll go wrong on every
enable/disable pair).

>
>>
>> What should we do in case of underflow/overflow then? Saturate or just
>> let it wrap around? If an overflow occurs we're probably in some
>> infinite recursion and we'll crash anyway, but an underflow is likely
>> due to a double disable() and saturating would probably allow to
>> recover.
>>
>>>
>>>> +    /* enable() must not be called while paused */
>>>> +    VM_WARN_ON(state->count > 0 && !state->enabled);
>>>> +
>>>> +    if (state->count == 0) {
>>>> +        arch_enter_lazy_mmu_mode();
>>>> +        state->enabled = true;
>>>> +    }
>>>> +    ++state->count;
>>>
>>> Can do
>>>
>>> if (state->count++ == 0) {
>>
>> My idea here was to have exactly the reverse order between enable() and
>> disable(), so that arch_enter() is called before lazy_mmu_state is
>> updated, and arch_leave() afterwards. arch_* probably shouldn't rely on
>> this (or care), but I liked the symmetry.
>
> I see, but really the arch callback should never have to care about that
> value -- unless something is messed up :)

Fair enough, then I can fold those increments/decrements ;)

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-24 12:09     ` Kevin Brodsky
@ 2025-10-24 14:42       ` David Hildenbrand
  2025-10-24 14:54         ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-24 14:42 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 14:09, Kevin Brodsky wrote:
> On 23/10/2025 21:36, David Hildenbrand wrote:
>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>> [...]
>>>
>>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> index 146287d9580f..7704dbe8e88d 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>>>        batch->active = 1;
>>>    }
>>>    +static inline void arch_flush_lazy_mmu_mode(void)
>>> +{
>>> +    struct ppc64_tlb_batch *batch;
>>> +
>>> +    batch = this_cpu_ptr(&ppc64_tlb_batch);
>>
>> The downside is the double this_cpu_ptr() now on the
>> arch_leave_lazy_mmu_mode() path.
> 
> This is only temporary, patch 9 removes it from arch_enter(). I don't
> think having a redundant this_cpu_ptr() for a few commits is really a
> concern?

Oh, right. Consider mentioning in the patch description

"Note that follow-up patches will remove the double this_cpu_ptr() on 
the arch_leave_lazy_mmu_mode() path again."

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-23 20:06   ` David Hildenbrand
@ 2025-10-24 14:47     ` David Woodhouse
  2025-10-24 14:51       ` David Hildenbrand
  2025-10-24 15:05       ` Kevin Brodsky
  0 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2025-10-24 14:47 UTC (permalink / raw)
  To: David Hildenbrand, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

[-- Attachment #1: Type: text/plain, Size: 850 bytes --]

On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
> On 15.10.25 10:27, Kevin Brodsky wrote:
> > We currently set a TIF flag when scheduling out a task that is in
> > lazy MMU mode, in order to restore it when the task is scheduled
> > again.
> > 
> > The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> > mode in task_struct::lazy_mmu_state. We can therefore check that
> > state when switching to the new task, instead of using a separate
> > TIF flag.
> > 
> > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> > ---
> 
> 
> Looks ok to me, but I hope we get some confirmation from x86 / xen
> folks.


I know tglx has shouted at me in the past for precisely this reminder,
but you know you can test Xen guests under QEMU/KVM now and don't need
to actually run Xen? Has this been boot tested?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 14:47     ` David Woodhouse
@ 2025-10-24 14:51       ` David Hildenbrand
  2025-10-24 15:13         ` David Woodhouse
  2025-10-24 22:52         ` Demi Marie Obenour
  2025-10-24 15:05       ` Kevin Brodsky
  1 sibling, 2 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-10-24 14:51 UTC (permalink / raw)
  To: David Woodhouse, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 16:47, David Woodhouse wrote:
> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>> We currently set a TIF flag when scheduling out a task that is in
>>> lazy MMU mode, in order to restore it when the task is scheduled
>>> again.
>>>
>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>> state when switching to the new task, instead of using a separate
>>> TIF flag.
>>>
>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>> ---
>>
>>
>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>> folks.
> 
> 
> I know tglx has shouted at me in the past for precisely this reminder,
> but you know you can test Xen guests under QEMU/KVM now and don't need
> to actually run Xen? Has this been boot tested?

And after that, boot-testing sparc as well? :D

If it's easy, why not. But other people should not suffer for all the 
XEN hacks we keep dragging along.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-24 14:42       ` David Hildenbrand
@ 2025-10-24 14:54         ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 14:54 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24/10/2025 16:42, David Hildenbrand wrote:
> On 24.10.25 14:09, Kevin Brodsky wrote:
>> On 23/10/2025 21:36, David Hildenbrand wrote:
>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>> [...]
>>>>
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>>> b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>>> index 146287d9580f..7704dbe8e88d 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>>> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>>>>        batch->active = 1;
>>>>    }
>>>>    +static inline void arch_flush_lazy_mmu_mode(void)
>>>> +{
>>>> +    struct ppc64_tlb_batch *batch;
>>>> +
>>>> +    batch = this_cpu_ptr(&ppc64_tlb_batch);
>>>
>>> The downside is the double this_cpu_ptr() now on the
>>> arch_leave_lazy_mmu_mode() path.
>>
>> This is only temporary, patch 9 removes it from arch_enter(). I don't
>> think having a redundant this_cpu_ptr() for a few commits is really a
>> concern?
>
> Oh, right. Consider mentioning in the patch description
>
> "Note that follow-up patches will remove the double this_cpu_ptr() on
> the arch_leave_lazy_mmu_mode() path again." 

Sounds good, will do.

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 14:47     ` David Woodhouse
  2025-10-24 14:51       ` David Hildenbrand
@ 2025-10-24 15:05       ` Kevin Brodsky
  2025-10-24 15:17         ` David Woodhouse
  1 sibling, 1 reply; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-24 15:05 UTC (permalink / raw)
  To: David Woodhouse, David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24/10/2025 16:47, David Woodhouse wrote:
> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>> We currently set a TIF flag when scheduling out a task that is in
>>> lazy MMU mode, in order to restore it when the task is scheduled
>>> again.
>>>
>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>> state when switching to the new task, instead of using a separate
>>> TIF flag.
>>>
>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>> ---
>>
>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>> folks.
>
> I know tglx has shouted at me in the past for precisely this reminder,
> but you know you can test Xen guests under QEMU/KVM now and don't need
> to actually run Xen? Has this been boot tested?

I considered boot-testing a Xen guest (considering the Xen-specific
changes in this series), but having no idea how to go about it I quickly
gave up... Happy to follow instructions :)

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 14:51       ` David Hildenbrand
@ 2025-10-24 15:13         ` David Woodhouse
  2025-10-24 15:16           ` David Hildenbrand
  2025-10-24 15:38           ` John Paul Adrian Glaubitz
  2025-10-24 22:52         ` Demi Marie Obenour
  1 sibling, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2025-10-24 15:13 UTC (permalink / raw)
  To: David Hildenbrand, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

[-- Attachment #1: Type: text/plain, Size: 1194 bytes --]

On Fri, 2025-10-24 at 16:51 +0200, David Hildenbrand wrote:
> On 24.10.25 16:47, David Woodhouse wrote:
> > On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
> > > On 15.10.25 10:27, Kevin Brodsky wrote:
> > > > We currently set a TIF flag when scheduling out a task that is in
> > > > lazy MMU mode, in order to restore it when the task is scheduled
> > > > again.
> > > > 
> > > > The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> > > > mode in task_struct::lazy_mmu_state. We can therefore check that
> > > > state when switching to the new task, instead of using a separate
> > > > TIF flag.
> > > > 
> > > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> > > > ---
> > > 
> > > 
> > > Looks ok to me, but I hope we get some confirmation from x86 / xen
> > > folks.
> > 
> > 
> > I know tglx has shouted at me in the past for precisely this reminder,
> > but you know you can test Xen guests under QEMU/KVM now and don't need
> > to actually run Xen? Has this been boot tested?
> 
> And after that, boot-testing sparc as well? :D

Also not that hard in QEMU, I believe. Although I do have some SPARC
boxes in the shed...


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:13         ` David Woodhouse
@ 2025-10-24 15:16           ` David Hildenbrand
  2025-10-24 15:38           ` John Paul Adrian Glaubitz
  1 sibling, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-10-24 15:16 UTC (permalink / raw)
  To: David Woodhouse, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 17:13, David Woodhouse wrote:
> On Fri, 2025-10-24 at 16:51 +0200, David Hildenbrand wrote:
>> On 24.10.25 16:47, David Woodhouse wrote:
>>> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>> We currently set a TIF flag when scheduling out a task that is in
>>>>> lazy MMU mode, in order to restore it when the task is scheduled
>>>>> again.
>>>>>
>>>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>>>> state when switching to the new task, instead of using a separate
>>>>> TIF flag.
>>>>>
>>>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>>>> ---
>>>>
>>>>
>>>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>>>> folks.
>>>
>>>
>>> I know tglx has shouted at me in the past for precisely this reminder,
>>> but you know you can test Xen guests under QEMU/KVM now and don't need
>>> to actually run Xen? Has this been boot tested?
>>
>> And after that, boot-testing sparc as well? :D
> 
> Also not that hard in QEMU, I believe. Although I do have some SPARC
> boxes in the shed...

Yeah, I once went through the pain of getting a sparc64 system booting 
in QEMU with a distro (was it debian?) that was 7 years old or so.

Fantastic experience.

Only took me 2 days IIRC. Absolutely worth it to not break upstream 
kernels on a museum piece.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:05       ` Kevin Brodsky
@ 2025-10-24 15:17         ` David Woodhouse
  2025-10-27 13:38           ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Woodhouse @ 2025-10-24 15:17 UTC (permalink / raw)
  To: Kevin Brodsky, David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

[-- Attachment #1: Type: text/plain, Size: 1817 bytes --]

On Fri, 2025-10-24 at 17:05 +0200, Kevin Brodsky wrote:
> On 24/10/2025 16:47, David Woodhouse wrote:
> > On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
> > > On 15.10.25 10:27, Kevin Brodsky wrote:
> > > > We currently set a TIF flag when scheduling out a task that is in
> > > > lazy MMU mode, in order to restore it when the task is scheduled
> > > > again.
> > > > 
> > > > The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> > > > mode in task_struct::lazy_mmu_state. We can therefore check that
> > > > state when switching to the new task, instead of using a separate
> > > > TIF flag.
> > > > 
> > > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> > > > ---
> > > 
> > > Looks ok to me, but I hope we get some confirmation from x86 / xen
> > > folks.
> > 
> > I know tglx has shouted at me in the past for precisely this reminder,
> > but you know you can test Xen guests under QEMU/KVM now and don't need
> > to actually run Xen? Has this been boot tested?
> 
> I considered boot-testing a Xen guest (considering the Xen-specific
> changes in this series), but having no idea how to go about it I quickly
> gave up... Happy to follow instructions :)

https://qemu-project.gitlab.io/qemu/system/i386/xen.html covers booting
Xen HVM guests, and near the bottom PV guests too (for which you do
need a copy of Xen to run in QEMU with '--kernel xen', and your
distro's build should suffice for that).

Let me know if you have any trouble. Here's a sample command line which
works here...

qemu-system-x86_64 -display none --accel kvm,xen-version=0x40011,kernel-irqchip=split -drive file=/var/lib/libvirt/images/fedora28.qcow2,if=xen -kernel ~/git/linux-2.6/arch/x86/boot/bzImage -append "root=/dev/xvda1 console=ttyS0" -serial mon:stdio

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:13         ` David Woodhouse
  2025-10-24 15:16           ` David Hildenbrand
@ 2025-10-24 15:38           ` John Paul Adrian Glaubitz
  2025-10-24 15:47             ` David Hildenbrand
  1 sibling, 1 reply; 58+ messages in thread
From: John Paul Adrian Glaubitz @ 2025-10-24 15:38 UTC (permalink / raw)
  To: David Woodhouse, David Hildenbrand, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On Fri, 2025-10-24 at 16:13 +0100, David Woodhouse wrote:
> On Fri, 2025-10-24 at 16:51 +0200, David Hildenbrand wrote:
> > On 24.10.25 16:47, David Woodhouse wrote:
> > > On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
> > > > On 15.10.25 10:27, Kevin Brodsky wrote:
> > > > > We currently set a TIF flag when scheduling out a task that is in
> > > > > lazy MMU mode, in order to restore it when the task is scheduled
> > > > > again.
> > > > > 
> > > > > The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> > > > > mode in task_struct::lazy_mmu_state. We can therefore check that
> > > > > state when switching to the new task, instead of using a separate
> > > > > TIF flag.
> > > > > 
> > > > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> > > > > ---
> > > > 
> > > > 
> > > > Looks ok to me, but I hope we get some confirmation from x86 / xen
> > > > folks.
> > > 
> > > 
> > > I know tglx has shouted at me in the past for precisely this reminder,
> > > but you know you can test Xen guests under QEMU/KVM now and don't need
> > > to actually run Xen? Has this been boot tested?
> > 
> > And after that, boot-testing sparc as well? :D
> 
> Also not that hard in QEMU, I believe. Although I do have some SPARC
> boxes in the shed...

Please have people test kernel changes on SPARC on real hardware. QEMU does not
emulate sun4v, for example, and therefore testing in QEMU does not cover all
of SPARC hardware.

There are plenty of people on the debian-sparc, gentoo-sparc and sparclinux
LKML mailing lists that can test kernel patches for SPARC. If SPARC-relevant
changes need to be tested, please ask there and don't bury such things in a
deeply nested thread in a discussion which doesn't even have SPARC in the
mail subject.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:38           ` John Paul Adrian Glaubitz
@ 2025-10-24 15:47             ` David Hildenbrand
  2025-10-24 15:51               ` John Paul Adrian Glaubitz
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-24 15:47 UTC (permalink / raw)
  To: John Paul Adrian Glaubitz, David Woodhouse, Kevin Brodsky,
	linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 17:38, John Paul Adrian Glaubitz wrote:
> On Fri, 2025-10-24 at 16:13 +0100, David Woodhouse wrote:
>> On Fri, 2025-10-24 at 16:51 +0200, David Hildenbrand wrote:
>>> On 24.10.25 16:47, David Woodhouse wrote:
>>>> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>>> We currently set a TIF flag when scheduling out a task that is in
>>>>>> lazy MMU mode, in order to restore it when the task is scheduled
>>>>>> again.
>>>>>>
>>>>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>>>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>>>>> state when switching to the new task, instead of using a separate
>>>>>> TIF flag.
>>>>>>
>>>>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>>>>> ---
>>>>>
>>>>>
>>>>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>>>>> folks.
>>>>
>>>>
>>>> I know tglx has shouted at me in the past for precisely this reminder,
>>>> but you know you can test Xen guests under QEMU/KVM now and don't need
>>>> to actually run Xen? Has this been boot tested?
>>>
>>> And after that, boot-testing sparc as well? :D
>>
>> Also not that hard in QEMU, I believe. Although I do have some SPARC
>> boxes in the shed...
> 
> Please have people test kernel changes on SPARC on real hardware. QEMU does not
> emulate sun4v, for example, and therefore testing in QEMU does not cover all
> of SPARC hardware.
> 
> There are plenty of people on the debian-sparc, gentoo-sparc and sparclinux
> LKML mailing lists that can test kernel patches for SPARC. If SPARC-relevant
> changes need to be tested, please ask there and don't bury such things in a
> deeply nested thread in a discussion which doesn't even have SPARC in the
> mail subject.

Hi Adrian,

out of curiosity, do people monitor sparclinux@ for changes to actively 
offer testing when required -- like would it be sufficient to CC 
relevant maintainers+list (like done here) and raise in the cover letter 
that some testing help would be appreciated?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:47             ` David Hildenbrand
@ 2025-10-24 15:51               ` John Paul Adrian Glaubitz
  2025-10-27 12:38                 ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: John Paul Adrian Glaubitz @ 2025-10-24 15:51 UTC (permalink / raw)
  To: David Hildenbrand, David Woodhouse, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

Hi David,

On Fri, 2025-10-24 at 17:47 +0200, David Hildenbrand wrote:
> > Please have people test kernel changes on SPARC on real hardware. QEMU does not
> > emulate sun4v, for example, and therefore testing in QEMU does not cover all
> > of SPARC hardware.
> > 
> > There are plenty of people on the debian-sparc, gentoo-sparc and sparclinux
> > LKML mailing lists that can test kernel patches for SPARC. If SPARC-relevant
> > changes need to be tested, please ask there and don't bury such things in a
> > deeply nested thread in a discussion which doesn't even have SPARC in the
> > mail subject.
> 
> out of curiosity, do people monitor sparclinux@ for changes to actively 
> offer testing when required -- like would it be sufficient to CC 
> relevant maintainers+list (like done here) and raise in the cover letter 
> that some testing help would be appreciated?

Yes, that's definitely the case. But it should be obvious that from the subject
of the mail that the change affects SPARC as not everyone can read every mail
they're receiving through mailing lists.

I'm trying to keep up, but since I'm on mailing lists for many different architectures,
mails can slip through the cracks.

For people that want to test changes on SPARC regularly, I can also offer accounts
on SPARC test machines running on a Solaris LDOM (logical domain) on a SPARC T4.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 14:51       ` David Hildenbrand
  2025-10-24 15:13         ` David Woodhouse
@ 2025-10-24 22:52         ` Demi Marie Obenour
  2025-10-27 12:29           ` David Hildenbrand
  1 sibling, 1 reply; 58+ messages in thread
From: Demi Marie Obenour @ 2025-10-24 22:52 UTC (permalink / raw)
  To: David Hildenbrand, David Woodhouse, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86


[-- Attachment #1.1.1: Type: text/plain, Size: 1278 bytes --]

On 10/24/25 10:51, David Hildenbrand wrote:
> On 24.10.25 16:47, David Woodhouse wrote:
>> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>> We currently set a TIF flag when scheduling out a task that is in
>>>> lazy MMU mode, in order to restore it when the task is scheduled
>>>> again.
>>>>
>>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>>> state when switching to the new task, instead of using a separate
>>>> TIF flag.
>>>>
>>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>>> ---
>>>
>>>
>>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>>> folks.
>>
>>
>> I know tglx has shouted at me in the past for precisely this reminder,
>> but you know you can test Xen guests under QEMU/KVM now and don't need
>> to actually run Xen? Has this been boot tested?
> 
> And after that, boot-testing sparc as well? :D
> 
> If it's easy, why not. But other people should not suffer for all the 
> XEN hacks we keep dragging along.

Which hacks?  Serious question.  Is this just for Xen PV or is HVM
also affected?
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 22:52         ` Demi Marie Obenour
@ 2025-10-27 12:29           ` David Hildenbrand
  2025-10-27 13:32             ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-27 12:29 UTC (permalink / raw)
  To: Demi Marie Obenour, David Woodhouse, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 25.10.25 00:52, Demi Marie Obenour wrote:
> On 10/24/25 10:51, David Hildenbrand wrote:
>> On 24.10.25 16:47, David Woodhouse wrote:
>>> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>> We currently set a TIF flag when scheduling out a task that is in
>>>>> lazy MMU mode, in order to restore it when the task is scheduled
>>>>> again.
>>>>>
>>>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>>>> state when switching to the new task, instead of using a separate
>>>>> TIF flag.
>>>>>
>>>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>>>> ---
>>>>
>>>>
>>>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>>>> folks.
>>>
>>>
>>> I know tglx has shouted at me in the past for precisely this reminder,
>>> but you know you can test Xen guests under QEMU/KVM now and don't need
>>> to actually run Xen? Has this been boot tested?
>>
>> And after that, boot-testing sparc as well? :D
>>
>> If it's easy, why not. But other people should not suffer for all the
>> XEN hacks we keep dragging along.
> 
> Which hacks?  Serious question.  Is this just for Xen PV or is HVM
> also affected?

In the context of this series, XEN_LAZY_MMU.

Your question regarding PV/HVM emphasizes my point: how is a submitter 
supposed to know which XEN combinations to test (and how to test them), 
to not confidentially break something here.

We really need guidance+help from the XEN folks here.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:51               ` John Paul Adrian Glaubitz
@ 2025-10-27 12:38                 ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-10-27 12:38 UTC (permalink / raw)
  To: John Paul Adrian Glaubitz, David Woodhouse, Kevin Brodsky,
	linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 17:51, John Paul Adrian Glaubitz wrote:
> Hi David,

Hi,

> 
> On Fri, 2025-10-24 at 17:47 +0200, David Hildenbrand wrote:
>>> Please have people test kernel changes on SPARC on real hardware. QEMU does not
>>> emulate sun4v, for example, and therefore testing in QEMU does not cover all
>>> of SPARC hardware.
>>>
>>> There are plenty of people on the debian-sparc, gentoo-sparc and sparclinux
>>> LKML mailing lists that can test kernel patches for SPARC. If SPARC-relevant
>>> changes need to be tested, please ask there and don't bury such things in a
>>> deeply nested thread in a discussion which doesn't even have SPARC in the
>>> mail subject.
>>
>> out of curiosity, do people monitor sparclinux@ for changes to actively
>> offer testing when required -- like would it be sufficient to CC
>> relevant maintainers+list (like done here) and raise in the cover letter
>> that some testing help would be appreciated?
> 
> Yes, that's definitely the case. But it should be obvious that from the subject
> of the mail that the change affects SPARC as not everyone can read every mail
> they're receiving through mailing lists.

Agreed. One would hope that people only CC the sparc mailing list + 
maintainers when there is actually something relevant in there.

Also, it would be nice if someone (e.g., the maintainer or reviewers) 
could monitor the list to spot that there is testing demand to CC the 
right people.

I guess one problem might be that nobody is getting paid to work on 
sparc I guess (I'm happy to be wrong on that one :) ).

Regarding sparc, I'll keep in mind that we might have to write a 
separate mail to the list to get some help with testing.

> 
> I'm trying to keep up, but since I'm on mailing lists for many different architectures,
> mails can slip through the cracks.

Yeah, that's understandable.

> 
> For people that want to test changes on SPARC regularly, I can also offer accounts
> on SPARC test machines running on a Solaris LDOM (logical domain) on a SPARC T4.

For example, I do have a s390x machine in an IBM cloud where I can test 
stuff. But I worked on s390x before, so I know how to test and what to 
test, and how to troubleshoot.

On sparc I'd unfortunately have a hard time even understanding whether a 
simple boot test on some machine will actually trigger what I wanted to 
test :(

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-27 12:29           ` David Hildenbrand
@ 2025-10-27 13:32             ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-27 13:32 UTC (permalink / raw)
  To: David Hildenbrand, Demi Marie Obenour, David Woodhouse, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 27/10/2025 13:29, David Hildenbrand wrote:
> On 25.10.25 00:52, Demi Marie Obenour wrote:
>> On 10/24/25 10:51, David Hildenbrand wrote:
>>> On 24.10.25 16:47, David Woodhouse wrote:
>>>> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>>> We currently set a TIF flag when scheduling out a task that is in
>>>>>> lazy MMU mode, in order to restore it when the task is scheduled
>>>>>> again.
>>>>>>
>>>>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>>>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>>>>> state when switching to the new task, instead of using a separate
>>>>>> TIF flag.
>>>>>>
>>>>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>>>>> ---
>>>>>
>>>>>
>>>>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>>>>> folks.
>>>>
>>>>
>>>> I know tglx has shouted at me in the past for precisely this reminder,
>>>> but you know you can test Xen guests under QEMU/KVM now and don't need
>>>> to actually run Xen? Has this been boot tested?
>>>
>>> And after that, boot-testing sparc as well? :D
>>>
>>> If it's easy, why not. But other people should not suffer for all the
>>> XEN hacks we keep dragging along.
>>
>> Which hacks?  Serious question.  Is this just for Xen PV or is HVM
>> also affected?
>
> In the context of this series, XEN_LAZY_MMU.

FWIW in that particular case it's relatively easy to tell this is
specific to Xen PV (this is only used in mmu_pv.c and enlighten_pv.c).
Knowing what to test is certainly not obvious in general, though.

- Kevin

>
> Your question regarding PV/HVM emphasizes my point: how is a submitter
> supposed to know which XEN combinations to test (and how to test
> them), to not confidentially break something here.
>
> We really need guidance+help from the XEN folks here.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching
  2025-10-24 15:17         ` David Woodhouse
@ 2025-10-27 13:38           ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-27 13:38 UTC (permalink / raw)
  To: David Woodhouse, David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24/10/2025 17:17, David Woodhouse wrote:
> On Fri, 2025-10-24 at 17:05 +0200, Kevin Brodsky wrote:
>> On 24/10/2025 16:47, David Woodhouse wrote:
>>> On Thu, 2025-10-23 at 22:06 +0200, David Hildenbrand wrote:
>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>> We currently set a TIF flag when scheduling out a task that is in
>>>>> lazy MMU mode, in order to restore it when the task is scheduled
>>>>> again.
>>>>>
>>>>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>>>>> mode in task_struct::lazy_mmu_state. We can therefore check that
>>>>> state when switching to the new task, instead of using a separate
>>>>> TIF flag.
>>>>>
>>>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>>>> ---
>>>> Looks ok to me, but I hope we get some confirmation from x86 / xen
>>>> folks.
>>> I know tglx has shouted at me in the past for precisely this reminder,
>>> but you know you can test Xen guests under QEMU/KVM now and don't need
>>> to actually run Xen? Has this been boot tested?
>> I considered boot-testing a Xen guest (considering the Xen-specific
>> changes in this series), but having no idea how to go about it I quickly
>> gave up... Happy to follow instructions :)
> https://qemu-project.gitlab.io/qemu/system/i386/xen.html covers booting
> Xen HVM guests, and near the bottom PV guests too (for which you do
> need a copy of Xen to run in QEMU with '--kernel xen', and your
> distro's build should suffice for that).
>
> Let me know if you have any trouble. Here's a sample command line which
> works here...
>
> qemu-system-x86_64 -display none --accel kvm,xen-version=0x40011,kernel-irqchip=split -drive file=/var/lib/libvirt/images/fedora28.qcow2,if=xen -kernel ~/git/linux-2.6/arch/x86/boot/bzImage -append "root=/dev/xvda1 console=ttyS0" -serial mon:stdio

Thanks this is helpful! Unfortunately lazy_mmu is only used in the PV
case, so I'd need to run a PV guest. And the distro I'm using (Arch
Linux) does not have a Xen package :/ It can be built from source from
the AUR but that looks rather involved. Are there some prebuilt binaries
I could grab and just point QEMU to?

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-24 14:32         ` Kevin Brodsky
@ 2025-10-27 16:24           ` David Hildenbrand
  2025-10-28 10:34             ` Kevin Brodsky
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-10-27 16:24 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 24.10.25 16:32, Kevin Brodsky wrote:
> On 24/10/2025 15:27, David Hildenbrand wrote:
>> On 24.10.25 14:13, Kevin Brodsky wrote:
>>> On 23/10/2025 21:52, David Hildenbrand wrote:
>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>> [...]
>>>>>
>>>>> * madvise_*_pte_range() call arch_leave() in multiple paths, some
>>>>>      followed by an immediate exit/rescheduling and some followed by a
>>>>>      conditional exit. These functions assume that they are called
>>>>>      with lazy MMU disabled and we cannot simply use pause()/resume()
>>>>>      to address that. This patch leaves the situation unchanged by
>>>>>      calling enable()/disable() in all cases.
>>>>
>>>> I'm confused, the function simply does
>>>>
>>>> (a) enables lazy mmu
>>>> (b) does something on the page table
>>>> (c) disables lazy mmu
>>>> (d) does something expensive (split folio -> take sleepable locks,
>>>>       flushes tlb)
>>>> (e) go to (a)
>>>
>>> That step is conditional: we exit right away if pte_offset_map_lock()
>>> fails. The fundamental issue is that pause() must always be matched with
>>> resume(), but as those functions look today there is no situation where
>>> a pause() would always be matched with a resume().
>>
>> We have matches enable/disable, so my question is rather "why" you are
>> even thinking about using pause/resume?
>>
>> What would be the benefit of that? If there is no benefit then just
>> drop this from the patch description as it's more confusing than just
>> ... doing what the existing code does :)
> 
> Ah sorry I misunderstood, I guess you originally meant: why would we use
> pause()/resume()?
> 
> The issue is the one I mentioned in the commit message: using
> enable()/disable(), we assume that the functions are called with lazy
> MMU mode is disabled. Consider:
> 
>    lazy_mmu_mode_enable()
>    madvise_cold_or_pageout_pte_range():
>      lazy_mmu_mode_enable()
>      ...
>      if (need_resched()) {
>        lazy_mmu_mode_disable()
>        cond_resched() // lazy MMU still enabled
>      }
> 
> This will explode on architectures that do not allow sleeping while in
> lazy MMU mode. I'm not saying this is an actual problem - I don't see
> why those functions would be called with lazy MMU mode enabled. But it
> does go against the notion that nesting works everywhere.

I would tackle it from a different direction: if code calls with lazy 
MMU enabled into random other code that might sleep, that caller would 
be wrong.

It's not about changing functions like this to use pause/resume.

Maybe the rule is simple: if you enable the lazy MMU, don't call any 
functions that might sleep.

Maybe we could support that later by handling it before/after sleeping, 
if ever required?

Or am I missing something regarding your point on pause()/resume()?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers
  2025-10-27 16:24           ` David Hildenbrand
@ 2025-10-28 10:34             ` Kevin Brodsky
  0 siblings, 0 replies; 58+ messages in thread
From: Kevin Brodsky @ 2025-10-28 10:34 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, H. Peter Anvin,
	Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett,
	Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 27/10/2025 17:24, David Hildenbrand wrote:
> On 24.10.25 16:32, Kevin Brodsky wrote:
>> On 24/10/2025 15:27, David Hildenbrand wrote:
>>> On 24.10.25 14:13, Kevin Brodsky wrote:
>>>> On 23/10/2025 21:52, David Hildenbrand wrote:
>>>>> On 15.10.25 10:27, Kevin Brodsky wrote:
>>>>>> [...]
>>>>>>
>>>>>> * madvise_*_pte_range() call arch_leave() in multiple paths, some
>>>>>>      followed by an immediate exit/rescheduling and some followed
>>>>>> by a
>>>>>>      conditional exit. These functions assume that they are called
>>>>>>      with lazy MMU disabled and we cannot simply use
>>>>>> pause()/resume()
>>>>>>      to address that. This patch leaves the situation unchanged by
>>>>>>      calling enable()/disable() in all cases.
>>>>>
>>>>> I'm confused, the function simply does
>>>>>
>>>>> (a) enables lazy mmu
>>>>> (b) does something on the page table
>>>>> (c) disables lazy mmu
>>>>> (d) does something expensive (split folio -> take sleepable locks,
>>>>>       flushes tlb)
>>>>> (e) go to (a)
>>>>
>>>> That step is conditional: we exit right away if pte_offset_map_lock()
>>>> fails. The fundamental issue is that pause() must always be matched
>>>> with
>>>> resume(), but as those functions look today there is no situation
>>>> where
>>>> a pause() would always be matched with a resume().
>>>
>>> We have matches enable/disable, so my question is rather "why" you are
>>> even thinking about using pause/resume?
>>>
>>> What would be the benefit of that? If there is no benefit then just
>>> drop this from the patch description as it's more confusing than just
>>> ... doing what the existing code does :)
>>
>> Ah sorry I misunderstood, I guess you originally meant: why would we use
>> pause()/resume()?
>>
>> The issue is the one I mentioned in the commit message: using
>> enable()/disable(), we assume that the functions are called with lazy
>> MMU mode is disabled. Consider:
>>
>>    lazy_mmu_mode_enable()
>>    madvise_cold_or_pageout_pte_range():
>>      lazy_mmu_mode_enable()
>>      ...
>>      if (need_resched()) {
>>        lazy_mmu_mode_disable()
>>        cond_resched() // lazy MMU still enabled
>>      }
>>
>> This will explode on architectures that do not allow sleeping while in
>> lazy MMU mode. I'm not saying this is an actual problem - I don't see
>> why those functions would be called with lazy MMU mode enabled. But it
>> does go against the notion that nesting works everywhere.
>
> I would tackle it from a different direction: if code calls with lazy
> MMU enabled into random other code that might sleep, that caller would
> be wrong.
>
> It's not about changing functions like this to use pause/resume.
>
> Maybe the rule is simple: if you enable the lazy MMU, don't call any
> functions that might sleep.

You're right, this is a requirement for lazy MMU. Calling enable() then
disable() means returning to the original state, and if the function
sleeps at that point then the caller must not itself enable lazy MMU.

I mixed up that case with the original motivation for pause()/resume(),
which is to temporarily pause any batching. This is considered an
implementation detail and the caller isn't expected to be aware of it,
hence the need for that use-case to work regardless of nesting.

> Maybe we could support that later by handling it before/after
> sleeping, if ever required?

Indeed, pause()/resume() could be used to allow functions that sleep to
be called with lazy MMU enabled. But that's only a hypothetical use-case
for now.

> Or am I missing something regarding your point on pause()/resume()?

Doesn't sound like it :) I'll remove that paragraph from the (already
long) commit message. Thanks!

- Kevin


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-10-28 10:34 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15  8:27 [PATCH v3 00/13] Nesting support for lazy MMU mode Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 01/13] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 02/13] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
2025-10-15 16:52   ` Dave Hansen
2025-10-16  7:32     ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 03/13] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
2025-10-23 19:36   ` David Hildenbrand
2025-10-24 12:09     ` Kevin Brodsky
2025-10-24 14:42       ` David Hildenbrand
2025-10-24 14:54         ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 04/13] sparc/mm: " Kevin Brodsky
2025-10-23 19:37   ` David Hildenbrand
2025-10-15  8:27 ` [PATCH v3 05/13] mm: introduce CONFIG_ARCH_LAZY_MMU Kevin Brodsky
2025-10-18  9:52   ` Mike Rapoport
2025-10-20 10:37     ` Kevin Brodsky
2025-10-23 19:38       ` David Hildenbrand
2025-10-15  8:27 ` [PATCH v3 06/13] mm: introduce generic lazy_mmu helpers Kevin Brodsky
2025-10-17 15:54   ` Alexander Gordeev
2025-10-20 10:32     ` Kevin Brodsky
2025-10-23 19:52   ` David Hildenbrand
2025-10-24 12:13     ` Kevin Brodsky
2025-10-24 13:27       ` David Hildenbrand
2025-10-24 14:32         ` Kevin Brodsky
2025-10-27 16:24           ` David Hildenbrand
2025-10-28 10:34             ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 07/13] mm: enable lazy_mmu sections to nest Kevin Brodsky
2025-10-23 20:00   ` David Hildenbrand
2025-10-24 12:16     ` Kevin Brodsky
2025-10-24 13:23       ` David Hildenbrand
2025-10-24 14:33         ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 08/13] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 09/13] powerpc/mm: replace batch->active " Kevin Brodsky
2025-10-23 20:02   ` David Hildenbrand
2025-10-24 12:16     ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 10/13] sparc/mm: " Kevin Brodsky
2025-10-23 20:03   ` David Hildenbrand
2025-10-15  8:27 ` [PATCH v3 11/13] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
2025-10-23 20:06   ` David Hildenbrand
2025-10-24 14:47     ` David Woodhouse
2025-10-24 14:51       ` David Hildenbrand
2025-10-24 15:13         ` David Woodhouse
2025-10-24 15:16           ` David Hildenbrand
2025-10-24 15:38           ` John Paul Adrian Glaubitz
2025-10-24 15:47             ` David Hildenbrand
2025-10-24 15:51               ` John Paul Adrian Glaubitz
2025-10-27 12:38                 ` David Hildenbrand
2025-10-24 22:52         ` Demi Marie Obenour
2025-10-27 12:29           ` David Hildenbrand
2025-10-27 13:32             ` Kevin Brodsky
2025-10-24 15:05       ` Kevin Brodsky
2025-10-24 15:17         ` David Woodhouse
2025-10-27 13:38           ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 12/13] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
2025-10-23 20:08   ` David Hildenbrand
2025-10-24 12:17     ` Kevin Brodsky
2025-10-15  8:27 ` [PATCH v3 13/13] mm: introduce arch_wants_lazy_mmu_mode() Kevin Brodsky
2025-10-23 20:10   ` David Hildenbrand
2025-10-24 12:17     ` Kevin Brodsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).