[PATCH v4 00/12] Nesting support for lazy MMU mode

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/12] Nesting support for lazy MMU mode
@ 2025-10-29 10:08 Kevin Brodsky
  2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
                   ` (11 more replies)
  0 siblings, 12 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:08 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

When the lazy MMU mode was introduced eons ago, it wasn't made clear
whether such a sequence was legal:

	arch_enter_lazy_mmu_mode()
	...
		arch_enter_lazy_mmu_mode()
		...
		arch_leave_lazy_mmu_mode()
	...
	arch_leave_lazy_mmu_mode()

It seems fair to say that nested calls to
arch_{enter,leave}_lazy_mmu_mode() were not expected, and most
architectures never explicitly supported it.

Nesting does in fact occur in certain configurations, and avoiding it
has proved difficult. This series therefore enables lazy_mmu sections to
nest, on all architectures.

Nesting is handled using a counter in task_struct (patch 7), like other
stateless APIs such as pagefault_{disable,enable}(). This is fully
handled in a new generic layer in <linux/pgtable.h>; the arch_* API
remains unchanged. A new pair of calls, lazy_mmu_mode_{pause,resume}(),
is also introduced to allow functions that are called with the lazy MMU
mode enabled to temporarily pause it, regardless of nesting.

An arch now opts in to using the lazy MMU mode by selecting
CONFIG_ARCH_LAZY_MMU; this is more appropriate now that we have a
generic API, especially with state conditionally added to task_struct.

---

Background: Ryan Roberts' series from March [1] attempted to prevent
nesting from ever occurring, and mostly succeeded. Unfortunately, a
corner case (DEBUG_PAGEALLOC) may still cause nesting to occur on arm64.
Ryan proposed [2] to address that corner case at the generic level but
this approach received pushback; [3] then attempted to solve the issue
on arm64 only, but it was deemed too fragile.

It feels generally difficult to guarantee that lazy_mmu sections don't
nest, because callers of various standard mm functions do not know if
the function uses lazy_mmu itself.

The overall approach in v3/v4 is very close to what David Hildenbrand
proposed on v2 [4].

Unlike in v1/v2, no special provision is made for architectures to
save/restore extra state when entering/leaving the mode. Based on the
discussions so far, this does not seem to be required - an arch can
store any relevant state in thread_struct during arch_enter() and
restore it in arch_leave(). Nesting is not a concern as these functions
are only called at the top level, not in nested sections.

The introduction of a generic layer, and tracking of the lazy MMU state
in task_struct, also allows to streamline the arch callbacks - this
series removes 72 lines from arch/.

Patch overview:

* Patch 1: cleanup - avoids having to deal with the powerpc
  context-switching code

* Patch 2-4: prepare arch_flush_lazy_mmu_mode() to be called from the
  generic layer (patch 7)

* Patch 5-6: new API + CONFIG_ARCH_LAZY_MMU

* Patch 7: nesting support

* Patch 8-12: move as much handling as possible to the generic layer

This series has been tested by running the mm kselfetsts on arm64 with
DEBUG_VM, DEBUG_PAGEALLOC and KFENCE. It was also build-tested on other
architectures (with and without XEN_PV on x86).

- Kevin

[1] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250530140446.2387131-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/ef343405-c394-4763-a79f-21381f217b6c@redhat.com/
---
Changelog

v3..v4:

- Patch 2: restored ordering of preempt_{disable,enable}() [Dave Hansen]
- Patch 5 onwards: s/ARCH_LAZY_MMU/ARCH_HAS_LAZY_MMU_MODE/ [Mike Rapoport]
- Patch 7: renamed lazy_mmu_state members, removed VM_BUG_ON(),
  reordered writes to lazy_mmu_state members [David Hildenbrand]
- Dropped patch 13 as it doesn't seem justified [David H]
- Various improvements to commit messages [David H]

v3: https://lore.kernel.org/all/20251015082727.2395128-1-kevin.brodsky@arm.com/

v2..v3:

- Full rewrite; dropped all Acked-by/Reviewed-by.
- Rebased on v6.18-rc1.

v2: https://lore.kernel.org/all/20250908073931.4159362-1-kevin.brodsky@arm.com/

v1..v2:
- Rebased on mm-unstable.
- Patch 2: handled new calls to enter()/leave(), clarified how the "flush"
  pattern (leave() followed by enter()) is handled.
- Patch 5,6: removed unnecessary local variable [Alexander Gordeev's
  suggestion].
- Added Mike Rapoport's Acked-by.

v1: https://lore.kernel.org/all/20250904125736.3918646-1-kevin.brodsky@arm.com/
---
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
---
Alexander Gordeev (1):
  powerpc/64s: Do not re-activate batched TLB flush

Kevin Brodsky (11):
  x86/xen: simplify flush_lazy_mmu()
  powerpc/mm: implement arch_flush_lazy_mmu_mode()
  sparc/mm: implement arch_flush_lazy_mmu_mode()
  mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
  mm: introduce generic lazy_mmu helpers
  mm: enable lazy_mmu sections to nest
  arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  sparc/mm: replace batch->active with in_lazy_mmu_mode()
  x86/xen: use lazy_mmu_state when context-switching
  mm: bail out of lazy_mmu_mode_* in interrupt context

 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/pgtable.h              |  46 +-------
 arch/arm64/include/asm/thread_info.h          |   3 +-
 arch/arm64/mm/mmu.c                           |   4 +-
 arch/arm64/mm/pageattr.c                      |   4 +-
 .../include/asm/book3s/64/tlbflush-hash.h     |  22 ++--
 arch/powerpc/include/asm/thread_info.h        |   2 -
 arch/powerpc/kernel/process.c                 |  25 -----
 arch/powerpc/mm/book3s64/hash_tlb.c           |  10 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |   4 +-
 arch/powerpc/platforms/Kconfig.cputype        |   1 +
 arch/sparc/Kconfig                            |   1 +
 arch/sparc/include/asm/tlbflush_64.h          |   5 +-
 arch/sparc/mm/tlb.c                           |  14 +--
 arch/x86/Kconfig                              |   1 +
 arch/x86/boot/compressed/misc.h               |   1 +
 arch/x86/boot/startup/sme.c                   |   1 +
 arch/x86/include/asm/paravirt.h               |   1 -
 arch/x86/include/asm/pgtable.h                |   3 +-
 arch/x86/include/asm/thread_info.h            |   4 +-
 arch/x86/xen/enlighten_pv.c                   |   3 +-
 arch/x86/xen/mmu_pv.c                         |   6 +-
 fs/proc/task_mmu.c                            |   4 +-
 include/linux/mm_types_task.h                 |   5 +
 include/linux/pgtable.h                       | 104 +++++++++++++++++-
 include/linux/sched.h                         |  19 ++++
 mm/Kconfig                                    |   3 +
 mm/kasan/shadow.c                             |   8 +-
 mm/madvise.c                                  |  18 +--
 mm/memory.c                                   |  16 +--
 mm/migrate_device.c                           |   4 +-
 mm/mprotect.c                                 |   4 +-
 mm/mremap.c                                   |   4 +-
 mm/userfaultfd.c                              |   4 +-
 mm/vmalloc.c                                  |  12 +-
 mm/vmscan.c                                   |  12 +-
 36 files changed, 213 insertions(+), 166 deletions(-)


base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
-- 
2.47.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
@ 2025-10-29 10:08 ` Kevin Brodsky
  2025-11-01 12:05   ` David Hildenbrand
                     ` (2 more replies)
  2025-10-29 10:08 ` [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
                   ` (10 subsequent siblings)
  11 siblings, 3 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:08 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

From: Alexander Gordeev <agordeev@linux.ibm.com>

Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
lazy mmu mode") a task can not be preempted while in lazy MMU mode.
Therefore, the batch re-activation code is never called, so remove it.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/powerpc/include/asm/thread_info.h |  2 --
 arch/powerpc/kernel/process.c          | 25 -------------------------
 2 files changed, 27 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index b0f200aba2b3..97f35f9b1a96 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
 /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
-#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
-#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
 
 #ifndef __ASSEMBLER__
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index eb23966ac0a9..9237dcbeee4a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 {
 	struct thread_struct *new_thread, *old_thread;
 	struct task_struct *last;
-#ifdef CONFIG_PPC_64S_HASH_MMU
-	struct ppc64_tlb_batch *batch;
-#endif
 
 	new_thread = &new->thread;
 	old_thread = &current->thread;
@@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	WARN_ON(!irqs_disabled());
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
-	if (batch->active) {
-		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
-		if (batch->index)
-			__flush_tlb_pending(batch);
-		batch->active = 0;
-	}
-
 	/*
 	 * On POWER9 the copy-paste buffer can only paste into
 	 * foreign real addresses, so unprivileged processes can not
@@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	 */
 
 #ifdef CONFIG_PPC_BOOK3S_64
-#ifdef CONFIG_PPC_64S_HASH_MMU
-	/*
-	 * This applies to a process that was context switched while inside
-	 * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
-	 * deactivated above, before _switch(). This will never be the case
-	 * for new tasks.
-	 */
-	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
-		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
-		batch = this_cpu_ptr(&ppc64_tlb_batch);
-		batch->active = 1;
-	}
-#endif
-
 	/*
 	 * Math facilities are masked out of the child MSR in copy_thread.
 	 * A new task does not need to restore_math because it will
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu()
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
  2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
@ 2025-10-29 10:08 ` Kevin Brodsky
  2025-11-01 12:14   ` David Hildenbrand
                     ` (2 more replies)
  2025-10-29 10:09 ` [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:08 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

arch_flush_lazy_mmu_mode() is called when outstanding batched
pgtable operations must be completed immediately. There should
however be no need to leave and re-enter lazy MMU completely. The
only part of that sequence that we really need is xen_mc_flush();
call it directly.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/x86/xen/mmu_pv.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2a4a8deaf612..7a35c3393df4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2139,10 +2139,8 @@ static void xen_flush_lazy_mmu(void)
 {
 	preempt_disable();
 
-	if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
-		arch_leave_lazy_mmu_mode();
-		arch_enter_lazy_mmu_mode();
-	}
+	if (xen_get_lazy_mode() == XEN_LAZY_MMU)
+		xen_mc_flush();
 
 	preempt_enable();
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
  2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
  2025-10-29 10:08 ` [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-01 12:14   ` David Hildenbrand
  2025-11-05  3:15   ` Ritesh Harjani
  2025-10-29 10:09 ` [PATCH v4 04/12] sparc/mm: " Kevin Brodsky
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Upcoming changes to the lazy_mmu API will cause
arch_flush_lazy_mmu_mode() to be called when leaving a nested
lazy_mmu section.

Move the relevant logic from arch_leave_lazy_mmu_mode() to
arch_flush_lazy_mmu_mode() and have the former call the latter.

Note: the additional this_cpu_ptr() on the
arch_leave_lazy_mmu_mode() path will be removed in a subsequent
patch.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 146287d9580f..7704dbe8e88d 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
 	batch->active = 1;
 }
 
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	struct ppc64_tlb_batch *batch;
+
+	batch = this_cpu_ptr(&ppc64_tlb_batch);
+
+	if (batch->index)
+		__flush_tlb_pending(batch);
+}
+
 static inline void arch_leave_lazy_mmu_mode(void)
 {
 	struct ppc64_tlb_batch *batch;
@@ -49,14 +59,11 @@ static inline void arch_leave_lazy_mmu_mode(void)
 		return;
 	batch = this_cpu_ptr(&ppc64_tlb_batch);
 
-	if (batch->index)
-		__flush_tlb_pending(batch);
+	arch_flush_lazy_mmu_mode();
 	batch->active = 0;
 	preempt_enable();
 }
 
-#define arch_flush_lazy_mmu_mode()      do {} while (0)
-
 extern void hash__tlbiel_all(unsigned int action);
 
 extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 04/12] sparc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (2 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-01 12:14   ` David Hildenbrand
  2025-10-29 10:09 ` [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Upcoming changes to the lazy_mmu API will cause
arch_flush_lazy_mmu_mode() to be called when leaving a nested
lazy_mmu section.

Move the relevant logic from arch_leave_lazy_mmu_mode() to
arch_flush_lazy_mmu_mode() and have the former call the latter.

Note: the additional this_cpu_ptr() on the
arch_leave_lazy_mmu_mode() path will be removed in a subsequent
patch.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/sparc/include/asm/tlbflush_64.h | 2 +-
 arch/sparc/mm/tlb.c                  | 9 ++++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 8b8cdaa69272..925bb5d7a4e1 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -43,8 +43,8 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 void flush_tlb_pending(void);
 void arch_enter_lazy_mmu_mode(void);
+void arch_flush_lazy_mmu_mode(void);
 void arch_leave_lazy_mmu_mode(void);
-#define arch_flush_lazy_mmu_mode()      do {} while (0)
 
 /* Local cpu only.  */
 void __flush_tlb_all(void);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index a35ddcca5e76..7b5dfcdb1243 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -59,12 +59,19 @@ void arch_enter_lazy_mmu_mode(void)
 	tb->active = 1;
 }
 
-void arch_leave_lazy_mmu_mode(void)
+void arch_flush_lazy_mmu_mode(void)
 {
 	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
 
 	if (tb->tlb_nr)
 		flush_tlb_pending();
+}
+
+void arch_leave_lazy_mmu_mode(void)
+{
+	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
+
+	arch_flush_lazy_mmu_mode();
 	tb->active = 0;
 	preempt_enable();
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (3 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 04/12] sparc/mm: " Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-01 12:16   ` David Hildenbrand
                     ` (2 more replies)
  2025-10-29 10:09 ` [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Architectures currently opt in for implementing lazy_mmu helpers by
defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.

In preparation for introducing a generic lazy_mmu layer that will
require storage in task_struct, let's switch to a cleaner approach:
instead of defining a macro, select a CONFIG option.

This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
arch select it when it implements lazy_mmu helpers.
__HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
relies on the new CONFIG instead.

On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
selected. This creates some complications in arch/x86/boot/, because
a few files manually undefine PARAVIRT* options. As a result
<asm/paravirt.h> does not define the lazy_mmu helpers, but this
breaks the build as <linux/pgtable.h> only defines them if
!CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
way out of this - let's just undefine that new CONFIG too.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/Kconfig                                 | 1 +
 arch/arm64/include/asm/pgtable.h                   | 1 -
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
 arch/powerpc/platforms/Kconfig.cputype             | 1 +
 arch/sparc/Kconfig                                 | 1 +
 arch/sparc/include/asm/tlbflush_64.h               | 2 --
 arch/x86/Kconfig                                   | 1 +
 arch/x86/boot/compressed/misc.h                    | 1 +
 arch/x86/boot/startup/sme.c                        | 1 +
 arch/x86/include/asm/paravirt.h                    | 1 -
 include/linux/pgtable.h                            | 2 +-
 mm/Kconfig                                         | 3 +++
 12 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6663ffd23f25..e6bf5c7311b5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -122,6 +122,7 @@ config ARM64
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
 	select ARCH_HAS_UBSAN
+	select ARCH_HAS_LAZY_MMU_MODE
 	select ARM_AMBA
 	select ARM_ARCH_TIMER
 	select ARM_GIC
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0944e296dd4a..54f8d6bb6f22 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -80,7 +80,6 @@ static inline void queue_pte_barriers(void)
 	}
 }
 
-#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	/*
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 7704dbe8e88d..623a8a8b2d0e 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -24,8 +24,6 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
 
 extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
 
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	struct ppc64_tlb_batch *batch;
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 7b527d18aa5e..2942d57cf59c 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -93,6 +93,7 @@ config PPC_BOOK3S_64
 	select IRQ_WORK
 	select PPC_64S_HASH_MMU if !PPC_RADIX_MMU
 	select KASAN_VMALLOC if KASAN
+	select ARCH_HAS_LAZY_MMU_MODE
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index a630d373e645..2bad14744ca4 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -112,6 +112,7 @@ config SPARC64
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
 	select ARCH_SUPPORTS_SCHED_SMT if SMP
 	select ARCH_SUPPORTS_SCHED_MC  if SMP
+	select ARCH_HAS_LAZY_MMU_MODE
 
 config ARCH_PROC_KCORE_TEXT
 	def_bool y
diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 925bb5d7a4e1..4e1036728e2f 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -39,8 +39,6 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-
 void flush_tlb_pending(void);
 void arch_enter_lazy_mmu_mode(void);
 void arch_flush_lazy_mmu_mode(void);
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..ef4332d720ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -804,6 +804,7 @@ config PARAVIRT
 config PARAVIRT_XXL
 	bool
 	depends on X86_64
+	select ARCH_HAS_LAZY_MMU_MODE
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index db1048621ea2..cdd7f692d9ee 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -11,6 +11,7 @@
 #undef CONFIG_PARAVIRT
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
+#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
 
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index e7ea65f3f1d6..b76a7c95dfe1 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -24,6 +24,7 @@
 #undef CONFIG_PARAVIRT
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
+#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
 
 /*
  * This code runs before CPU feature bits are set. By default, the
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b5e59a7ba0d0..13f9cd31c8f8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -526,7 +526,6 @@ static inline void arch_end_context_switch(struct task_struct *next)
 	PVOP_VCALL1(cpu.end_context_switch, next);
 }
 
-#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	PVOP_VCALL0(mmu.lazy_mode.enter);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 32e8457ad535..9894366e768b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
  * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
  * and the mode cannot be used in interrupt context.
  */
-#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void) {}
 static inline void arch_leave_lazy_mmu_mode(void) {}
 static inline void arch_flush_lazy_mmu_mode(void) {}
diff --git a/mm/Kconfig b/mm/Kconfig
index 0e26f4fc8717..5480c9a1bfb2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1372,6 +1372,9 @@ config PT_RECLAIM
 config FIND_NORMAL_PAGE
 	def_bool n
 
+config ARCH_HAS_LAZY_MMU_MODE
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (4 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-01 12:18   ` David Hildenbrand
  2025-11-07 14:26   ` Ryan Roberts
  2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

The implementation of the lazy MMU mode is currently entirely
arch-specific; core code directly calls arch helpers:
arch_{enter,leave}_lazy_mmu_mode().

We are about to introduce support for nested lazy MMU sections.
As things stand we'd have to duplicate that logic in every arch
implementing lazy_mmu - adding to a fair amount of logic
already duplicated across lazy_mmu implementations.

This patch therefore introduces a new generic layer that calls the
existing arch_* helpers. Two pair of calls are introduced:

* lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
    This is the standard case where the mode is enabled for a given
    block of code by surrounding it with enable() and disable()
    calls.

* lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
    This is for situations where the mode is temporarily disabled
    by first calling pause() and then resume() (e.g. to prevent any
    batching from occurring in a critical section).

The documentation in <linux/pgtable.h> will be updated in a
subsequent patch.

No functional change should be introduced at this stage.
The implementation of enable()/resume() and disable()/pause() is
currently identical, but nesting support will change that.

Most of the call sites have been updated using the following
Coccinelle script:

@@
@@
{
...
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
...
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
...
}

@@
@@
{
...
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_pause();
...
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_resume();
...
}

A couple of notes regarding x86:

* Xen is currently the only case where explicit handling is required
  for lazy MMU when context-switching. This is purely an
  implementation detail and using the generic lazy_mmu_mode_*
  functions would cause trouble when nesting support is introduced,
  because the generic functions must be called from the current task.
  For that reason we still use arch_leave() and arch_enter() there.

* x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
  places, but only defines it if PARAVIRT_XXL is selected, and we
  are removing the fallback in <linux/pgtable.h>. Add a new fallback
  definition to <asm/pgtable.h> to keep things building.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/mm/mmu.c                     |  4 ++--
 arch/arm64/mm/pageattr.c                |  4 ++--
 arch/powerpc/mm/book3s64/hash_tlb.c     |  8 +++----
 arch/powerpc/mm/book3s64/subpage_prot.c |  4 ++--
 arch/x86/include/asm/pgtable.h          |  3 ++-
 fs/proc/task_mmu.c                      |  4 ++--
 include/linux/pgtable.h                 | 29 +++++++++++++++++++++----
 mm/kasan/shadow.c                       |  8 +++----
 mm/madvise.c                            | 18 +++++++--------
 mm/memory.c                             | 16 +++++++-------
 mm/migrate_device.c                     |  4 ++--
 mm/mprotect.c                           |  4 ++--
 mm/mremap.c                             |  4 ++--
 mm/userfaultfd.c                        |  4 ++--
 mm/vmalloc.c                            | 12 +++++-----
 mm/vmscan.c                             | 12 +++++-----
 16 files changed, 80 insertions(+), 58 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b8d37eb037fc..d9c8e94f140f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -731,7 +731,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 		return -EINVAL;
 
 	mutex_lock(&pgtable_split_lock);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	/*
 	 * The split_kernel_leaf_mapping_locked() may sleep, it is not a
@@ -753,7 +753,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 			ret = split_kernel_leaf_mapping_locked(end);
 	}
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	mutex_unlock(&pgtable_split_lock);
 	return ret;
 }
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 5135f2d66958..e4059f13c4ed 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -110,7 +110,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
 	if (WARN_ON_ONCE(ret))
 		return ret;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	/*
 	 * The caller must ensure that the range we are operating on does not
@@ -119,7 +119,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
 	 */
 	ret = walk_kernel_page_table_range_lockless(start, start + size,
 						    &pageattr_ops, NULL, &data);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	return ret;
 }
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 21fcad97ae80..787f7a0e27f0 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -205,7 +205,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
 	 * way to do things but is fine for our needs here.
 	 */
 	local_irq_save(flags);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; start < end; start += PAGE_SIZE) {
 		pte_t *ptep = find_init_mm_pte(start, &hugepage_shift);
 		unsigned long pte;
@@ -217,7 +217,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
 			continue;
 		hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift);
 	}
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	local_irq_restore(flags);
 }
 
@@ -237,7 +237,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
 	 * way to do things but is fine for our needs here.
 	 */
 	local_irq_save(flags);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	start_pte = pte_offset_map(pmd, addr);
 	if (!start_pte)
 		goto out;
@@ -249,6 +249,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
 	}
 	pte_unmap(start_pte);
 out:
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	local_irq_restore(flags);
 }
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index ec98e526167e..07c47673bba2 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -73,13 +73,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; npages > 0; --npages) {
 		pte_update(mm, addr, pte, 0, 0, 0);
 		addr += PAGE_SIZE;
 		++pte;
 	}
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e33df3da6980..14fd672bc9b2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -117,7 +117,8 @@ extern pmdval_t early_pmd_flags;
 #define pte_val(x)	native_pte_val(x)
 #define __pte(x)	native_make_pte(x)
 
-#define arch_end_context_switch(prev)	do {} while(0)
+#define arch_end_context_switch(prev)	do {} while (0)
+#define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif	/* CONFIG_PARAVIRT_XXL */
 
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..d16ba1d32169 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2703,7 +2703,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 		return 0;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
 		/* Fast path for performing exclusive WP */
@@ -2773,7 +2773,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (flush_end)
 		flush_tlb_range(vma, start, addr);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(start_pte, ptl);
 
 	cond_resched();
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9894366e768b..b5fdf32c437f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,10 +231,31 @@ static inline int pmd_dirty(pmd_t pmd)
  * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
  * and the mode cannot be used in interrupt context.
  */
-#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
-static inline void arch_enter_lazy_mmu_mode(void) {}
-static inline void arch_leave_lazy_mmu_mode(void) {}
-static inline void arch_flush_lazy_mmu_mode(void) {}
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+static inline void lazy_mmu_mode_enable(void)
+{
+	arch_enter_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_disable(void)
+{
+	arch_leave_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_pause(void)
+{
+	arch_leave_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_resume(void)
+{
+	arch_enter_lazy_mmu_mode();
+}
+#else
+static inline void lazy_mmu_mode_enable(void) {}
+static inline void lazy_mmu_mode_disable(void) {}
+static inline void lazy_mmu_mode_pause(void) {}
+static inline void lazy_mmu_mode_resume(void) {}
 #endif
 
 #ifndef pte_batch_hint
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 5d2a876035d6..c49b029d3593 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	pte_t pte;
 	int index;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_pause();
 
 	index = PFN_DOWN(addr - data->start);
 	page = data->pages[index];
@@ -319,7 +319,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	}
 	spin_unlock(&init_mm.page_table_lock);
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_resume();
 
 	return 0;
 }
@@ -482,7 +482,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	pte_t pte;
 	int none;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_pause();
 
 	spin_lock(&init_mm.page_table_lock);
 	pte = ptep_get(ptep);
@@ -494,7 +494,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	if (likely(!none))
 		__free_page(pfn_to_page(pte_pfn(pte)));
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_resume();
 
 	return 0;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..536026772160 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -455,7 +455,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -463,7 +463,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		if (++batch_count == SWAP_CLUSTER_MAX) {
 			batch_count = 0;
 			if (need_resched()) {
-				arch_leave_lazy_mmu_mode();
+				lazy_mmu_mode_disable();
 				pte_unmap_unlock(start_pte, ptl);
 				cond_resched();
 				goto restart;
@@ -499,7 +499,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				if (!folio_trylock(folio))
 					continue;
 				folio_get(folio);
-				arch_leave_lazy_mmu_mode();
+				lazy_mmu_mode_disable();
 				pte_unmap_unlock(start_pte, ptl);
 				start_pte = NULL;
 				err = split_folio(folio);
@@ -510,7 +510,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				arch_enter_lazy_mmu_mode();
+				lazy_mmu_mode_enable();
 				if (!err)
 					nr = 0;
 				continue;
@@ -558,7 +558,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 	if (start_pte) {
-		arch_leave_lazy_mmu_mode();
+		lazy_mmu_mode_disable();
 		pte_unmap_unlock(start_pte, ptl);
 	}
 	if (pageout)
@@ -677,7 +677,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -727,7 +727,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				if (!folio_trylock(folio))
 					continue;
 				folio_get(folio);
-				arch_leave_lazy_mmu_mode();
+				lazy_mmu_mode_disable();
 				pte_unmap_unlock(start_pte, ptl);
 				start_pte = NULL;
 				err = split_folio(folio);
@@ -738,7 +738,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				arch_enter_lazy_mmu_mode();
+				lazy_mmu_mode_enable();
 				if (!err)
 					nr = 0;
 				continue;
@@ -778,7 +778,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (nr_swap)
 		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
 	if (start_pte) {
-		arch_leave_lazy_mmu_mode();
+		lazy_mmu_mode_disable();
 		pte_unmap_unlock(start_pte, ptl);
 	}
 	cond_resched();
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..2d662dee5ae7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1254,7 +1254,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		nr = 1;
@@ -1323,7 +1323,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
 		 addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(orig_src_pte, src_ptl);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
@@ -1842,7 +1842,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		return addr;
 
 	flush_tlb_batched_pending(mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	do {
 		bool any_skipped = false;
 
@@ -1874,7 +1874,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
 
 	add_mm_rss_vec(mm, rss);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	/* Do the actual TLB flush before dropping ptl */
 	if (force_flush) {
@@ -2817,7 +2817,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	do {
 		BUG_ON(!pte_none(ptep_get(pte)));
 		if (!pfn_modify_allowed(pfn, prot)) {
@@ -2827,7 +2827,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(mapped_pte, ptl);
 	return err;
 }
@@ -3134,7 +3134,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			return -EINVAL;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	if (fn) {
 		do {
@@ -3147,7 +3147,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	}
 	*mask |= PGTBL_PTE_MODIFIED;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	if (mm != &init_mm)
 		pte_unmap_unlock(mapped_pte, ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..dcdc46b96cc7 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -110,7 +110,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	if (!ptep)
 		goto again;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		struct dev_pagemap *pgmap;
@@ -287,7 +287,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	if (unmapped)
 		flush_tlb_range(walk->vma, start, end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(ptep - 1, ptl);
 
 	return 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 113b48985834..bcb183a6fd2f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -293,7 +293,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 		target_node = numa_node_id();
 
 	flush_tlb_batched_pending(vma->vm_mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 	do {
 		nr_ptes = 1;
 		oldpte = ptep_get(pte);
@@ -439,7 +439,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			}
 		}
 	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(pte - 1, ptl);
 
 	return pages;
diff --git a/mm/mremap.c b/mm/mremap.c
index bd7314898ec5..a2e2cd8f279a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -256,7 +256,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 	flush_tlb_batched_pending(vma->vm_mm);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
 		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
@@ -301,7 +301,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		}
 	}
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	if (force_flush)
 		flush_tlb_range(vma, old_end - len, old_end);
 	if (new_ptl != old_ptl)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..e01f7813e15c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1100,7 +1100,7 @@ static long move_present_ptes(struct mm_struct *mm,
 	/* It's safe to drop the reference now as the page-table is holding one. */
 	folio_put(*first_src_folio);
 	*first_src_folio = NULL;
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	while (true) {
 		orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1138,7 +1138,7 @@ static long move_present_ptes(struct mm_struct *mm,
 			break;
 	}
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	if (src_addr > src_start)
 		flush_tlb_range(src_vma, src_start, src_addr);
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 798b2ed21e46..b9940590a40d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -105,7 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	if (!pte)
 		return -ENOMEM;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		if (unlikely(!pte_none(ptep_get(pte)))) {
@@ -131,7 +131,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		pfn++;
 	} while (pte += PFN_DOWN(size), addr += size, addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
@@ -359,7 +359,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 #ifdef CONFIG_HUGETLB_PAGE
@@ -378,7 +378,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
 	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
 }
 
@@ -526,7 +526,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!pte)
 		return -ENOMEM;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		struct page *page = pages[*nr];
@@ -548,7 +548,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		(*nr)++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	*mask |= PGTBL_PTE_MODIFIED;
 
 	return err;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2fc8b626d3d..7d2d87069530 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3551,7 +3551,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		return false;
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 restart:
 	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
 		unsigned long pfn;
@@ -3592,7 +3592,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
 		goto restart;
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	pte_unmap_unlock(pte, ptl);
 
 	return suitable_to_scan(total, young);
@@ -3633,7 +3633,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	if (!spin_trylock(ptl))
 		goto done;
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	do {
 		unsigned long pfn;
@@ -3680,7 +3680,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 
 	walk_update_folio(walk, last, gen, dirty);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 	spin_unlock(ptl);
 done:
 	*first = -1;
@@ -4279,7 +4279,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		}
 	}
 
-	arch_enter_lazy_mmu_mode();
+	lazy_mmu_mode_enable();
 
 	pte -= (addr - start) / PAGE_SIZE;
 
@@ -4313,7 +4313,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
 	walk_update_folio(walk, last, gen, dirty);
 
-	arch_leave_lazy_mmu_mode();
+	lazy_mmu_mode_disable();
 
 	/* feedback from rmap walkers to page table walkers */
 	if (mm_state && suitable_to_scan(i, young))
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (5 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-10-29 16:41   ` Alexander Gordeev
                     ` (3 more replies)
  2025-10-29 10:09 ` [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
                   ` (4 subsequent siblings)
  11 siblings, 4 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Despite recent efforts to prevent lazy_mmu sections from nesting, it
remains difficult to ensure that it never occurs - and in fact it
does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
made nesting tolerable on arm64, but without truly supporting it:
the inner call to leave() disables the batching optimisation before
the outer section ends.

This patch actually enables lazy_mmu sections to nest by tracking
the nesting level in task_struct, in a similar fashion to e.g.
pagefault_{enable,disable}(). This is fully handled by the generic
lazy_mmu helpers that were recently introduced.

lazy_mmu sections were not initially intended to nest, so we need to
clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
This patch takes the following approach:

* The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
  calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.

* Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
  to the arch via arch_{enter,leave} - lazy MMU remains enabled so
  the assumption is that these callbacks are not relevant. However,
  existing code may rely on a call to disable() to flush any batched
  state, regardless of nesting. arch_flush_lazy_mmu_mode() is
  therefore called in that situation.

A separate interface was recently introduced to temporarily pause
the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
exits the mode *regardless of the nesting level*, and resume()
restores the mode at the same nesting level.

Whether the mode is actually enabled or not at any point is tracked
by a separate "active" field in task_struct; this makes it possible
to check invariants in the generic API, and to expose a new
in_lazy_mmu_mode() helper to replace the various ways arch's
currently track whether the mode is enabled (this will be done in
later patches).

In summary (nesting/active represent the values *after* the call):

lazy_mmu_mode_enable()		-> arch_enter()	    nesting=1 active=1
    lazy_mmu_mode_enable()	-> ø		    nesting=2 active=1
	lazy_mmu_mode_pause()	-> arch_leave()     nesting=2 active=0
	lazy_mmu_mode_resume()	-> arch_enter()     nesting=2 active=1
    lazy_mmu_mode_disable()	-> arch_flush()     nesting=1 active=1
lazy_mmu_mode_disable()		-> arch_leave()     nesting=0 active=0

Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
headers included by <linux/pgtable.h> to use it.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ------
 include/linux/mm_types_task.h    |  5 +++
 include/linux/pgtable.h          | 67 ++++++++++++++++++++++++++++++--
 include/linux/sched.h            | 16 ++++++++
 4 files changed, 84 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 54f8d6bb6f22..535435248923 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	/*
-	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
-	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
-	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
-	 * permissions on the linear map with apply_to_page_range(), which
-	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
-	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
-	 * flush and clear the flag such that the remainder of the work in the
-	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
-	 * keeps tracking simple.
-	 */
-
 	if (in_interrupt())
 		return;
 
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a82aa80c0ba4..632d404f8191 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
 #endif
 };
 
+struct lazy_mmu_state {
+	u8 nesting_level;
+	bool active;
+};
+
 #endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b5fdf32c437f..e6064e00b22d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
  * of the lazy mode. So the implementation must assume preemption may be enabled
  * and cpu migration is possible; it must take steps to be robust against this.
  * (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
- * and the mode cannot be used in interrupt context.
+ * held, but for kernel PTE updates, no lock is held). The mode cannot be used
+ * in interrupt context.
+ *
+ * The lazy MMU mode is enabled for a given block of code using:
+ *
+ *   lazy_mmu_mode_enable();
+ *   <code>
+ *   lazy_mmu_mode_disable();
+ *
+ * Nesting is permitted: <code> may itself use an enable()/disable() pair.
+ * A nested call to enable() has no functional effect; however disable() causes
+ * any batched architectural state to be flushed regardless of nesting. After a
+ * call to disable(), the caller can therefore rely on all previous page table
+ * modifications to have taken effect, but the lazy MMU mode may still be
+ * enabled.
+ *
+ * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
+ * This can be done using:
+ *
+ *   lazy_mmu_mode_pause();
+ *   <code>
+ *   lazy_mmu_mode_resume();
+ *
+ * This sequence must only be used if the lazy MMU mode is already enabled.
+ * pause() ensures that the mode is exited regardless of the nesting level;
+ * resume() re-enters the mode at the same nesting level. <code> must not modify
+ * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
+ * helpers).
+ *
+ * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
+ * currently enabled.
  */
 #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
 static inline void lazy_mmu_mode_enable(void)
 {
-	arch_enter_lazy_mmu_mode();
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
+	/* enable() must not be called while paused */
+	VM_WARN_ON(state->nesting_level > 0 && !state->active);
+
+	if (state->nesting_level++ == 0) {
+		state->active = true;
+		arch_enter_lazy_mmu_mode();
+	}
 }
 
 static inline void lazy_mmu_mode_disable(void)
 {
-	arch_leave_lazy_mmu_mode();
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_WARN_ON_ONCE(state->nesting_level == 0);
+	VM_WARN_ON(!state->active);
+
+	if (--state->nesting_level == 0) {
+		state->active = false;
+		arch_leave_lazy_mmu_mode();
+	} else {
+		/* Exiting a nested section */
+		arch_flush_lazy_mmu_mode();
+	}
 }
 
 static inline void lazy_mmu_mode_pause(void)
 {
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_WARN_ON(state->nesting_level == 0 || !state->active);
+
+	state->active = false;
 	arch_leave_lazy_mmu_mode();
 }
 
 static inline void lazy_mmu_mode_resume(void)
 {
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	VM_WARN_ON(state->nesting_level == 0 || state->active);
+
+	state->active = true;
 	arch_enter_lazy_mmu_mode();
 }
 #else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbb7340c5866..11566d973f42 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1441,6 +1441,10 @@ struct task_struct {
 
 	struct page_frag		task_frag;
 
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+	struct lazy_mmu_state		lazy_mmu_state;
+#endif
+
 #ifdef CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info		*delays;
 #endif
@@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
 	return task_index_to_char(task_state_index(tsk));
 }
 
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+static inline bool in_lazy_mmu_mode(void)
+{
+	return current->lazy_mmu_state.active;
+}
+#else
+static inline bool in_lazy_mmu_mode(void)
+{
+	return false;
+}
+#endif
+
 extern struct pid *cad_pid;
 
 /*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (6 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-03 16:03   ` David Hildenbrand
  2025-11-07 15:28   ` Ryan Roberts
  2025-10-29 10:09 ` [PATCH v4 09/12] powerpc/mm: replace batch->active " Kevin Brodsky
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. As a result we no longer need a TIF flag for that purpose -
let's use the new in_lazy_mmu_mode() helper instead.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h     | 16 +++-------------
 arch/arm64/include/asm/thread_info.h |  3 +--
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 535435248923..61ca88f94551 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -62,30 +62,21 @@ static inline void emit_pte_barriers(void)
 
 static inline void queue_pte_barriers(void)
 {
-	unsigned long flags;
-
 	if (in_interrupt()) {
 		emit_pte_barriers();
 		return;
 	}
 
-	flags = read_thread_flags();
-
-	if (flags & BIT(TIF_LAZY_MMU)) {
-		/* Avoid the atomic op if already set. */
-		if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
-			set_thread_flag(TIF_LAZY_MMU_PENDING);
-	} else {
+	if (in_lazy_mmu_mode())
+		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);
+	else
 		emit_pte_barriers();
-	}
 }
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
 	if (in_interrupt())
 		return;
-
-	set_thread_flag(TIF_LAZY_MMU);
 }
 
 static inline void arch_flush_lazy_mmu_mode(void)
@@ -103,7 +94,6 @@ static inline void arch_leave_lazy_mmu_mode(void)
 		return;
 
 	arch_flush_lazy_mmu_mode();
-	clear_thread_flag(TIF_LAZY_MMU);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index f241b8601ebd..4ff8da0767d9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -84,8 +84,7 @@ void arch_setup_new_exec(void);
 #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
 #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
 #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
-#define TIF_LAZY_MMU		31	/* Task in lazy mmu mode */
-#define TIF_LAZY_MMU_PENDING	32	/* Ops pending for lazy mmu mode exit */
+#define TIF_LAZY_MMU_PENDING	31	/* Ops pending for lazy mmu mode exit */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 09/12] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (7 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-03 16:05   ` David Hildenbrand
  2025-11-05  9:40   ` Ritesh Harjani
  2025-10-29 10:09 ` [PATCH v4 10/12] sparc/mm: " Kevin Brodsky
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

A per-CPU batch struct is activated when entering lazy MMU mode; its
lifetime is the same as the lazy MMU section (it is deactivated when
leaving the mode). Preemption is disabled in that interval to ensure
that the per-CPU reference remains valid.

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. We can therefore use the generic helper in_lazy_mmu_mode()
to tell whether a batch struct is active instead of tracking it
explicitly.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 9 ---------
 arch/powerpc/mm/book3s64/hash_tlb.c                | 2 +-
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 623a8a8b2d0e..bbc54690d374 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -12,7 +12,6 @@
 #define PPC64_TLB_BATCH_NR 192
 
 struct ppc64_tlb_batch {
-	int			active;
 	unsigned long		index;
 	struct mm_struct	*mm;
 	real_pte_t		pte[PPC64_TLB_BATCH_NR];
@@ -26,8 +25,6 @@ extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
 
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch;
-
 	if (radix_enabled())
 		return;
 	/*
@@ -35,8 +32,6 @@ static inline void arch_enter_lazy_mmu_mode(void)
 	 * operating on kernel page tables.
 	 */
 	preempt_disable();
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
-	batch->active = 1;
 }
 
 static inline void arch_flush_lazy_mmu_mode(void)
@@ -51,14 +46,10 @@ static inline void arch_flush_lazy_mmu_mode(void)
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	struct ppc64_tlb_batch *batch;
-
 	if (radix_enabled())
 		return;
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
 
 	arch_flush_lazy_mmu_mode();
-	batch->active = 0;
 	preempt_enable();
 }
 
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 787f7a0e27f0..72b83f582b6d 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -100,7 +100,7 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 	 * Check if we have an active batch on this CPU. If not, just
 	 * flush now and return.
 	 */
-	if (!batch->active) {
+	if (!in_lazy_mmu_mode()) {
 		flush_hash_page(vpn, rpte, psize, ssize, mm_is_thread_local(mm));
 		put_cpu_var(ppc64_tlb_batch);
 		return;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 10/12] sparc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (8 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 09/12] powerpc/mm: replace batch->active " Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-03 16:11   ` David Hildenbrand (Red Hat)
  2025-10-29 10:09 ` [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
  2025-10-29 10:09 ` [PATCH v4 12/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
  11 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

A per-CPU batch struct is activated when entering lazy MMU mode; its
lifetime is the same as the lazy MMU section (it is deactivated when
leaving the mode). Preemption is disabled in that interval to ensure
that the per-CPU reference remains valid.

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. We can therefore use the generic helper in_lazy_mmu_mode()
to tell whether a batch struct is active instead of tracking it
explicitly.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/sparc/include/asm/tlbflush_64.h | 1 -
 arch/sparc/mm/tlb.c                  | 9 +--------
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 4e1036728e2f..6133306ba59a 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -12,7 +12,6 @@ struct tlb_batch {
 	unsigned int hugepage_shift;
 	struct mm_struct *mm;
 	unsigned long tlb_nr;
-	unsigned long active;
 	unsigned long vaddrs[TLB_BATCH_NR];
 };
 
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 7b5dfcdb1243..879e22c86e5c 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -52,11 +52,7 @@ void flush_tlb_pending(void)
 
 void arch_enter_lazy_mmu_mode(void)
 {
-	struct tlb_batch *tb;
-
 	preempt_disable();
-	tb = this_cpu_ptr(&tlb_batch);
-	tb->active = 1;
 }
 
 void arch_flush_lazy_mmu_mode(void)
@@ -69,10 +65,7 @@ void arch_flush_lazy_mmu_mode(void)
 
 void arch_leave_lazy_mmu_mode(void)
 {
-	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
-
 	arch_flush_lazy_mmu_mode();
-	tb->active = 0;
 	preempt_enable();
 }
 
@@ -93,7 +86,7 @@ static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
 		nr = 0;
 	}
 
-	if (!tb->active) {
+	if (!in_lazy_mmu_mode()) {
 		flush_tsb_user_page(mm, vaddr, hugepage_shift);
 		global_flush_tlb_page(mm, vaddr);
 		goto out;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (9 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 10/12] sparc/mm: " Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-03 16:15   ` David Hildenbrand (Red Hat)
  2025-10-29 10:09 ` [PATCH v4 12/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
  11 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

We currently set a TIF flag when scheduling out a task that is in
lazy MMU mode, in order to restore it when the task is scheduled
again.

The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode in task_struct::lazy_mmu_state. We can therefore check that
state when switching to the new task, instead of using a separate
TIF flag.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/x86/include/asm/thread_info.h | 4 +---
 arch/x86/xen/enlighten_pv.c        | 3 +--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e71e0e8362ed..0067684afb5b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,8 +100,7 @@ struct thread_info {
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_SINGLESTEP		25	/* reenable singlestep on user return*/
 #define TIF_BLOCKSTEP		26	/* set when we want DEBUGCTLMSR_BTF */
-#define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
-#define TIF_ADDR32		28	/* 32-bit address space on 64 bits */
+#define TIF_ADDR32		27	/* 32-bit address space on 64 bits */
 
 #define _TIF_SSBD		BIT(TIF_SSBD)
 #define _TIF_SPEC_IB		BIT(TIF_SPEC_IB)
@@ -114,7 +113,6 @@ struct thread_info {
 #define _TIF_FORCED_TF		BIT(TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		BIT(TIF_BLOCKSTEP)
 #define _TIF_SINGLESTEP		BIT(TIF_SINGLESTEP)
-#define _TIF_LAZY_MMU_UPDATES	BIT(TIF_LAZY_MMU_UPDATES)
 #define _TIF_ADDR32		BIT(TIF_ADDR32)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 4806cc28d7ca..f40f5999352e 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -426,7 +426,6 @@ static void xen_start_context_switch(struct task_struct *prev)
 
 	if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) {
 		arch_leave_lazy_mmu_mode();
-		set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU_UPDATES);
 	}
 	enter_lazy(XEN_LAZY_CPU);
 }
@@ -437,7 +436,7 @@ static void xen_end_context_switch(struct task_struct *next)
 
 	xen_mc_flush();
 	leave_lazy(XEN_LAZY_CPU);
-	if (test_and_clear_ti_thread_flag(task_thread_info(next), TIF_LAZY_MMU_UPDATES))
+	if (next->lazy_mmu_state.active)
 		arch_enter_lazy_mmu_mode();
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 12/12] mm: bail out of lazy_mmu_mode_* in interrupt context
  2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
                   ` (10 preceding siblings ...)
  2025-10-29 10:09 ` [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
@ 2025-10-29 10:09 ` Kevin Brodsky
  2025-11-07 15:42   ` Ryan Roberts
  11 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-29 10:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

The lazy MMU mode cannot be used in interrupt context. This is
documented in <linux/pgtable.h>, but isn't consistently handled
across architectures.

arm64 ensures that calls to lazy_mmu_mode_* have no effect in
interrupt context, because such calls do occur in certain
configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
batching in interrupt contexts"). Other architectures do not check
this situation, most likely because it hasn't occurred so far.

Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
disabled while in interrupt mode (see queue_pte_barriers() and
xen_get_lazy_mode() respectively).

Let's handle this in the new generic lazy_mmu layer, in the same
fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt(), and
have in_lazy_mmu_mode() return false to disable any optimisation.
Also remove the arm64 handling that is now redundant; x86/Xen has
its own internal tracking so it is left unchanged.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 17 +----------------
 include/linux/pgtable.h          | 16 ++++++++++++++--
 include/linux/sched.h            |  3 +++
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 61ca88f94551..96987a49e83b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -62,37 +62,22 @@ static inline void emit_pte_barriers(void)
 
 static inline void queue_pte_barriers(void)
 {
-	if (in_interrupt()) {
-		emit_pte_barriers();
-		return;
-	}
-
 	if (in_lazy_mmu_mode())
 		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);
 	else
 		emit_pte_barriers();
 }
 
-static inline void arch_enter_lazy_mmu_mode(void)
-{
-	if (in_interrupt())
-		return;
-}
+static inline void arch_enter_lazy_mmu_mode(void) {}
 
 static inline void arch_flush_lazy_mmu_mode(void)
 {
-	if (in_interrupt())
-		return;
-
 	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
 		emit_pte_barriers();
 }
 
 static inline void arch_leave_lazy_mmu_mode(void)
 {
-	if (in_interrupt())
-		return;
-
 	arch_flush_lazy_mmu_mode();
 }
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e6064e00b22d..e6069ce4ec83 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,8 +228,8 @@ static inline int pmd_dirty(pmd_t pmd)
  * of the lazy mode. So the implementation must assume preemption may be enabled
  * and cpu migration is possible; it must take steps to be robust against this.
  * (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). The mode cannot be used
- * in interrupt context.
+ * held, but for kernel PTE updates, no lock is held). The mode is disabled
+ * in interrupt context and calls to the lazy_mmu API have no effect.
  *
  * The lazy MMU mode is enabled for a given block of code using:
  *
@@ -265,6 +265,9 @@ static inline void lazy_mmu_mode_enable(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
 	/* enable() must not be called while paused */
 	VM_WARN_ON(state->nesting_level > 0 && !state->active);
@@ -279,6 +282,9 @@ static inline void lazy_mmu_mode_disable(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_WARN_ON_ONCE(state->nesting_level == 0);
 	VM_WARN_ON(!state->active);
 
@@ -295,6 +301,9 @@ static inline void lazy_mmu_mode_pause(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_WARN_ON(state->nesting_level == 0 || !state->active);
 
 	state->active = false;
@@ -305,6 +314,9 @@ static inline void lazy_mmu_mode_resume(void)
 {
 	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 
+	if (in_interrupt())
+		return;
+
 	VM_WARN_ON(state->nesting_level == 0 || state->active);
 
 	state->active = true;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11566d973f42..bb873016ffcf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1731,6 +1731,9 @@ static inline char task_state_to_char(struct task_struct *tsk)
 #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
 static inline bool in_lazy_mmu_mode(void)
 {
+	if (in_interrupt())
+		return false;
+
 	return current->lazy_mmu_state.active;
 }
 #else
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
@ 2025-10-29 16:41   ` Alexander Gordeev
  2025-10-30 10:28     ` Kevin Brodsky
  2025-11-01 12:22   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 62+ messages in thread
From: Alexander Gordeev @ 2025-10-29 16:41 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On Wed, Oct 29, 2025 at 10:09:04AM +0000, Kevin Brodsky wrote:

Hi Kevin,

> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return current->lazy_mmu_state.active;

Whether (nesting_level > 0) is more correct check?
Otherwise, it returns false while in paused mode.

May be check both nesting_level and active and also introduce
in_lazy_mmu_paused_mode() right away to avoid any confusion?

> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return false;
> +}
> +#endif
> +
>  extern struct pid *cad_pid;
>  
>  /*

Thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-29 16:41   ` Alexander Gordeev
@ 2025-10-30 10:28     ` Kevin Brodsky
  2025-10-30 16:34       ` Alexander Gordeev
  0 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-10-30 10:28 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 17:41, Alexander Gordeev wrote:
> On Wed, Oct 29, 2025 at 10:09:04AM +0000, Kevin Brodsky wrote:
>
> Hi Kevin,
>
>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> +static inline bool in_lazy_mmu_mode(void)
>> +{
>> +	return current->lazy_mmu_state.active;
> Whether (nesting_level > 0) is more correct check?
> Otherwise, it returns false while in paused mode.

That's exactly the intention. Lazy MMU is disabled while paused. The
users of that helper want to know if lazy MMU is currently enabled (to
decide whether to batch updates for instance); whether this is because
we are paused or not in any lazy_mmu section (nesting_level == 0) makes
no difference.

> May be check both nesting_level and active and also introduce
> in_lazy_mmu_paused_mode() right away to avoid any confusion?

Can you think of any situation where a caller would specifically want to
know that lazy MMU is paused?

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-30 10:28     ` Kevin Brodsky
@ 2025-10-30 16:34       ` Alexander Gordeev
  0 siblings, 0 replies; 62+ messages in thread
From: Alexander Gordeev @ 2025-10-30 16:34 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On Thu, Oct 30, 2025 at 11:28:53AM +0100, Kevin Brodsky wrote:
> On 29/10/2025 17:41, Alexander Gordeev wrote:
> > On Wed, Oct 29, 2025 at 10:09:04AM +0000, Kevin Brodsky wrote:
> >
> > Hi Kevin,
> >
> >> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> >> +static inline bool in_lazy_mmu_mode(void)
> >> +{
> >> +	return current->lazy_mmu_state.active;
> > Whether (nesting_level > 0) is more correct check?
> > Otherwise, it returns false while in paused mode.
> 
> That's exactly the intention. Lazy MMU is disabled while paused. The
> users of that helper want to know if lazy MMU is currently enabled (to
> decide whether to batch updates for instance); whether this is because
> we are paused or not in any lazy_mmu section (nesting_level == 0) makes
> no difference.
> 
> > May be check both nesting_level and active and also introduce
> > in_lazy_mmu_paused_mode() right away to avoid any confusion?
> 
> Can you think of any situation where a caller would specifically want to
> know that lazy MMU is paused?

I thought I do, but in_lazy_mmu_mode() alone works just fine,
as you described (at least for now).

> - Kevin

Thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
@ 2025-11-01 12:05   ` David Hildenbrand
  2025-11-05  2:46   ` Ritesh Harjani
  2025-11-07 12:25   ` Ryan Roberts
  2 siblings, 0 replies; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:05 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:08, Kevin Brodsky wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
> lazy mmu mode") a task can not be preempted while in lazy MMU mode.
> Therefore, the batch re-activation code is never called, so remove it.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu()
  2025-10-29 10:08 ` [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
@ 2025-11-01 12:14   ` David Hildenbrand
  2025-11-03 18:06     ` Kevin Brodsky
  2025-11-07 12:31   ` Ryan Roberts
  2025-11-07 15:45   ` Jürgen Groß
  2 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:14 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:08, Kevin Brodsky wrote:
> arch_flush_lazy_mmu_mode() is called when outstanding batched
> pgtable operations must be completed immediately. There should
> however be no need to leave and re-enter lazy MMU completely. The
> only part of that sequence that we really need is xen_mc_flush();
> call it directly.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   arch/x86/xen/mmu_pv.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 2a4a8deaf612..7a35c3393df4 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2139,10 +2139,8 @@ static void xen_flush_lazy_mmu(void)
>   {
>   	preempt_disable();
>   
> -	if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
> -		arch_leave_lazy_mmu_mode();
> -		arch_enter_lazy_mmu_mode();
> -	}
> +	if (xen_get_lazy_mode() == XEN_LAZY_MMU)
> +		xen_mc_flush();
>   
>   	preempt_enable();
>   }

Looks like that was moved to XEN code in

commit a4a7644c15096f57f92252dd6e1046bf269c87d8
Author: Juergen Gross <jgross@suse.com>
Date:   Wed Sep 13 13:38:27 2023 +0200

     x86/xen: move paravirt lazy code


And essentially the previous implementation lived in 
arch/x86/kernel/paravirt.c:paravirt_flush_lazy_mmu(void) in an 
implementation-agnostic way:

void paravirt_flush_lazy_mmu(void)
{
        preempt_disable();

        if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) {
                arch_leave_lazy_mmu_mode();
                arch_enter_lazy_mmu_mode();
        }

        preempt_enable();
}


So indeed, I assume just doing the flush here is sufficient.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
@ 2025-11-01 12:14   ` David Hildenbrand
  2025-11-05  3:15   ` Ritesh Harjani
  1 sibling, 0 replies; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:14 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> Upcoming changes to the lazy_mmu API will cause
> arch_flush_lazy_mmu_mode() to be called when leaving a nested
> lazy_mmu section.
> 
> Move the relevant logic from arch_leave_lazy_mmu_mode() to
> arch_flush_lazy_mmu_mode() and have the former call the latter.
> 
> Note: the additional this_cpu_ptr() on the
> arch_leave_lazy_mmu_mode() path will be removed in a subsequent
> patch.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 04/12] sparc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 04/12] sparc/mm: " Kevin Brodsky
@ 2025-11-01 12:14   ` David Hildenbrand
  0 siblings, 0 replies; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:14 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> Upcoming changes to the lazy_mmu API will cause
> arch_flush_lazy_mmu_mode() to be called when leaving a nested
> lazy_mmu section.
> 
> Move the relevant logic from arch_leave_lazy_mmu_mode() to
> arch_flush_lazy_mmu_mode() and have the former call the latter.
> 
> Note: the additional this_cpu_ptr() on the
> arch_leave_lazy_mmu_mode() path will be removed in a subsequent
> patch.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
  2025-10-29 10:09 ` [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
@ 2025-11-01 12:16   ` David Hildenbrand
  2025-11-05  4:40   ` Ritesh Harjani
  2025-11-07 13:56   ` Ryan Roberts
  2 siblings, 0 replies; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:16 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> Architectures currently opt in for implementing lazy_mmu helpers by
> defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
> 
> In preparation for introducing a generic lazy_mmu layer that will
> require storage in task_struct, let's switch to a cleaner approach:
> instead of defining a macro, select a CONFIG option.
> 
> This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
> arch select it when it implements lazy_mmu helpers.
> __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
> relies on the new CONFIG instead.
> 
> On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
> selected. This creates some complications in arch/x86/boot/, because
> a few files manually undefine PARAVIRT* options. As a result
> <asm/paravirt.h> does not define the lazy_mmu helpers, but this
> breaks the build as <linux/pgtable.h> only defines them if
> !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
> way out of this - let's just undefine that new CONFIG too.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  2025-10-29 10:09 ` [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
@ 2025-11-01 12:18   ` David Hildenbrand
  2025-11-07 14:26   ` Ryan Roberts
  1 sibling, 0 replies; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:18 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> The implementation of the lazy MMU mode is currently entirely
> arch-specific; core code directly calls arch helpers:
> arch_{enter,leave}_lazy_mmu_mode().
> 
> We are about to introduce support for nested lazy MMU sections.
> As things stand we'd have to duplicate that logic in every arch
> implementing lazy_mmu - adding to a fair amount of logic
> already duplicated across lazy_mmu implementations.
> 
> This patch therefore introduces a new generic layer that calls the
> existing arch_* helpers. Two pair of calls are introduced:
> 
> * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
>      This is the standard case where the mode is enabled for a given
>      block of code by surrounding it with enable() and disable()
>      calls.
> 
> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
>      This is for situations where the mode is temporarily disabled
>      by first calling pause() and then resume() (e.g. to prevent any
>      batching from occurring in a critical section).
> 
> The documentation in <linux/pgtable.h> will be updated in a
> subsequent patch.
> 
> No functional change should be introduced at this stage.
> The implementation of enable()/resume() and disable()/pause() is
> currently identical, but nesting support will change that.
> 
> Most of the call sites have been updated using the following
> Coccinelle script:
> 
> @@
> @@
> {
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> ...
> }
> 
> @@
> @@
> {
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
> ...
> }
> 
> A couple of notes regarding x86:
> 
> * Xen is currently the only case where explicit handling is required
>    for lazy MMU when context-switching. This is purely an
>    implementation detail and using the generic lazy_mmu_mode_*
>    functions would cause trouble when nesting support is introduced,
>    because the generic functions must be called from the current task.
>    For that reason we still use arch_leave() and arch_enter() there.
> 
> * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
>    places, but only defines it if PARAVIRT_XXL is selected, and we
>    are removing the fallback in <linux/pgtable.h>. Add a new fallback
>    definition to <asm/pgtable.h> to keep things building.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
  2025-10-29 16:41   ` Alexander Gordeev
@ 2025-11-01 12:22   ` David Hildenbrand
  2025-11-03 18:08     ` Kevin Brodsky
  2025-11-05  8:49   ` Ritesh Harjani
  2025-11-07 14:59   ` Ryan Roberts
  3 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2025-11-01 12:22 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86


>   static inline void lazy_mmu_mode_pause(void)
>   {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || !state->active);
> +
> +	state->active = false;
>   	arch_leave_lazy_mmu_mode();

Just one question:

Don't we want to allow for pause/resume when not enabled? Would seem 
valid to me, because pause/resume code should actually not worry about 
that, right?

if (!state->nesting_level) {
	VM_WARN_ON(state->active);
	return;
}
VM_WARN_ON(!state->active);
state->active = false;
arch_leave_lazy_mmu_mode();

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
@ 2025-11-03 16:03   ` David Hildenbrand
  2025-11-03 18:25     ` Kevin Brodsky
  2025-11-07 15:28   ` Ryan Roberts
  1 sibling, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2025-11-03 16:03 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. As a result we no longer need a TIF flag for that purpose -
> let's use the new in_lazy_mmu_mode() helper instead.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   arch/arm64/include/asm/pgtable.h     | 16 +++-------------
>   arch/arm64/include/asm/thread_info.h |  3 +--
>   2 files changed, 4 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 535435248923..61ca88f94551 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -62,30 +62,21 @@ static inline void emit_pte_barriers(void)
>   
>   static inline void queue_pte_barriers(void)
>   {
> -	unsigned long flags;
> -
>   	if (in_interrupt()) {
>   		emit_pte_barriers();
>   		return;
>   	}
>   
> -	flags = read_thread_flags();
> -
> -	if (flags & BIT(TIF_LAZY_MMU)) {
> -		/* Avoid the atomic op if already set. */
> -		if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
> -			set_thread_flag(TIF_LAZY_MMU_PENDING);
> -	} else {
> +	if (in_lazy_mmu_mode())
> +		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);

You likely don't want a test_and_set here, which would do a 
test_and_set_bit() -- an atomic rmw.

You only want to avoid the atomic write if already set.

So keep the current

	/* Avoid the atomic op if already set. */
	if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
		set_thread_flag(TIF_LAZY_MMU_PENDING);

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 09/12] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 09/12] powerpc/mm: replace batch->active " Kevin Brodsky
@ 2025-11-03 16:05   ` David Hildenbrand
  2025-11-04 11:33     ` Kevin Brodsky
  2025-11-05  9:40   ` Ritesh Harjani
  1 sibling, 1 reply; 62+ messages in thread
From: David Hildenbrand @ 2025-11-03 16:05 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> A per-CPU batch struct is activated when entering lazy MMU mode; its
> lifetime is the same as the lazy MMU section (it is deactivated when
> leaving the mode). Preemption is disabled in that interval to ensure
> that the per-CPU reference remains valid.
> 
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. We can therefore use the generic helper in_lazy_mmu_mode()
> to tell whether a batch struct is active instead of tracking it
> explicitly.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---

I suspect you were not able to test this on real HW. Some help from the 
ppc folks would be appreciated.

LGTM, but the interaction with pause/resume adds a bit of complication 
on top.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 10/12] sparc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 10/12] sparc/mm: " Kevin Brodsky
@ 2025-11-03 16:11   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 62+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-03 16:11 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> A per-CPU batch struct is activated when entering lazy MMU mode; its
> lifetime is the same as the lazy MMU section (it is deactivated when
> leaving the mode). Preemption is disabled in that interval to ensure
> that the per-CPU reference remains valid.
> 
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. We can therefore use the generic helper in_lazy_mmu_mode()
> to tell whether a batch struct is active instead of tracking it
> explicitly.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   arch/sparc/include/asm/tlbflush_64.h | 1 -
>   arch/sparc/mm/tlb.c                  | 9 +--------
>   2 files changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
> index 4e1036728e2f..6133306ba59a 100644
> --- a/arch/sparc/include/asm/tlbflush_64.h
> +++ b/arch/sparc/include/asm/tlbflush_64.h
> @@ -12,7 +12,6 @@ struct tlb_batch {
>   	unsigned int hugepage_shift;
>   	struct mm_struct *mm;
>   	unsigned long tlb_nr;
> -	unsigned long active;
>   	unsigned long vaddrs[TLB_BATCH_NR];
>   };
>   
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index 7b5dfcdb1243..879e22c86e5c 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -52,11 +52,7 @@ void flush_tlb_pending(void)
>   
>   void arch_enter_lazy_mmu_mode(void)
>   {
> -	struct tlb_batch *tb;
> -
>   	preempt_disable();
> -	tb = this_cpu_ptr(&tlb_batch);
> -	tb->active = 1;
>   }
>   
>   void arch_flush_lazy_mmu_mode(void)
> @@ -69,10 +65,7 @@ void arch_flush_lazy_mmu_mode(void)
>   
>   void arch_leave_lazy_mmu_mode(void)
>   {
> -	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
> -
>   	arch_flush_lazy_mmu_mode();
> -	tb->active = 0;
>   	preempt_enable();
>   }
>   
> @@ -93,7 +86,7 @@ static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
>   		nr = 0;
>   	}
>   
> -	if (!tb->active) {
> +	if (!in_lazy_mmu_mode()) {
>   		flush_tsb_user_page(mm, vaddr, hugepage_shift);
>   		global_flush_tlb_page(mm, vaddr);
>   		goto out;

(messing up my transition to the email address as Thunderbird defaults 
to my old one still on mails received through RH servers)

Did we get this tested with some help from sparc64 folks?

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching
  2025-10-29 10:09 ` [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
@ 2025-11-03 16:15   ` David Hildenbrand (Red Hat)
  2025-11-03 18:29     ` Kevin Brodsky
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-03 16:15 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29.10.25 11:09, Kevin Brodsky wrote:
> We currently set a TIF flag when scheduling out a task that is in
> lazy MMU mode, in order to restore it when the task is scheduled
> again.
> 
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode in task_struct::lazy_mmu_state. We can therefore check that
> state when switching to the new task, instead of using a separate
> TIF flag.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   arch/x86/include/asm/thread_info.h | 4 +---
>   arch/x86/xen/enlighten_pv.c        | 3 +--
>   2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index e71e0e8362ed..0067684afb5b 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -100,8 +100,7 @@ struct thread_info {
>   #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
>   #define TIF_SINGLESTEP		25	/* reenable singlestep on user return*/
>   #define TIF_BLOCKSTEP		26	/* set when we want DEBUGCTLMSR_BTF */
> -#define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
> -#define TIF_ADDR32		28	/* 32-bit address space on 64 bits */
> +#define TIF_ADDR32		27	/* 32-bit address space on 64 bits */
>   
>   #define _TIF_SSBD		BIT(TIF_SSBD)
>   #define _TIF_SPEC_IB		BIT(TIF_SPEC_IB)
> @@ -114,7 +113,6 @@ struct thread_info {
>   #define _TIF_FORCED_TF		BIT(TIF_FORCED_TF)
>   #define _TIF_BLOCKSTEP		BIT(TIF_BLOCKSTEP)
>   #define _TIF_SINGLESTEP		BIT(TIF_SINGLESTEP)
> -#define _TIF_LAZY_MMU_UPDATES	BIT(TIF_LAZY_MMU_UPDATES)
>   #define _TIF_ADDR32		BIT(TIF_ADDR32)
>   
>   /* flags to check in __switch_to() */
> diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
> index 4806cc28d7ca..f40f5999352e 100644
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -426,7 +426,6 @@ static void xen_start_context_switch(struct task_struct *prev)
>   
>   	if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) {
>   		arch_leave_lazy_mmu_mode();
> -		set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU_UPDATES);
>   	}
>   	enter_lazy(XEN_LAZY_CPU);
>   }
> @@ -437,7 +436,7 @@ static void xen_end_context_switch(struct task_struct *next)
>   
>   	xen_mc_flush();
>   	leave_lazy(XEN_LAZY_CPU);
> -	if (test_and_clear_ti_thread_flag(task_thread_info(next), TIF_LAZY_MMU_UPDATES))
> +	if (next->lazy_mmu_state.active)

This is nasty. If in_lazy_mmu_mode() is not sufficient, we will want to 
have a separate helper that makes it clear what the difference between 
both variants is.


-- 
Cheers

David

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu()
  2025-11-01 12:14   ` David Hildenbrand
@ 2025-11-03 18:06     ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-03 18:06 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 01/11/2025 12:14, David Hildenbrand wrote:
> On 29.10.25 11:08, Kevin Brodsky wrote:
>> arch_flush_lazy_mmu_mode() is called when outstanding batched
>> pgtable operations must be completed immediately. There should
>> however be no need to leave and re-enter lazy MMU completely. The
>> only part of that sequence that we really need is xen_mc_flush();
>> call it directly.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>   arch/x86/xen/mmu_pv.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 2a4a8deaf612..7a35c3393df4 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -2139,10 +2139,8 @@ static void xen_flush_lazy_mmu(void)
>>   {
>>       preempt_disable();
>>   -    if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
>> -        arch_leave_lazy_mmu_mode();
>> -        arch_enter_lazy_mmu_mode();
>> -    }
>> +    if (xen_get_lazy_mode() == XEN_LAZY_MMU)
>> +        xen_mc_flush();
>>         preempt_enable();
>>   }
>
> Looks like that was moved to XEN code in
>
> commit a4a7644c15096f57f92252dd6e1046bf269c87d8
> Author: Juergen Gross <jgross@suse.com>
> Date:   Wed Sep 13 13:38:27 2023 +0200
>
>     x86/xen: move paravirt lazy code
>
>
> And essentially the previous implementation lived in
> arch/x86/kernel/paravirt.c:paravirt_flush_lazy_mmu(void) in an
> implementation-agnostic way:
>
> void paravirt_flush_lazy_mmu(void)
> {
>        preempt_disable();
>
>        if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) {
>                arch_leave_lazy_mmu_mode();
>                arch_enter_lazy_mmu_mode();
>        }
>
>        preempt_enable();
> }

Indeed, I saw that too. Calling the generic leave/enter functions made
some sense at that point, but now that the implementation is
Xen-specific we can directly call xen_mc_flush().

>
> So indeed, I assume just doing the flush here is sufficient.
>
> Reviewed-by: David Hildenbrand <david@redhat.com> 

Thanks for the review!

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-01 12:22   ` David Hildenbrand
@ 2025-11-03 18:08     ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-03 18:08 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 01/11/2025 12:22, David Hildenbrand wrote:
>
>>   static inline void lazy_mmu_mode_pause(void)
>>   {
>> +    struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +    VM_WARN_ON(state->nesting_level == 0 || !state->active);
>> +
>> +    state->active = false;
>>       arch_leave_lazy_mmu_mode();
>
> Just one question:
>
> Don't we want to allow for pause/resume when not enabled? Would seem
> valid to me, because pause/resume code should actually not worry about
> that, right?

This does sound sensible, thanks for the suggestion. The initial goal
was to allow functions that know they're called with lazy MMU enabled to
be able to pause it temporarily if they need batching disabled. But we
could generalise this to: if you know batching would break things, then
you can preemptively add a pause/resume pair, and it won't do anything
unless you're called with lazy MMU enabled.

I also like this as this removes an invalid usage situation - now as
long as you have balanced enable/disable and pause/resume calls, you're
good. Will make that change in v5.

- Kevin

>
> if (!state->nesting_level) {
>     VM_WARN_ON(state->active);
>     return;
> }
> VM_WARN_ON(!state->active);
> state->active = false;
> arch_leave_lazy_mmu_mode();
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  2025-11-03 16:03   ` David Hildenbrand
@ 2025-11-03 18:25     ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-03 18:25 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 03/11/2025 16:03, David Hildenbrand wrote:
> On 29.10.25 11:09, Kevin Brodsky wrote:
>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>> mode. As a result we no longer need a TIF flag for that purpose -
>> let's use the new in_lazy_mmu_mode() helper instead.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h     | 16 +++-------------
>>   arch/arm64/include/asm/thread_info.h |  3 +--
>>   2 files changed, 4 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h
>> b/arch/arm64/include/asm/pgtable.h
>> index 535435248923..61ca88f94551 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -62,30 +62,21 @@ static inline void emit_pte_barriers(void)
>>     static inline void queue_pte_barriers(void)
>>   {
>> -    unsigned long flags;
>> -
>>       if (in_interrupt()) {
>>           emit_pte_barriers();
>>           return;
>>       }
>>   -    flags = read_thread_flags();
>> -
>> -    if (flags & BIT(TIF_LAZY_MMU)) {
>> -        /* Avoid the atomic op if already set. */
>> -        if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
>> -            set_thread_flag(TIF_LAZY_MMU_PENDING);
>> -    } else {
>> +    if (in_lazy_mmu_mode())
>> +        test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);
>
> You likely don't want a test_and_set here, which would do a
> test_and_set_bit() -- an atomic rmw.

Ah yes good point, the new version would do an atomic RMW in all cases.
Simpler code but also slower :/

>
> You only want to avoid the atomic write if already set.
>
> So keep the current
>
>     /* Avoid the atomic op if already set. */
>     if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
>         set_thread_flag(TIF_LAZY_MMU_PENDING); 

Pretty much, since we're now only considering one flag we can simplify
it to:

if (!test_thread_flag(TIF_LAZY_MMU_PENDING))
    set_thread_flag(TIF_LAZY_MMU_PENDING);

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching
  2025-11-03 16:15   ` David Hildenbrand (Red Hat)
@ 2025-11-03 18:29     ` Kevin Brodsky
  2025-11-03 19:23       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-03 18:29 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 03/11/2025 16:15, David Hildenbrand (Red Hat) wrote:
> On 29.10.25 11:09, Kevin Brodsky wrote:
>> [...]
>>
>> @@ -437,7 +436,7 @@ static void xen_end_context_switch(struct
>> task_struct *next)
>>         xen_mc_flush();
>>       leave_lazy(XEN_LAZY_CPU);
>> -    if (test_and_clear_ti_thread_flag(task_thread_info(next),
>> TIF_LAZY_MMU_UPDATES))
>> +    if (next->lazy_mmu_state.active)
>
> This is nasty. If in_lazy_mmu_mode() is not sufficient, we will want
> to have a separate helper that makes it clear what the difference
> between both variants is.

in_lazy_mmu_mode() operates on current, but here we're operating on a
different task. The difference is more fundamental than just passing a
task_struct * or not: in_lazy_mmu_mode() is about whether we're
currently in lazy MMU mode, i.e. not paused and not in interrupt
context. A task that isn't scheduled is never in lazy MMU mode -
lazy_mmu_state.active is just the saved state to be restored when
scheduled again.

My point here is that we could have a helper for this use-case, but it
should not be used in other situations (at least not on current). Maybe
__task_lazy_mmu_active(task)? I do wonder if accessing lazy_mmu_state
directly isn't expressing the intention well enough though (checking the
saved state).

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching
  2025-11-03 18:29     ` Kevin Brodsky
@ 2025-11-03 19:23       ` David Hildenbrand (Red Hat)
  2025-11-04 11:28         ` Kevin Brodsky
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-03 19:23 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 03.11.25 19:29, Kevin Brodsky wrote:
> On 03/11/2025 16:15, David Hildenbrand (Red Hat) wrote:
>> On 29.10.25 11:09, Kevin Brodsky wrote:
>>> [...]
>>>
>>> @@ -437,7 +436,7 @@ static void xen_end_context_switch(struct
>>> task_struct *next)
>>>          xen_mc_flush();
>>>        leave_lazy(XEN_LAZY_CPU);
>>> -    if (test_and_clear_ti_thread_flag(task_thread_info(next),
>>> TIF_LAZY_MMU_UPDATES))
>>> +    if (next->lazy_mmu_state.active)
>>
>> This is nasty. If in_lazy_mmu_mode() is not sufficient, we will want
>> to have a separate helper that makes it clear what the difference
>> between both variants is.
> 
> in_lazy_mmu_mode() operates on current, but here we're operating on a
> different task. The difference is more fundamental than just passing a
> task_struct * or not: in_lazy_mmu_mode() is about whether we're
> currently in lazy MMU mode, i.e. not paused and not in interrupt
> context. A task that isn't scheduled is never in lazy MMU mode -
> lazy_mmu_state.active is just the saved state to be restored when
> scheduled again.
> 
> My point here is that we could have a helper for this use-case, but it
> should not be used in other situations (at least not on current). Maybe
> __task_lazy_mmu_active(task)? I do wonder if accessing lazy_mmu_state
> directly isn't expressing the intention well enough though (checking the
> saved state).


Likely there should be a

/**
  * task_lazy_mmu_active - test whether the lazy-mmu mode is active for a
  *			  task
  * @task: ...
  *
  * The lazy-mmu mode is active if a task has lazy-mmu mode enabled and
  * currently not paused.
  */
static inline bool task_lazy_mmu_active(struct task_struct *task)
{
	return task->lazy_mmu_state.active;
}

/**
  * in_lazy_mmu_mode() - test whether current is in lazy-mmu mode
  *
  * Test whether the current task is in lazy-mmu mode: whether the
  * interrupts are enabled and the lazy-mmu mode is active for the
  * current task.
  */
  static inline bool in_lazy_mmu_mode(void)
  {
+	if (in_interrupt())
+		return false;
+
  	return task_lazy_mmu_active(current);
  }


Something like that. Maybe we can find better terminology.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching
  2025-11-03 19:23       ` David Hildenbrand (Red Hat)
@ 2025-11-04 11:28         ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-04 11:28 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 03/11/2025 19:23, David Hildenbrand (Red Hat) wrote:
> On 03.11.25 19:29, Kevin Brodsky wrote:
>> On 03/11/2025 16:15, David Hildenbrand (Red Hat) wrote:
>>> On 29.10.25 11:09, Kevin Brodsky wrote:
>>>> [...]
>>>>
>>>> @@ -437,7 +436,7 @@ static void xen_end_context_switch(struct
>>>> task_struct *next)
>>>>          xen_mc_flush();
>>>>        leave_lazy(XEN_LAZY_CPU);
>>>> -    if (test_and_clear_ti_thread_flag(task_thread_info(next),
>>>> TIF_LAZY_MMU_UPDATES))
>>>> +    if (next->lazy_mmu_state.active)
>>>
>>> This is nasty. If in_lazy_mmu_mode() is not sufficient, we will want
>>> to have a separate helper that makes it clear what the difference
>>> between both variants is.
>>
>> in_lazy_mmu_mode() operates on current, but here we're operating on a
>> different task. The difference is more fundamental than just passing a
>> task_struct * or not: in_lazy_mmu_mode() is about whether we're
>> currently in lazy MMU mode, i.e. not paused and not in interrupt
>> context. A task that isn't scheduled is never in lazy MMU mode -
>> lazy_mmu_state.active is just the saved state to be restored when
>> scheduled again.
>>
>> My point here is that we could have a helper for this use-case, but it
>> should not be used in other situations (at least not on current). Maybe
>> __task_lazy_mmu_active(task)? I do wonder if accessing lazy_mmu_state
>> directly isn't expressing the intention well enough though (checking the
>> saved state).
>
>
> Likely there should be a
>
> /**
>  * task_lazy_mmu_active - test whether the lazy-mmu mode is active for a
>  *              task
>  * @task: ...
>  *
>  * The lazy-mmu mode is active if a task has lazy-mmu mode enabled and
>  * currently not paused.
>  */
> static inline bool task_lazy_mmu_active(struct task_struct *task)
> {
>     return task->lazy_mmu_state.active;
> }
>
> /**
>  * in_lazy_mmu_mode() - test whether current is in lazy-mmu mode
>  *
>  * Test whether the current task is in lazy-mmu mode: whether the
>  * interrupts are enabled and the lazy-mmu mode is active for the
>  * current task.
>  */
>  static inline bool in_lazy_mmu_mode(void)
>  {
> +    if (in_interrupt())
> +        return false;
> +
>      return task_lazy_mmu_active(current);
>  }
>
>
> Something like that. Maybe we can find better terminology.

That's probably the clearest yes, will make the change. I can't think of
more self-documenting names, spelling out the difference in the comments
is likely the best we can do.

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 09/12] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-11-03 16:05   ` David Hildenbrand
@ 2025-11-04 11:33     ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-04 11:33 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 03/11/2025 16:05, David Hildenbrand wrote:
> On 29.10.25 11:09, Kevin Brodsky wrote:
>> A per-CPU batch struct is activated when entering lazy MMU mode; its
>> lifetime is the same as the lazy MMU section (it is deactivated when
>> leaving the mode). Preemption is disabled in that interval to ensure
>> that the per-CPU reference remains valid.
>>
>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>> mode. We can therefore use the generic helper in_lazy_mmu_mode()
>> to tell whether a batch struct is active instead of tracking it
>> explicitly.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>
> I suspect you were not able to test this on real HW. Some help from
> the ppc folks would be appreciated.

Indeed, it would be nice to get some testing on ppc HW that actually
uses lazy MMU (!radix_enabled()).

>
> LGTM, but the interaction with pause/resume adds a bit of complication
> on top.

Does it? This series doesn't change when arch_enter() and arch_leave()
are called, batch->active and in_lazy_mmu_mode() should coincide. 

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
  2025-11-01 12:05   ` David Hildenbrand
@ 2025-11-05  2:46   ` Ritesh Harjani
  2025-11-06 10:29     ` Kevin Brodsky
  2025-11-07 12:25   ` Ryan Roberts
  2 siblings, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-05  2:46 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86, Venkat Rao Bagalkote

Kevin Brodsky <kevin.brodsky@arm.com> writes:

> From: Alexander Gordeev <agordeev@linux.ibm.com>
>
> Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
> lazy mmu mode") a task can not be preempted while in lazy MMU mode.
> Therefore, the batch re-activation code is never called, so remove it.
>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/powerpc/include/asm/thread_info.h |  2 --
>  arch/powerpc/kernel/process.c          | 25 -------------------------
>  2 files changed, 27 deletions(-)
>

Since the commit referenced in above disables the preemption in
arch_enter_lazy_mmu(), so the expectation is that we will never be
context switched while in lazy_mmu, hence the code changes in
switch_to() around __flush_tlb_pending() should ideally never be called.

With this analysis - the patch looks good to me. I will give this entire
patch series a try on Power HW with Hash mmu too (which uses lazy mmu and
let you know the results of that)!

For this patch please feel free to add:
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


CC: Venkat who also runs CI on linux Power HW for upstream testing :)

-ritesh


> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index b0f200aba2b3..97f35f9b1a96 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
>  /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
>  #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
>  #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
> -#define TLF_LAZY_MMU		3	/* tlb_batch is active */
>  #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
>  
>  #define _TLF_NAPPING		(1 << TLF_NAPPING)
>  #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
> -#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
>  #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
>  
>  #ifndef __ASSEMBLER__
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index eb23966ac0a9..9237dcbeee4a 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  {
>  	struct thread_struct *new_thread, *old_thread;
>  	struct task_struct *last;
> -#ifdef CONFIG_PPC_64S_HASH_MMU
> -	struct ppc64_tlb_batch *batch;
> -#endif
>  
>  	new_thread = &new->thread;
>  	old_thread = &current->thread;
> @@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	WARN_ON(!irqs_disabled());
>  
>  #ifdef CONFIG_PPC_64S_HASH_MMU
> -	batch = this_cpu_ptr(&ppc64_tlb_batch);
> -	if (batch->active) {
> -		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
> -		if (batch->index)
> -			__flush_tlb_pending(batch);
> -		batch->active = 0;
> -	}
> -
>  	/*
>  	 * On POWER9 the copy-paste buffer can only paste into
>  	 * foreign real addresses, so unprivileged processes can not
> @@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	 */
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> -#ifdef CONFIG_PPC_64S_HASH_MMU
> -	/*
> -	 * This applies to a process that was context switched while inside
> -	 * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
> -	 * deactivated above, before _switch(). This will never be the case
> -	 * for new tasks.
> -	 */
> -	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
> -		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
> -		batch = this_cpu_ptr(&ppc64_tlb_batch);
> -		batch->active = 1;
> -	}
> -#endif
> -
>  	/*
>  	 * Math facilities are masked out of the child MSR in copy_thread.
>  	 * A new task does not need to restore_math because it will
> -- 
> 2.47.0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
  2025-11-01 12:14   ` David Hildenbrand
@ 2025-11-05  3:15   ` Ritesh Harjani
  2025-11-05  9:49     ` Ritesh Harjani
  1 sibling, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-05  3:15 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Kevin Brodsky <kevin.brodsky@arm.com> writes:

> Upcoming changes to the lazy_mmu API will cause
> arch_flush_lazy_mmu_mode() to be called when leaving a nested
> lazy_mmu section.
>
> Move the relevant logic from arch_leave_lazy_mmu_mode() to
> arch_flush_lazy_mmu_mode() and have the former call the latter.
>
> Note: the additional this_cpu_ptr() on the
> arch_leave_lazy_mmu_mode() path will be removed in a subsequent
> patch.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 146287d9580f..7704dbe8e88d 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>  	batch->active = 1;
>  }
>  
> +static inline void arch_flush_lazy_mmu_mode(void)
> +{
> +	struct ppc64_tlb_batch *batch;
> +
> +	batch = this_cpu_ptr(&ppc64_tlb_batch);
> +
> +	if (batch->index)
> +		__flush_tlb_pending(batch);
> +}
> +

This looks a bit scary since arch_flush_lazy_mmu_mode() is getting
called from several of the places in later patches(). 

Although I think arch_flush_lazy_mmu_mode() will only always be called
in nested lazy mmu case right?

Do you think we can add a VM_BUG_ON(radix_enabled()); in above to make
sure the above never gets called in radix_enabled() case. 

I am still going over the patch series, but while reviewing this I
wanted to take your opinion.

Ohh wait.. There is no way of knowing the return value from
arch_enter_lazy_mmu_mode().. I think you might need a similar check to
return from arch_flush_lazy_mmu_mode() too, if radix_enabled() is true.


-ritesh


>  static inline void arch_leave_lazy_mmu_mode(void)
>  {
>  	struct ppc64_tlb_batch *batch;
> @@ -49,14 +59,11 @@ static inline void arch_leave_lazy_mmu_mode(void)
>  		return;
>  	batch = this_cpu_ptr(&ppc64_tlb_batch);
>  
> -	if (batch->index)
> -		__flush_tlb_pending(batch);
> +	arch_flush_lazy_mmu_mode();
>  	batch->active = 0;
>  	preempt_enable();
>  }
>  
> -#define arch_flush_lazy_mmu_mode()      do {} while (0)
> -
>  extern void hash__tlbiel_all(unsigned int action);
>  
>  extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
> -- 
> 2.47.0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
  2025-10-29 10:09 ` [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
  2025-11-01 12:16   ` David Hildenbrand
@ 2025-11-05  4:40   ` Ritesh Harjani
  2025-11-06 10:33     ` Kevin Brodsky
  2025-11-07 13:56   ` Ryan Roberts
  2 siblings, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-05  4:40 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Kevin Brodsky <kevin.brodsky@arm.com> writes:

> Architectures currently opt in for implementing lazy_mmu helpers by
> defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
>
> In preparation for introducing a generic lazy_mmu layer that will
> require storage in task_struct, let's switch to a cleaner approach:
> instead of defining a macro, select a CONFIG option.
>
> This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
> arch select it when it implements lazy_mmu helpers.
> __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
> relies on the new CONFIG instead.
>
> On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
> selected. This creates some complications in arch/x86/boot/, because
> a few files manually undefine PARAVIRT* options. As a result
> <asm/paravirt.h> does not define the lazy_mmu helpers, but this
> breaks the build as <linux/pgtable.h> only defines them if
> !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
> way out of this - let's just undefine that new CONFIG too.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/Kconfig                                 | 1 +
>  arch/arm64/include/asm/pgtable.h                   | 1 -
>  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
>  arch/powerpc/platforms/Kconfig.cputype             | 1 +
>  arch/sparc/Kconfig                                 | 1 +
>  arch/sparc/include/asm/tlbflush_64.h               | 2 --
>  arch/x86/Kconfig                                   | 1 +
>  arch/x86/boot/compressed/misc.h                    | 1 +
>  arch/x86/boot/startup/sme.c                        | 1 +
>  arch/x86/include/asm/paravirt.h                    | 1 -
>  include/linux/pgtable.h                            | 2 +-
>  mm/Kconfig                                         | 3 +++
>  12 files changed, 10 insertions(+), 7 deletions(-)

Maybe we can add this to ... ?

Documentation/features/vm/lazy_mmu/arch-support.txt

#
# Feature name:          lazy_mmu mode
#         Kconfig:       ARCH_HAS_LAZY_MMU_MODE
#         description:   arch supports arch_{enter|flush|leave}_lazy_mmu_mode()
#
    -----------------------
    |         arch |status|
    -----------------------
    |       arm64: |  ok  |
    |     powerpc: |  ok  |
    |       sparc: |  ok  |
    |         x86: |  ok  |
    -----------------------


As for this patch, the changes are mostly straight forward around the
configs part. This looks good to me. Please feel free to add: 

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
  2025-10-29 16:41   ` Alexander Gordeev
  2025-11-01 12:22   ` David Hildenbrand
@ 2025-11-05  8:49   ` Ritesh Harjani
  2025-11-05 16:12     ` Alexander Gordeev
  2025-11-07 14:59   ` Ryan Roberts
  3 siblings, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-05  8:49 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Kevin Brodsky <kevin.brodsky@arm.com> writes:

> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
>
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
>
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
>
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
>   calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
>
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
>   to the arch via arch_{enter,leave} - lazy MMU remains enabled so
>   the assumption is that these callbacks are not relevant. However,
>   existing code may rely on a call to disable() to flush any batched
>   state, regardless of nesting. arch_flush_lazy_mmu_mode() is
>   therefore called in that situation.
>
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
>
> Whether the mode is actually enabled or not at any point is tracked
> by a separate "active" field in task_struct; this makes it possible
> to check invariants in the generic API, and to expose a new
> in_lazy_mmu_mode() helper to replace the various ways arch's
> currently track whether the mode is enabled (this will be done in
> later patches).
>
> In summary (nesting/active represent the values *after* the call):
>
> lazy_mmu_mode_enable()		-> arch_enter()	    nesting=1 active=1
>     lazy_mmu_mode_enable()	-> ø		    nesting=2 active=1
> 	lazy_mmu_mode_pause()	-> arch_leave()     nesting=2 active=0
> 	lazy_mmu_mode_resume()	-> arch_enter()     nesting=2 active=1
>     lazy_mmu_mode_disable()	-> arch_flush()     nesting=1 active=1
> lazy_mmu_mode_disable()		-> arch_leave()     nesting=0 active=0
>
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ------
>  include/linux/mm_types_task.h    |  5 +++
>  include/linux/pgtable.h          | 67 ++++++++++++++++++++++++++++++--
>  include/linux/sched.h            | 16 ++++++++
>  4 files changed, 84 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 54f8d6bb6f22..535435248923 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
>  
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
> -	/*
> -	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> -	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> -	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> -	 * permissions on the linear map with apply_to_page_range(), which
> -	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
> -	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
> -	 * flush and clear the flag such that the remainder of the work in the
> -	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
> -	 * keeps tracking simple.
> -	 */
> -
>  	if (in_interrupt())
>  		return;
>  
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a82aa80c0ba4..632d404f8191 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
>  #endif
>  };
>  
> +struct lazy_mmu_state {
> +	u8 nesting_level;
> +	bool active;
> +};
> +
>  #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5fdf32c437f..e6064e00b22d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>   * of the lazy mode. So the implementation must assume preemption may be enabled
>   * and cpu migration is possible; it must take steps to be robust against this.
>   * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> + * in interrupt context.
> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + *   lazy_mmu_mode_enable();
> + *   <code>
> + *   lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + *   lazy_mmu_mode_pause();
> + *   <code>
> + *   lazy_mmu_mode_resume();
> + *
> + * This sequence must only be used if the lazy MMU mode is already enabled.
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. <code> must not modify
> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
> + * helpers).
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
>   */
>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  static inline void lazy_mmu_mode_enable(void)
>  {
> -	arch_enter_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> +	/* enable() must not be called while paused */
> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> +
> +	if (state->nesting_level++ == 0) {
> +		state->active = true;
> +		arch_enter_lazy_mmu_mode();
> +	}
>  }

Some architectures disables preemption in their
arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.

  static inline void lazy_mmu_mode_enable(void)
  {
 -	arch_enter_lazy_mmu_mode();
 +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
 +
 +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
 +	/* enable() must not be called while paused */
 +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
 +
 +	if (state->nesting_level++ == 0) {
 +		arch_enter_lazy_mmu_mode();
 +		state->active = true;
 +	}
  }

... I think it make more sense to enable the state after the arch_**
call right.

>  
>  static inline void lazy_mmu_mode_disable(void)
>  {
> -	arch_leave_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON_ONCE(state->nesting_level == 0);
> +	VM_WARN_ON(!state->active);
> +
> +	if (--state->nesting_level == 0) {
> +		state->active = false;
> +		arch_leave_lazy_mmu_mode();
> +	} else {
> +		/* Exiting a nested section */
> +		arch_flush_lazy_mmu_mode();
> +	}
>  }

This looks ok though.

>  
>  static inline void lazy_mmu_mode_pause(void)
>  {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || !state->active);
> +
> +	state->active = false;
>  	arch_leave_lazy_mmu_mode();
>  }
>  
>  static inline void lazy_mmu_mode_resume(void)
>  {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || state->active);
> +
> +	state->active = true;
>  	arch_enter_lazy_mmu_mode();
>  }

Ditto.

-ritesh


>  #else
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cbb7340c5866..11566d973f42 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>  
>  	struct page_frag		task_frag;
>  
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +	struct lazy_mmu_state		lazy_mmu_state;
> +#endif
> +
>  #ifdef CONFIG_TASK_DELAY_ACCT
>  	struct task_delay_info		*delays;
>  #endif
> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>  	return task_index_to_char(task_state_index(tsk));
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return current->lazy_mmu_state.active;
> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return false;
> +}
> +#endif
> +
>  extern struct pid *cad_pid;
>  
>  /*
> -- 
> 2.47.0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 09/12] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 09/12] powerpc/mm: replace batch->active " Kevin Brodsky
  2025-11-03 16:05   ` David Hildenbrand
@ 2025-11-05  9:40   ` Ritesh Harjani
  1 sibling, 0 replies; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-05  9:40 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Kevin Brodsky <kevin.brodsky@arm.com> writes:

> A per-CPU batch struct is activated when entering lazy MMU mode; its
> lifetime is the same as the lazy MMU section (it is deactivated when
> leaving the mode). Preemption is disabled in that interval to ensure
> that the per-CPU reference remains valid.
>
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. We can therefore use the generic helper in_lazy_mmu_mode()
> to tell whether a batch struct is active instead of tracking it
> explicitly.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 9 ---------
>  arch/powerpc/mm/book3s64/hash_tlb.c                | 2 +-
>  2 files changed, 1 insertion(+), 10 deletions(-)
>

This looks good to me.

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-11-05  3:15   ` Ritesh Harjani
@ 2025-11-05  9:49     ` Ritesh Harjani
  2025-11-06 10:31       ` Kevin Brodsky
  0 siblings, 1 reply; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-05  9:49 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Kevin Brodsky <kevin.brodsky@arm.com> writes:
>
>> Upcoming changes to the lazy_mmu API will cause
>> arch_flush_lazy_mmu_mode() to be called when leaving a nested
>> lazy_mmu section.
>>
>> Move the relevant logic from arch_leave_lazy_mmu_mode() to
>> arch_flush_lazy_mmu_mode() and have the former call the latter.
>>
>> Note: the additional this_cpu_ptr() on the
>> arch_leave_lazy_mmu_mode() path will be removed in a subsequent
>> patch.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>  .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++----
>>  1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> index 146287d9580f..7704dbe8e88d 100644
>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>>  	batch->active = 1;
>>  }
>>  
>> +static inline void arch_flush_lazy_mmu_mode(void)
>> +{
>> +	struct ppc64_tlb_batch *batch;
>> +
>> +	batch = this_cpu_ptr(&ppc64_tlb_batch);
>> +
>> +	if (batch->index)
>> +		__flush_tlb_pending(batch);
>> +}
>> +
>
> This looks a bit scary since arch_flush_lazy_mmu_mode() is getting
> called from several of the places in later patches(). 
>
> Although I think arch_flush_lazy_mmu_mode() will only always be called
> in nested lazy mmu case right?
>
> Do you think we can add a VM_BUG_ON(radix_enabled()); in above to make
> sure the above never gets called in radix_enabled() case. 
>
> I am still going over the patch series, but while reviewing this I
> wanted to take your opinion.
>
> Ohh wait.. There is no way of knowing the return value from
> arch_enter_lazy_mmu_mode().. I think you might need a similar check to
> return from arch_flush_lazy_mmu_mode() too, if radix_enabled() is true.
>

Now that I have gone through this series, it seems plaussible that since
lazy mmu mode supports nesting, arch_flush_lazy_mmu_mode() can get
called while the lazy mmu is active due to nesting.. 

That means we should add the radix_enabled() check as I was talking in
above i.e. 

@@ -38,6 +38,9 @@ static inline void arch_flush_lazy_mmu_mode(void)
 {
        struct ppc64_tlb_batch *batch;

+       if (radix_enabled())
+               return;
+
        batch = this_cpu_ptr(&ppc64_tlb_batch);

        if (batch->index)

Correct? Although otherwise also I don't think it should be a problem
because batch->index is only valid during hash, but I still think we can
add above check so that we don't have to call this_cpu_ptr() to check
for batch->index whenever flush is being called.

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-05  8:49   ` Ritesh Harjani
@ 2025-11-05 16:12     ` Alexander Gordeev
  2025-11-06 10:51       ` Kevin Brodsky
  2025-11-06 16:32       ` Ritesh Harjani
  0 siblings, 2 replies; 62+ messages in thread
From: Alexander Gordeev @ 2025-11-05 16:12 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Kevin Brodsky, linux-mm, linux-kernel, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
> > + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> > + * currently enabled.
> >   */
> >  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> >  static inline void lazy_mmu_mode_enable(void)
> >  {
> > -	arch_enter_lazy_mmu_mode();
> > +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> > +
> > +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> > +	/* enable() must not be called while paused */
> > +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> > +
> > +	if (state->nesting_level++ == 0) {
> > +		state->active = true;
> > +		arch_enter_lazy_mmu_mode();
> > +	}
> >  }
> 
> Some architectures disables preemption in their
> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.

Do you have some scenario in mind that could cause an issue?
IOW, what could go wrong if the process is scheduled to another
CPU before preempt_disable() is called?

>   static inline void lazy_mmu_mode_enable(void)
>   {
>  -	arch_enter_lazy_mmu_mode();
>  +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>  +
>  +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>  +	/* enable() must not be called while paused */
>  +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>  +
>  +	if (state->nesting_level++ == 0) {
>  +		arch_enter_lazy_mmu_mode();
>  +		state->active = true;
>  +	}
>   }
> 
> ... I think it make more sense to enable the state after the arch_**
> call right.

But then in_lazy_mmu_mode() would return false if called from
arch_enter_lazy_mmu_mode(). Not big problem, but still..

> >  static inline void lazy_mmu_mode_disable(void)
> >  {
> > -	arch_leave_lazy_mmu_mode();
> > +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> > +
> > +	VM_WARN_ON_ONCE(state->nesting_level == 0);
> > +	VM_WARN_ON(!state->active);
> > +
> > +	if (--state->nesting_level == 0) {
> > +		state->active = false;
> > +		arch_leave_lazy_mmu_mode();
> > +	} else {
> > +		/* Exiting a nested section */
> > +		arch_flush_lazy_mmu_mode();
> > +	}
> >  }
> 
> This looks ok though.
> 
> >  
> >  static inline void lazy_mmu_mode_pause(void)
> >  {
> > +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> > +
> > +	VM_WARN_ON(state->nesting_level == 0 || !state->active);
> > +
> > +	state->active = false;
> >  	arch_leave_lazy_mmu_mode();
> >  }
> >  
> >  static inline void lazy_mmu_mode_resume(void)
> >  {
> > +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> > +
> > +	VM_WARN_ON(state->nesting_level == 0 || state->active);
> > +
> > +	state->active = true;
> >  	arch_enter_lazy_mmu_mode();
> >  }
> 
> Ditto.
> 
> -ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-11-05  2:46   ` Ritesh Harjani
@ 2025-11-06 10:29     ` Kevin Brodsky
  2025-11-08  0:35       ` Ritesh Harjani
  0 siblings, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-06 10:29 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86, Venkat Rao Bagalkote

On 05/11/2025 02:46, Ritesh Harjani (IBM) wrote:
> Kevin Brodsky <kevin.brodsky@arm.com> writes:
>
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
>> lazy mmu mode") a task can not be preempted while in lazy MMU mode.
>> Therefore, the batch re-activation code is never called, so remove it.
>>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>  arch/powerpc/include/asm/thread_info.h |  2 --
>>  arch/powerpc/kernel/process.c          | 25 -------------------------
>>  2 files changed, 27 deletions(-)
>>
> Since the commit referenced in above disables the preemption in
> arch_enter_lazy_mmu(), so the expectation is that we will never be
> context switched while in lazy_mmu, hence the code changes in
> switch_to() around __flush_tlb_pending() should ideally never be called.

Correct, that's the idea.

> With this analysis - the patch looks good to me. I will give this entire
> patch series a try on Power HW with Hash mmu too (which uses lazy mmu and
> let you know the results of that)!

That'd be very appreciated, thanks a lot!

> For this patch please feel free to add:
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
>
> CC: Venkat who also runs CI on linux Power HW for upstream testing :)

Ack, will Cc you both in the next version.

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode()
  2025-11-05  9:49     ` Ritesh Harjani
@ 2025-11-06 10:31       ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-06 10:31 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 05/11/2025 09:49, Ritesh Harjani (IBM) wrote:
> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>
>> Kevin Brodsky <kevin.brodsky@arm.com> writes:
>>
>>> Upcoming changes to the lazy_mmu API will cause
>>> arch_flush_lazy_mmu_mode() to be called when leaving a nested
>>> lazy_mmu section.
>>>
>>> Move the relevant logic from arch_leave_lazy_mmu_mode() to
>>> arch_flush_lazy_mmu_mode() and have the former call the latter.
>>>
>>> Note: the additional this_cpu_ptr() on the
>>> arch_leave_lazy_mmu_mode() path will be removed in a subsequent
>>> patch.
>>>
>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>> ---
>>>  .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 +++++++++++----
>>>  1 file changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> index 146287d9580f..7704dbe8e88d 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
>>> @@ -41,6 +41,16 @@ static inline void arch_enter_lazy_mmu_mode(void)
>>>  	batch->active = 1;
>>>  }
>>>  
>>> +static inline void arch_flush_lazy_mmu_mode(void)
>>> +{
>>> +	struct ppc64_tlb_batch *batch;
>>> +
>>> +	batch = this_cpu_ptr(&ppc64_tlb_batch);
>>> +
>>> +	if (batch->index)
>>> +		__flush_tlb_pending(batch);
>>> +}
>>> +
>> This looks a bit scary since arch_flush_lazy_mmu_mode() is getting
>> called from several of the places in later patches(). 
>>
>> Although I think arch_flush_lazy_mmu_mode() will only always be called
>> in nested lazy mmu case right?
>>
>> Do you think we can add a VM_BUG_ON(radix_enabled()); in above to make
>> sure the above never gets called in radix_enabled() case. 
>>
>> I am still going over the patch series, but while reviewing this I
>> wanted to take your opinion.
>>
>> Ohh wait.. There is no way of knowing the return value from
>> arch_enter_lazy_mmu_mode().. I think you might need a similar check to
>> return from arch_flush_lazy_mmu_mode() too, if radix_enabled() is true.
>>
> Now that I have gone through this series, it seems plaussible that since
> lazy mmu mode supports nesting, arch_flush_lazy_mmu_mode() can get
> called while the lazy mmu is active due to nesting.. 
>
> That means we should add the radix_enabled() check as I was talking in
> above i.e. 
>
> @@ -38,6 +38,9 @@ static inline void arch_flush_lazy_mmu_mode(void)
>  {
>         struct ppc64_tlb_batch *batch;
>
> +       if (radix_enabled())
> +               return;
> +
>         batch = this_cpu_ptr(&ppc64_tlb_batch);
>
>         if (batch->index)
>
> Correct? Although otherwise also I don't think it should be a problem
> because batch->index is only valid during hash, but I still think we can
> add above check so that we don't have to call this_cpu_ptr() to check
> for batch->index whenever flush is being called.

You're right! I missed this because v3 had an extra patch (13) that
turned all the lazy_mmu_mode_* into no-ops if radix_enabled(). The
optimisation didn't seem to be worth the noise so I dropped it, but it
does mean that arch_flush() will now be called in the nested case
regardless of radix_enabled().

Will fix in v5, thanks!

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
  2025-11-05  4:40   ` Ritesh Harjani
@ 2025-11-06 10:33     ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-06 10:33 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 05/11/2025 04:40, Ritesh Harjani (IBM) wrote:
> Kevin Brodsky <kevin.brodsky@arm.com> writes:
>
>> Architectures currently opt in for implementing lazy_mmu helpers by
>> defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
>>
>> In preparation for introducing a generic lazy_mmu layer that will
>> require storage in task_struct, let's switch to a cleaner approach:
>> instead of defining a macro, select a CONFIG option.
>>
>> This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
>> arch select it when it implements lazy_mmu helpers.
>> __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
>> relies on the new CONFIG instead.
>>
>> On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
>> selected. This creates some complications in arch/x86/boot/, because
>> a few files manually undefine PARAVIRT* options. As a result
>> <asm/paravirt.h> does not define the lazy_mmu helpers, but this
>> breaks the build as <linux/pgtable.h> only defines them if
>> !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
>> way out of this - let's just undefine that new CONFIG too.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>  arch/arm64/Kconfig                                 | 1 +
>>  arch/arm64/include/asm/pgtable.h                   | 1 -
>>  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
>>  arch/powerpc/platforms/Kconfig.cputype             | 1 +
>>  arch/sparc/Kconfig                                 | 1 +
>>  arch/sparc/include/asm/tlbflush_64.h               | 2 --
>>  arch/x86/Kconfig                                   | 1 +
>>  arch/x86/boot/compressed/misc.h                    | 1 +
>>  arch/x86/boot/startup/sme.c                        | 1 +
>>  arch/x86/include/asm/paravirt.h                    | 1 -
>>  include/linux/pgtable.h                            | 2 +-
>>  mm/Kconfig                                         | 3 +++
>>  12 files changed, 10 insertions(+), 7 deletions(-)
> Maybe we can add this to ... ?
>
> Documentation/features/vm/lazy_mmu/arch-support.txt
>
> #
> # Feature name:          lazy_mmu mode
> #         Kconfig:       ARCH_HAS_LAZY_MMU_MODE
> #         description:   arch supports arch_{enter|flush|leave}_lazy_mmu_mode()
> #
>     -----------------------
>     |         arch |status|
>     -----------------------
>     |       arm64: |  ok  |
>     |     powerpc: |  ok  |
>     |       sparc: |  ok  |
>     |         x86: |  ok  |
>     -----------------------

That's an interesting idea but I'm not sure it really makes sense for
lazy MMU? AFAIU these arch-support.txt files are meant to help identify
which generic features an arch has support for. Lazy MMU isn't really a
feature though, in the sense that what it does is entirely defined by
the arch. This patch does introduce a generic layer, but ultimately it
remains a collection of arch hooks.

> As for this patch, the changes are mostly straight forward around the
> configs part. This looks good to me. Please feel free to add: 
>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Thanks for the review!

- Kevin

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-05 16:12     ` Alexander Gordeev
@ 2025-11-06 10:51       ` Kevin Brodsky
  2025-11-06 15:33         ` Alexander Gordeev
  2025-11-06 16:32       ` Ritesh Harjani
  1 sibling, 1 reply; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-06 10:51 UTC (permalink / raw)
  To: Alexander Gordeev, Ritesh Harjani
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 05/11/2025 16:12, Alexander Gordeev wrote:
> On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>> + * currently enabled.
>>>   */
>>>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>  static inline void lazy_mmu_mode_enable(void)
>>>  {
>>> -	arch_enter_lazy_mmu_mode();
>>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>> +
>>> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>> +	/* enable() must not be called while paused */
>>> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>> +
>>> +	if (state->nesting_level++ == 0) {
>>> +		state->active = true;
>>> +		arch_enter_lazy_mmu_mode();
>>> +	}
>>>  }
>> Some architectures disables preemption in their
>> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
>> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
> Do you have some scenario in mind that could cause an issue?
> IOW, what could go wrong if the process is scheduled to another
> CPU before preempt_disable() is called?

I'm not sure I understand the issue either.

>>   static inline void lazy_mmu_mode_enable(void)
>>   {
>>  -	arch_enter_lazy_mmu_mode();
>>  +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>  +
>>  +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>  +	/* enable() must not be called while paused */
>>  +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>  +
>>  +	if (state->nesting_level++ == 0) {
>>  +		arch_enter_lazy_mmu_mode();
>>  +		state->active = true;
>>  +	}
>>   }
>>
>> ... I think it make more sense to enable the state after the arch_**
>> call right.
> But then in_lazy_mmu_mode() would return false if called from
> arch_enter_lazy_mmu_mode(). Not big problem, but still..

The ordering of nesting_level/active was the way you expected in v3, but
the conclusion of the discussion with David H [1] is that it doesn't
really matter so I simplified the ordering in v4 - the arch hooks
shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
arch_enter()/arch_leave() shouldn't need it anyway since they're called
once per outer section (not in nested sections). arch_flush() could
potentially do something different when nested, but that seems unlikely.

- Kevin

[1]
https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@redhat.com/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-06 10:51       ` Kevin Brodsky
@ 2025-11-06 15:33         ` Alexander Gordeev
  2025-11-07 10:16           ` Kevin Brodsky
  0 siblings, 1 reply; 62+ messages in thread
From: Alexander Gordeev @ 2025-11-06 15:33 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: Ritesh Harjani, linux-mm, linux-kernel, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On Thu, Nov 06, 2025 at 10:51:43AM +0000, Kevin Brodsky wrote:
> On 05/11/2025 16:12, Alexander Gordeev wrote:
> > On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
> >>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> >>> + * currently enabled.
> >>>   */
> >>>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> >>>  static inline void lazy_mmu_mode_enable(void)
> >>>  {
> >>> -	arch_enter_lazy_mmu_mode();
> >>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> >>> +
> >>> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> >>> +	/* enable() must not be called while paused */
> >>> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> >>> +
> >>> +	if (state->nesting_level++ == 0) {
> >>> +		state->active = true;
> >>> +		arch_enter_lazy_mmu_mode();
> >>> +	}
> >>>  }
> >> Some architectures disables preemption in their
> >> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
> >> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
> > Do you have some scenario in mind that could cause an issue?
> > IOW, what could go wrong if the process is scheduled to another
> > CPU before preempt_disable() is called?
> 
> I'm not sure I understand the issue either.
> 
> >>   static inline void lazy_mmu_mode_enable(void)
> >>   {
> >>  -	arch_enter_lazy_mmu_mode();
> >>  +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> >>  +
> >>  +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> >>  +	/* enable() must not be called while paused */
> >>  +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> >>  +
> >>  +	if (state->nesting_level++ == 0) {
> >>  +		arch_enter_lazy_mmu_mode();
> >>  +		state->active = true;
> >>  +	}
> >>   }
> >>
> >> ... I think it make more sense to enable the state after the arch_**
> >> call right.
> > But then in_lazy_mmu_mode() would return false if called from
> > arch_enter_lazy_mmu_mode(). Not big problem, but still..
> 
> The ordering of nesting_level/active was the way you expected in v3, but
> the conclusion of the discussion with David H [1] is that it doesn't
> really matter so I simplified the ordering in v4 - the arch hooks
> shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
> arch_enter()/arch_leave() shouldn't need it anyway since they're called
> once per outer section (not in nested sections). arch_flush() could
> potentially do something different when nested, but that seems unlikely.
> 
> - Kevin
> 
> [1]
> https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@redhat.com/

I might be misunderstand this conversation, but it looked to me as a discussion
about lazy_mmu_state::nesting_level value, not lazy_mmu_state::active.

I do use in_lazy_mmu_mode() (lazy_mmu_state::active) check from the arch-
callbacks. Here is the example (and likely the only case so far) where it hits:

static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
				      void *_data)
{
	lazy_mmu_mode_pause();
	...
	if (likely(pte_none(ptep_get(ptep)))) {

		/* Here set_pte() checks whether we are in lazy_mmu mode */
		set_pte_at(&init_mm, addr, ptep, pte);	<--- calls set_pte()
		data->pages[index] = NULL;
	}
	...
	lazy_mmu_mode_resume();
	...
}

So without in_lazy_mmu_mode() check above the arch-specific set_pte()
implementation enters a wrong branch, which ends up in:

[  394.503134] Call Trace:
[  394.503137]  [<00007fffe01333f4>] dump_stack_lvl+0xbc/0xf0 
[  394.503143]  [<00007fffe010298c>] vpanic+0x1cc/0x418 
[  394.503149]  [<00007fffe0102c7a>] panic+0xa2/0xa8 
[  394.503154]  [<00007fffe01e7a8a>] check_panic_on_warn+0x8a/0xb0 
[  394.503160]  [<00007fffe082d122>] end_report+0x72/0x110 
[  394.503166]  [<00007fffe082d3e6>] kasan_report+0xc6/0x100 
[  394.503171]  [<00007fffe01b9556>] ipte_batch_ptep_get+0x146/0x150 
[  394.503176]  [<00007fffe0830096>] kasan_populate_vmalloc_pte+0xe6/0x1e0 
[  394.503183]  [<00007fffe0718050>] apply_to_pte_range+0x1a0/0x570 
[  394.503189]  [<00007fffe07260fa>] __apply_to_page_range+0x3ca/0x8f0 
[  394.503195]  [<00007fffe0726648>] apply_to_page_range+0x28/0x40 
[  394.503201]  [<00007fffe082fe34>] __kasan_populate_vmalloc+0x324/0x340 
[  394.503207]  [<00007fffe076954e>] alloc_vmap_area+0x31e/0xbf0 
[  394.503213]  [<00007fffe0770106>] __get_vm_area_node+0x1a6/0x2d0 
[  394.503218]  [<00007fffe07716fa>] __vmalloc_node_range_noprof+0xba/0x260 
[  394.503224]  [<00007fffe0771970>] __vmalloc_node_noprof+0xd0/0x110 
[  394.503229]  [<00007fffe0771a22>] vmalloc_noprof+0x32/0x40 
[  394.503234]  [<00007fff604eaa42>] full_fit_alloc_test+0xb2/0x3e0 [test_vmalloc] 
[  394.503241]  [<00007fff604eb478>] test_func+0x488/0x760 [test_vmalloc] 
[  394.503247]  [<00007fffe025ad68>] kthread+0x368/0x630 
[  394.503253]  [<00007fffe01391e0>] __ret_from_fork+0xd0/0x490 
[  394.503259]  [<00007fffe24e468a>] ret_from_fork+0xa/0x30 

I could have cached lazy_mmu_state::active as arch-specific data
and check it, but then what is the point to have it generalized?

Thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-05 16:12     ` Alexander Gordeev
  2025-11-06 10:51       ` Kevin Brodsky
@ 2025-11-06 16:32       ` Ritesh Harjani
  2025-11-06 17:01         ` Ritesh Harjani
  2025-11-07 11:13         ` Kevin Brodsky
  1 sibling, 2 replies; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-06 16:32 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Kevin Brodsky, linux-mm, linux-kernel, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Alexander Gordeev <agordeev@linux.ibm.com> writes:

> On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
>> > + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> > + * currently enabled.
>> >   */
>> >  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> >  static inline void lazy_mmu_mode_enable(void)
>> >  {
>> > -	arch_enter_lazy_mmu_mode();
>> > +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> > +
>> > +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>> > +	/* enable() must not be called while paused */
>> > +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>> > +
>> > +	if (state->nesting_level++ == 0) {
>> > +		state->active = true;
>> > +		arch_enter_lazy_mmu_mode();
>> > +	}
>> >  }
>> 
>> Some architectures disables preemption in their
>> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
>> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
>
> Do you have some scenario in mind that could cause an issue?
>
No not really. But that's a deviation from what previous arch hooks were
expecting. Although thinking this through - I don't have any usecase
where this can be a problem. 

But let me re-visit some of the code paths on ppc64 lazy mmu... 

Looking at the arch specific usecase I see we always do get_cpu_var()
for accessing the per-cpu batch array which disables preemption before
accessing the per-cpu structure.. This per-cpu structure is where we
batch pte updates... 

For e.g... 
  
    arch_enter_lazy_mmu_mode()
        hpte_need_flush()
            get_cpu_var()   // this takes care of preempt_disable() 
            adds vpns to per-cpu batch[i]
            put_cpu_var()   // 
    arch_leave_lazy_mmu_mode()

> IOW, what could go wrong if the process is scheduled to another
> CPU before preempt_disable() is called?

So from above - I don't think your sequence to update
   state->active = true 
before calling arch_enter hook should be a problem.
Based on above this looks mostly ok to me.

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-06 16:32       ` Ritesh Harjani
@ 2025-11-06 17:01         ` Ritesh Harjani
  2025-11-07 11:13         ` Kevin Brodsky
  1 sibling, 0 replies; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-06 17:01 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Kevin Brodsky, linux-mm, linux-kernel, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> For e.g... 
>   
>     arch_enter_lazy_mmu_mode()
>         hpte_need_flush()
>             get_cpu_var()   // this takes care of preempt_disable() 
>             adds vpns to per-cpu batch[i]
>             put_cpu_var()   // 
>     arch_leave_lazy_mmu_mode()
>

Sorry, here is the more accurate call sequence for previous email.

caller()... 
     arch_enter_lazy_mmu_mode()
     ptep_xxx_()
        pte_update()
             hpte_need_flush()
                get_cpu_var()   // this takes care of preempt_disable() 
                adds vpns to per-cpu batch[i]
                put_cpu_var()   // 
     arch_leave_lazy_mmu_mode()

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-06 15:33         ` Alexander Gordeev
@ 2025-11-07 10:16           ` Kevin Brodsky
  0 siblings, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-07 10:16 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Ritesh Harjani, linux-mm, linux-kernel, Andreas Larsson,
	Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 06/11/2025 15:33, Alexander Gordeev wrote:
>> [...]
>>>>   static inline void lazy_mmu_mode_enable(void)
>>>>   {
>>>>  -	arch_enter_lazy_mmu_mode();
>>>>  +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>>>  +
>>>>  +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>>>  +	/* enable() must not be called while paused */
>>>>  +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>>>  +
>>>>  +	if (state->nesting_level++ == 0) {
>>>>  +		arch_enter_lazy_mmu_mode();
>>>>  +		state->active = true;
>>>>  +	}
>>>>   }
>>>>
>>>> ... I think it make more sense to enable the state after the arch_**
>>>> call right.
>>> But then in_lazy_mmu_mode() would return false if called from
>>> arch_enter_lazy_mmu_mode(). Not big problem, but still..
>> The ordering of nesting_level/active was the way you expected in v3, but
>> the conclusion of the discussion with David H [1] is that it doesn't
>> really matter so I simplified the ordering in v4 - the arch hooks
>> shouldn't call in_lazy_mmu_mode() or inspect lazy_mmu_state.
>> arch_enter()/arch_leave() shouldn't need it anyway since they're called
>> once per outer section (not in nested sections). arch_flush() could
>> potentially do something different when nested, but that seems unlikely.
>>
>> - Kevin
>>
>> [1]
>> https://lore.kernel.org/all/af4414b6-617c-4dc8-bddc-3ea00d1f6f3b@redhat.com/
> I might be misunderstand this conversation, but it looked to me as a discussion
> about lazy_mmu_state::nesting_level value, not lazy_mmu_state::active.
>
> I do use in_lazy_mmu_mode() (lazy_mmu_state::active) check from the arch-
> callbacks. Here is the example (and likely the only case so far) where it hits:

Sorry I didn't mean arch callbacks in general, I meant the ones called
from lazy_mmu_mode_*, that is arch_*_lazy_mmu_mode.

Patch 8 also makes use of in_lazy_mmu_mode() in set_pte() et al. on arm64.

- Kevin

> static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> 				      void *_data)
> {
> 	lazy_mmu_mode_pause();
> 	...
> 	if (likely(pte_none(ptep_get(ptep)))) {
>
> 		/* Here set_pte() checks whether we are in lazy_mmu mode */
> 		set_pte_at(&init_mm, addr, ptep, pte);	<--- calls set_pte()
> 		data->pages[index] = NULL;
> 	}
> 	...
> 	lazy_mmu_mode_resume();
> 	...
> }
>
> So without in_lazy_mmu_mode() check above the arch-specific set_pte()
> implementation enters a wrong branch, which ends up in:
>
> [  394.503134] Call Trace:
> [  394.503137]  [<00007fffe01333f4>] dump_stack_lvl+0xbc/0xf0 FWIW 
> [  394.503143]  [<00007fffe010298c>] vpanic+0x1cc/0x418 
> [  394.503149]  [<00007fffe0102c7a>] panic+0xa2/0xa8 
> [  394.503154]  [<00007fffe01e7a8a>] check_panic_on_warn+0x8a/0xb0 
> [  394.503160]  [<00007fffe082d122>] end_report+0x72/0x110 
> [  394.503166]  [<00007fffe082d3e6>] kasan_report+0xc6/0x100 
> [  394.503171]  [<00007fffe01b9556>] ipte_batch_ptep_get+0x146/0x150 
> [  394.503176]  [<00007fffe0830096>] kasan_populate_vmalloc_pte+0xe6/0x1e0 
> [  394.503183]  [<00007fffe0718050>] apply_to_pte_range+0x1a0/0x570 
> [  394.503189]  [<00007fffe07260fa>] __apply_to_page_range+0x3ca/0x8f0 
> [  394.503195]  [<00007fffe0726648>] apply_to_page_range+0x28/0x40 
> [  394.503201]  [<00007fffe082fe34>] __kasan_populate_vmalloc+0x324/0x340 
> [  394.503207]  [<00007fffe076954e>] alloc_vmap_area+0x31e/0xbf0 
> [  394.503213]  [<00007fffe0770106>] __get_vm_area_node+0x1a6/0x2d0 
> [  394.503218]  [<00007fffe07716fa>] __vmalloc_node_range_noprof+0xba/0x260 
> [  394.503224]  [<00007fffe0771970>] __vmalloc_node_noprof+0xd0/0x110 
> [  394.503229]  [<00007fffe0771a22>] vmalloc_noprof+0x32/0x40 
> [  394.503234]  [<00007fff604eaa42>] full_fit_alloc_test+0xb2/0x3e0 [test_vmalloc] 
> [  394.503241]  [<00007fff604eb478>] test_func+0x488/0x760 [test_vmalloc] 
> [  394.503247]  [<00007fffe025ad68>] kthread+0x368/0x630 
> [  394.503253]  [<00007fffe01391e0>] __ret_from_fork+0xd0/0x490 
> [  394.503259]  [<00007fffe24e468a>] ret_from_fork+0xa/0x30 
>
> I could have cached lazy_mmu_state::active as arch-specific data
> and check it, but then what is the point to have it generalized?
>
> Thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-11-06 16:32       ` Ritesh Harjani
  2025-11-06 17:01         ` Ritesh Harjani
@ 2025-11-07 11:13         ` Kevin Brodsky
  1 sibling, 0 replies; 62+ messages in thread
From: Kevin Brodsky @ 2025-11-07 11:13 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Alexander Gordeev
  Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 06/11/2025 16:32, Ritesh Harjani (IBM) wrote:
> Alexander Gordeev <agordeev@linux.ibm.com> writes:
>
>> On Wed, Nov 05, 2025 at 02:19:03PM +0530, Ritesh Harjani wrote:
>>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>>> + * currently enabled.
>>>>   */
>>>>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>>  static inline void lazy_mmu_mode_enable(void)
>>>>  {
>>>> -	arch_enter_lazy_mmu_mode();
>>>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>>> +
>>>> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>>>> +	/* enable() must not be called while paused */
>>>> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>>>> +
>>>> +	if (state->nesting_level++ == 0) {
>>>> +		state->active = true;
>>>> +		arch_enter_lazy_mmu_mode();
>>>> +	}
>>>>  }
>>> Some architectures disables preemption in their
>>> arch_enter_lazy_mmu_mode(). So shouldn't the state->active = true should
>>> happen after arch_enter_lazy_mmu_mode() has disabled preemption()? i.e.
>> Do you have some scenario in mind that could cause an issue?
>>
> No not really. But that's a deviation from what previous arch hooks were
> expecting. Although thinking this through - I don't have any usecase
> where this can be a problem.

Which arch hook expectations are you referring to?

> But let me re-visit some of the code paths on ppc64 lazy mmu... 
>
> Looking at the arch specific usecase I see we always do get_cpu_var()
> for accessing the per-cpu batch array which disables preemption before
> accessing the per-cpu structure.. This per-cpu structure is where we
> batch pte updates...

arch_enter() disables preemption so accesses to per-CPU variables
anywhere in the section shouldn't be an issue either way.

The bigger picture (regarding patch 9) is that what in_lazy_mmu_state()
returns is based on the current task's state (not a per-CPU variable),
and always false while in interrupt. As a result whether preemption is
disabled or not should make no difference, only program order matters.

- Kevin

> For e.g... 
>   
>     arch_enter_lazy_mmu_mode()
>         hpte_need_flush()
>             get_cpu_var()   // this takes care of preempt_disable() 
>             adds vpns to per-cpu batch[i]
>             put_cpu_var()   // 
>     arch_leave_lazy_mmu_mode()
>
>> IOW, what could go wrong if the process is scheduled to another
>> CPU before preempt_disable() is called?
> So from above - I don't think your sequence to update
>    state->active = true 
> before calling arch_enter hook should be a problem.
> Based on above this looks mostly ok to me.
>
> -ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
  2025-11-01 12:05   ` David Hildenbrand
  2025-11-05  2:46   ` Ritesh Harjani
@ 2025-11-07 12:25   ` Ryan Roberts
  2025-11-07 12:28     ` Ryan Roberts
  2 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 12:25 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:08, Kevin Brodsky wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
> lazy mmu mode") a task can not be preempted while in lazy MMU mode.
> Therefore, the batch re-activation code is never called, so remove it.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  arch/powerpc/include/asm/thread_info.h |  2 --
>  arch/powerpc/kernel/process.c          | 25 -------------------------
>  2 files changed, 27 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index b0f200aba2b3..97f35f9b1a96 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
>  /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
>  #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
>  #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
> -#define TLF_LAZY_MMU		3	/* tlb_batch is active */
>  #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
>  
>  #define _TLF_NAPPING		(1 << TLF_NAPPING)
>  #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
> -#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
>  #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
>  
>  #ifndef __ASSEMBLER__
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index eb23966ac0a9..9237dcbeee4a 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  {
>  	struct thread_struct *new_thread, *old_thread;
>  	struct task_struct *last;
> -#ifdef CONFIG_PPC_64S_HASH_MMU
> -	struct ppc64_tlb_batch *batch;
> -#endif
>  
>  	new_thread = &new->thread;
>  	old_thread = &current->thread;
> @@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	WARN_ON(!irqs_disabled());
>  
>  #ifdef CONFIG_PPC_64S_HASH_MMU
> -	batch = this_cpu_ptr(&ppc64_tlb_batch);
> -	if (batch->active) {
> -		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
> -		if (batch->index)
> -			__flush_tlb_pending(batch);
> -		batch->active = 0;
> -	}
> -
>  	/*
>  	 * On POWER9 the copy-paste buffer can only paste into
>  	 * foreign real addresses, so unprivileged processes can not
> @@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	 */
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> -#ifdef CONFIG_PPC_64S_HASH_MMU
> -	/*
> -	 * This applies to a process that was context switched while inside
> -	 * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
> -	 * deactivated above, before _switch(). This will never be the case
> -	 * for new tasks.
> -	 */
> -	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
> -		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
> -		batch = this_cpu_ptr(&ppc64_tlb_batch);
> -		batch->active = 1;
> -	}
> -#endif
> -
>  	/*
>  	 * Math facilities are masked out of the child MSR in copy_thread.
>  	 * A new task does not need to restore_math because it will


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-11-07 12:25   ` Ryan Roberts
@ 2025-11-07 12:28     ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 12:28 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 07/11/2025 12:25, Ryan Roberts wrote:
> On 29/10/2025 10:08, Kevin Brodsky wrote:
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
>> lazy mmu mode") a task can not be preempted while in lazy MMU mode.
>> Therefore, the batch re-activation code is never called, so remove it.
>>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

I should also add, that as far as I can tell, this was dead code because the
powerpc implementation disables preemption in a lazy mmu region. It would
probably be preferable to understand why the preemption disabling approach was
added in the first place. Perhaps it would be better to remove that and keep
this code. But given you are not changing any current behaviour and this is
removing dead code, that's probably something for the ppc folks to look into
another day.

Thanks,
Ryan

> 
>> ---
>>  arch/powerpc/include/asm/thread_info.h |  2 --
>>  arch/powerpc/kernel/process.c          | 25 -------------------------
>>  2 files changed, 27 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
>> index b0f200aba2b3..97f35f9b1a96 100644
>> --- a/arch/powerpc/include/asm/thread_info.h
>> +++ b/arch/powerpc/include/asm/thread_info.h
>> @@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
>>  /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
>>  #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
>>  #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
>> -#define TLF_LAZY_MMU		3	/* tlb_batch is active */
>>  #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
>>  
>>  #define _TLF_NAPPING		(1 << TLF_NAPPING)
>>  #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
>> -#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
>>  #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
>>  
>>  #ifndef __ASSEMBLER__
>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>> index eb23966ac0a9..9237dcbeee4a 100644
>> --- a/arch/powerpc/kernel/process.c
>> +++ b/arch/powerpc/kernel/process.c
>> @@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>  {
>>  	struct thread_struct *new_thread, *old_thread;
>>  	struct task_struct *last;
>> -#ifdef CONFIG_PPC_64S_HASH_MMU
>> -	struct ppc64_tlb_batch *batch;
>> -#endif
>>  
>>  	new_thread = &new->thread;
>>  	old_thread = &current->thread;
>> @@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>  	WARN_ON(!irqs_disabled());
>>  
>>  #ifdef CONFIG_PPC_64S_HASH_MMU
>> -	batch = this_cpu_ptr(&ppc64_tlb_batch);
>> -	if (batch->active) {
>> -		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
>> -		if (batch->index)
>> -			__flush_tlb_pending(batch);
>> -		batch->active = 0;
>> -	}
>> -
>>  	/*
>>  	 * On POWER9 the copy-paste buffer can only paste into
>>  	 * foreign real addresses, so unprivileged processes can not
>> @@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>  	 */
>>  
>>  #ifdef CONFIG_PPC_BOOK3S_64
>> -#ifdef CONFIG_PPC_64S_HASH_MMU
>> -	/*
>> -	 * This applies to a process that was context switched while inside
>> -	 * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
>> -	 * deactivated above, before _switch(). This will never be the case
>> -	 * for new tasks.
>> -	 */
>> -	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
>> -		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
>> -		batch = this_cpu_ptr(&ppc64_tlb_batch);
>> -		batch->active = 1;
>> -	}
>> -#endif
>> -
>>  	/*
>>  	 * Math facilities are masked out of the child MSR in copy_thread.
>>  	 * A new task does not need to restore_math because it will
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu()
  2025-10-29 10:08 ` [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
  2025-11-01 12:14   ` David Hildenbrand
@ 2025-11-07 12:31   ` Ryan Roberts
  2025-11-07 15:45   ` Jürgen Groß
  2 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 12:31 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:08, Kevin Brodsky wrote:
> arch_flush_lazy_mmu_mode() is called when outstanding batched
> pgtable operations must be completed immediately. There should
> however be no need to leave and re-enter lazy MMU completely. The
> only part of that sequence that we really need is xen_mc_flush();
> call it directly.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>

This looks functionally equivalent to me, so:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

But I don't think this tidy up is strictly necessary for your series to work?
(perhaps I'll change my mind on that as I go through it).

> ---
>  arch/x86/xen/mmu_pv.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 2a4a8deaf612..7a35c3393df4 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2139,10 +2139,8 @@ static void xen_flush_lazy_mmu(void)
>  {
>  	preempt_disable();
>  
> -	if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
> -		arch_leave_lazy_mmu_mode();
> -		arch_enter_lazy_mmu_mode();
> -	}
> +	if (xen_get_lazy_mode() == XEN_LAZY_MMU)
> +		xen_mc_flush();
>  
>  	preempt_enable();
>  }


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
  2025-10-29 10:09 ` [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
  2025-11-01 12:16   ` David Hildenbrand
  2025-11-05  4:40   ` Ritesh Harjani
@ 2025-11-07 13:56   ` Ryan Roberts
  2 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 13:56 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:09, Kevin Brodsky wrote:
> Architectures currently opt in for implementing lazy_mmu helpers by
> defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
> 
> In preparation for introducing a generic lazy_mmu layer that will
> require storage in task_struct, let's switch to a cleaner approach:
> instead of defining a macro, select a CONFIG option.
> 
> This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
> arch select it when it implements lazy_mmu helpers.
> __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
> relies on the new CONFIG instead.
> 
> On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
> selected. This creates some complications in arch/x86/boot/, because
> a few files manually undefine PARAVIRT* options. As a result
> <asm/paravirt.h> does not define the lazy_mmu helpers, but this
> breaks the build as <linux/pgtable.h> only defines them if
> !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
> way out of this - let's just undefine that new CONFIG too.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/Kconfig                                 | 1 +
>  arch/arm64/include/asm/pgtable.h                   | 1 -
>  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
>  arch/powerpc/platforms/Kconfig.cputype             | 1 +
>  arch/sparc/Kconfig                                 | 1 +
>  arch/sparc/include/asm/tlbflush_64.h               | 2 --
>  arch/x86/Kconfig                                   | 1 +
>  arch/x86/boot/compressed/misc.h                    | 1 +
>  arch/x86/boot/startup/sme.c                        | 1 +
>  arch/x86/include/asm/paravirt.h                    | 1 -
>  include/linux/pgtable.h                            | 2 +-
>  mm/Kconfig                                         | 3 +++
>  12 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 6663ffd23f25..e6bf5c7311b5 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -122,6 +122,7 @@ config ARM64
>  	select ARCH_WANTS_NO_INSTR
>  	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>  	select ARCH_HAS_UBSAN
> +	select ARCH_HAS_LAZY_MMU_MODE

nit: This list is mostly in alphabetical order. Further up the list there are a
lot of ARCH_HAS_* entries. Perhaps move it to the correct position in that lot?
Then ARCH_HAS_UBSAN stays out of order on its own.

Otherwise, all looks reasonable to me:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

>  	select ARM_AMBA
>  	select ARM_ARCH_TIMER
>  	select ARM_GIC
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0944e296dd4a..54f8d6bb6f22 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -80,7 +80,6 @@ static inline void queue_pte_barriers(void)
>  	}
>  }
>  
> -#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
>  	/*
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 7704dbe8e88d..623a8a8b2d0e 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -24,8 +24,6 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
>  
>  extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
>  
> -#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> -
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
>  	struct ppc64_tlb_batch *batch;
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 7b527d18aa5e..2942d57cf59c 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -93,6 +93,7 @@ config PPC_BOOK3S_64
>  	select IRQ_WORK
>  	select PPC_64S_HASH_MMU if !PPC_RADIX_MMU
>  	select KASAN_VMALLOC if KASAN
> +	select ARCH_HAS_LAZY_MMU_MODE
>  
>  config PPC_BOOK3E_64
>  	bool "Embedded processors"
> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
> index a630d373e645..2bad14744ca4 100644
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -112,6 +112,7 @@ config SPARC64
>  	select NEED_PER_CPU_PAGE_FIRST_CHUNK
>  	select ARCH_SUPPORTS_SCHED_SMT if SMP
>  	select ARCH_SUPPORTS_SCHED_MC  if SMP
> +	select ARCH_HAS_LAZY_MMU_MODE
>  
>  config ARCH_PROC_KCORE_TEXT
>  	def_bool y
> diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
> index 925bb5d7a4e1..4e1036728e2f 100644
> --- a/arch/sparc/include/asm/tlbflush_64.h
> +++ b/arch/sparc/include/asm/tlbflush_64.h
> @@ -39,8 +39,6 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
>  
>  void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>  
> -#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> -
>  void flush_tlb_pending(void);
>  void arch_enter_lazy_mmu_mode(void);
>  void arch_flush_lazy_mmu_mode(void);
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index fa3b616af03a..ef4332d720ab 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -804,6 +804,7 @@ config PARAVIRT
>  config PARAVIRT_XXL
>  	bool
>  	depends on X86_64
> +	select ARCH_HAS_LAZY_MMU_MODE
>  
>  config PARAVIRT_DEBUG
>  	bool "paravirt-ops debugging"
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index db1048621ea2..cdd7f692d9ee 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -11,6 +11,7 @@
>  #undef CONFIG_PARAVIRT
>  #undef CONFIG_PARAVIRT_XXL
>  #undef CONFIG_PARAVIRT_SPINLOCKS
> +#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  #undef CONFIG_KASAN
>  #undef CONFIG_KASAN_GENERIC
>  
> diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
> index e7ea65f3f1d6..b76a7c95dfe1 100644
> --- a/arch/x86/boot/startup/sme.c
> +++ b/arch/x86/boot/startup/sme.c
> @@ -24,6 +24,7 @@
>  #undef CONFIG_PARAVIRT
>  #undef CONFIG_PARAVIRT_XXL
>  #undef CONFIG_PARAVIRT_SPINLOCKS
> +#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  
>  /*
>   * This code runs before CPU feature bits are set. By default, the
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index b5e59a7ba0d0..13f9cd31c8f8 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -526,7 +526,6 @@ static inline void arch_end_context_switch(struct task_struct *next)
>  	PVOP_VCALL1(cpu.end_context_switch, next);
>  }
>  
> -#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
>  	PVOP_VCALL0(mmu.lazy_mode.enter);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 32e8457ad535..9894366e768b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
>   * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>   * and the mode cannot be used in interrupt context.
>   */
> -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  static inline void arch_enter_lazy_mmu_mode(void) {}
>  static inline void arch_leave_lazy_mmu_mode(void) {}
>  static inline void arch_flush_lazy_mmu_mode(void) {}
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0e26f4fc8717..5480c9a1bfb2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1372,6 +1372,9 @@ config PT_RECLAIM
>  config FIND_NORMAL_PAGE
>  	def_bool n
>  
> +config ARCH_HAS_LAZY_MMU_MODE
> +	bool
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  2025-10-29 10:09 ` [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
  2025-11-01 12:18   ` David Hildenbrand
@ 2025-11-07 14:26   ` Ryan Roberts
  2025-11-07 14:34     ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 14:26 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:09, Kevin Brodsky wrote:
> The implementation of the lazy MMU mode is currently entirely
> arch-specific; core code directly calls arch helpers:
> arch_{enter,leave}_lazy_mmu_mode().
> 
> We are about to introduce support for nested lazy MMU sections.
> As things stand we'd have to duplicate that logic in every arch
> implementing lazy_mmu - adding to a fair amount of logic
> already duplicated across lazy_mmu implementations.
> 
> This patch therefore introduces a new generic layer that calls the
> existing arch_* helpers. Two pair of calls are introduced:
> 
> * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
>     This is the standard case where the mode is enabled for a given
>     block of code by surrounding it with enable() and disable()
>     calls.
> 
> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
>     This is for situations where the mode is temporarily disabled
>     by first calling pause() and then resume() (e.g. to prevent any
>     batching from occurring in a critical section).
> 
> The documentation in <linux/pgtable.h> will be updated in a
> subsequent patch.
> 
> No functional change should be introduced at this stage.
> The implementation of enable()/resume() and disable()/pause() is
> currently identical, but nesting support will change that.
> 
> Most of the call sites have been updated using the following
> Coccinelle script:
> 
> @@
> @@
> {
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> ...
> }
> 
> @@
> @@
> {
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
> ...
> }
> 
> A couple of notes regarding x86:
> 
> * Xen is currently the only case where explicit handling is required
>   for lazy MMU when context-switching. This is purely an
>   implementation detail and using the generic lazy_mmu_mode_*
>   functions would cause trouble when nesting support is introduced,
>   because the generic functions must be called from the current task.
>   For that reason we still use arch_leave() and arch_enter() there.
> 
> * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
>   places, but only defines it if PARAVIRT_XXL is selected, and we
>   are removing the fallback in <linux/pgtable.h>. Add a new fallback
>   definition to <asm/pgtable.h> to keep things building.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/mm/mmu.c                     |  4 ++--
>  arch/arm64/mm/pageattr.c                |  4 ++--
>  arch/powerpc/mm/book3s64/hash_tlb.c     |  8 +++----
>  arch/powerpc/mm/book3s64/subpage_prot.c |  4 ++--
>  arch/x86/include/asm/pgtable.h          |  3 ++-
>  fs/proc/task_mmu.c                      |  4 ++--
>  include/linux/pgtable.h                 | 29 +++++++++++++++++++++----
>  mm/kasan/shadow.c                       |  8 +++----
>  mm/madvise.c                            | 18 +++++++--------
>  mm/memory.c                             | 16 +++++++-------
>  mm/migrate_device.c                     |  4 ++--
>  mm/mprotect.c                           |  4 ++--
>  mm/mremap.c                             |  4 ++--
>  mm/userfaultfd.c                        |  4 ++--
>  mm/vmalloc.c                            | 12 +++++-----
>  mm/vmscan.c                             | 12 +++++-----
>  16 files changed, 80 insertions(+), 58 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index b8d37eb037fc..d9c8e94f140f 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -731,7 +731,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>  		return -EINVAL;
>  
>  	mutex_lock(&pgtable_split_lock);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	/*
>  	 * The split_kernel_leaf_mapping_locked() may sleep, it is not a

This is a bit unfortunate, IMHO. The rest of this comment explains that although
you're not supposed to sleep inside lazy mmu mode, it's fine for arm64's
implementation. But we are no longer calling arm64's implementation; we are
calling a generic function, which does who knows what.

I think it all still works, but we are no longer containing our assumptions in
arm64 code. We are relying on implementation details of generic code.

> @@ -753,7 +753,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>  			ret = split_kernel_leaf_mapping_locked(end);
>  	}
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	mutex_unlock(&pgtable_split_lock);
>  	return ret;
>  }
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 5135f2d66958..e4059f13c4ed 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -110,7 +110,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
>  	if (WARN_ON_ONCE(ret))
>  		return ret;
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	/*
>  	 * The caller must ensure that the range we are operating on does not
> @@ -119,7 +119,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
>  	 */
>  	ret = walk_kernel_page_table_range_lockless(start, start + size,
>  						    &pageattr_ops, NULL, &data);
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  
>  	return ret;
>  }
> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
> index 21fcad97ae80..787f7a0e27f0 100644
> --- a/arch/powerpc/mm/book3s64/hash_tlb.c
> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c
> @@ -205,7 +205,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
>  	 * way to do things but is fine for our needs here.
>  	 */
>  	local_irq_save(flags);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	for (; start < end; start += PAGE_SIZE) {
>  		pte_t *ptep = find_init_mm_pte(start, &hugepage_shift);
>  		unsigned long pte;
> @@ -217,7 +217,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
>  			continue;
>  		hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift);
>  	}
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	local_irq_restore(flags);
>  }
>  
> @@ -237,7 +237,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
>  	 * way to do things but is fine for our needs here.
>  	 */
>  	local_irq_save(flags);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	start_pte = pte_offset_map(pmd, addr);
>  	if (!start_pte)
>  		goto out;
> @@ -249,6 +249,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
>  	}
>  	pte_unmap(start_pte);
>  out:
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	local_irq_restore(flags);
>  }
> diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
> index ec98e526167e..07c47673bba2 100644
> --- a/arch/powerpc/mm/book3s64/subpage_prot.c
> +++ b/arch/powerpc/mm/book3s64/subpage_prot.c
> @@ -73,13 +73,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
>  	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return;
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	for (; npages > 0; --npages) {
>  		pte_update(mm, addr, pte, 0, 0, 0);
>  		addr += PAGE_SIZE;
>  		++pte;
>  	}
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(pte - 1, ptl);
>  }
>  
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index e33df3da6980..14fd672bc9b2 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -117,7 +117,8 @@ extern pmdval_t early_pmd_flags;
>  #define pte_val(x)	native_pte_val(x)
>  #define __pte(x)	native_make_pte(x)
>  
> -#define arch_end_context_switch(prev)	do {} while(0)
> +#define arch_end_context_switch(prev)	do {} while (0)
> +#define arch_flush_lazy_mmu_mode()	do {} while (0)

Andrew converted over the default version of this (which you have removed with
this commit) to be static inline instead of the do/while guff. Perhaps you
should try to preserve that improvement here?

See Commit d02ac836e4d6 ("include/linux/pgtable.h: convert
arch_enter_lazy_mmu_mode() and friends to static inlines")

>  #endif	/* CONFIG_PARAVIRT_XXL */
>  
>  static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index fc35a0543f01..d16ba1d32169 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -2703,7 +2703,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>  		return 0;
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
>  		/* Fast path for performing exclusive WP */
> @@ -2773,7 +2773,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>  	if (flush_end)
>  		flush_tlb_range(vma, start, addr);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(start_pte, ptl);
>  
>  	cond_resched();
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 9894366e768b..b5fdf32c437f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -231,10 +231,31 @@ static inline int pmd_dirty(pmd_t pmd)
>   * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>   * and the mode cannot be used in interrupt context.
>   */
> -#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> -static inline void arch_enter_lazy_mmu_mode(void) {}
> -static inline void arch_leave_lazy_mmu_mode(void) {}
> -static inline void arch_flush_lazy_mmu_mode(void) {}
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline void lazy_mmu_mode_enable(void)
> +{
> +	arch_enter_lazy_mmu_mode();
> +}
> +
> +static inline void lazy_mmu_mode_disable(void)
> +{
> +	arch_leave_lazy_mmu_mode();
> +}
> +
> +static inline void lazy_mmu_mode_pause(void)
> +{
> +	arch_leave_lazy_mmu_mode();
> +}
> +
> +static inline void lazy_mmu_mode_resume(void)
> +{
> +	arch_enter_lazy_mmu_mode();
> +}

It would be good to add documentation blocks for each of these.

> +#else
> +static inline void lazy_mmu_mode_enable(void) {}
> +static inline void lazy_mmu_mode_disable(void) {}
> +static inline void lazy_mmu_mode_pause(void) {}
> +static inline void lazy_mmu_mode_resume(void) {}
>  #endif
>  
>  #ifndef pte_batch_hint
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index 5d2a876035d6..c49b029d3593 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	pte_t pte;
>  	int index;
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_pause();

I wonder if there really are use cases that *require* pause/resume? I think
these kasan cases could be correctly implemented using a new nest level instead?
Are there cases where the effects really need to be immediate or do the effects
just need to be visible when you get to where the resume is?

If the latter, that could just be turned into a nested disable (e.g. a flush).
In this case, there is only 1 PTE write so no benefit, but I wonder if other
cases may have more PTE writes that could then still be batched. It would be
nice to simplify the API by removing pause/resume if we can?

Thanks,
Ryan

>  
>  	index = PFN_DOWN(addr - data->start);
>  	page = data->pages[index];
> @@ -319,7 +319,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	}
>  	spin_unlock(&init_mm.page_table_lock);
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_resume();
>  
>  	return 0;
>  }
> @@ -482,7 +482,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	pte_t pte;
>  	int none;
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_pause();
>  
>  	spin_lock(&init_mm.page_table_lock);
>  	pte = ptep_get(ptep);
> @@ -494,7 +494,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	if (likely(!none))
>  		__free_page(pfn_to_page(pte_pfn(pte)));
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_resume();
>  
>  	return 0;
>  }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fb1c86e630b6..536026772160 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -455,7 +455,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	if (!start_pte)
>  		return 0;
>  	flush_tlb_batched_pending(mm);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>  		nr = 1;
>  		ptent = ptep_get(pte);
> @@ -463,7 +463,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
>  			batch_count = 0;
>  			if (need_resched()) {
> -				arch_leave_lazy_mmu_mode();
> +				lazy_mmu_mode_disable();
>  				pte_unmap_unlock(start_pte, ptl);
>  				cond_resched();
>  				goto restart;
> @@ -499,7 +499,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  				if (!folio_trylock(folio))
>  					continue;
>  				folio_get(folio);
> -				arch_leave_lazy_mmu_mode();
> +				lazy_mmu_mode_disable();
>  				pte_unmap_unlock(start_pte, ptl);
>  				start_pte = NULL;
>  				err = split_folio(folio);
> @@ -510,7 +510,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  				if (!start_pte)
>  					break;
>  				flush_tlb_batched_pending(mm);
> -				arch_enter_lazy_mmu_mode();
> +				lazy_mmu_mode_enable();
>  				if (!err)
>  					nr = 0;
>  				continue;
> @@ -558,7 +558,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  	}
>  
>  	if (start_pte) {
> -		arch_leave_lazy_mmu_mode();
> +		lazy_mmu_mode_disable();
>  		pte_unmap_unlock(start_pte, ptl);
>  	}
>  	if (pageout)
> @@ -677,7 +677,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	if (!start_pte)
>  		return 0;
>  	flush_tlb_batched_pending(mm);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>  		nr = 1;
>  		ptent = ptep_get(pte);
> @@ -727,7 +727,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				if (!folio_trylock(folio))
>  					continue;
>  				folio_get(folio);
> -				arch_leave_lazy_mmu_mode();
> +				lazy_mmu_mode_disable();
>  				pte_unmap_unlock(start_pte, ptl);
>  				start_pte = NULL;
>  				err = split_folio(folio);
> @@ -738,7 +738,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				if (!start_pte)
>  					break;
>  				flush_tlb_batched_pending(mm);
> -				arch_enter_lazy_mmu_mode();
> +				lazy_mmu_mode_enable();
>  				if (!err)
>  					nr = 0;
>  				continue;
> @@ -778,7 +778,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	if (nr_swap)
>  		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
>  	if (start_pte) {
> -		arch_leave_lazy_mmu_mode();
> +		lazy_mmu_mode_disable();
>  		pte_unmap_unlock(start_pte, ptl);
>  	}
>  	cond_resched();
> diff --git a/mm/memory.c b/mm/memory.c
> index 74b45e258323..2d662dee5ae7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1254,7 +1254,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>  	orig_src_pte = src_pte;
>  	orig_dst_pte = dst_pte;
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	do {
>  		nr = 1;
> @@ -1323,7 +1323,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
>  		 addr != end);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(orig_src_pte, src_ptl);
>  	add_mm_rss_vec(dst_mm, rss);
>  	pte_unmap_unlock(orig_dst_pte, dst_ptl);
> @@ -1842,7 +1842,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		return addr;
>  
>  	flush_tlb_batched_pending(mm);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	do {
>  		bool any_skipped = false;
>  
> @@ -1874,7 +1874,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
>  
>  	add_mm_rss_vec(mm, rss);
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  
>  	/* Do the actual TLB flush before dropping ptl */
>  	if (force_flush) {
> @@ -2817,7 +2817,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  	mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
>  		return -ENOMEM;
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	do {
>  		BUG_ON(!pte_none(ptep_get(pte)));
>  		if (!pfn_modify_allowed(pfn, prot)) {
> @@ -2827,7 +2827,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
>  		pfn++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(mapped_pte, ptl);
>  	return err;
>  }
> @@ -3134,7 +3134,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			return -EINVAL;
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	if (fn) {
>  		do {
> @@ -3147,7 +3147,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  	}
>  	*mask |= PGTBL_PTE_MODIFIED;
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  
>  	if (mm != &init_mm)
>  		pte_unmap_unlock(mapped_pte, ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index abd9f6850db6..dcdc46b96cc7 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -110,7 +110,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>  	if (!ptep)
>  		goto again;
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	for (; addr < end; addr += PAGE_SIZE, ptep++) {
>  		struct dev_pagemap *pgmap;
> @@ -287,7 +287,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  	if (unmapped)
>  		flush_tlb_range(walk->vma, start, end);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(ptep - 1, ptl);
>  
>  	return 0;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 113b48985834..bcb183a6fd2f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -293,7 +293,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  		target_node = numa_node_id();
>  
>  	flush_tlb_batched_pending(vma->vm_mm);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  	do {
>  		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
> @@ -439,7 +439,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			}
>  		}
>  	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(pte - 1, ptl);
>  
>  	return pages;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index bd7314898ec5..a2e2cd8f279a 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -256,7 +256,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	if (new_ptl != old_ptl)
>  		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>  	flush_tlb_batched_pending(vma->vm_mm);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>  		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> @@ -301,7 +301,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		}
>  	}
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	if (force_flush)
>  		flush_tlb_range(vma, old_end - len, old_end);
>  	if (new_ptl != old_ptl)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index af61b95c89e4..e01f7813e15c 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1100,7 +1100,7 @@ static long move_present_ptes(struct mm_struct *mm,
>  	/* It's safe to drop the reference now as the page-table is holding one. */
>  	folio_put(*first_src_folio);
>  	*first_src_folio = NULL;
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	while (true) {
>  		orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1138,7 +1138,7 @@ static long move_present_ptes(struct mm_struct *mm,
>  			break;
>  	}
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	if (src_addr > src_start)
>  		flush_tlb_range(src_vma, src_start, src_addr);
>  
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 798b2ed21e46..b9940590a40d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -105,7 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	if (!pte)
>  		return -ENOMEM;
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	do {
>  		if (unlikely(!pte_none(ptep_get(pte)))) {
> @@ -131,7 +131,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		pfn++;
>  	} while (pte += PFN_DOWN(size), addr += size, addr != end);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }
> @@ -359,7 +359,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	unsigned long size = PAGE_SIZE;
>  
>  	pte = pte_offset_kernel(pmd, addr);
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	do {
>  #ifdef CONFIG_HUGETLB_PAGE
> @@ -378,7 +378,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
>  	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  }
>  
> @@ -526,7 +526,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	if (!pte)
>  		return -ENOMEM;
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	do {
>  		struct page *page = pages[*nr];
> @@ -548,7 +548,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  		(*nr)++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  
>  	return err;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b2fc8b626d3d..7d2d87069530 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3551,7 +3551,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>  		return false;
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  restart:
>  	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
>  		unsigned long pfn;
> @@ -3592,7 +3592,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>  	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
>  		goto restart;
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	pte_unmap_unlock(pte, ptl);
>  
>  	return suitable_to_scan(total, young);
> @@ -3633,7 +3633,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
>  	if (!spin_trylock(ptl))
>  		goto done;
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	do {
>  		unsigned long pfn;
> @@ -3680,7 +3680,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
>  
>  	walk_update_folio(walk, last, gen, dirty);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  	spin_unlock(ptl);
>  done:
>  	*first = -1;
> @@ -4279,7 +4279,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  		}
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	lazy_mmu_mode_enable();
>  
>  	pte -= (addr - start) / PAGE_SIZE;
>  
> @@ -4313,7 +4313,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  
>  	walk_update_folio(walk, last, gen, dirty);
>  
> -	arch_leave_lazy_mmu_mode();
> +	lazy_mmu_mode_disable();
>  
>  	/* feedback from rmap walkers to page table walkers */
>  	if (mm_state && suitable_to_scan(i, young))


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  2025-11-07 14:26   ` Ryan Roberts
@ 2025-11-07 14:34     ` David Hildenbrand (Red Hat)
  2025-11-07 15:22       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-07 14:34 UTC (permalink / raw)
  To: Ryan Roberts, Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

>>   #ifndef pte_batch_hint
>> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
>> index 5d2a876035d6..c49b029d3593 100644
>> --- a/mm/kasan/shadow.c
>> +++ b/mm/kasan/shadow.c
>> @@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>>   	pte_t pte;
>>   	int index;
>>   
>> -	arch_leave_lazy_mmu_mode();
>> +	lazy_mmu_mode_pause();
> 
> I wonder if there really are use cases that *require* pause/resume? I think
> these kasan cases could be correctly implemented using a new nest level instead?
> Are there cases where the effects really need to be immediate or do the effects
> just need to be visible when you get to where the resume is?
> 
> If the latter, that could just be turned into a nested disable (e.g. a flush).
> In this case, there is only 1 PTE write so no benefit, but I wonder if other
> cases may have more PTE writes that could then still be batched. It would be
> nice to simplify the API by removing pause/resume if we can?

It has clear semantics, clearer than some nest-disable IMHO.

Maybe you can elaborate how you would change ("simplify") the API in 
that regard? What would the API look like?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest
  2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
                     ` (2 preceding siblings ...)
  2025-11-05  8:49   ` Ritesh Harjani
@ 2025-11-07 14:59   ` Ryan Roberts
  3 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 14:59 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:09, Kevin Brodsky wrote:
> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
> 
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
> 
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
> 
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
>   calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
> 
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
>   to the arch via arch_{enter,leave} - lazy MMU remains enabled so
>   the assumption is that these callbacks are not relevant. However,
>   existing code may rely on a call to disable() to flush any batched
>   state, regardless of nesting. arch_flush_lazy_mmu_mode() is
>   therefore called in that situation.
> 
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
> 
> Whether the mode is actually enabled or not at any point is tracked
> by a separate "active" field in task_struct; this makes it possible
> to check invariants in the generic API, and to expose a new
> in_lazy_mmu_mode() helper to replace the various ways arch's
> currently track whether the mode is enabled (this will be done in
> later patches).
> 
> In summary (nesting/active represent the values *after* the call):
> 
> lazy_mmu_mode_enable()		-> arch_enter()	    nesting=1 active=1
>     lazy_mmu_mode_enable()	-> ø		    nesting=2 active=1
> 	lazy_mmu_mode_pause()	-> arch_leave()     nesting=2 active=0
> 	lazy_mmu_mode_resume()	-> arch_enter()     nesting=2 active=1
>     lazy_mmu_mode_disable()	-> arch_flush()     nesting=1 active=1
> lazy_mmu_mode_disable()		-> arch_leave()     nesting=0 active=0
> 
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ------
>  include/linux/mm_types_task.h    |  5 +++
>  include/linux/pgtable.h          | 67 ++++++++++++++++++++++++++++++--
>  include/linux/sched.h            | 16 ++++++++
>  4 files changed, 84 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 54f8d6bb6f22..535435248923 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
>  
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
> -	/*
> -	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> -	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> -	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> -	 * permissions on the linear map with apply_to_page_range(), which
> -	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
> -	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
> -	 * flush and clear the flag such that the remainder of the work in the
> -	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
> -	 * keeps tracking simple.
> -	 */
> -
>  	if (in_interrupt())
>  		return;
>  
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a82aa80c0ba4..632d404f8191 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
>  #endif
>  };
>  
> +struct lazy_mmu_state {
> +	u8 nesting_level;
> +	bool active;
> +};
> +
>  #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5fdf32c437f..e6064e00b22d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>   * of the lazy mode. So the implementation must assume preemption may be enabled
>   * and cpu migration is possible; it must take steps to be robust against this.
>   * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> + * in interrupt context.

"The mode cannot be used in interrupt context"; except it is for arm64. KFENCE
and/or DEBUG_PAGEALLOC will request the arch to change linear map permissions,
which will enter lazy mmu (now using the new generic API). This can happen in
softirq context.


> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + *   lazy_mmu_mode_enable();
> + *   <code>
> + *   lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + *   lazy_mmu_mode_pause();
> + *   <code>
> + *   lazy_mmu_mode_resume();
> + *
> + * This sequence must only be used if the lazy MMU mode is already enabled.
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. <code> must not modify
> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
> + * helpers).
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
>   */

Nice documentation!

>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  static inline void lazy_mmu_mode_enable(void)
>  {
> -	arch_enter_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> +	/* enable() must not be called while paused */
> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> +
> +	if (state->nesting_level++ == 0) {

Hmm... for the arm64 case of calling this in an interrupt, Is it safe?

If a task is calling this function and gets interrupted here, nesting_level==1
but active==false. The interrupt then calls this function and increments from 1
to 2 but arch_enter_lazy_mmu_mode() hasn't been called.

More dangerously (I think), when the interrupt handler calls
lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() which
could be an issue because as far as the arch is concerned, it's not in lazy mode.

The current arm64 implementation works because setting and clearing the thread
flags is atomic.

Perhaps you need to disable preemption around the if block?

> +		state->active = true;
> +		arch_enter_lazy_mmu_mode();
> +	}
>  }
>  
>  static inline void lazy_mmu_mode_disable(void)
>  {
> -	arch_leave_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON_ONCE(state->nesting_level == 0);
> +	VM_WARN_ON(!state->active);
> +
> +	if (--state->nesting_level == 0) {
> +		state->active = false;
> +		arch_leave_lazy_mmu_mode();
> +	} else {
> +		/* Exiting a nested section */
> +		arch_flush_lazy_mmu_mode();
> +	}
>  }
>  
>  static inline void lazy_mmu_mode_pause(void)
>  {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || !state->active);

nit: do you need the first condition? I think when nesting_level==0, we expect
to be !active?

> +
> +	state->active = false;
>  	arch_leave_lazy_mmu_mode();
>  }
>  
>  static inline void lazy_mmu_mode_resume(void)
>  {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || state->active);

Similar argument?

> +
> +	state->active = true;
>  	arch_enter_lazy_mmu_mode();
>  }
>  #else
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cbb7340c5866..11566d973f42 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>  
>  	struct page_frag		task_frag;
>  
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +	struct lazy_mmu_state		lazy_mmu_state;
> +#endif
> +
>  #ifdef CONFIG_TASK_DELAY_ACCT
>  	struct task_delay_info		*delays;
>  #endif
> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>  	return task_index_to_char(task_state_index(tsk));
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return current->lazy_mmu_state.active;
> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return false;

Just pointing out that this isn't really a correct implementation:

lazy_mmu_mode_enable()
ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
lazy_mmu_mode_disable()

Although it probably doesn't matter in practice?

Thanks,
Ryan

> +}
> +#endif
> +
>  extern struct pid *cad_pid;
>  
>  /*


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
  2025-11-07 14:34     ` David Hildenbrand (Red Hat)
@ 2025-11-07 15:22       ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 15:22 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
	H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86

On 07/11/2025 14:34, David Hildenbrand (Red Hat) wrote:
>>>   #ifndef pte_batch_hint
>>> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
>>> index 5d2a876035d6..c49b029d3593 100644
>>> --- a/mm/kasan/shadow.c
>>> +++ b/mm/kasan/shadow.c
>>> @@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep,
>>> unsigned long addr,
>>>       pte_t pte;
>>>       int index;
>>>   -    arch_leave_lazy_mmu_mode();
>>> +    lazy_mmu_mode_pause();
>>
>> I wonder if there really are use cases that *require* pause/resume? I think
>> these kasan cases could be correctly implemented using a new nest level instead?
>> Are there cases where the effects really need to be immediate or do the effects
>> just need to be visible when you get to where the resume is?
>>
>> If the latter, that could just be turned into a nested disable (e.g. a flush).
>> In this case, there is only 1 PTE write so no benefit, but I wonder if other
>> cases may have more PTE writes that could then still be batched. It would be
>> nice to simplify the API by removing pause/resume if we can?
> 
> It has clear semantics, clearer than some nest-disable IMHO.
> 
> Maybe you can elaborate how you would change ("simplify") the API in that
> regard? What would the API look like?

By simplify, I just meant can we remove lazy_mmu_mode_pause() and
lazy_mmu_mode_resume() ?


We currently have:

apply_to_page_range
  lazy_mmu_mode_enable()
    kasan_populate_vmalloc_pte()
      lazy_mmu_mode_pause()
      <code>
      lazy_mmu_mode_resume()
  lazy_mmu_mode_disable()

Where <code> is setting ptes. But if <code> doesn't need the effects to be
visible until lazy_mmu_mode_resume(), then you could replace the block with:

apply_to_page_range
  lazy_mmu_mode_enable()
    kasan_populate_vmalloc_pte()
      lazy_mmu_mode_enable()
      <code>
      lazy_mmu_mode_disable()
  lazy_mmu_mode_disable()

However, looking at this more closely, I'm not really clear on why we need *any*
special attention to lazy mmu inside of kasan_populate_vmalloc_pte() and
kasan_depopulate_vmalloc_pte().

I *think* that the original concern was that we were doing ptep_get(ptep) inside
of a lazy_mmu block? So we need to flush so that the getter returns the most
recent value? But given we have never written to that particular ptep while in
the lazy mmu block, there is surely no hazard in the first place?

apply_to_existing_page_range() will only call kasan_depopulate_vmalloc_pte()
once per pte, right? So given we read the ptep before writing it, there should
be no hazard? If so we can remove pause/resume.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
  2025-10-29 10:09 ` [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
  2025-11-03 16:03   ` David Hildenbrand
@ 2025-11-07 15:28   ` Ryan Roberts
  1 sibling, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 15:28 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:09, Kevin Brodsky wrote:
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. As a result we no longer need a TIF flag for that purpose -
> let's use the new in_lazy_mmu_mode() helper instead.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h     | 16 +++-------------
>  arch/arm64/include/asm/thread_info.h |  3 +--
>  2 files changed, 4 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 535435248923..61ca88f94551 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -62,30 +62,21 @@ static inline void emit_pte_barriers(void)
>  
>  static inline void queue_pte_barriers(void)
>  {
> -	unsigned long flags;
> -
>  	if (in_interrupt()) {
>  		emit_pte_barriers();
>  		return;
>  	}
>  
> -	flags = read_thread_flags();
> -
> -	if (flags & BIT(TIF_LAZY_MMU)) {
> -		/* Avoid the atomic op if already set. */
> -		if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
> -			set_thread_flag(TIF_LAZY_MMU_PENDING);
> -	} else {
> +	if (in_lazy_mmu_mode())
> +		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);

This removes the optimization to only do the atomic set operation if the bit is
not already set. I think that should remain.

> +	else
>  		emit_pte_barriers();
> -	}
>  }
>  
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
>  	if (in_interrupt())
>  		return;

Why are you keeping this test? Surely it can go?

> -
> -	set_thread_flag(TIF_LAZY_MMU);
>  }
>  
>  static inline void arch_flush_lazy_mmu_mode(void)
> @@ -103,7 +94,6 @@ static inline void arch_leave_lazy_mmu_mode(void)
>  		return;
>  
>  	arch_flush_lazy_mmu_mode();
> -	clear_thread_flag(TIF_LAZY_MMU);
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index f241b8601ebd..4ff8da0767d9 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -84,8 +84,7 @@ void arch_setup_new_exec(void);
>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
> -#define TIF_LAZY_MMU		31	/* Task in lazy mmu mode */
> -#define TIF_LAZY_MMU_PENDING	32	/* Ops pending for lazy mmu mode exit */
> +#define TIF_LAZY_MMU_PENDING	31	/* Ops pending for lazy mmu mode exit */
>  
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 12/12] mm: bail out of lazy_mmu_mode_* in interrupt context
  2025-10-29 10:09 ` [PATCH v4 12/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
@ 2025-11-07 15:42   ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-11-07 15:42 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86

On 29/10/2025 10:09, Kevin Brodsky wrote:
> The lazy MMU mode cannot be used in interrupt context. This is
> documented in <linux/pgtable.h>, but isn't consistently handled
> across architectures.
> 
> arm64 ensures that calls to lazy_mmu_mode_* have no effect in
> interrupt context, because such calls do occur in certain
> configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
> batching in interrupt contexts"). Other architectures do not check
> this situation, most likely because it hasn't occurred so far.
> 
> Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
> disabled while in interrupt mode (see queue_pte_barriers() and
> xen_get_lazy_mode() respectively).
> 
> Let's handle this in the new generic lazy_mmu layer, in the same
> fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt(), and
> have in_lazy_mmu_mode() return false to disable any optimisation.
> Also remove the arm64 handling that is now redundant; x86/Xen has
> its own internal tracking so it is left unchanged.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 17 +----------------
>  include/linux/pgtable.h          | 16 ++++++++++++++--
>  include/linux/sched.h            |  3 +++
>  3 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 61ca88f94551..96987a49e83b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -62,37 +62,22 @@ static inline void emit_pte_barriers(void)
>  
>  static inline void queue_pte_barriers(void)
>  {
> -	if (in_interrupt()) {
> -		emit_pte_barriers();
> -		return;
> -	}
> -
>  	if (in_lazy_mmu_mode())
>  		test_and_set_thread_flag(TIF_LAZY_MMU_PENDING);
>  	else
>  		emit_pte_barriers();
>  }
>  
> -static inline void arch_enter_lazy_mmu_mode(void)
> -{
> -	if (in_interrupt())
> -		return;
> -}
> +static inline void arch_enter_lazy_mmu_mode(void) {}
>  
>  static inline void arch_flush_lazy_mmu_mode(void)
>  {
> -	if (in_interrupt())
> -		return;
> -
>  	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
>  		emit_pte_barriers();
>  }
>  
>  static inline void arch_leave_lazy_mmu_mode(void)
>  {
> -	if (in_interrupt())
> -		return;
> -
>  	arch_flush_lazy_mmu_mode();
>  }

Ahh ok, by the time you get to the final state, I think a most of my
comments/concerns are solved. Certainly this now looks safe for the interrupt
case, whereas I think the intermediate state when you initially introduce
nesting is broken. So perhaps you want to look at how to rework it to prevent that.

Thanks,
Ryan


>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e6064e00b22d..e6069ce4ec83 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,8 +228,8 @@ static inline int pmd_dirty(pmd_t pmd)
>   * of the lazy mode. So the implementation must assume preemption may be enabled
>   * and cpu migration is possible; it must take steps to be robust against this.
>   * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> - * in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode is disabled
> + * in interrupt context and calls to the lazy_mmu API have no effect.
>   *
>   * The lazy MMU mode is enabled for a given block of code using:
>   *
> @@ -265,6 +265,9 @@ static inline void lazy_mmu_mode_enable(void)
>  {
>  	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>  
> +	if (in_interrupt())
> +		return;
> +
>  	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>  	/* enable() must not be called while paused */
>  	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> @@ -279,6 +282,9 @@ static inline void lazy_mmu_mode_disable(void)
>  {
>  	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>  
> +	if (in_interrupt())
> +		return;
> +
>  	VM_WARN_ON_ONCE(state->nesting_level == 0);
>  	VM_WARN_ON(!state->active);
>  
> @@ -295,6 +301,9 @@ static inline void lazy_mmu_mode_pause(void)
>  {
>  	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>  
> +	if (in_interrupt())
> +		return;
> +
>  	VM_WARN_ON(state->nesting_level == 0 || !state->active);
>  
>  	state->active = false;
> @@ -305,6 +314,9 @@ static inline void lazy_mmu_mode_resume(void)
>  {
>  	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>  
> +	if (in_interrupt())
> +		return;
> +
>  	VM_WARN_ON(state->nesting_level == 0 || state->active);
>  
>  	state->active = true;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 11566d973f42..bb873016ffcf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1731,6 +1731,9 @@ static inline char task_state_to_char(struct task_struct *tsk)
>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  static inline bool in_lazy_mmu_mode(void)
>  {
> +	if (in_interrupt())
> +		return false;
> +
>  	return current->lazy_mmu_state.active;
>  }
>  #else


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu()
  2025-10-29 10:08 ` [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
  2025-11-01 12:14   ` David Hildenbrand
  2025-11-07 12:31   ` Ryan Roberts
@ 2025-11-07 15:45   ` Jürgen Groß
  2 siblings, 0 replies; 62+ messages in thread
From: Jürgen Groß @ 2025-11-07 15:45 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
	Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
	linuxppc-dev, sparclinux, xen-devel, x86


[-- Attachment #1.1.1: Type: text/plain, Size: 452 bytes --]

On 29.10.25 11:08, Kevin Brodsky wrote:
> arch_flush_lazy_mmu_mode() is called when outstanding batched
> pgtable operations must be completed immediately. There should
> however be no need to leave and re-enter lazy MMU completely. The
> only part of that sequence that we really need is xen_mc_flush();
> call it directly.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush
  2025-11-06 10:29     ` Kevin Brodsky
@ 2025-11-08  0:35       ` Ritesh Harjani
  0 siblings, 0 replies; 62+ messages in thread
From: Ritesh Harjani @ 2025-11-08  0:35 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
	Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
	Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller,
	David Woodhouse, H. Peter Anvin, Ingo Molnar, Jann Horn,
	Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
	Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
	Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
	Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yeoreum Yun, linux-arm-kernel, linuxppc-dev, sparclinux,
	xen-devel, x86, Venkat Rao Bagalkote

Kevin Brodsky <kevin.brodsky@arm.com> writes:

> On 05/11/2025 02:46, Ritesh Harjani (IBM) wrote:
>> Kevin Brodsky <kevin.brodsky@arm.com> writes:
>>
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
>>> lazy mmu mode") a task can not be preempted while in lazy MMU mode.
>>> Therefore, the batch re-activation code is never called, so remove it.
>>>
>>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>>> ---
>>>  arch/powerpc/include/asm/thread_info.h |  2 --
>>>  arch/powerpc/kernel/process.c          | 25 -------------------------
>>>  2 files changed, 27 deletions(-)
>>>
>> Since the commit referenced in above disables the preemption in
>> arch_enter_lazy_mmu(), so the expectation is that we will never be
>> context switched while in lazy_mmu, hence the code changes in
>> switch_to() around __flush_tlb_pending() should ideally never be called.
>
> Correct, that's the idea.
>
>> With this analysis - the patch looks good to me. I will give this entire
>> patch series a try on Power HW with Hash mmu too (which uses lazy mmu and
>> let you know the results of that)!
>
> That'd be very appreciated, thanks a lot!
>

I did give this patch series a run on Power10 with Hash MMU. I ran the
following stress-ng tests and didn't observe any issues (kernel warnings) so far.

stress-ng --all 0 -t 60s --perf -v --verify \
--tlb-shootdown 0 \
--fault 0 \
--userfaultfd 0 \
--fork 0 \
--exec 0 \
--memfd 0 \
--numa 0 \
--pkey 0 \
--remap 0 \
--vm 0 \
--rmap 0 \
-x swap,pagemove
(Note not all options shown here will work with --verify)

Let me know what else I can run for validation?
Do you know of any specific tests for validation of lazy mmu feature?

>> For this patch please feel free to add:
>> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>
>>
>> CC: Venkat who also runs CI on linux Power HW for upstream testing :)
>
> Ack, will Cc you both in the next version.

Sure. Thanks!

-ritesh

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2025-11-08  2:33 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 10:08 [PATCH v4 00/12] Nesting support for lazy MMU mode Kevin Brodsky
2025-10-29 10:08 ` [PATCH v4 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
2025-11-01 12:05   ` David Hildenbrand
2025-11-05  2:46   ` Ritesh Harjani
2025-11-06 10:29     ` Kevin Brodsky
2025-11-08  0:35       ` Ritesh Harjani
2025-11-07 12:25   ` Ryan Roberts
2025-11-07 12:28     ` Ryan Roberts
2025-10-29 10:08 ` [PATCH v4 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
2025-11-01 12:14   ` David Hildenbrand
2025-11-03 18:06     ` Kevin Brodsky
2025-11-07 12:31   ` Ryan Roberts
2025-11-07 15:45   ` Jürgen Groß
2025-10-29 10:09 ` [PATCH v4 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
2025-11-01 12:14   ` David Hildenbrand
2025-11-05  3:15   ` Ritesh Harjani
2025-11-05  9:49     ` Ritesh Harjani
2025-11-06 10:31       ` Kevin Brodsky
2025-10-29 10:09 ` [PATCH v4 04/12] sparc/mm: " Kevin Brodsky
2025-11-01 12:14   ` David Hildenbrand
2025-10-29 10:09 ` [PATCH v4 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
2025-11-01 12:16   ` David Hildenbrand
2025-11-05  4:40   ` Ritesh Harjani
2025-11-06 10:33     ` Kevin Brodsky
2025-11-07 13:56   ` Ryan Roberts
2025-10-29 10:09 ` [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
2025-11-01 12:18   ` David Hildenbrand
2025-11-07 14:26   ` Ryan Roberts
2025-11-07 14:34     ` David Hildenbrand (Red Hat)
2025-11-07 15:22       ` Ryan Roberts
2025-10-29 10:09 ` [PATCH v4 07/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
2025-10-29 16:41   ` Alexander Gordeev
2025-10-30 10:28     ` Kevin Brodsky
2025-10-30 16:34       ` Alexander Gordeev
2025-11-01 12:22   ` David Hildenbrand
2025-11-03 18:08     ` Kevin Brodsky
2025-11-05  8:49   ` Ritesh Harjani
2025-11-05 16:12     ` Alexander Gordeev
2025-11-06 10:51       ` Kevin Brodsky
2025-11-06 15:33         ` Alexander Gordeev
2025-11-07 10:16           ` Kevin Brodsky
2025-11-06 16:32       ` Ritesh Harjani
2025-11-06 17:01         ` Ritesh Harjani
2025-11-07 11:13         ` Kevin Brodsky
2025-11-07 14:59   ` Ryan Roberts
2025-10-29 10:09 ` [PATCH v4 08/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
2025-11-03 16:03   ` David Hildenbrand
2025-11-03 18:25     ` Kevin Brodsky
2025-11-07 15:28   ` Ryan Roberts
2025-10-29 10:09 ` [PATCH v4 09/12] powerpc/mm: replace batch->active " Kevin Brodsky
2025-11-03 16:05   ` David Hildenbrand
2025-11-04 11:33     ` Kevin Brodsky
2025-11-05  9:40   ` Ritesh Harjani
2025-10-29 10:09 ` [PATCH v4 10/12] sparc/mm: " Kevin Brodsky
2025-11-03 16:11   ` David Hildenbrand (Red Hat)
2025-10-29 10:09 ` [PATCH v4 11/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
2025-11-03 16:15   ` David Hildenbrand (Red Hat)
2025-11-03 18:29     ` Kevin Brodsky
2025-11-03 19:23       ` David Hildenbrand (Red Hat)
2025-11-04 11:28         ` Kevin Brodsky
2025-10-29 10:09 ` [PATCH v4 12/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
2025-11-07 15:42   ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).