[PATCH 0/6] mm: Consolidate lazy MMU mode context

sparclinux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] mm: Consolidate lazy MMU mode context
@ 2025-06-12 17:36 Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 1/6] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

Hi All,

Consolidate arch_enter|leave|flush_lazy_mmu_mode() context and protect
with a lock not only user, but also kernel mappings before entering
the lazy MMU mode.

For not fully preemptible (Real-Time) kernels that simplifies semantics -
while the mode is active the code should assume it is executing in atomic
context. That paves the way for cleanups, such as suggested for sparc and
powerpc and hopefully brings a bit more clarity in what labeled in commit
691ee97e1a9d ("mm: fix lazy mmu docs and usage") as "a bit of a mess (to
put it politely)".

The fully preemptible (Real-Time) kernels could probably also be brought
into the fold, but I am not sure about implications (to say at least).

This series is continuation of [1] and [2], with commit b6ea95a34cbd
("kasan: avoid sleepable page allocation from atomic context") already
upstream.

I dared to keep Nicholas Piggin R-b on patch 3, but dropped it from
patch 2 due to new bits.

Except the optional sparc (untested) and powerpc (complile-tested)
updates this series is a prerequisite for implementation of lazy MMU
mode on s390.

1. https://lore.kernel.org/linux-mm/cover.1744037648.git.agordeev@linux.ibm.com/#r
2. https://lore.kernel.org/linux-mm/cover.1744128123.git.agordeev@linux.ibm.com/#r

Thanks!

Alexander Gordeev (6):
  mm: Cleanup apply_to_pte_range() routine
  mm: Lock kernel page tables before entering lazy MMU mode
  mm/debug: Detect wrong arch_enter_lazy_mmu_mode() contexts
  sparc/mm: Do not disable preemption in lazy MMU mode
  powerpc/64s: Do not disable preemption in lazy MMU mode
  powerpc/64s: Do not re-activate batched TLB flush

 .../include/asm/book3s/64/tlbflush-hash.h     | 13 ++++----
 arch/powerpc/include/asm/thread_info.h        |  2 --
 arch/powerpc/kernel/process.c                 | 25 --------------
 arch/sparc/include/asm/tlbflush_64.h          |  2 +-
 arch/sparc/mm/tlb.c                           | 12 ++++---
 include/linux/pgtable.h                       | 32 +++++++++++++-----
 mm/kasan/shadow.c                             |  5 ---
 mm/memory.c                                   | 33 ++++++++++++-------
 mm/vmalloc.c                                  |  6 ++++
 9 files changed, 65 insertions(+), 65 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/6] mm: Cleanup apply_to_pte_range() routine
  2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
@ 2025-06-12 17:36 ` Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode Alexander Gordeev
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

Reverse 'create' vs 'mm == &init_mm' conditions and move
page table mask modification out of the atomic context.
This is a prerequisite for locking kernel page tables.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/memory.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8eba595056fe..71b3d3f98999 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3006,24 +3006,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     pte_fn_t fn, void *data, bool create,
 				     pgtbl_mod_mask *mask)
 {
+	int err = create ? -ENOMEM : -EINVAL;
 	pte_t *pte, *mapped_pte;
-	int err = 0;
 	spinlock_t *ptl;
 
-	if (create) {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_alloc_kernel_track(pmd, addr, mask) :
-			pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (mm == &init_mm) {
+		if (create)
+			pte = pte_alloc_kernel_track(pmd, addr, mask);
+		else
+			pte = pte_offset_kernel(pmd, addr);
 		if (!pte)
-			return -ENOMEM;
+			return err;
 	} else {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_offset_kernel(pmd, addr) :
-			pte_offset_map_lock(mm, pmd, addr, &ptl);
+		if (create)
+			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+		else
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
-			return -EINVAL;
+			return err;
+		mapped_pte = pte;
 	}
 
+	err = 0;
 	arch_enter_lazy_mmu_mode();
 
 	if (fn) {
@@ -3035,12 +3039,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			}
 		} while (pte++, addr += PAGE_SIZE, addr != end);
 	}
-	*mask |= PGTBL_PTE_MODIFIED;
 
 	arch_leave_lazy_mmu_mode();
 
 	if (mm != &init_mm)
 		pte_unmap_unlock(mapped_pte, ptl);
+
+	*mask |= PGTBL_PTE_MODIFIED;
+
 	return err;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
  2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 1/6] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
@ 2025-06-12 17:36 ` Alexander Gordeev
  2025-06-13  8:26   ` Ryan Roberts
                     ` (2 more replies)
  2025-06-12 17:36 ` [PATCH 3/6] mm/debug: Detect wrong arch_enter_lazy_mmu_mode() contexts Alexander Gordeev
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

As a follow-up to commit 691ee97e1a9d ("mm: fix lazy mmu docs and
usage") take a step forward and protect with a lock not only user,
but also kernel mappings before entering the lazy MMU mode. With
that the semantics of arch_enter|leave_lazy_mmu_mode() callbacks
is consolidated, which allows further simplifications.

The effect of this consolidation is not fully preemptible (Real-Time)
kernels can not enter the context switch while the lazy MMU mode is
active - which is easier to comprehend.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/linux/pgtable.h | 12 ++++++------
 mm/kasan/shadow.c       |  5 -----
 mm/memory.c             |  5 ++++-
 mm/vmalloc.c            |  6 ++++++
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0b6e1f781d86..33bf2b13c219 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -224,12 +224,12 @@ static inline int pmd_dirty(pmd_t pmd)
  * a raw PTE pointer after it has been modified are not guaranteed to be
  * up to date.
  *
- * In the general case, no lock is guaranteed to be held between entry and exit
- * of the lazy mode. So the implementation must assume preemption may be enabled
- * and cpu migration is possible; it must take steps to be robust against this.
- * (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
- * and the mode cannot be used in interrupt context.
+ * For PREEMPT_RT kernels implementation must assume that preemption may
+ * be enabled and cpu migration is possible between entry and exit of the
+ * lazy MMU mode; it must take steps to be robust against this. There is
+ * no such assumption for non-PREEMPT_RT kernels, since both kernel and
+ * user page tables are protected with a spinlock while in lazy MMU mode.
+ * Nesting is not permitted and the mode cannot be used in interrupt context.
  */
 #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 #define arch_enter_lazy_mmu_mode()	do {} while (0)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index d2c70cd2afb1..45115bd770a9 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -313,12 +313,10 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	__memset(page_to_virt(page), KASAN_VMALLOC_INVALID, PAGE_SIZE);
 	pte = pfn_pte(page_to_pfn(page), PAGE_KERNEL);
 
-	spin_lock(&init_mm.page_table_lock);
 	if (likely(pte_none(ptep_get(ptep)))) {
 		set_pte_at(&init_mm, addr, ptep, pte);
 		data->pages[index] = NULL;
 	}
-	spin_unlock(&init_mm.page_table_lock);
 
 	return 0;
 }
@@ -465,13 +463,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 
 	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
 
-	spin_lock(&init_mm.page_table_lock);
-
 	if (likely(!pte_none(ptep_get(ptep)))) {
 		pte_clear(&init_mm, addr, ptep);
 		free_page(page);
 	}
-	spin_unlock(&init_mm.page_table_lock);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 71b3d3f98999..1ddc532b1f13 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3017,6 +3017,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			pte = pte_offset_kernel(pmd, addr);
 		if (!pte)
 			return err;
+		spin_lock(&init_mm.page_table_lock);
 	} else {
 		if (create)
 			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -3042,7 +3043,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm == &init_mm)
+		spin_unlock(&init_mm.page_table_lock);
+	else
 		pte_unmap_unlock(mapped_pte, ptl);
 
 	*mask |= PGTBL_PTE_MODIFIED;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ab986dd09b6a..57b11000ae36 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -105,6 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	if (!pte)
 		return -ENOMEM;
 
+	spin_lock(&init_mm.page_table_lock);
 	arch_enter_lazy_mmu_mode();
 
 	do {
@@ -132,6 +133,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	} while (pte += PFN_DOWN(size), addr += size, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	spin_unlock(&init_mm.page_table_lock);
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
@@ -359,6 +361,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
+	spin_lock(&init_mm.page_table_lock);
 	arch_enter_lazy_mmu_mode();
 
 	do {
@@ -379,6 +382,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	spin_unlock(&init_mm.page_table_lock);
 	*mask |= PGTBL_PTE_MODIFIED;
 }
 
@@ -525,6 +529,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!pte)
 		return -ENOMEM;
 
+	spin_lock(&init_mm.page_table_lock);
 	arch_enter_lazy_mmu_mode();
 
 	do {
@@ -542,6 +547,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	spin_unlock(&init_mm.page_table_lock);
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/6] mm/debug: Detect wrong arch_enter_lazy_mmu_mode() contexts
  2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 1/6] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode Alexander Gordeev
@ 2025-06-12 17:36 ` Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 4/6] sparc/mm: Do not disable preemption in lazy MMU mode Alexander Gordeev
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

Make default arch_enter|leave|flush_lazy_mmu_mode() callbacks
complain on enabled preemption to detect wrong contexts. That
could help to prevent the complicated lazy MMU mode semantics
misuse, such like one that was solved with commit b9ef323ea168
("powerpc/64s: Disable preemption in hash lazy mmu mode").

Skip fully preemptible kernels, since in such case taking the
page table lock does not disable preemption, so the described
check would be wrong.

Most platforms do not implement the lazy MMU mode callbacks,
so to aovid a performance impact allow the complaint when
CONFIG_DEBUG_VM option is enabled only.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/linux/pgtable.h | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 33bf2b13c219..0cb8abdc58a8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -232,9 +232,23 @@ static inline int pmd_dirty(pmd_t pmd)
  * Nesting is not permitted and the mode cannot be used in interrupt context.
  */
 #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode()	do {} while (0)
-#define arch_leave_lazy_mmu_mode()	do {} while (0)
-#define arch_flush_lazy_mmu_mode()	do {} while (0)
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		VM_WARN_ON_ONCE(preemptible());
+}
+
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		VM_WARN_ON_ONCE(preemptible());
+}
+
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		VM_WARN_ON_ONCE(preemptible());
+}
 #endif
 
 #ifndef pte_batch_hint
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/6] sparc/mm: Do not disable preemption in lazy MMU mode
  2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
                   ` (2 preceding siblings ...)
  2025-06-12 17:36 ` [PATCH 3/6] mm/debug: Detect wrong arch_enter_lazy_mmu_mode() contexts Alexander Gordeev
@ 2025-06-12 17:36 ` Alexander Gordeev
  2025-06-13  8:40   ` Ryan Roberts
  2025-06-12 17:36 ` [PATCH 5/6] powerpc/64s: " Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 6/6] powerpc/64s: Do not re-activate batched TLB flush Alexander Gordeev
  5 siblings, 1 reply; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

Commit a1d416bf9faf ("sparc/mm: disable preemption in lazy mmu mode")
is not necessary anymore, since the lazy MMU mode is entered with a
spinlock held and sparc does not support Real-Time. Thus, upon entering
the lazy mode the preemption is already disabled.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/sparc/include/asm/tlbflush_64.h |  2 +-
 arch/sparc/mm/tlb.c                  | 12 ++++++++----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 8b8cdaa69272..a6d8068fb211 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -44,7 +44,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 void flush_tlb_pending(void);
 void arch_enter_lazy_mmu_mode(void);
 void arch_leave_lazy_mmu_mode(void);
-#define arch_flush_lazy_mmu_mode()      do {} while (0)
+void arch_flush_lazy_mmu_mode(void);
 
 /* Local cpu only.  */
 void __flush_tlb_all(void);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index a35ddcca5e76..e46dfd5f2583 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -52,10 +52,9 @@ void flush_tlb_pending(void)
 
 void arch_enter_lazy_mmu_mode(void)
 {
-	struct tlb_batch *tb;
+	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
 
-	preempt_disable();
-	tb = this_cpu_ptr(&tlb_batch);
+	VM_WARN_ON_ONCE(preemptible());
 	tb->active = 1;
 }
 
@@ -63,10 +62,15 @@ void arch_leave_lazy_mmu_mode(void)
 {
 	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
 
+	VM_WARN_ON_ONCE(preemptible());
 	if (tb->tlb_nr)
 		flush_tlb_pending();
 	tb->active = 0;
-	preempt_enable();
+}
+
+void arch_flush_lazy_mmu_mode(void)
+{
+	VM_WARN_ON_ONCE(preemptible());
 }
 
 static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/6] powerpc/64s: Do not disable preemption in lazy MMU mode
  2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
                   ` (3 preceding siblings ...)
  2025-06-12 17:36 ` [PATCH 4/6] sparc/mm: Do not disable preemption in lazy MMU mode Alexander Gordeev
@ 2025-06-12 17:36 ` Alexander Gordeev
  2025-06-12 17:36 ` [PATCH 6/6] powerpc/64s: Do not re-activate batched TLB flush Alexander Gordeev
  5 siblings, 0 replies; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy
mmu mode") is not necessary anymore, since the lazy MMU mode is
entered with a spinlock held and powerpc does not support Real-Time.
Thus, upon entering the lazy mode the preemption is already disabled.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 146287d9580f..aeac22b576c8 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -30,13 +30,9 @@ static inline void arch_enter_lazy_mmu_mode(void)
 {
 	struct ppc64_tlb_batch *batch;
 
+	VM_WARN_ON_ONCE(preemptible());
 	if (radix_enabled())
 		return;
-	/*
-	 * apply_to_page_range can call us this preempt enabled when
-	 * operating on kernel page tables.
-	 */
-	preempt_disable();
 	batch = this_cpu_ptr(&ppc64_tlb_batch);
 	batch->active = 1;
 }
@@ -45,6 +41,7 @@ static inline void arch_leave_lazy_mmu_mode(void)
 {
 	struct ppc64_tlb_batch *batch;
 
+	VM_WARN_ON_ONCE(preemptible());
 	if (radix_enabled())
 		return;
 	batch = this_cpu_ptr(&ppc64_tlb_batch);
@@ -52,10 +49,12 @@ static inline void arch_leave_lazy_mmu_mode(void)
 	if (batch->index)
 		__flush_tlb_pending(batch);
 	batch->active = 0;
-	preempt_enable();
 }
 
-#define arch_flush_lazy_mmu_mode()      do {} while (0)
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	VM_WARN_ON_ONCE(preemptible());
+}
 
 extern void hash__tlbiel_all(unsigned int action);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/6] powerpc/64s: Do not re-activate batched TLB flush
  2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
                   ` (4 preceding siblings ...)
  2025-06-12 17:36 ` [PATCH 5/6] powerpc/64s: " Alexander Gordeev
@ 2025-06-12 17:36 ` Alexander Gordeev
  5 siblings, 0 replies; 12+ messages in thread
From: Alexander Gordeev @ 2025-06-12 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
lazy mmu mode") a task can not be preempted while in lazy MMU mode.
Therefore, the batch re-activation code is never called, so remove it.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/powerpc/include/asm/thread_info.h |  2 --
 arch/powerpc/kernel/process.c          | 25 -------------------------
 2 files changed, 27 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 2785c7462ebf..092118a68862 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
 /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
-#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 #define TLF_RUNLATCH		4	/* Is the runlatch enabled? */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
-#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 #define _TLF_RUNLATCH		(1 << TLF_RUNLATCH)
 
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 855e09886503..edb59a447149 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 {
 	struct thread_struct *new_thread, *old_thread;
 	struct task_struct *last;
-#ifdef CONFIG_PPC_64S_HASH_MMU
-	struct ppc64_tlb_batch *batch;
-#endif
 
 	new_thread = &new->thread;
 	old_thread = &current->thread;
@@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	WARN_ON(!irqs_disabled());
 
 #ifdef CONFIG_PPC_64S_HASH_MMU
-	batch = this_cpu_ptr(&ppc64_tlb_batch);
-	if (batch->active) {
-		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
-		if (batch->index)
-			__flush_tlb_pending(batch);
-		batch->active = 0;
-	}
-
 	/*
 	 * On POWER9 the copy-paste buffer can only paste into
 	 * foreign real addresses, so unprivileged processes can not
@@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	 */
 
 #ifdef CONFIG_PPC_BOOK3S_64
-#ifdef CONFIG_PPC_64S_HASH_MMU
-	/*
-	 * This applies to a process that was context switched while inside
-	 * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
-	 * deactivated above, before _switch(). This will never be the case
-	 * for new tasks.
-	 */
-	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
-		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
-		batch = this_cpu_ptr(&ppc64_tlb_batch);
-		batch->active = 1;
-	}
-#endif
-
 	/*
 	 * Math facilities are masked out of the child MSR in copy_thread.
 	 * A new task does not need to restore_math because it will
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
  2025-06-12 17:36 ` [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode Alexander Gordeev
@ 2025-06-13  8:26   ` Ryan Roberts
  2025-06-13 10:32   ` Uladzislau Rezki
  2025-06-18 17:32   ` Dan Carpenter
  2 siblings, 0 replies; 12+ messages in thread
From: Ryan Roberts @ 2025-06-13  8:26 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge

On 12/06/2025 18:36, Alexander Gordeev wrote:
> As a follow-up to commit 691ee97e1a9d ("mm: fix lazy mmu docs and
> usage") take a step forward and protect with a lock not only user,
> but also kernel mappings before entering the lazy MMU mode. With
> that the semantics of arch_enter|leave_lazy_mmu_mode() callbacks
> is consolidated, which allows further simplifications.
> 
> The effect of this consolidation is not fully preemptible (Real-Time)
> kernels can not enter the context switch while the lazy MMU mode is
> active - which is easier to comprehend.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 12 ++++++------
>  mm/kasan/shadow.c       |  5 -----
>  mm/memory.c             |  5 ++++-
>  mm/vmalloc.c            |  6 ++++++
>  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 0b6e1f781d86..33bf2b13c219 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -224,12 +224,12 @@ static inline int pmd_dirty(pmd_t pmd)
>   * a raw PTE pointer after it has been modified are not guaranteed to be
>   * up to date.
>   *
> - * In the general case, no lock is guaranteed to be held between entry and exit
> - * of the lazy mode. So the implementation must assume preemption may be enabled
> - * and cpu migration is possible; it must take steps to be robust against this.
> - * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * For PREEMPT_RT kernels implementation must assume that preemption may
> + * be enabled and cpu migration is possible between entry and exit of the
> + * lazy MMU mode; it must take steps to be robust against this. There is
> + * no such assumption for non-PREEMPT_RT kernels, since both kernel and
> + * user page tables are protected with a spinlock while in lazy MMU mode.
> + * Nesting is not permitted and the mode cannot be used in interrupt context.

While I agree that spec for lazy mmu mode is not well defined, and welcome
changes to clarify and unify the implementations across arches, I think this is
a step in the wrong direction.

First the major one: you are serializing kernel pgtable operations that don't
need to be serialized. This, surely, can only lead to performance loss? vmalloc
could previously (mostly) run in parallel; The only part that was serialized was
the allocation of the VA space. Once that's done, operations on the VA space can
be done in parallel because each is only operating on the area it allocated.
With your change I think all pte operations are serialised with the single
init_mm.page_table_lock.

Additionally, some arches (inc arm64) use apply_to_page_range() to modify the
permissions of regions of kernel VA space. Again, we used to be able to modify
multiple regions in parallel, but you are now serializing this for no good reason.

Secondly, the lazy mmu handler still needs to handle the
preemption-while-in-lazy-mmu case because, as you mention, it can still be
preempted for PREEMPT_RT kernels where the spin lock is converted to a sleepable
lock.

So I think the handler needs to either explicitly disable preemption (as powerpc
and sparc do) or handle it by plugging into the arch-specific context switch
code (as x86 does) or only maintain per-task state in the first place (as arm64
does).

Thanks,
Ryan

>   */
>  #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>  #define arch_enter_lazy_mmu_mode()	do {} while (0)
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..45115bd770a9 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -313,12 +313,10 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	__memset(page_to_virt(page), KASAN_VMALLOC_INVALID, PAGE_SIZE);
>  	pte = pfn_pte(page_to_pfn(page), PAGE_KERNEL);
>  
> -	spin_lock(&init_mm.page_table_lock);
>  	if (likely(pte_none(ptep_get(ptep)))) {
>  		set_pte_at(&init_mm, addr, ptep, pte);
>  		data->pages[index] = NULL;
>  	}
> -	spin_unlock(&init_mm.page_table_lock);
>  
>  	return 0;
>  }
> @@ -465,13 +463,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  
>  	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
>  
> -	spin_lock(&init_mm.page_table_lock);
> -
>  	if (likely(!pte_none(ptep_get(ptep)))) {
>  		pte_clear(&init_mm, addr, ptep);
>  		free_page(page);
>  	}
> -	spin_unlock(&init_mm.page_table_lock);
>  
>  	return 0;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 71b3d3f98999..1ddc532b1f13 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3017,6 +3017,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			pte = pte_offset_kernel(pmd, addr);
>  		if (!pte)
>  			return err;
> +		spin_lock(&init_mm.page_table_lock);
>  	} else {
>  		if (create)
>  			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> @@ -3042,7 +3043,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  
>  	arch_leave_lazy_mmu_mode();
>  
> -	if (mm != &init_mm)
> +	if (mm == &init_mm)
> +		spin_unlock(&init_mm.page_table_lock);
> +	else
>  		pte_unmap_unlock(mapped_pte, ptl);
>  
>  	*mask |= PGTBL_PTE_MODIFIED;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ab986dd09b6a..57b11000ae36 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -105,6 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	if (!pte)
>  		return -ENOMEM;
>  
> +	spin_lock(&init_mm.page_table_lock);
>  	arch_enter_lazy_mmu_mode();
>  
>  	do {
> @@ -132,6 +133,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	} while (pte += PFN_DOWN(size), addr += size, addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
> +	spin_unlock(&init_mm.page_table_lock);
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }
> @@ -359,6 +361,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	unsigned long size = PAGE_SIZE;
>  
>  	pte = pte_offset_kernel(pmd, addr);
> +	spin_lock(&init_mm.page_table_lock);
>  	arch_enter_lazy_mmu_mode();
>  
>  	do {
> @@ -379,6 +382,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
> +	spin_unlock(&init_mm.page_table_lock);
>  	*mask |= PGTBL_PTE_MODIFIED;
>  }
>  
> @@ -525,6 +529,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	if (!pte)
>  		return -ENOMEM;
>  
> +	spin_lock(&init_mm.page_table_lock);
>  	arch_enter_lazy_mmu_mode();
>  
>  	do {
> @@ -542,6 +547,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
> +	spin_unlock(&init_mm.page_table_lock);
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/6] sparc/mm: Do not disable preemption in lazy MMU mode
  2025-06-12 17:36 ` [PATCH 4/6] sparc/mm: Do not disable preemption in lazy MMU mode Alexander Gordeev
@ 2025-06-13  8:40   ` Ryan Roberts
  0 siblings, 0 replies; 12+ messages in thread
From: Ryan Roberts @ 2025-06-13  8:40 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton
  Cc: linux-kernel, linux-mm, sparclinux, xen-devel, linuxppc-dev,
	linux-s390, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge

On 12/06/2025 18:36, Alexander Gordeev wrote:
> Commit a1d416bf9faf ("sparc/mm: disable preemption in lazy mmu mode")
> is not necessary anymore, since the lazy MMU mode is entered with a
> spinlock held and sparc does not support Real-Time. Thus, upon entering
> the lazy mode the preemption is already disabled.

Surely given Sparc knows that it doesn't support PREEMPT_RT, it is better for
it's implementation to explicitly disable preemption rather than rely on the
spinlock to do it, since the spinlock penalizes other arches unnecessarily? It
also prevents multiple CPUs from updating (different areas of) kernel pgtables
in parallel. The property Sparc needs is for the task to stay on the same CPU
without interruption, right? Same goes for powerpc.

> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  arch/sparc/include/asm/tlbflush_64.h |  2 +-
>  arch/sparc/mm/tlb.c                  | 12 ++++++++----
>  2 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
> index 8b8cdaa69272..a6d8068fb211 100644
> --- a/arch/sparc/include/asm/tlbflush_64.h
> +++ b/arch/sparc/include/asm/tlbflush_64.h
> @@ -44,7 +44,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>  void flush_tlb_pending(void);
>  void arch_enter_lazy_mmu_mode(void);
>  void arch_leave_lazy_mmu_mode(void);
> -#define arch_flush_lazy_mmu_mode()      do {} while (0)
> +void arch_flush_lazy_mmu_mode(void);
>  
>  /* Local cpu only.  */
>  void __flush_tlb_all(void);
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index a35ddcca5e76..e46dfd5f2583 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -52,10 +52,9 @@ void flush_tlb_pending(void)
>  
>  void arch_enter_lazy_mmu_mode(void)
>  {
> -	struct tlb_batch *tb;
> +	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
>  
> -	preempt_disable();
> -	tb = this_cpu_ptr(&tlb_batch);
> +	VM_WARN_ON_ONCE(preemptible());
>  	tb->active = 1;
>  }
>  
> @@ -63,10 +62,15 @@ void arch_leave_lazy_mmu_mode(void)
>  {
>  	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
>  
> +	VM_WARN_ON_ONCE(preemptible());
>  	if (tb->tlb_nr)
>  		flush_tlb_pending();
>  	tb->active = 0;
> -	preempt_enable();
> +}
> +
> +void arch_flush_lazy_mmu_mode(void)
> +{
> +	VM_WARN_ON_ONCE(preemptible());
>  }
>  
>  static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
  2025-06-12 17:36 ` [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode Alexander Gordeev
  2025-06-13  8:26   ` Ryan Roberts
@ 2025-06-13 10:32   ` Uladzislau Rezki
  2025-06-18 17:32   ` Dan Carpenter
  2 siblings, 0 replies; 12+ messages in thread
From: Uladzislau Rezki @ 2025-06-13 10:32 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, linux-kernel, linux-mm, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

On Thu, Jun 12, 2025 at 07:36:09PM +0200, Alexander Gordeev wrote:
> As a follow-up to commit 691ee97e1a9d ("mm: fix lazy mmu docs and
> usage") take a step forward and protect with a lock not only user,
> but also kernel mappings before entering the lazy MMU mode. With
> that the semantics of arch_enter|leave_lazy_mmu_mode() callbacks
> is consolidated, which allows further simplifications.
> 
> The effect of this consolidation is not fully preemptible (Real-Time)
> kernels can not enter the context switch while the lazy MMU mode is
> active - which is easier to comprehend.
> 
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  include/linux/pgtable.h | 12 ++++++------
>  mm/kasan/shadow.c       |  5 -----
>  mm/memory.c             |  5 ++++-
>  mm/vmalloc.c            |  6 ++++++
>  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 0b6e1f781d86..33bf2b13c219 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -224,12 +224,12 @@ static inline int pmd_dirty(pmd_t pmd)
>   * a raw PTE pointer after it has been modified are not guaranteed to be
>   * up to date.
>   *
> - * In the general case, no lock is guaranteed to be held between entry and exit
> - * of the lazy mode. So the implementation must assume preemption may be enabled
> - * and cpu migration is possible; it must take steps to be robust against this.
> - * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * For PREEMPT_RT kernels implementation must assume that preemption may
> + * be enabled and cpu migration is possible between entry and exit of the
> + * lazy MMU mode; it must take steps to be robust against this. There is
> + * no such assumption for non-PREEMPT_RT kernels, since both kernel and
> + * user page tables are protected with a spinlock while in lazy MMU mode.
> + * Nesting is not permitted and the mode cannot be used in interrupt context.
>   */
>  #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>  #define arch_enter_lazy_mmu_mode()	do {} while (0)
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..45115bd770a9 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -313,12 +313,10 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	__memset(page_to_virt(page), KASAN_VMALLOC_INVALID, PAGE_SIZE);
>  	pte = pfn_pte(page_to_pfn(page), PAGE_KERNEL);
>  
> -	spin_lock(&init_mm.page_table_lock);
>  	if (likely(pte_none(ptep_get(ptep)))) {
>  		set_pte_at(&init_mm, addr, ptep, pte);
>  		data->pages[index] = NULL;
>  	}
> -	spin_unlock(&init_mm.page_table_lock);
>  
>  	return 0;
>  }
> @@ -465,13 +463,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  
>  	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
>  
> -	spin_lock(&init_mm.page_table_lock);
> -
>  	if (likely(!pte_none(ptep_get(ptep)))) {
>  		pte_clear(&init_mm, addr, ptep);
>  		free_page(page);
>  	}
> -	spin_unlock(&init_mm.page_table_lock);
>  
>  	return 0;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 71b3d3f98999..1ddc532b1f13 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3017,6 +3017,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			pte = pte_offset_kernel(pmd, addr);
>  		if (!pte)
>  			return err;
> +		spin_lock(&init_mm.page_table_lock);
>  	} else {
>  		if (create)
>  			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> @@ -3042,7 +3043,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  
>  	arch_leave_lazy_mmu_mode();
>  
> -	if (mm != &init_mm)
> +	if (mm == &init_mm)
> +		spin_unlock(&init_mm.page_table_lock);
> +	else
>  		pte_unmap_unlock(mapped_pte, ptl);
>  
>  	*mask |= PGTBL_PTE_MODIFIED;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ab986dd09b6a..57b11000ae36 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -105,6 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	if (!pte)
>  		return -ENOMEM;
>  
> +	spin_lock(&init_mm.page_table_lock);
>
This is not good. We introduce another bottle-neck.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
  2025-06-12 17:36 ` [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode Alexander Gordeev
  2025-06-13  8:26   ` Ryan Roberts
  2025-06-13 10:32   ` Uladzislau Rezki
@ 2025-06-18 17:32   ` Dan Carpenter
  2025-06-19  9:53     ` Uladzislau Rezki
  2 siblings, 1 reply; 12+ messages in thread
From: Dan Carpenter @ 2025-06-18 17:32 UTC (permalink / raw)
  To: oe-kbuild, Alexander Gordeev, Andrew Morton
  Cc: lkp, oe-kbuild-all, Linux Memory Management List, linux-kernel,
	sparclinux, xen-devel, linuxppc-dev, linux-s390, Hugh Dickins,
	Nicholas Piggin, Guenter Roeck, Juergen Gross,
	Jeremy Fitzhardinge, Ryan Roberts

Hi Alexander,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Alexander-Gordeev/mm-Cleanup-apply_to_pte_range-routine/20250613-013835
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/7bd3a45dbc375dc2c15cebae09cb2bb972d6039f.1749747752.git.agordeev%40linux.ibm.com
patch subject: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
config: x86_64-randconfig-161-20250613 (https://download.01.org/0day-ci/archive/20250613/202506132017.T1l1l6ME-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202506132017.T1l1l6ME-lkp@intel.com/

smatch warnings:
mm/vmalloc.c:552 vmap_pages_pte_range() warn: inconsistent returns 'global &init_mm.page_table_lock'.

vim +552 mm/vmalloc.c

0a264884046f1ab Nicholas Piggin   2021-04-29  517  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
2ba3e6947aed9bb Joerg Roedel      2020-06-01  518  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
2ba3e6947aed9bb Joerg Roedel      2020-06-01  519  		pgtbl_mod_mask *mask)
^1da177e4c3f415 Linus Torvalds    2005-04-16  520  {
^1da177e4c3f415 Linus Torvalds    2005-04-16  521  	pte_t *pte;
^1da177e4c3f415 Linus Torvalds    2005-04-16  522  
db64fe02258f150 Nicholas Piggin   2008-10-18  523  	/*
db64fe02258f150 Nicholas Piggin   2008-10-18  524  	 * nr is a running index into the array which helps higher level
db64fe02258f150 Nicholas Piggin   2008-10-18  525  	 * callers keep track of where we're up to.
db64fe02258f150 Nicholas Piggin   2008-10-18  526  	 */
db64fe02258f150 Nicholas Piggin   2008-10-18  527  
2ba3e6947aed9bb Joerg Roedel      2020-06-01  528  	pte = pte_alloc_kernel_track(pmd, addr, mask);
^1da177e4c3f415 Linus Torvalds    2005-04-16  529  	if (!pte)
^1da177e4c3f415 Linus Torvalds    2005-04-16  530  		return -ENOMEM;
44562c71e2cfc9e Ryan Roberts      2025-04-22  531  
dac0cc793368851 Alexander Gordeev 2025-06-12  532  	spin_lock(&init_mm.page_table_lock);
44562c71e2cfc9e Ryan Roberts      2025-04-22  533  	arch_enter_lazy_mmu_mode();
44562c71e2cfc9e Ryan Roberts      2025-04-22  534  
^1da177e4c3f415 Linus Torvalds    2005-04-16  535  	do {
db64fe02258f150 Nicholas Piggin   2008-10-18  536  		struct page *page = pages[*nr];
db64fe02258f150 Nicholas Piggin   2008-10-18  537  
c33c794828f2121 Ryan Roberts      2023-06-12  538  		if (WARN_ON(!pte_none(ptep_get(pte))))
db64fe02258f150 Nicholas Piggin   2008-10-18  539  			return -EBUSY;
db64fe02258f150 Nicholas Piggin   2008-10-18  540  		if (WARN_ON(!page))
^1da177e4c3f415 Linus Torvalds    2005-04-16  541  			return -ENOMEM;
4fcdcc12915c707 Yury Norov        2022-04-28  542  		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
4fcdcc12915c707 Yury Norov        2022-04-28  543  			return -EINVAL;

These error paths don't unlock &init_mm.page_table_lock?

4fcdcc12915c707 Yury Norov        2022-04-28  544  
^1da177e4c3f415 Linus Torvalds    2005-04-16  545  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
db64fe02258f150 Nicholas Piggin   2008-10-18  546  		(*nr)++;
^1da177e4c3f415 Linus Torvalds    2005-04-16  547  	} while (pte++, addr += PAGE_SIZE, addr != end);
44562c71e2cfc9e Ryan Roberts      2025-04-22  548  
44562c71e2cfc9e Ryan Roberts      2025-04-22  549  	arch_leave_lazy_mmu_mode();
dac0cc793368851 Alexander Gordeev 2025-06-12  550  	spin_unlock(&init_mm.page_table_lock);
2ba3e6947aed9bb Joerg Roedel      2020-06-01  551  	*mask |= PGTBL_PTE_MODIFIED;
^1da177e4c3f415 Linus Torvalds    2005-04-16 @552  	return 0;
^1da177e4c3f415 Linus Torvalds    2005-04-16  553  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
  2025-06-18 17:32   ` Dan Carpenter
@ 2025-06-19  9:53     ` Uladzislau Rezki
  0 siblings, 0 replies; 12+ messages in thread
From: Uladzislau Rezki @ 2025-06-19  9:53 UTC (permalink / raw)
  To: Dan Carpenter, Alexander Gordeev, Andrew Morton
  Cc: oe-kbuild, Alexander Gordeev, Andrew Morton, lkp, oe-kbuild-all,
	Linux Memory Management List, linux-kernel, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge, Ryan Roberts

On Wed, Jun 18, 2025 at 08:32:28PM +0300, Dan Carpenter wrote:
> Hi Alexander,
> 
> kernel test robot noticed the following build warnings:
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Alexander-Gordeev/mm-Cleanup-apply_to_pte_range-routine/20250613-013835
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/7bd3a45dbc375dc2c15cebae09cb2bb972d6039f.1749747752.git.agordeev%40linux.ibm.com
> patch subject: [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode
> config: x86_64-randconfig-161-20250613 (https://download.01.org/0day-ci/archive/20250613/202506132017.T1l1l6ME-lkp@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> | Closes: https://lore.kernel.org/r/202506132017.T1l1l6ME-lkp@intel.com/
> 
> smatch warnings:
> mm/vmalloc.c:552 vmap_pages_pte_range() warn: inconsistent returns 'global &init_mm.page_table_lock'.
> 
> vim +552 mm/vmalloc.c
> 
> 0a264884046f1ab Nicholas Piggin   2021-04-29  517  static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> 2ba3e6947aed9bb Joerg Roedel      2020-06-01  518  		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> 2ba3e6947aed9bb Joerg Roedel      2020-06-01  519  		pgtbl_mod_mask *mask)
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  520  {
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  521  	pte_t *pte;
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  522  
> db64fe02258f150 Nicholas Piggin   2008-10-18  523  	/*
> db64fe02258f150 Nicholas Piggin   2008-10-18  524  	 * nr is a running index into the array which helps higher level
> db64fe02258f150 Nicholas Piggin   2008-10-18  525  	 * callers keep track of where we're up to.
> db64fe02258f150 Nicholas Piggin   2008-10-18  526  	 */
> db64fe02258f150 Nicholas Piggin   2008-10-18  527  
> 2ba3e6947aed9bb Joerg Roedel      2020-06-01  528  	pte = pte_alloc_kernel_track(pmd, addr, mask);
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  529  	if (!pte)
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  530  		return -ENOMEM;
> 44562c71e2cfc9e Ryan Roberts      2025-04-22  531  
> dac0cc793368851 Alexander Gordeev 2025-06-12  532  	spin_lock(&init_mm.page_table_lock);
> 44562c71e2cfc9e Ryan Roberts      2025-04-22  533  	arch_enter_lazy_mmu_mode();
> 44562c71e2cfc9e Ryan Roberts      2025-04-22  534  
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  535  	do {
> db64fe02258f150 Nicholas Piggin   2008-10-18  536  		struct page *page = pages[*nr];
> db64fe02258f150 Nicholas Piggin   2008-10-18  537  
> c33c794828f2121 Ryan Roberts      2023-06-12  538  		if (WARN_ON(!pte_none(ptep_get(pte))))
> db64fe02258f150 Nicholas Piggin   2008-10-18  539  			return -EBUSY;
> db64fe02258f150 Nicholas Piggin   2008-10-18  540  		if (WARN_ON(!page))
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  541  			return -ENOMEM;
> 4fcdcc12915c707 Yury Norov        2022-04-28  542  		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> 4fcdcc12915c707 Yury Norov        2022-04-28  543  			return -EINVAL;
> 
> These error paths don't unlock &init_mm.page_table_lock?
> 
> 4fcdcc12915c707 Yury Norov        2022-04-28  544  
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  545  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> db64fe02258f150 Nicholas Piggin   2008-10-18  546  		(*nr)++;
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  547  	} while (pte++, addr += PAGE_SIZE, addr != end);
> 44562c71e2cfc9e Ryan Roberts      2025-04-22  548  
> 44562c71e2cfc9e Ryan Roberts      2025-04-22  549  	arch_leave_lazy_mmu_mode();
> dac0cc793368851 Alexander Gordeev 2025-06-12  550  	spin_unlock(&init_mm.page_table_lock);
> 2ba3e6947aed9bb Joerg Roedel      2020-06-01  551  	*mask |= PGTBL_PTE_MODIFIED;
> ^1da177e4c3f415 Linus Torvalds    2005-04-16 @552  	return 0;
> ^1da177e4c3f415 Linus Torvalds    2005-04-16  553  }
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
> 
> 
This patch introduce a huge performance degrade when testing this by
the test_vmalloc.sh performance tool. We return back to a single, not
serialized global spilock where we spend 90% of cycles:

<snip>
+   91.01%     1.67%  [kernel]          [k] _raw_spin_lock
-   89.29%    89.25%  [kernel]          [k] native_queued_spin_lock_slowpath
     69.82% ret_from_fork_asm
      - ret_from_fork
         - 69.81% kthread
            - 69.66% test_func
               - 26.31% full_fit_alloc_test
                  - 19.11% __vmalloc_node_noprof
                     - __vmalloc_node_range_noprof
                        - 13.73% vmap_small_pages_range_noflush
                             _raw_spin_lock
                             native_queued_spin_lock_slowpath
                        - 5.38% __get_vm_area_node
                             alloc_vmap_area
                             _raw_spin_lock
                             native_queued_spin_lock_slowpath
                  - 13.32% vfree.part.0
                     - 13.31% remove_vm_area
                        - 13.27% __vunmap_range_noflush
                             _raw_spin_lock
                             native_queued_spin_lock_slowpath
               - 25.57% fix_size_alloc_test
                  - 22.59% __vmalloc_node_noprof
                     - __vmalloc_node_range_noprof
                        - 17.34% vmap_small_pages_range_noflush
                             _raw_spin_lock
                             native_queued_spin_lock_slowpath
                        - 5.25% __get_vm_area_node
                             alloc_vmap_area
                             _raw_spin_lock
                             native_queued_spin_lock_slowpath
                  - 11.59% vfree.part.0
                     - remove_vm_area
                        - 11.55% __vunmap_range_noflush
                             _raw_spin_lock
                             native_queued_spin_lock_slowpath
               - 17.78% long_busy_list_alloc_test
                  - 13.90% __vmalloc_node_noprof
                     - __vmalloc_node_range_noprof
                        - 9.95% vmap_small_pages_range_noflush
                             _raw_spin_lock
<snip>

No, we can not take this patch.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-06-19  9:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 17:36 [PATCH 0/6] mm: Consolidate lazy MMU mode context Alexander Gordeev
2025-06-12 17:36 ` [PATCH 1/6] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
2025-06-12 17:36 ` [PATCH 2/6] mm: Lock kernel page tables before entering lazy MMU mode Alexander Gordeev
2025-06-13  8:26   ` Ryan Roberts
2025-06-13 10:32   ` Uladzislau Rezki
2025-06-18 17:32   ` Dan Carpenter
2025-06-19  9:53     ` Uladzislau Rezki
2025-06-12 17:36 ` [PATCH 3/6] mm/debug: Detect wrong arch_enter_lazy_mmu_mode() contexts Alexander Gordeev
2025-06-12 17:36 ` [PATCH 4/6] sparc/mm: Do not disable preemption in lazy MMU mode Alexander Gordeev
2025-06-13  8:40   ` Ryan Roberts
2025-06-12 17:36 ` [PATCH 5/6] powerpc/64s: " Alexander Gordeev
2025-06-12 17:36 ` [PATCH 6/6] powerpc/64s: Do not re-activate batched TLB flush Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).