Linux s390 Architecture development
 help / color / mirror / Atom feed
* [PATCH -next v4 0/4] s390/mm: Batch PTE updates in lazy MMU mode
@ 2026-06-18 14:47 Alexander Gordeev
  2026-06-18 14:47 ` [PATCH -next v4 1/4] mm: Make lazy MMU mode context-aware Alexander Gordeev
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-06-18 14:47 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Hi All!

This is v4 of the batched PTE updates in lazy MMU mode rework.

The presented implementation sets up per-cpu caches in the s390-specific
hotplug callbacks as opposed to CPUHP_BP_PREPARE_DYN hooks. I like this
approach better, since it locates the SMP-related lowcore initialization
in one place. But that is up for discussion.

Changes since v3:
- all prerequisite patches are landed in -next and removed from the series

Changes since v2:
- lazy_mmu_mode_enable_for_pte_range() renamed to lazy_mmu_mode_enable_with_ptes()
  (David Hildenbrand)
- patch "mm/pgtable: Fix bogus comment to clear_not_present_full_ptes()"
  is dropped (David Hildenbrand)
- direct PTE dereferencing KASAN sanitizer added (Heiko Carstens)
- CONFIG_IPTE_BATCH option is dropped (Heiko Carstens)
- PTE_POISON changed from zero to 0x800 (Heiko Carstens)
- allocate per-cpu caches on CPU hot-plug (Heiko Carstens)
- introduced a lowcore field for fast lazy mode checking (Heiko Carstens)
- few minor code changes (Heiko Carstens)

Changes since v1:
- lazy_mmu_mode_enable_pte() renamed to lazy_mmu_mode_enable_for_pte_range()
- lazy_mmu_mode_enable_for_pte_range() semantics clarified
- some sashiko comments addressed [1] including one bug fix [1]
- patches 2-4 added

1. https://sashiko.dev/#/patchset/cover.1774420056.git.agordeev%40linux.ibm.com

This series addresses an s390-specific aspect of how page table entries
are modified. In many cases, changing a valid PTE (for example, setting
or clearing a hardware bit) requires issuing an Invalidate Page Table
Entry (IPTE) instruction beforehand.

A disadvantage of the IPTE instruction is that it may initiate a
machine-wide quiesce state. This state acts as an expensive global
hardware lock and should be avoided whenever possible.

Currently, IPTE is invoked for each individual PTE update in most code
paths. However, the instruction itself supports invalidating multiple
PTEs at once, covering up to 256 entries. Using this capability can
significantly reduce the number of quiesce events, with a positive
impact on overall system performance. At present, this feature is not
utilized.

An effort was therefore made to identify kernel code paths that update
large numbers of consecutive PTEs. Such updates can be batched and
handled by a single IPTE invocation, leveraging the hardware support
described above.

A natural candidate for this optimization is page-table walkers that
change attributes of memory ranges and thus modify contiguous ranges
of PTEs. Many memory-management system calls enter lazy MMU mode while
updating such ranges.

This lazy MMU mode can be leveraged to build on the already existing
infrastructure and implement a software-level lazy MMU mechanism,
allowing expensive PTE invalidations on s390 to be batched.

Alexander Gordeev (4):
  mm: Make lazy MMU mode context-aware
  s390/mm: Batch PTE updates in lazy MMU mode
  mm/kasan: Introduce helpers for lazy MMU mode sanitizer
  s390/mm: Lazy MMU mode sanitizer

 arch/s390/Kconfig                |   1 +
 arch/s390/include/asm/lazy_mmu.h |   9 +
 arch/s390/include/asm/lowcore.h  |   2 +-
 arch/s390/include/asm/pgtable.h  | 163 +++++++++++--
 arch/s390/kernel/setup.c         |   2 +
 arch/s390/kernel/smp.c           |   7 +
 arch/s390/mm/Makefile            |   2 +-
 arch/s390/mm/lazy_mmu.c          | 401 +++++++++++++++++++++++++++++++
 arch/s390/mm/pgtable.c           |   8 +-
 fs/proc/task_mmu.c               |   2 +-
 include/linux/kasan.h            |  16 ++
 include/linux/pgtable.h          |  46 ++++
 mm/kasan/common.c                |  10 +
 mm/kasan/kasan.h                 |   2 +
 mm/madvise.c                     |   8 +-
 mm/memory.c                      |   8 +-
 mm/mprotect.c                    |   2 +-
 mm/mremap.c                      |   2 +-
 mm/vmalloc.c                     |   6 +-
 19 files changed, 659 insertions(+), 38 deletions(-)
 create mode 100644 arch/s390/include/asm/lazy_mmu.h
 create mode 100644 arch/s390/mm/lazy_mmu.c

-- 
2.53.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH -next v4 1/4] mm: Make lazy MMU mode context-aware
  2026-06-18 14:47 [PATCH -next v4 0/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-06-18 14:47 ` Alexander Gordeev
  2026-06-18 14:47 ` [PATCH -next v4 2/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-06-18 14:47 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Lazy MMU mode is assumed to be context-independent, in the sense
that it does not need any additional information while operating.
However, the s390 architecture benefits from knowing the exact
page table entries being modified.

Introduce lazy_mmu_mode_enable_with_ptes(), which is provided with
the process address space and the page table being operated on.
This information is required to enable s390-specific optimizations.

The function takes parameters that are typically passed to page-
table level walkers, which implies that the span of PTE entries
never crosses a page table boundary.

Architectures that do not require such information simply do not
need to define the lazy_mmu_mode_enable_with_ptes() callback.

Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 fs/proc/task_mmu.c      |  2 +-
 include/linux/pgtable.h | 46 +++++++++++++++++++++++++++++++++++++++++
 mm/madvise.c            |  8 +++----
 mm/memory.c             |  8 +++----
 mm/mprotect.c           |  2 +-
 mm/mremap.c             |  2 +-
 mm/vmalloc.c            |  6 +++---
 7 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d32408f7cd5e..750f6095147f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2842,7 +2842,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 		return 0;
 	}
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(vma->vm_mm, start, end, start_pte);
 
 	if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
 		/* Fast path for performing exclusive WP */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2981e386da7b..cc85daf30739 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -271,6 +271,50 @@ static inline void lazy_mmu_mode_enable(void)
 		arch_enter_lazy_mmu_mode();
 }
 
+#ifndef arch_enter_lazy_mmu_mode_with_ptes
+static inline void arch_enter_lazy_mmu_mode_with_ptes(struct mm_struct *mm,
+		unsigned long addr, unsigned long end, pte_t *ptep)
+{
+	arch_enter_lazy_mmu_mode();
+}
+#endif
+
+/**
+ * lazy_mmu_mode_enable_with_ptes() - Enable the lazy MMU mode with a speedup hint.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Start address of the range.
+ * @end: End address of the range.
+ * @ptep: Page table pointer for the first entry.
+ *
+ * Enters a new lazy MMU mode section; if the mode was not already enabled,
+ * enables it and calls arch_enter_lazy_mmu_mode_with_ptes().
+ *
+ * PTEs that fall within the specified range might observe update speedups.
+ * The PTEs must belong to the specified address space and be in the same PMD.
+ *
+ * There are no requirements on the order or range completeness of PTE
+ * updates for the specified range.
+ *
+ * Must be paired with a call to lazy_mmu_mode_disable().
+ *
+ * Has no effect if called:
+ * - While paused - see lazy_mmu_mode_pause()
+ * - In interrupt context
+ */
+static inline void lazy_mmu_mode_enable_with_ptes(struct mm_struct *mm,
+		unsigned long addr, unsigned long end, pte_t *ptep)
+{
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	if (in_interrupt() || state->pause_count > 0)
+		return;
+
+	VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
+
+	if (state->enable_count++ == 0)
+		arch_enter_lazy_mmu_mode_with_ptes(mm, addr, end, ptep);
+}
+
 /**
  * lazy_mmu_mode_disable() - Disable the lazy MMU mode.
  *
@@ -387,6 +431,8 @@ static inline void lazy_mmu_mode_resume(void)
 }
 #else
 static inline void lazy_mmu_mode_enable(void) {}
+static inline void lazy_mmu_mode_enable_with_ptes(struct mm_struct *mm,
+		unsigned long addr, unsigned long end, pte_t *ptep) {}
 static inline void lazy_mmu_mode_disable(void) {}
 static inline void lazy_mmu_mode_pause(void) {}
 static inline void lazy_mmu_mode_resume(void) {}
diff --git a/mm/madvise.c b/mm/madvise.c
index cd9bb077072c..c14bd5d1828e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -453,7 +453,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -508,7 +508,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				lazy_mmu_mode_enable();
+				lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 				if (!err)
 					nr = 0;
 				continue;
@@ -675,7 +675,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -735,7 +735,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				lazy_mmu_mode_enable();
+				lazy_mmu_mode_enable_with_ptes(mm, addr, end, pte);
 				if (!err)
 					nr = 0;
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..ee1770ff4a64 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1272,7 +1272,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(src_mm, addr, end, src_pte);
 
 	do {
 		nr = 1;
@@ -1922,7 +1922,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		return addr;
 
 	flush_tlb_batched_pending(mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 	do {
 		bool any_skipped = false;
 
@@ -2919,7 +2919,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, mapped_pte);
 	do {
 		BUG_ON(!pte_none(ptep_get(pte)));
 		if (!pfn_modify_allowed(pfn, prot)) {
@@ -3330,7 +3330,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			return -EINVAL;
 	}
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, mapped_pte);
 
 	if (fn) {
 		do {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9cbf932b028c..3fc26418e837 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -337,7 +337,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 		is_private_single_threaded = vma_is_single_threaded_private(vma);
 
 	flush_tlb_batched_pending(vma->vm_mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(vma->vm_mm, addr, end, pte);
 	do {
 		nr_ptes = 1;
 		oldpte = ptep_get(pte);
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832..0dfe3de39ccc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -260,7 +260,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 	flush_tlb_batched_pending(vma->vm_mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, old_addr, old_end, old_ptep);
 
 	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
 		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1afca3568b9b..b5ed2b05771f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -108,7 +108,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	if (!pte)
 		return -ENOMEM;
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(&init_mm, addr, end, pte);
 
 	do {
 		if (unlikely(!pte_none(ptep_get(pte)))) {
@@ -371,7 +371,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(&init_mm, addr, end, pte);
 
 	do {
 #ifdef CONFIG_HUGETLB_PAGE
@@ -538,7 +538,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!pte)
 		return -ENOMEM;
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(&init_mm, addr, end, pte);
 
 	do {
 		struct page *page = pages[*nr];
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH -next v4 2/4] s390/mm: Batch PTE updates in lazy MMU mode
  2026-06-18 14:47 [PATCH -next v4 0/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
  2026-06-18 14:47 ` [PATCH -next v4 1/4] mm: Make lazy MMU mode context-aware Alexander Gordeev
@ 2026-06-18 14:47 ` Alexander Gordeev
  2026-06-18 15:10   ` sashiko-bot
  2026-06-18 14:47 ` [PATCH -next v4 3/4] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
  2026-06-18 14:47 ` [PATCH -next v4 4/4] s390/mm: Lazy " Alexander Gordeev
  3 siblings, 1 reply; 7+ messages in thread
From: Alexander Gordeev @ 2026-06-18 14:47 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Make use of the IPTE instruction's "Additional Entries" field to
invalidate multiple PTEs in one go while in lazy MMU mode. This
is the mode in which many memory-management system calls (like
mremap(), mprotect(), etc.) update memory attributes.

To achieve that, the set_pte() and ptep_get() primitives use a
per-CPU cache to store and retrieve PTE values and apply the
cached values to the real page table once lazy MMU mode is left.

The same is done for memory-management platform callbacks that
would otherwise cause intense per-PTE IPTE traffic, reducing the
number of IPTE instructions from up to PTRS_PER_PTE to a single
instruction in the best case. The average reduction is of course
smaller.

Since all existing page table iterators called in lazy MMU mode
handle one table at a time, the per-CPU cache does not need to be
larger than PTRS_PER_PTE entries. That also naturally aligns with
the IPTE instruction, which must not cross a page table boundary.

Before this change, the system calls did:

	lazy_mmu_mode_enable_pte()
	...
	<update PTEs>		// up to PTRS_PER_PTE single-IPTEs
	...
	lazy_mmu_mode_disable()

With this change, the system calls do:

    lazy_mmu_mode_enable_pte()
    ...
    <store new PTE values in the per-CPU cache>
    ...
    lazy_mmu_mode_disable()	// apply cache with one multi-IPTE

When applied to large memory ranges, some system calls show
significant speedups:

    mprotect()    ~15x
    munmap()      ~3x
    mremap()      ~28x

At the same time, fork() shows a measurable slowdown of ~1.5x.

The overall results depend on memory size and access patterns,
but the change generally does not degrade performance.

In addition to a process-wide impact, the rework affects the
whole Central Electronics Complex (CEC). Each (global) IPTE
instruction initiates a quiesce state in a CEC, so reducing
the number of IPTE calls relieves CEC-wide quiesce traffic.

In an extreme case of mprotect() contiguously triggering the
quiesce state on four LPARs in parallel, measurements show
~25x fewer quiesce events.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/s390/Kconfig                |   1 +
 arch/s390/include/asm/lazy_mmu.h |   9 +
 arch/s390/include/asm/lowcore.h  |   2 +-
 arch/s390/include/asm/pgtable.h  | 157 +++++++++++--
 arch/s390/kernel/setup.c         |   2 +
 arch/s390/kernel/smp.c           |   7 +
 arch/s390/mm/Makefile            |   2 +-
 arch/s390/mm/lazy_mmu.c          | 382 +++++++++++++++++++++++++++++++
 arch/s390/mm/pgtable.c           |   8 +-
 9 files changed, 546 insertions(+), 24 deletions(-)
 create mode 100644 arch/s390/include/asm/lazy_mmu.h
 create mode 100644 arch/s390/mm/lazy_mmu.c

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 84404e6778d5..7846332dcd0a 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -97,6 +97,7 @@ config S390
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_LAZY_MMU_MODE
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
diff --git a/arch/s390/include/asm/lazy_mmu.h b/arch/s390/include/asm/lazy_mmu.h
new file mode 100644
index 000000000000..98366e9de9bc
--- /dev/null
+++ b/arch/s390/include/asm/lazy_mmu.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LAZY_MMU_H
+#define __LAZY_MMU_H
+
+void lazy_mmu_online_boot_cpu(void);
+int lazy_mmu_online_cpu(gfp_t gfp, unsigned int cpu);
+void lazy_mmu_offline_cpu(unsigned int cpu);
+
+#endif /* __LAZY_MMU_H */
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 3b3ecc647993..dba236664da9 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -163,7 +163,7 @@ struct lowcore {
 	__s32	preempt_count;			/* 0x03a8 */
 	__u32	spinlock_lockval;		/* 0x03ac */
 	__u32	spinlock_index;			/* 0x03b0 */
-	__u8	pad_0x03b4[0x03b8-0x03b4];	/* 0x03b4 */
+	__s32	lazy_mmu_count;			/* 0x03b4 */
 	__u64	percpu_offset;			/* 0x03b8 */
 	__u8	percpu_register;		/* 0x03c0 */
 	__u8	pad_0x03c1[0x0400-0x03c1];	/* 0x03c1 */
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index f9a8a92fa160..2b6659d61fa5 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -39,6 +39,64 @@ enum {
 
 extern atomic_long_t direct_pages_count[PG_DIRECT_MAP_MAX];
 
+bool __lazy_mmu_ptep_test_and_clear_young(unsigned long addr, pte_t *ptep, int *res);
+bool __lazy_mmu_ptep_get_and_clear(unsigned long addr, pte_t *ptep, pte_t *res);
+bool __lazy_mmu_ptep_modify_prot_start(unsigned long addr, pte_t *ptep, pte_t *res);
+bool __lazy_mmu_ptep_modify_prot_commit(unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t pte);
+bool __lazy_mmu_ptep_set_wrprotect(unsigned long addr, pte_t *ptep);
+bool __lazy_mmu_set_pte(pte_t *ptep, pte_t pte);
+bool __lazy_mmu_ptep_get(pte_t *ptep, pte_t *res);
+
+static __always_inline bool is_lazy_mmu_active(void)
+{
+	if (__is_defined(__DECOMPRESSOR))
+		return false;
+	if (!get_lowcore()->lazy_mmu_count)
+		return false;
+	return true;
+}
+
+static inline
+bool lazy_mmu_ptep_test_and_clear_young(unsigned long addr, pte_t *ptep, int *res)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_test_and_clear_young(addr, ptep, res);
+}
+
+static inline
+bool lazy_mmu_ptep_get_and_clear(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_get_and_clear(addr, ptep, res);
+}
+
+static inline
+bool lazy_mmu_ptep_modify_prot_start(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_modify_prot_start(addr, ptep, res);
+}
+
+static inline
+bool lazy_mmu_ptep_modify_prot_commit(unsigned long addr, pte_t *ptep,
+				      pte_t old_pte, pte_t pte)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_modify_prot_commit(addr, ptep, old_pte, pte);
+}
+
+static inline
+bool lazy_mmu_ptep_set_wrprotect(unsigned long addr, pte_t *ptep)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_set_wrprotect(addr, ptep);
+}
+
 static inline void update_page_count(int level, long count)
 {
 	if (IS_ENABLED(CONFIG_PROC_FS))
@@ -978,15 +1036,30 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 	WRITE_ONCE(*pmdp, pmd);
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 }
 
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	if (!is_lazy_mmu_active() || !__lazy_mmu_set_pte(ptep, pte))
+		__set_pte(ptep, pte);
+}
+
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 #define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
-	return READ_ONCE(*ptep);
+	pte_t res;
+
+	if (!is_lazy_mmu_active() || !__lazy_mmu_ptep_get(ptep, &res))
+		res = __ptep_get(ptep);
+	return res;
 }
 
 #define pmdp_get pmdp_get
@@ -1179,6 +1252,15 @@ static __always_inline void __ptep_ipte_range(unsigned long address, int nr,
 	} while (nr != 255);
 }
 
+void arch_enter_lazy_mmu_mode_with_ptes(struct mm_struct *mm,
+					unsigned long addr, unsigned long end,
+					pte_t *pte);
+#define arch_enter_lazy_mmu_mode_with_ptes arch_enter_lazy_mmu_mode_with_ptes
+
+void arch_enter_lazy_mmu_mode(void);
+void arch_leave_lazy_mmu_mode(void);
+void arch_flush_lazy_mmu_mode(void);
+
 /*
  * This is hard to understand. ptep_get_and_clear and ptep_clear_flush
  * both clear the TLB for the unmapped pte. The reason is that
@@ -1199,10 +1281,16 @@ pte_t ptep_xchg_lazy(struct mm_struct *, unsigned long, pte_t *, pte_t);
 static inline bool ptep_test_and_clear_young(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = ptep_get(ptep);
+	pte_t pte;
+	int res;
 
-	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
-	return pte_young(pte);
+	if (!lazy_mmu_ptep_test_and_clear_young(addr, ptep, &res)) {
+		pte = __ptep_get(ptep);
+		pte = pte_mkold(pte);
+		pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte);
+		res = pte_young(pte);
+	}
+	return res;
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1218,7 +1306,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 {
 	pte_t res;
 
-	res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+	if (!lazy_mmu_ptep_get_and_clear(addr, ptep, &res))
+		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
 	page_table_check_pte_clear(mm, addr, res);
 	/* At this point the reference through the mapping is still present */
 	if (mm_is_protected(mm) && pte_present(res))
@@ -1227,9 +1316,34 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 }
 
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
-pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
-void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
-			     pte_t *, pte_t, pte_t);
+pte_t ___ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
+void ___ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
+				pte_t *, pte_t, pte_t);
+
+static inline
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
+			     unsigned long addr, pte_t *ptep)
+{
+	pte_t res;
+
+	if (!lazy_mmu_ptep_modify_prot_start(addr, ptep, &res))
+		res = ___ptep_modify_prot_start(vma, addr, ptep);
+	return res;
+}
+
+static inline
+void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, pte_t old_pte, pte_t pte)
+{
+	if (!lazy_mmu_ptep_modify_prot_commit(addr, ptep, old_pte, pte))
+		___ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+}
+
+bool ipte_range_ptep_modify_prot_start(struct vm_area_struct *vma,
+				       unsigned long addr, pte_t *ptep, pte_t *res);
+bool ipte_range_ptep_modify_prot_commit(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t old_pte, pte_t pte);
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
 static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
@@ -1259,11 +1373,13 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 {
 	pte_t res;
 
-	if (full) {
-		res = ptep_get(ptep);
-		set_pte(ptep, __pte(_PAGE_INVALID));
-	} else {
-		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+	if (!lazy_mmu_ptep_get_and_clear(addr, ptep, &res)) {
+		if (full) {
+			res = __ptep_get(ptep);
+			__set_pte(ptep, __pte(_PAGE_INVALID));
+		} else {
+			res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+		}
 	}
 	page_table_check_pte_clear(mm, addr, res);
 	/* At this point the reference through the mapping is still present */
@@ -1289,10 +1405,15 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = ptep_get(ptep);
+	pte_t pte;
 
-	if (pte_write(pte))
-		ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
+	if (!lazy_mmu_ptep_set_wrprotect(addr, ptep)) {
+		pte = __ptep_get(ptep);
+		if (pte_write(pte)) {
+			pte = pte_wrprotect(pte);
+			ptep_xchg_lazy(mm, addr, ptep, pte);
+		}
+	}
 }
 
 /*
@@ -1325,7 +1446,7 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
 	 * PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
 	 * A local RDP can be used to do the flush.
 	 */
-	if (cpu_has_rdp() && !(pte_val(ptep_get(ptep)) & _PAGE_PROTECT))
+	if (cpu_has_rdp() && !(pte_val(__ptep_get(ptep)) & _PAGE_PROTECT))
 		__ptep_rdp(address, ptep, 1);
 }
 #define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index b60284328fe3..f5a3c9e1b6b8 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -77,6 +77,7 @@
 #include <asm/maccess.h>
 #include <asm/uv.h>
 #include <asm/asm-offsets.h>
+#include <asm/lazy_mmu.h>
 #include "entry.h"
 
 /*
@@ -1012,5 +1013,6 @@ void __init setup_arch(char **cmdline_p)
 
 void __init arch_cpu_finalize_init(void)
 {
+	lazy_mmu_online_boot_cpu();
 	sclp_init();
 }
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 0ba7f89b8161..0a826bbaf1dd 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -59,6 +59,7 @@
 #include <asm/topology.h>
 #include <asm/vdso.h>
 #include <asm/maccess.h>
+#include <asm/lazy_mmu.h>
 #include "entry.h"
 
 enum {
@@ -866,6 +867,11 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)
 	rc = pcpu_alloc_lowcore(pcpu, cpu);
 	if (rc)
 		return rc;
+	rc = lazy_mmu_online_cpu(GFP_KERNEL, cpu);
+	if (rc) {
+		pcpu_free_lowcore(pcpu, cpu);
+		return rc;
+	}
 	/*
 	 * Make sure global control register contents do not change
 	 * until new CPU has initialized control registers.
@@ -921,6 +927,7 @@ void __cpu_die(unsigned int cpu)
 	pcpu = per_cpu_ptr(&pcpu_devices, cpu);
 	while (!pcpu_stopped(pcpu))
 		cpu_relax();
+	lazy_mmu_offline_cpu(cpu);
 	pcpu_free_lowcore(pcpu, cpu);
 	cpumask_clear_cpu(cpu, mm_cpumask(&init_mm));
 	cpumask_clear_cpu(cpu, &init_mm.context.cpu_attach_mask);
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index 193899c39ca7..26e9fc11543a 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -3,7 +3,7 @@
 # Makefile for the linux s390-specific parts of the memory manager.
 #
 
-obj-y		:= init.o fault.o extmem.o mmap.o vmem.o maccess.o
+obj-y		:= init.o fault.o extmem.o mmap.o vmem.o maccess.o lazy_mmu.o
 obj-y		+= page-states.o pageattr.o pgtable.o pgalloc.o extable.o
 
 obj-$(CONFIG_CMM)		+= cmm.o
diff --git a/arch/s390/mm/lazy_mmu.c b/arch/s390/mm/lazy_mmu.c
new file mode 100644
index 000000000000..d75b93d9b0de
--- /dev/null
+++ b/arch/s390/mm/lazy_mmu.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pgtable.h>
+#include <linux/kasan.h>
+#include <linux/slab.h>
+#include <asm/facility.h>
+#include <asm/lazy_mmu.h>
+#include <kunit/visibility.h>
+
+#define PTE_POISON	_PAGE_LARGE
+
+struct ipte_range {
+	struct mm_struct *mm;
+	unsigned long base_addr;
+	unsigned long base_end;
+	pte_t *base_pte;
+	pte_t *start_pte;
+	pte_t *end_pte;
+	pte_t cache[PTRS_PER_PTE];
+};
+
+static DEFINE_PER_CPU(struct ipte_range *, ipte_range);
+
+static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
+{
+	unsigned long page_invalid_bit;
+	pte_t *ptep, pte;
+
+	pte = __ptep_get(start);
+	page_invalid_bit = pte_val(pte) & _PAGE_INVALID;
+
+	for (ptep = start + 1; ptep < end; ptep++) {
+		pte = __ptep_get(ptep);
+		if ((pte_val(pte) & _PAGE_INVALID) != page_invalid_bit)
+			break;
+	}
+
+	*valid = !(page_invalid_bit);
+	return ptep - start;
+}
+
+static void __invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
+				   int nr_ptes, pte_t *ptep)
+{
+	atomic_inc(&mm->context.flush_count);
+	if (cpu_has_tlb_lc() && cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))
+		__ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_LOCAL);
+	else
+		__ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_GLOBAL);
+	atomic_dec(&mm->context.flush_count);
+}
+
+static int invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
+				pte_t *start, pte_t *end)
+{
+	int nr_ptes;
+	bool valid;
+
+	nr_ptes = count_contiguous(start, end, &valid);
+	if (valid)
+		__invalidate_pte_range(mm, addr, nr_ptes, start);
+
+	return nr_ptes;
+}
+
+static void set_pte_range(struct mm_struct *mm, unsigned long addr,
+			  pte_t *ptep, pte_t *end, pte_t *cache)
+{
+	int i, nr_ptes;
+
+	while (ptep < end) {
+		nr_ptes = invalidate_pte_range(mm, addr, ptep, end);
+
+		for (i = 0; i < nr_ptes; i++, ptep++, cache++) {
+			__set_pte(ptep, *cache);
+			*cache = __pte(PTE_POISON);
+		}
+
+		addr += nr_ptes * PAGE_SIZE;
+	}
+}
+
+static void enter_ipte_norange(void)
+{
+	struct ipte_range __maybe_unused *range;
+
+	if (!test_facility(13))
+		return;
+
+	range = get_cpu_var(ipte_range);
+	get_lowcore()->lazy_mmu_count++;
+}
+
+static void enter_ipte_range(struct mm_struct *mm,
+			     unsigned long addr, unsigned long end, pte_t *pte)
+{
+	struct ipte_range *range;
+
+	if (!test_facility(13))
+		return;
+
+	range = get_cpu_var(ipte_range);
+	get_lowcore()->lazy_mmu_count++;
+
+	range->mm = mm;
+	range->base_addr = addr;
+	range->base_end = end;
+	range->base_pte = pte;
+}
+
+static void leave_ipte_range(void)
+{
+	pte_t *ptep, *start, *start_cache, *cache;
+	unsigned long start_addr, addr;
+	struct ipte_range *range;
+	int start_idx;
+
+	if (!test_facility(13))
+		return;
+
+	lockdep_assert_preemption_disabled();
+	range = this_cpu_read(ipte_range);
+	if (!range->mm)
+		goto norange;
+	if (!range->start_pte)
+		goto done;
+
+	start = range->start_pte;
+	start_idx = range->start_pte - range->base_pte;
+	start_addr = range->base_addr + start_idx * PAGE_SIZE;
+	addr = start_addr;
+	start_cache = &range->cache[start_idx];
+	cache = start_cache;
+	for (ptep = start; ptep < range->end_pte; ptep++, cache++, addr += PAGE_SIZE) {
+		if (pte_val(*cache) == PTE_POISON) {
+			if (start) {
+				set_pte_range(range->mm, start_addr, start, ptep, start_cache);
+				start = NULL;
+			}
+		} else if (!start) {
+			start = ptep;
+			start_addr = addr;
+			start_cache = cache;
+		}
+	}
+	set_pte_range(range->mm, start_addr, start, ptep, start_cache);
+
+	range->start_pte = NULL;
+	range->end_pte = NULL;
+
+done:
+	range->mm = NULL;
+	range->base_addr = 0;
+	range->base_end = 0;
+	range->base_pte = NULL;
+
+norange:
+	get_lowcore()->lazy_mmu_count--;
+	put_cpu_var(ipte_range);
+}
+
+static void flush_lazy_mmu_mode(void)
+{
+	unsigned long addr, end;
+	struct ipte_range *range;
+	struct mm_struct *mm;
+	pte_t *pte;
+
+	if (!test_facility(13))
+		return;
+
+	range = get_cpu_var(ipte_range);
+	if (range->mm) {
+		mm = range->mm;
+		addr = range->base_addr;
+		end = range->base_end;
+		pte = range->base_pte;
+
+		leave_ipte_range();
+		enter_ipte_range(mm, addr, end, pte);
+	}
+	put_cpu_var(ipte_range);
+}
+
+void arch_enter_lazy_mmu_mode(void)
+{
+	enter_ipte_norange();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_enter_lazy_mmu_mode);
+
+void arch_enter_lazy_mmu_mode_with_ptes(struct mm_struct *mm,
+					unsigned long addr, unsigned long end,
+					pte_t *pte)
+{
+	enter_ipte_range(mm, addr, end, pte);
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_enter_lazy_mmu_mode_with_ptes);
+
+void arch_leave_lazy_mmu_mode(void)
+{
+	leave_ipte_range();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_leave_lazy_mmu_mode);
+
+void arch_flush_lazy_mmu_mode(void)
+{
+	flush_lazy_mmu_mode();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_flush_lazy_mmu_mode);
+
+static void __ipte_range_set_pte(struct ipte_range *range, pte_t *ptep, pte_t pte)
+{
+	unsigned int idx = ptep - range->base_pte;
+
+	lockdep_assert_preemption_disabled();
+	range->cache[idx] = pte;
+
+	if (!range->start_pte) {
+		range->start_pte = ptep;
+		range->end_pte = ptep + 1;
+	} else if (ptep < range->start_pte) {
+		range->start_pte = ptep;
+	} else if (ptep + 1 > range->end_pte) {
+		range->end_pte = ptep + 1;
+	}
+}
+
+static pte_t __ipte_range_ptep_get(struct ipte_range *range, pte_t *ptep)
+{
+	unsigned int idx = ptep - range->base_pte;
+
+	lockdep_assert_preemption_disabled();
+	if (pte_val(range->cache[idx]) == PTE_POISON)
+		return __ptep_get(ptep);
+	return range->cache[idx];
+}
+
+static struct ipte_range *this_ipte_range(pte_t *ptep)
+{
+	struct ipte_range *range;
+	unsigned int nr_ptes;
+
+	range = this_cpu_read(ipte_range);
+	if (ptep < range->base_pte)
+		return NULL;
+	nr_ptes = (range->base_end - range->base_addr) / PAGE_SIZE;
+	if (ptep >= range->base_pte + nr_ptes)
+		return NULL;
+
+	return range;
+}
+
+bool __lazy_mmu_set_pte(pte_t *ptep, pte_t pte)
+{
+	struct ipte_range *range;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	__ipte_range_set_pte(range, ptep, pte);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_get(pte_t *ptep, pte_t *res)
+{
+	struct ipte_range *range;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	*res = __ipte_range_ptep_get(range, ptep);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_test_and_clear_young(unsigned long addr, pte_t *ptep, int *res)
+{
+	struct ipte_range *range;
+	pte_t pte, old;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	old = __ipte_range_ptep_get(range, ptep);
+	pte = pte_mkold(old);
+	__ipte_range_set_pte(range, ptep, pte);
+	*res = pte_young(old);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_get_and_clear(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	struct ipte_range *range;
+	pte_t pte, old;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	old = __ipte_range_ptep_get(range, ptep);
+	pte = __pte(_PAGE_INVALID);
+	__ipte_range_set_pte(range, ptep, pte);
+	*res = old;
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_modify_prot_start(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	return __lazy_mmu_ptep_get_and_clear(addr, ptep, res);
+}
+
+bool __lazy_mmu_ptep_modify_prot_commit(unsigned long addr, pte_t *ptep,
+					pte_t old_pte, pte_t pte)
+{
+	struct ipte_range *range;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	__ipte_range_set_pte(range, ptep, pte);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_set_wrprotect(unsigned long addr, pte_t *ptep)
+{
+	struct ipte_range *range;
+	pte_t pte;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	pte = __ipte_range_ptep_get(range, ptep);
+	if (pte_write(pte)) {
+		pte = pte_wrprotect(pte);
+		__ipte_range_set_pte(range, ptep, pte);
+	}
+
+	return true;
+}
+
+int lazy_mmu_online_cpu(gfp_t gfp, unsigned int cpu)
+{
+	struct ipte_range *range;
+	int i;
+
+	if (!test_facility(13))
+		return 0;
+
+	range = kzalloc_obj(*range, gfp);
+	if (!range)
+		return -ENOMEM;
+	for (i = 0; i < ARRAY_SIZE(range->cache); i++)
+		range->cache[i] = __pte(PTE_POISON);
+	per_cpu(ipte_range, cpu) = range;
+
+	return 0;
+}
+
+void lazy_mmu_offline_cpu(unsigned int cpu)
+{
+	struct ipte_range *range;
+
+	if (!test_facility(13))
+		return;
+
+	range = per_cpu(ipte_range, cpu);
+	per_cpu(ipte_range, cpu) = NULL;
+	kfree(range);
+}
+
+void __init lazy_mmu_online_boot_cpu(void)
+{
+	lazy_mmu_online_cpu(GFP_ATOMIC, 0);
+}
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4acd8b140c4b..df36523bcbbb 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -166,14 +166,14 @@ pte_t ptep_xchg_lazy(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(ptep_xchg_lazy);
 
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t *ptep)
+pte_t ___ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
+				pte_t *ptep)
 {
 	return ptep_flush_lazy(vma->vm_mm, addr, ptep, 1);
 }
 
-void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t *ptep, pte_t old_pte, pte_t pte)
+void ___ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+				pte_t *ptep, pte_t old_pte, pte_t pte)
 {
 	set_pte(ptep, pte);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH -next v4 3/4] mm/kasan: Introduce helpers for lazy MMU mode sanitizer
  2026-06-18 14:47 [PATCH -next v4 0/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
  2026-06-18 14:47 ` [PATCH -next v4 1/4] mm: Make lazy MMU mode context-aware Alexander Gordeev
  2026-06-18 14:47 ` [PATCH -next v4 2/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-06-18 14:47 ` Alexander Gordeev
  2026-06-18 15:00   ` sashiko-bot
  2026-06-18 14:47 ` [PATCH -next v4 4/4] s390/mm: Lazy " Alexander Gordeev
  3 siblings, 1 reply; 7+ messages in thread
From: Alexander Gordeev @ 2026-06-18 14:47 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Provide helpers that allow architectures implement
illegitimate PTE direct accesses while the lazy MMU
mode is enabled, such as:

	pte_t pte = *ptep;
	*ptep = pte;

By contrast, these would have to be:

	pte_t pte = ptep_get(ptep);
	set_pte(ptep, pte);

The direct PTE accesses pose a real issue on s390.

Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/linux/kasan.h | 16 ++++++++++++++++
 mm/kasan/common.c     | 10 ++++++++++
 mm/kasan/kasan.h      |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index bf233bde68c7..deadf566b84a 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -134,6 +134,20 @@ static __always_inline void kasan_poison_slab(struct slab *slab)
 		__kasan_poison_slab(slab);
 }
 
+void __kasan_poison_pte(pte_t *pte, int nr);
+static __always_inline void kasan_poison_pte(pte_t *pte, int nr)
+{
+	if (kasan_enabled())
+		__kasan_poison_pte(pte, nr);
+}
+
+void __kasan_unpoison_pte(pte_t *pte, int nr);
+static __always_inline void kasan_unpoison_pte(pte_t *pte, int nr)
+{
+	if (kasan_enabled())
+		__kasan_unpoison_pte(pte, nr);
+}
+
 void __kasan_unpoison_new_object(struct kmem_cache *cache, void *object);
 /**
  * kasan_unpoison_new_object - Temporarily unpoison a new slab object.
@@ -414,6 +428,8 @@ static inline bool kasan_unpoison_pages(struct page *page, unsigned int order,
 	return false;
 }
 static inline void kasan_poison_slab(struct slab *slab) {}
+static inline void kasan_poison_pte(pte_t *pte, int nr) {}
+static inline void kasan_unpoison_pte(pte_t *pte, int nr) {}
 static inline void kasan_unpoison_new_object(struct kmem_cache *cache,
 					void *object) {}
 static inline void kasan_poison_new_object(struct kmem_cache *cache,
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index b7d05c2a6d93..cbf68680614e 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -163,6 +163,16 @@ void __kasan_poison_slab(struct slab *slab)
 		     KASAN_SLAB_REDZONE, false);
 }
 
+void __kasan_poison_pte(pte_t *pte, int nr)
+{
+	kasan_poison(pte, sizeof(*pte) * nr, KASAN_LAZY_MMU_PTE, false);
+}
+
+void __kasan_unpoison_pte(pte_t *pte, int nr)
+{
+	kasan_unpoison(pte, sizeof(*pte) * nr, false);
+}
+
 void __kasan_unpoison_new_object(struct kmem_cache *cache, void *object)
 {
 	kasan_unpoison(object, cache->object_size, false);
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index fc9169a54766..8ba0fbabd75b 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -144,12 +144,14 @@ static inline bool kasan_requires_meta(void)
 #define KASAN_PAGE_REDZONE	0xFE  /* redzone for kmalloc_large allocation */
 #define KASAN_SLAB_REDZONE	0xFC  /* redzone for slab object */
 #define KASAN_SLAB_FREE		0xFB  /* freed slab object */
+#define KASAN_LAZY_MMU_PTE	0xFD
 #define KASAN_VMALLOC_INVALID	0xF8  /* inaccessible space in vmap area */
 #else
 #define KASAN_PAGE_FREE		KASAN_TAG_INVALID
 #define KASAN_PAGE_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_FREE		KASAN_TAG_INVALID
+#define KASAN_LAZY_MMU_PTE	KASAN_TAG_INVALID
 #define KASAN_VMALLOC_INVALID	KASAN_TAG_INVALID /* only used for SW_TAGS */
 #endif
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH -next v4 4/4] s390/mm: Lazy MMU mode sanitizer
  2026-06-18 14:47 [PATCH -next v4 0/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (2 preceding siblings ...)
  2026-06-18 14:47 ` [PATCH -next v4 3/4] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
@ 2026-06-18 14:47 ` Alexander Gordeev
  3 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-06-18 14:47 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Detect PTE entries access in lazy MMU mode by means other
than set_pte() and ptep_get() primitives, which would be
a read hazard.

The access to kasan shadow memory from ptep_get_lockless()
mistakenly hits invalid access in case a concurrent lazy
MMU access to the same PTE is happening. To avoid that
disable instrumentation for ptep_get_lockless() altogether.

Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h |  6 ++++++
 arch/s390/mm/lazy_mmu.c         | 27 +++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2b6659d61fa5..a93e7e786457 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1047,6 +1047,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 		__set_pte(ptep, pte);
 }
 
+#define ptep_get_lockless ptep_get_lockless
+static inline __no_sanitize_address pte_t ptep_get_lockless(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 static inline pte_t __ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
diff --git a/arch/s390/mm/lazy_mmu.c b/arch/s390/mm/lazy_mmu.c
index d75b93d9b0de..ee2385897bc7 100644
--- a/arch/s390/mm/lazy_mmu.c
+++ b/arch/s390/mm/lazy_mmu.c
@@ -63,10 +63,13 @@ static int invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
 }
 
 static void set_pte_range(struct mm_struct *mm, unsigned long addr,
-			  pte_t *ptep, pte_t *end, pte_t *cache)
+			  pte_t *start, pte_t *end, pte_t *cache)
 {
-	int i, nr_ptes;
+	int nr_ptes, nr_total = end - start;
+	pte_t *ptep = start;
+	int i;
 
+	kasan_unpoison_pte(start, nr_total);
 	while (ptep < end) {
 		nr_ptes = invalidate_pte_range(mm, addr, ptep, end);
 
@@ -77,6 +80,7 @@ static void set_pte_range(struct mm_struct *mm, unsigned long addr,
 
 		addr += nr_ptes * PAGE_SIZE;
 	}
+	kasan_poison_pte(start, nr_total);
 }
 
 static void enter_ipte_norange(void)
@@ -94,6 +98,7 @@ static void enter_ipte_range(struct mm_struct *mm,
 			     unsigned long addr, unsigned long end, pte_t *pte)
 {
 	struct ipte_range *range;
+	unsigned int nr_ptes;
 
 	if (!test_facility(13))
 		return;
@@ -105,6 +110,9 @@ static void enter_ipte_range(struct mm_struct *mm,
 	range->base_addr = addr;
 	range->base_end = end;
 	range->base_pte = pte;
+
+	nr_ptes = (range->base_end - range->base_addr) / PAGE_SIZE;
+	kasan_poison_pte(range->base_pte, nr_ptes);
 }
 
 static void leave_ipte_range(void)
@@ -112,6 +120,7 @@ static void leave_ipte_range(void)
 	pte_t *ptep, *start, *start_cache, *cache;
 	unsigned long start_addr, addr;
 	struct ipte_range *range;
+	unsigned int nr_ptes;
 	int start_idx;
 
 	if (!test_facility(13))
@@ -148,6 +157,9 @@ static void leave_ipte_range(void)
 	range->end_pte = NULL;
 
 done:
+	nr_ptes = (range->base_end - range->base_addr) / PAGE_SIZE;
+	kasan_unpoison_pte(range->base_pte, nr_ptes);
+
 	range->mm = NULL;
 	range->base_addr = 0;
 	range->base_end = 0;
@@ -227,10 +239,17 @@ static void __ipte_range_set_pte(struct ipte_range *range, pte_t *ptep, pte_t pt
 static pte_t __ipte_range_ptep_get(struct ipte_range *range, pte_t *ptep)
 {
 	unsigned int idx = ptep - range->base_pte;
+	pte_t pte;
 
 	lockdep_assert_preemption_disabled();
-	if (pte_val(range->cache[idx]) == PTE_POISON)
-		return __ptep_get(ptep);
+	if (pte_val(range->cache[idx]) == PTE_POISON) {
+		kasan_unpoison_pte(ptep, 1);
+		pte = __ptep_get(ptep);
+		kasan_poison_pte(ptep, 1);
+
+		return pte;
+	}
+
 	return range->cache[idx];
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH -next v4 3/4] mm/kasan: Introduce helpers for lazy MMU mode sanitizer
  2026-06-18 14:47 ` [PATCH -next v4 3/4] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
@ 2026-06-18 15:00   ` sashiko-bot
  0 siblings, 0 replies; 7+ messages in thread
From: sashiko-bot @ 2026-06-18 15:00 UTC (permalink / raw)
  To: Alexander Gordeev; +Cc: Heiko Carstens, Vasily Gorbik, linux-s390

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The new KASAN_LAZY_MMU_PTE shadow value is not mapped to a bug type in the KASAN reporting logic, resulting in uninformative "unknown-crash" reports.
--

commit 7d8731323b7762279e28d0aef3b1b3c457449b81
Author: Alexander Gordeev <agordeev@linux.ibm.com>

mm/kasan: Introduce helpers for lazy MMU mode sanitizer

This commit provides helpers to allow architectures to poison and unpoison
PTEs during lazy MMU mode. It introduces a new KASAN shadow tag to track
and catch direct PTE manipulations that improperly bypass the correct
accessors.

> diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
> index fc9169a547662..8ba0fbabd75b2 100644
> --- a/mm/kasan/kasan.h
> +++ b/mm/kasan/kasan.h
[ ... ]
> @@ -144,12 +144,14 @@ static inline bool kasan_requires_meta(void)
>  #define KASAN_PAGE_REDZONE	0xFE  /* redzone for kmalloc_large allocation */
>  #define KASAN_SLAB_REDZONE	0xFC  /* redzone for slab object */
>  #define KASAN_SLAB_FREE		0xFB  /* freed slab object */
> +#define KASAN_LAZY_MMU_PTE	0xFD

[Severity: Medium]
Since this adds a new shadow tag value, does get_shadow_bug_type() in
mm/kasan/report_generic.c also need to be updated to handle it?

Currently, if KASAN catches a direct PTE access and reads this shadow
value, it looks like it will fall through the switch statement in
get_shadow_bug_type() and default to reporting an "unknown-crash".

Would adding KASAN_LAZY_MMU_PTE to the switch cases ensure we get a
more descriptive bug type instead?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/cover.1781789772.git.agordeev@linux.ibm.com?part=3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH -next v4 2/4] s390/mm: Batch PTE updates in lazy MMU mode
  2026-06-18 14:47 ` [PATCH -next v4 2/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-06-18 15:10   ` sashiko-bot
  0 siblings, 0 replies; 7+ messages in thread
From: sashiko-bot @ 2026-06-18 15:10 UTC (permalink / raw)
  To: Alexander Gordeev; +Cc: Vasily Gorbik, Heiko Carstens, linux-s390

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Deferred PTE updates to a per-CPU cache break synchronization with lockless page table walkers like GUP-fast, allowing them to pin pages being concurrently write-protected for COW.
- [High] The lazy MMU cache bypasses the `full=1` TLB flush optimization during process teardown, causing an O(N) IPTE storm.
--

commit 38614c455538c6b6144011b070bbbf41513506e6
Author: Alexander Gordeev <agordeev@linux.ibm.com>

s390/mm: Batch PTE updates in lazy MMU mode

This commit introduces a per-CPU cache to batch PTE updates while in lazy
MMU mode. By caching PTE changes and applying them simultaneously with the
IPTE instruction's "Additional Entries" feature, it reduces overall IPTE
traffic.

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h

[ ... ]

> @@ -1259,11 +1373,13 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  {
>  	pte_t res;
>  
> -	if (full) {
> -		res = ptep_get(ptep);
> -		set_pte(ptep, __pte(_PAGE_INVALID));
> -	} else {
> -		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> +	if (!lazy_mmu_ptep_get_and_clear(addr, ptep, &res)) {

[Severity: High]
Does checking lazy_mmu_ptep_get_and_clear() before the 'full' flag bypass
the teardown optimization and cause an IPTE storm?

During process teardown, exit_mmap() and zap_pte_range() use
ptep_get_and_clear_full(..., full=1) to clear PTEs without issuing individual
TLB invalidations, deferring the flush to a global teardown at the end.
If lazy_mmu_mode_enable_with_ptes() is active, this intercepts the call,
writes _PAGE_INVALID to the cache, and bypasses the if (full) path.

Later, when lazy_mmu_mode_disable() flushes the cache, it calls
__invalidate_pte_range() in arch/s390/mm/lazy_mmu.c:

static void __invalidate_pte_range(...)
{
	...
	if (cpu_has_tlb_lc() && ...)
		__ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_LOCAL);
	else
		__ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_GLOBAL);
	...
}

This unconditionally executes batched IPTE instructions for all cached PTEs.
For multi-threaded processes, does this turn an O(1) TLB flush operation
into an O(N) global IPTE broadcast, causing a regression in system
performance during process termination?

> +		if (full) {
> +			res = __ptep_get(ptep);
> +			__set_pte(ptep, __pte(_PAGE_INVALID));
> +		} else {
> +			res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> +		}
>  	}
>  	page_table_check_pte_clear(mm, addr, res);

[ ... ]

> @@ -1289,10 +1405,15 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> -	pte_t pte = ptep_get(ptep);
> +	pte_t pte;
>  
> -	if (pte_write(pte))
> -		ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
> +	if (!lazy_mmu_ptep_set_wrprotect(addr, ptep)) {

[Severity: Critical]
Can intercepting the write-protect operation here break synchronization with
lockless page table walkers like GUP-fast?

During fork(), copy_pte_range() calls lazy_mmu_mode_enable_with_ptes().
When copy_present_pte() attempts to write-protect the parent's PTE by
calling ptep_set_wrprotect(), the lazy MMU mode intercepts the call via
__lazy_mmu_ptep_set_wrprotect() in arch/s390/mm/lazy_mmu.c:

bool __lazy_mmu_ptep_set_wrprotect(...)
{
	...
	pte = __ipte_range_ptep_get(range, ptep);
	if (pte_write(pte)) {
		pte = pte_wrprotect(pte);
		__ipte_range_set_pte(range, ptep, pte);
	}
	return true;
}

This updates the per-CPU cache but leaves the physical PTE completely valid
and writable until the cache is flushed at the end of the PMD range. Because
GUP-fast operates locklessly, a concurrent thread in the parent process can
read the physical PTE, observe that it is still writable, and successfully
take a FOLL_WRITE pin on the page.

Meanwhile, fork() proceeds, sees no pins, and shares the page with the
child. Does this create a regression where the parent retains a writable
GUP pin to a page shared with the child, allowing writes to corrupt the
child's memory and break COW isolation?

> +		pte = __ptep_get(ptep);
> +		if (pte_write(pte)) {
> +			pte = pte_wrprotect(pte);
> +			ptep_xchg_lazy(mm, addr, ptep, pte);
> +		}
> +	}
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/cover.1781789772.git.agordeev@linux.ibm.com?part=2

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-18 15:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18 14:47 [PATCH -next v4 0/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-06-18 14:47 ` [PATCH -next v4 1/4] mm: Make lazy MMU mode context-aware Alexander Gordeev
2026-06-18 14:47 ` [PATCH -next v4 2/4] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-06-18 15:10   ` sashiko-bot
2026-06-18 14:47 ` [PATCH -next v4 3/4] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
2026-06-18 15:00   ` sashiko-bot
2026-06-18 14:47 ` [PATCH -next v4 4/4] s390/mm: Lazy " Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox