[PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode
@ 2026-06-16 12:40 Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 1/7] mm: Make lazy MMU mode context-aware Alexander Gordeev
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Hi All!

This is v3 of the batched PTE updates in lazy MMU mode rework.

The prereq patches 4,5 are in the mm tree and scheduled for -next already:
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything

The presented implementation sets up per-cpu caches in the s390-specific
hotplug callbacks as opposed to CPUHP_BP_PREPARE_DYN hooks. I like this
approach better, since it locates the SMP-related lowcore initialization
in one place. But that is up for discussion.

Changes since v2:
- lazy_mmu_mode_enable_for_pte_range() renamed to lazy_mmu_mode_enable_with_ptes()
  (David Hildenbrand)
- patch "mm/pgtable: Fix bogus comment to clear_not_present_full_ptes()"
  is dropped (David Hildenbrand)
- direct PTE dereferencing KASAN sanitizer added (Heiko Carstens)
- CONFIG_IPTE_BATCH option is dropped (Heiko Carstens)
- PTE_POISON changed from zero to 0x800 (Heiko Carstens)
- allocate per-cpu caches on CPU hot-plug (Heiko Carstens)
- introduced a lowcore field for fast lazy mode checking (Heiko Carstens)
- few minor code changes (Heiko Carstens)

Changes since v1:
- lazy_mmu_mode_enable_pte() renamed to lazy_mmu_mode_enable_for_pte_range()
- lazy_mmu_mode_enable_for_pte_range() semantics clarified
- some sashiko comments addressed [1] including one bug fix
  1. https://sashiko.dev/#/patchset/cover.1774420056.git.agordeev%40linux.ibm.com
- patches 2-4 added

This series addresses an s390-specific aspect of how page table entries
are modified. In many cases, changing a valid PTE (for example, setting
or clearing a hardware bit) requires issuing an Invalidate Page Table
Entry (IPTE) instruction beforehand.

A disadvantage of the IPTE instruction is that it may initiate a
machine-wide quiesce state. This state acts as an expensive global
hardware lock and should be avoided whenever possible.

Currently, IPTE is invoked for each individual PTE update in most code
paths. However, the instruction itself supports invalidating multiple
PTEs at once, covering up to 256 entries. Using this capability can
significantly reduce the number of quiesce events, with a positive
impact on overall system performance. At present, this feature is not
utilized.

An effort was therefore made to identify kernel code paths that update
large numbers of consecutive PTEs. Such updates can be batched and
handled by a single IPTE invocation, leveraging the hardware support
described above.

A natural candidate for this optimization is page-table walkers that
change attributes of memory ranges and thus modify contiguous ranges
of PTEs. Many memory-management system calls enter lazy MMU mode while
updating such ranges.

This lazy MMU mode can be leveraged to build on the already existing
infrastructure and implement a software-level lazy MMU mechanism,
allowing expensive PTE invalidations on s390 to be batched.

Alexander Gordeev (7):
  mm: Make lazy MMU mode context-aware
  s390/mm: Complete ptep_get() conversion
  s390/mm: Batch PTE updates in lazy MMU mode
  mm/gup: Cleanup pgtable entry accessors
  mm/page_vma_mapped_walk: Use ptep_get_lockless() for lockless access
  mm/kasan: Introduce helpers for lazy MMU mode sanitizer
  s390/mm: Lazy MMU mode sanitizer

 arch/s390/Kconfig                |   1 +
 arch/s390/boot/vmem.c            |  32 +--
 arch/s390/include/asm/hugetlb.h  |   2 +-
 arch/s390/include/asm/lazy_mmu.h |   9 +
 arch/s390/include/asm/lowcore.h  |   2 +-
 arch/s390/include/asm/pgtable.h  | 213 +++++++++++++---
 arch/s390/kernel/setup.c         |   2 +
 arch/s390/kernel/smp.c           |   7 +
 arch/s390/mm/Makefile            |   2 +-
 arch/s390/mm/hugetlbpage.c       |  12 +-
 arch/s390/mm/lazy_mmu.c          | 401 +++++++++++++++++++++++++++++++
 arch/s390/mm/pageattr.c          |  45 ++--
 arch/s390/mm/pgtable.c           |   8 +-
 arch/s390/mm/vmem.c              |  82 ++++---
 fs/proc/task_mmu.c               |   2 +-
 include/linux/kasan.h            |  16 ++
 include/linux/pgtable.h          |  46 ++++
 mm/gup.c                         |   8 +-
 mm/kasan/common.c                |  10 +
 mm/kasan/kasan.h                 |   2 +
 mm/madvise.c                     |   8 +-
 mm/memory.c                      |   8 +-
 mm/mprotect.c                    |   2 +-
 mm/mremap.c                      |   2 +-
 mm/page_vma_mapped.c             |   9 +-
 mm/vmalloc.c                     |   6 +-
 26 files changed, 805 insertions(+), 132 deletions(-)
 create mode 100644 arch/s390/include/asm/lazy_mmu.h
 create mode 100644 arch/s390/mm/lazy_mmu.c

-- 
2.53.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/7] mm: Make lazy MMU mode context-aware
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 2/7] s390/mm: Complete ptep_get() conversion Alexander Gordeev
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Lazy MMU mode is assumed to be context-independent, in the sense
that it does not need any additional information while operating.
However, the s390 architecture benefits from knowing the exact
page table entries being modified.

Introduce lazy_mmu_mode_enable_with_ptes(), which is provided with
the process address space and the page table being operated on.
This information is required to enable s390-specific optimizations.

The function takes parameters that are typically passed to page-
table level walkers, which implies that the span of PTE entries
never crosses a page table boundary.

Architectures that do not require such information simply do not
need to define the lazy_mmu_mode_enable_with_ptes() callback.

Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 fs/proc/task_mmu.c      |  2 +-
 include/linux/pgtable.h | 46 +++++++++++++++++++++++++++++++++++++++++
 mm/madvise.c            |  8 +++----
 mm/memory.c             |  8 +++----
 mm/mprotect.c           |  2 +-
 mm/mremap.c             |  2 +-
 mm/vmalloc.c            |  6 +++---
 7 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 751b9ba160fb..a02a83c390b9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2752,7 +2752,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 		return 0;
 	}
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(vma->vm_mm, start, end, start_pte);
 
 	if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
 		/* Fast path for performing exclusive WP */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdd68ed3ae1a..6e582b9e58f3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -271,6 +271,50 @@ static inline void lazy_mmu_mode_enable(void)
 		arch_enter_lazy_mmu_mode();
 }
 
+#ifndef arch_enter_lazy_mmu_mode_with_ptes
+static inline void arch_enter_lazy_mmu_mode_with_ptes(struct mm_struct *mm,
+		unsigned long addr, unsigned long end, pte_t *ptep)
+{
+	arch_enter_lazy_mmu_mode();
+}
+#endif
+
+/**
+ * lazy_mmu_mode_enable_with_ptes() - Enable the lazy MMU mode with a speedup hint.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Start address of the range.
+ * @end: End address of the range.
+ * @ptep: Page table pointer for the first entry.
+ *
+ * Enters a new lazy MMU mode section; if the mode was not already enabled,
+ * enables it and calls arch_enter_lazy_mmu_mode_with_ptes().
+ *
+ * PTEs that fall within the specified range might observe update speedups.
+ * The PTEs must belong to the specified address space and be in the same PMD.
+ *
+ * There are no requirements on the order or range completeness of PTE
+ * updates for the specified range.
+ *
+ * Must be paired with a call to lazy_mmu_mode_disable().
+ *
+ * Has no effect if called:
+ * - While paused - see lazy_mmu_mode_pause()
+ * - In interrupt context
+ */
+static inline void lazy_mmu_mode_enable_with_ptes(struct mm_struct *mm,
+		unsigned long addr, unsigned long end, pte_t *ptep)
+{
+	struct lazy_mmu_state *state = &current->lazy_mmu_state;
+
+	if (in_interrupt() || state->pause_count > 0)
+		return;
+
+	VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
+
+	if (state->enable_count++ == 0)
+		arch_enter_lazy_mmu_mode_with_ptes(mm, addr, end, ptep);
+}
+
 /**
  * lazy_mmu_mode_disable() - Disable the lazy MMU mode.
  *
@@ -353,6 +397,8 @@ static inline void lazy_mmu_mode_resume(void)
 }
 #else
 static inline void lazy_mmu_mode_enable(void) {}
+static inline void lazy_mmu_mode_enable_with_ptes(struct mm_struct *mm,
+		unsigned long addr, unsigned long end, pte_t *ptep) {}
 static inline void lazy_mmu_mode_disable(void) {}
 static inline void lazy_mmu_mode_pause(void) {}
 static inline void lazy_mmu_mode_resume(void) {}
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..de39703c26a1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -453,7 +453,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -508,7 +508,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				lazy_mmu_mode_enable();
+				lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 				if (!err)
 					nr = 0;
 				continue;
@@ -675,7 +675,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!start_pte)
 		return 0;
 	flush_tlb_batched_pending(mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
 		nr = 1;
 		ptent = ptep_get(pte);
@@ -735,7 +735,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				if (!start_pte)
 					break;
 				flush_tlb_batched_pending(mm);
-				lazy_mmu_mode_enable();
+				lazy_mmu_mode_enable_with_ptes(mm, addr, end, pte);
 				if (!err)
 					nr = 0;
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..e4487564b166 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1272,7 +1272,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(src_mm, addr, end, src_pte);
 
 	do {
 		nr = 1;
@@ -1922,7 +1922,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		return addr;
 
 	flush_tlb_batched_pending(mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, start_pte);
 	do {
 		bool any_skipped = false;
 
@@ -2919,7 +2919,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, mapped_pte);
 	do {
 		BUG_ON(!pte_none(ptep_get(pte)));
 		if (!pfn_modify_allowed(pfn, prot)) {
@@ -3330,7 +3330,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			return -EINVAL;
 	}
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, addr, end, mapped_pte);
 
 	if (fn) {
 		do {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9cbf932b028c..3fc26418e837 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -337,7 +337,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 		is_private_single_threaded = vma_is_single_threaded_private(vma);
 
 	flush_tlb_batched_pending(vma->vm_mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(vma->vm_mm, addr, end, pte);
 	do {
 		nr_ptes = 1;
 		oldpte = ptep_get(pte);
diff --git a/mm/mremap.c b/mm/mremap.c
index e9c8b1d05832..0dfe3de39ccc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -260,7 +260,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
 	flush_tlb_batched_pending(vma->vm_mm);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(mm, old_addr, old_end, old_ptep);
 
 	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
 		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index bb6ae08d18f5..11c9c78072ae 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -108,7 +108,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	if (!pte)
 		return -ENOMEM;
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(&init_mm, addr, end, pte);
 
 	do {
 		if (unlikely(!pte_none(ptep_get(pte)))) {
@@ -371,7 +371,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(&init_mm, addr, end, pte);
 
 	do {
 #ifdef CONFIG_HUGETLB_PAGE
@@ -538,7 +538,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (!pte)
 		return -ENOMEM;
 
-	lazy_mmu_mode_enable();
+	lazy_mmu_mode_enable_with_ptes(&init_mm, addr, end, pte);
 
 	do {
 		struct page *page = pages[*nr];
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/7] s390/mm: Complete ptep_get() conversion
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 1/7] mm: Make lazy MMU mode context-aware Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 3/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Finalize commit c33c794828f2 ("mm: ptep_get() conversion") and
replace direct page table entry dereferencing with the proper
accessors (ptep_get(), pmdp_get(), etc.).

Override the default getter implementations even though they are
currently identical: pud_clear(), p4d_clear(), and pgd_clear()
require corresponding architecture-specific getters, but these
are not yet defined. This avoids a dependency loop.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/s390/boot/vmem.c           | 32 +++++++------
 arch/s390/include/asm/hugetlb.h |  2 +-
 arch/s390/include/asm/pgtable.h | 60 ++++++++++++++++++------
 arch/s390/mm/hugetlbpage.c      | 12 ++---
 arch/s390/mm/pageattr.c         | 45 ++++++++++--------
 arch/s390/mm/vmem.c             | 82 ++++++++++++++++++---------------
 6 files changed, 140 insertions(+), 93 deletions(-)

diff --git a/arch/s390/boot/vmem.c b/arch/s390/boot/vmem.c
index 7d6cc4c85af0..ff6d58a476ba 100644
--- a/arch/s390/boot/vmem.c
+++ b/arch/s390/boot/vmem.c
@@ -338,7 +338,7 @@ static void pgtable_pte_populate(pmd_t *pmd, unsigned long addr, unsigned long e
 
 	pte = pte_offset_kernel(pmd, addr);
 	for (; addr < end; addr += PAGE_SIZE, pte++) {
-		if (pte_none(*pte)) {
+		if (pte_none(ptep_get(pte))) {
 			if (kasan_pte_populate_zero_shadow(pte, mode))
 				continue;
 			entry = __pte(resolve_pa_may_alloc(addr, PAGE_SIZE, mode));
@@ -355,26 +355,27 @@ static void pgtable_pmd_populate(pud_t *pud, unsigned long addr, unsigned long e
 				 enum populate_mode mode)
 {
 	unsigned long pa, next, pages = 0;
-	pmd_t *pmd, entry;
+	pmd_t *pmd, entry, large_entry;
 	pte_t *pte;
 
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*pmd)) {
+		entry = pmdp_get(pmd);
+		if (pmd_none(entry)) {
 			if (kasan_pmd_populate_zero_shadow(pmd, addr, next, mode))
 				continue;
 			pa = try_get_large_pmd_pa(pmd, addr, next, mode);
 			if (pa != INVALID_PHYS_ADDR) {
-				entry = __pmd(pa);
-				entry = set_pmd_bit(entry, SEGMENT_KERNEL);
-				set_pmd(pmd, entry);
+				large_entry = __pmd(pa);
+				large_entry = set_pmd_bit(large_entry, SEGMENT_KERNEL);
+				set_pmd(pmd, large_entry);
 				pages++;
 				continue;
 			}
 			pte = boot_pte_alloc();
 			pmd_populate(&init_mm, pmd, pte);
-		} else if (pmd_leaf(*pmd)) {
+		} else if (pmd_leaf(entry)) {
 			continue;
 		}
 		pgtable_pte_populate(pmd, addr, next, mode);
@@ -387,26 +388,27 @@ static void pgtable_pud_populate(p4d_t *p4d, unsigned long addr, unsigned long e
 				 enum populate_mode mode)
 {
 	unsigned long pa, next, pages = 0;
-	pud_t *pud, entry;
+	pud_t *pud, entry, large_entry;
 	pmd_t *pmd;
 
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud)) {
+		entry = pudp_get(pud);
+		if (pud_none(entry)) {
 			if (kasan_pud_populate_zero_shadow(pud, addr, next, mode))
 				continue;
 			pa = try_get_large_pud_pa(pud, addr, next, mode);
 			if (pa != INVALID_PHYS_ADDR) {
-				entry = __pud(pa);
-				entry = set_pud_bit(entry, REGION3_KERNEL);
-				set_pud(pud, entry);
+				large_entry = __pud(pa);
+				large_entry = set_pud_bit(large_entry, REGION3_KERNEL);
+				set_pud(pud, large_entry);
 				pages++;
 				continue;
 			}
 			pmd = boot_crst_alloc(_SEGMENT_ENTRY_EMPTY);
 			pud_populate(&init_mm, pud, pmd);
-		} else if (pud_leaf(*pud)) {
+		} else if (pud_leaf(entry)) {
 			continue;
 		}
 		pgtable_pmd_populate(pud, addr, next, mode);
@@ -425,7 +427,7 @@ static void pgtable_p4d_populate(pgd_t *pgd, unsigned long addr, unsigned long e
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
 		next = p4d_addr_end(addr, end);
-		if (p4d_none(*p4d)) {
+		if (p4d_none(p4dp_get(p4d))) {
 			if (kasan_p4d_populate_zero_shadow(p4d, addr, next, mode))
 				continue;
 			pud = boot_crst_alloc(_REGION3_ENTRY_EMPTY);
@@ -451,7 +453,7 @@ static void pgtable_populate(unsigned long addr, unsigned long end, enum populat
 	pgd = pgd_offset(&init_mm, addr);
 	for (; addr < end; addr = next, pgd++) {
 		next = pgd_addr_end(addr, end);
-		if (pgd_none(*pgd)) {
+		if (pgd_none(pgdp_get(pgd))) {
 			if (kasan_pgd_populate_zero_shadow(pgd, addr, next, mode))
 				continue;
 			p4d = boot_crst_alloc(_REGION2_ENTRY_EMPTY);
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 6983e52eaf81..e33a5b587ee4 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -41,7 +41,7 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 				  pte_t *ptep, unsigned long sz)
 {
-	if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
+	if ((pte_val(ptep_get(ptep)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
 		set_pte(ptep, __pte(_REGION3_ENTRY_EMPTY));
 	else
 		set_pte(ptep, __pte(_SEGMENT_ENTRY_EMPTY));
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3197b8b372a2..f9a8a92fa160 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -983,22 +983,39 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 	WRITE_ONCE(*ptep, pte);
 }
 
-static inline void pgd_clear(pgd_t *pgd)
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
 {
-	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R1)
-		set_pgd(pgd, __pgd(_REGION1_ENTRY_EMPTY));
+	return READ_ONCE(*ptep);
 }
 
-static inline void p4d_clear(p4d_t *p4d)
+#define pmdp_get pmdp_get
+static inline pmd_t pmdp_get(pmd_t *pmdp)
 {
-	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R2)
-		set_p4d(p4d, __p4d(_REGION2_ENTRY_EMPTY));
+	return READ_ONCE(*pmdp);
 }
 
-static inline void pud_clear(pud_t *pud)
+#define pudp_get pudp_get
+static inline pud_t pudp_get(pud_t *pudp)
 {
-	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
-		set_pud(pud, __pud(_REGION3_ENTRY_EMPTY));
+	return READ_ONCE(*pudp);
+}
+
+#define p4dp_get p4dp_get
+static inline p4d_t p4dp_get(p4d_t *p4dp)
+{
+	return READ_ONCE(*p4dp);
+}
+
+#define pgdp_get pgdp_get
+static inline pgd_t pgdp_get(pgd_t *pgdp)
+{
+	return READ_ONCE(*pgdp);
+}
+
+static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	set_pte(ptep, __pte(_PAGE_INVALID));
 }
 
 static inline void pmd_clear(pmd_t *pmdp)
@@ -1006,9 +1023,22 @@ static inline void pmd_clear(pmd_t *pmdp)
 	set_pmd(pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
 }
 
-static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+static inline void pud_clear(pud_t *pud)
 {
-	set_pte(ptep, __pte(_PAGE_INVALID));
+	if ((pud_val(pudp_get(pud)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
+		set_pud(pud, __pud(_REGION3_ENTRY_EMPTY));
+}
+
+static inline void p4d_clear(p4d_t *p4d)
+{
+	if ((p4d_val(p4dp_get(p4d)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R2)
+		set_p4d(p4d, __p4d(_REGION2_ENTRY_EMPTY));
+}
+
+static inline void pgd_clear(pgd_t *pgd)
+{
+	if ((pgd_val(pgdp_get(pgd)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R1)
+		set_pgd(pgd, __pgd(_REGION1_ENTRY_EMPTY));
 }
 
 /*
@@ -1169,7 +1199,7 @@ pte_t ptep_xchg_lazy(struct mm_struct *, unsigned long, pte_t *, pte_t);
 static inline bool ptep_test_and_clear_young(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = *ptep;
+	pte_t pte = ptep_get(ptep);
 
 	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
 	return pte_young(pte);
@@ -1230,7 +1260,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 	pte_t res;
 
 	if (full) {
-		res = *ptep;
+		res = ptep_get(ptep);
 		set_pte(ptep, __pte(_PAGE_INVALID));
 	} else {
 		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
@@ -1259,7 +1289,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = *ptep;
+	pte_t pte = ptep_get(ptep);
 
 	if (pte_write(pte))
 		ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
@@ -1295,7 +1325,7 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
 	 * PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
 	 * A local RDP can be used to do the flush.
 	 */
-	if (cpu_has_rdp() && !(pte_val(*ptep) & _PAGE_PROTECT))
+	if (cpu_has_rdp() && !(pte_val(ptep_get(ptep)) & _PAGE_PROTECT))
 		__ptep_rdp(address, ptep, 1);
 }
 #define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 302ef5781b65..db35d8fe8609 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -143,7 +143,7 @@ void __set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	rste = __pte_to_rste(pte);
 
 	/* Set correct table type for 2G hugepages */
-	if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3) {
+	if ((pte_val(ptep_get(ptep)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3) {
 		if (likely(pte_present(pte)))
 			rste |= _REGION3_ENTRY_LARGE;
 		rste |= _REGION_ENTRY_TYPE_R3;
@@ -161,7 +161,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 
 pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
-	return __rste_to_pte(pte_val(*ptep));
+	return __rste_to_pte(pte_val(ptep_get(ptep)));
 }
 
 pte_t __huge_ptep_get_and_clear(struct mm_struct *mm,
@@ -171,7 +171,7 @@ pte_t __huge_ptep_get_and_clear(struct mm_struct *mm,
 	pmd_t *pmdp = (pmd_t *) ptep;
 	pud_t *pudp = (pud_t *) ptep;
 
-	if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
+	if ((pte_val(ptep_get(ptep)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
 		pudp_xchg_direct(mm, addr, pudp, __pud(_REGION3_ENTRY_EMPTY));
 	else
 		pmdp_xchg_direct(mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
@@ -209,13 +209,13 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 	pmd_t *pmdp = NULL;
 
 	pgdp = pgd_offset(mm, addr);
-	if (pgd_present(*pgdp)) {
+	if (pgd_present(pgdp_get(pgdp))) {
 		p4dp = p4d_offset(pgdp, addr);
-		if (p4d_present(*p4dp)) {
+		if (p4d_present(p4dp_get(p4dp))) {
 			pudp = pud_offset(p4dp, addr);
 			if (sz == PUD_SIZE)
 				return (pte_t *)pudp;
-			if (pud_present(*pudp))
+			if (pud_present(pudp_get(pudp)))
 				pmdp = pmd_offset(pudp, addr);
 		}
 	}
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index bb29c38ae624..e6f788696dd1 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -85,7 +85,7 @@ static int walk_pte_level(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		return 0;
 	ptep = pte_offset_kernel(pmdp, addr);
 	do {
-		new = *ptep;
+		new = ptep_get(ptep);
 		if (pte_none(new))
 			return -EINVAL;
 		if (flags & SET_MEMORY_RO)
@@ -114,15 +114,16 @@ static int split_pmd_page(pmd_t *pmdp, unsigned long addr)
 {
 	unsigned long pte_addr, prot;
 	pte_t *pt_dir, *ptep;
-	pmd_t new;
+	pmd_t new, pmd;
 	int i, ro, nx;
 
 	pt_dir = vmem_pte_alloc();
 	if (!pt_dir)
 		return -ENOMEM;
-	pte_addr = pmd_pfn(*pmdp) << PAGE_SHIFT;
-	ro = !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT);
-	nx = !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_NOEXEC);
+	pmd = pmdp_get(pmdp);
+	pte_addr = pmd_pfn(pmd) << PAGE_SHIFT;
+	ro = !!(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT);
+	nx = !!(pmd_val(pmd) & _SEGMENT_ENTRY_NOEXEC);
 	prot = pgprot_val(ro ? PAGE_KERNEL_RO : PAGE_KERNEL);
 	if (!nx)
 		prot &= ~_PAGE_NOEXEC;
@@ -142,7 +143,7 @@ static int split_pmd_page(pmd_t *pmdp, unsigned long addr)
 static void modify_pmd_page(pmd_t *pmdp, unsigned long addr,
 			    unsigned long flags)
 {
-	pmd_t new = *pmdp;
+	pmd_t new = pmdp_get(pmdp);
 
 	if (flags & SET_MEMORY_RO)
 		new = pmd_wrprotect(new);
@@ -165,16 +166,17 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
 			  unsigned long flags)
 {
 	unsigned long next;
+	pmd_t *pmdp, pmd;
 	int need_split;
-	pmd_t *pmdp;
 	int rc = 0;
 
 	pmdp = pmd_offset(pudp, addr);
 	do {
-		if (pmd_none(*pmdp))
+		pmd = pmdp_get(pmdp);
+		if (pmd_none(pmd))
 			return -EINVAL;
 		next = pmd_addr_end(addr, end);
-		if (pmd_leaf(*pmdp)) {
+		if (pmd_leaf(pmd)) {
 			need_split  = !!(flags & SET_MEMORY_4K);
 			need_split |= !!(addr & ~PMD_MASK);
 			need_split |= !!(addr + PMD_SIZE > next);
@@ -201,15 +203,16 @@ int split_pud_page(pud_t *pudp, unsigned long addr)
 {
 	unsigned long pmd_addr, prot;
 	pmd_t *pm_dir, *pmdp;
-	pud_t new;
+	pud_t new, pud;
 	int i, ro, nx;
 
 	pm_dir = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
 	if (!pm_dir)
 		return -ENOMEM;
-	pmd_addr = pud_pfn(*pudp) << PAGE_SHIFT;
-	ro = !!(pud_val(*pudp) & _REGION_ENTRY_PROTECT);
-	nx = !!(pud_val(*pudp) & _REGION_ENTRY_NOEXEC);
+	pud = pudp_get(pudp);
+	pmd_addr = pud_pfn(pud) << PAGE_SHIFT;
+	ro = !!(pud_val(pud) & _REGION_ENTRY_PROTECT);
+	nx = !!(pud_val(pud) & _REGION_ENTRY_NOEXEC);
 	prot = pgprot_val(ro ? SEGMENT_KERNEL_RO : SEGMENT_KERNEL);
 	if (!nx)
 		prot &= ~_SEGMENT_ENTRY_NOEXEC;
@@ -229,7 +232,7 @@ int split_pud_page(pud_t *pudp, unsigned long addr)
 static void modify_pud_page(pud_t *pudp, unsigned long addr,
 			    unsigned long flags)
 {
-	pud_t new = *pudp;
+	pud_t new = pudp_get(pudp);
 
 	if (flags & SET_MEMORY_RO)
 		new = pud_wrprotect(new);
@@ -252,16 +255,17 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
 			  unsigned long flags)
 {
 	unsigned long next;
+	pud_t *pudp, pud;
 	int need_split;
-	pud_t *pudp;
 	int rc = 0;
 
 	pudp = pud_offset(p4d, addr);
 	do {
-		if (pud_none(*pudp))
+		pud = pudp_get(pudp);
+		if (pud_none(pud))
 			return -EINVAL;
 		next = pud_addr_end(addr, end);
-		if (pud_leaf(*pudp)) {
+		if (pud_leaf(pud)) {
 			need_split  = !!(flags & SET_MEMORY_4K);
 			need_split |= !!(addr & ~PUD_MASK);
 			need_split |= !!(addr + PUD_SIZE > next);
@@ -291,7 +295,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 	p4dp = p4d_offset(pgd, addr);
 	do {
-		if (p4d_none(*p4dp))
+		if (p4d_none(p4dp_get(p4dp)))
 			return -EINVAL;
 		next = p4d_addr_end(addr, end);
 		rc = walk_pud_level(p4dp, addr, next, flags);
@@ -313,7 +317,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
 
 	pgdp = pgd_offset_k(addr);
 	do {
-		if (pgd_none(*pgdp))
+		if (pgd_none(pgdp_get(pgdp)))
 			break;
 		next = pgd_addr_end(addr, end);
 		rc = walk_p4d_level(pgdp, addr, next, flags);
@@ -451,7 +455,8 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
 		nr = min(numpages - i, nr);
 		if (enable) {
 			for (j = 0; j < nr; j++) {
-				pte = clear_pte_bit(*ptep, __pgprot(_PAGE_INVALID));
+				pte = ptep_get(ptep);
+				pte = clear_pte_bit(pte, __pgprot(_PAGE_INVALID));
 				set_pte(ptep, pte);
 				address += PAGE_SIZE;
 				ptep++;
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eeadff45e0e1..803099f3db73 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -171,18 +171,19 @@ static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
 {
 	unsigned long prot, pages = 0;
 	int ret = -ENOMEM;
-	pte_t *pte;
+	pte_t *pte, entry;
 
 	prot = pgprot_val(PAGE_KERNEL);
 	pte = pte_offset_kernel(pmd, addr);
 	for (; addr < end; addr += PAGE_SIZE, pte++) {
+		entry = ptep_get(pte);
 		if (!add) {
-			if (pte_none(*pte))
+			if (pte_none(entry))
 				continue;
 			if (!direct)
-				vmem_free_pages((unsigned long)pfn_to_virt(pte_pfn(*pte)), get_order(PAGE_SIZE), altmap);
+				vmem_free_pages((unsigned long)pfn_to_virt(pte_pfn(entry)), get_order(PAGE_SIZE), altmap);
 			pte_clear(&init_mm, addr, pte);
-		} else if (pte_none(*pte)) {
+		} else if (pte_none(entry)) {
 			if (!direct) {
 				void *new_page = vmemmap_alloc_block_buf(PAGE_SIZE, NUMA_NO_NODE, altmap);
 
@@ -212,10 +213,10 @@ static void try_free_pte_table(pmd_t *pmd, unsigned long start)
 	/* We can safely assume this is fully in 1:1 mapping & vmemmap area */
 	pte = pte_offset_kernel(pmd, start);
 	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
-		if (!pte_none(*pte))
+		if (!pte_none(ptep_get(pte)))
 			return;
 	}
-	vmem_pte_free((unsigned long *) pmd_deref(*pmd));
+	vmem_pte_free((unsigned long *)pmd_deref(pmdp_get(pmd)));
 	pmd_clear(pmd);
 }
 
@@ -226,6 +227,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 {
 	unsigned long next, prot, pages = 0;
 	int ret = -ENOMEM;
+	pmd_t entry;
 	pmd_t *pmd;
 	pte_t *pte;
 
@@ -233,23 +235,24 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 	pmd = pmd_offset(pud, addr);
 	for (; addr < end; addr = next, pmd++) {
 		next = pmd_addr_end(addr, end);
+		entry = pmdp_get(pmd);
 		if (!add) {
-			if (pmd_none(*pmd))
+			if (pmd_none(entry))
 				continue;
-			if (pmd_leaf(*pmd)) {
+			if (pmd_leaf(entry)) {
 				if (IS_ALIGNED(addr, PMD_SIZE) &&
 				    IS_ALIGNED(next, PMD_SIZE)) {
 					if (!direct)
-						vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
+						vmem_free_pages(pmd_deref(entry), get_order(PMD_SIZE), altmap);
 					pmd_clear(pmd);
 					pages++;
 				} else if (!direct && vmemmap_unuse_sub_pmd(addr, next)) {
-					vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
+					vmem_free_pages(pmd_deref(entry), get_order(PMD_SIZE), altmap);
 					pmd_clear(pmd);
 				}
 				continue;
 			}
-		} else if (pmd_none(*pmd)) {
+		} else if (pmd_none(entry)) {
 			if (IS_ALIGNED(addr, PMD_SIZE) &&
 			    IS_ALIGNED(next, PMD_SIZE) &&
 			    cpu_has_edat1() && direct &&
@@ -281,7 +284,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 			if (!pte)
 				goto out;
 			pmd_populate(&init_mm, pmd, pte);
-		} else if (pmd_leaf(*pmd)) {
+		} else if (pmd_leaf(entry)) {
 			if (!direct)
 				vmemmap_use_sub_pmd(addr, next);
 			continue;
@@ -306,9 +309,9 @@ static void try_free_pmd_table(pud_t *pud, unsigned long start)
 
 	pmd = pmd_offset(pud, start);
 	for (i = 0; i < PTRS_PER_PMD; i++, pmd++)
-		if (!pmd_none(*pmd))
+		if (!pmd_none(pmdp_get(pmd)))
 			return;
-	vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER, NULL);
+	vmem_free_pages(pud_deref(pudp_get(pud)), CRST_ALLOC_ORDER, NULL);
 	pud_clear(pud);
 }
 
@@ -317,21 +320,22 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 {
 	unsigned long next, prot, pages = 0;
 	int ret = -ENOMEM;
-	pud_t *pud;
+	pud_t *pud, entry;
 	pmd_t *pmd;
 
 	prot = pgprot_val(REGION3_KERNEL);
 	pud = pud_offset(p4d, addr);
 	for (; addr < end; addr = next, pud++) {
 		next = pud_addr_end(addr, end);
+		entry = pudp_get(pud);
 		if (!add) {
-			if (pud_none(*pud))
+			if (pud_none(entry))
 				continue;
-			if (pud_leaf(*pud)) {
+			if (pud_leaf(entry)) {
 				if (IS_ALIGNED(addr, PUD_SIZE) &&
 				    IS_ALIGNED(next, PUD_SIZE)) {
 					if (!direct)
-						vmem_free_pages(pud_deref(*pud), get_order(PUD_SIZE), altmap);
+						vmem_free_pages(pud_deref(entry), get_order(PUD_SIZE), altmap);
 					pud_clear(pud);
 					pages++;
 					continue;
@@ -339,7 +343,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 					split_pud_page(pud, addr & PUD_MASK);
 				}
 			}
-		} else if (pud_none(*pud)) {
+		} else if (pud_none(entry)) {
 			if (IS_ALIGNED(addr, PUD_SIZE) &&
 			    IS_ALIGNED(next, PUD_SIZE) &&
 			    cpu_has_edat2() && direct &&
@@ -352,7 +356,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 			if (!pmd)
 				goto out;
 			pud_populate(&init_mm, pud, pmd);
-		} else if (pud_leaf(*pud)) {
+		} else if (pud_leaf(entry)) {
 			continue;
 		}
 		ret = modify_pmd_table(pud, addr, next, add, direct, altmap);
@@ -375,10 +379,10 @@ static void try_free_pud_table(p4d_t *p4d, unsigned long start)
 
 	pud = pud_offset(p4d, start);
 	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
-		if (!pud_none(*pud))
+		if (!pud_none(pudp_get(pud)))
 			return;
 	}
-	vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER, NULL);
+	vmem_free_pages(p4d_deref(p4dp_get(p4d)), CRST_ALLOC_ORDER, NULL);
 	p4d_clear(p4d);
 }
 
@@ -387,16 +391,17 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 {
 	unsigned long next;
 	int ret = -ENOMEM;
-	p4d_t *p4d;
+	p4d_t *p4d, entry;
 	pud_t *pud;
 
 	p4d = p4d_offset(pgd, addr);
 	for (; addr < end; addr = next, p4d++) {
 		next = p4d_addr_end(addr, end);
+		entry = p4dp_get(p4d);
 		if (!add) {
-			if (p4d_none(*p4d))
+			if (p4d_none(entry))
 				continue;
-		} else if (p4d_none(*p4d)) {
+		} else if (p4d_none(entry)) {
 			pud = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
 			if (!pud)
 				goto out;
@@ -420,10 +425,10 @@ static void try_free_p4d_table(pgd_t *pgd, unsigned long start)
 
 	p4d = p4d_offset(pgd, start);
 	for (i = 0; i < PTRS_PER_P4D; i++, p4d++) {
-		if (!p4d_none(*p4d))
+		if (!p4d_none(p4dp_get(p4d)))
 			return;
 	}
-	vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER, NULL);
+	vmem_free_pages(pgd_deref(pgdp_get(pgd)), CRST_ALLOC_ORDER, NULL);
 	pgd_clear(pgd);
 }
 
@@ -432,7 +437,7 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 {
 	unsigned long addr, next;
 	int ret = -ENOMEM;
-	pgd_t *pgd;
+	pgd_t *pgd, entry;
 	p4d_t *p4d;
 
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
@@ -449,11 +454,12 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
 		pgd = pgd_offset_k(addr);
+		entry = pgdp_get(pgd);
 
 		if (!add) {
-			if (pgd_none(*pgd))
+			if (pgd_none(entry))
 				continue;
-		} else if (pgd_none(*pgd)) {
+		} else if (pgd_none(entry)) {
 			p4d = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
 			if (!p4d)
 				goto out;
@@ -575,6 +581,8 @@ int vmem_add_mapping(unsigned long start, unsigned long size)
 pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
 {
 	pte_t *ptep = NULL;
+	pud_t pud_entry;
+	pmd_t pmd_entry;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -582,7 +590,7 @@ pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
 	pte_t *pte;
 
 	pgd = pgd_offset_k(addr);
-	if (pgd_none(*pgd)) {
+	if (pgd_none(pgdp_get(pgd))) {
 		if (!alloc)
 			goto out;
 		p4d = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
@@ -591,7 +599,7 @@ pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
 		pgd_populate(&init_mm, pgd, p4d);
 	}
 	p4d = p4d_offset(pgd, addr);
-	if (p4d_none(*p4d)) {
+	if (p4d_none(p4dp_get(p4d))) {
 		if (!alloc)
 			goto out;
 		pud = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
@@ -600,25 +608,27 @@ pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
 		p4d_populate(&init_mm, p4d, pud);
 	}
 	pud = pud_offset(p4d, addr);
-	if (pud_none(*pud)) {
+	pud_entry = pudp_get(pud);
+	if (pud_none(pud_entry)) {
 		if (!alloc)
 			goto out;
 		pmd = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
 		if (!pmd)
 			goto out;
 		pud_populate(&init_mm, pud, pmd);
-	} else if (WARN_ON_ONCE(pud_leaf(*pud))) {
+	} else if (WARN_ON_ONCE(pud_leaf(pud_entry))) {
 		goto out;
 	}
 	pmd = pmd_offset(pud, addr);
-	if (pmd_none(*pmd)) {
+	pmd_entry = pmdp_get(pmd);
+	if (pmd_none(pmd_entry)) {
 		if (!alloc)
 			goto out;
 		pte = vmem_pte_alloc();
 		if (!pte)
 			goto out;
 		pmd_populate(&init_mm, pmd, pte);
-	} else if (WARN_ON_ONCE(pmd_leaf(*pmd))) {
+	} else if (WARN_ON_ONCE(pmd_leaf(pmd_entry))) {
 		goto out;
 	}
 	ptep = pte_offset_kernel(pmd, addr);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/7] s390/mm: Batch PTE updates in lazy MMU mode
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 1/7] mm: Make lazy MMU mode context-aware Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 2/7] s390/mm: Complete ptep_get() conversion Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 4/7] mm/gup: Cleanup pgtable entry accessors Alexander Gordeev
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Make use of the IPTE instruction's "Additional Entries" field to
invalidate multiple PTEs in one go while in lazy MMU mode. This
is the mode in which many memory-management system calls (like
mremap(), mprotect(), etc.) update memory attributes.

To achieve that, the set_pte() and ptep_get() primitives use a
per-CPU cache to store and retrieve PTE values and apply the
cached values to the real page table once lazy MMU mode is left.

The same is done for memory-management platform callbacks that
would otherwise cause intense per-PTE IPTE traffic, reducing the
number of IPTE instructions from up to PTRS_PER_PTE to a single
instruction in the best case. The average reduction is of course
smaller.

Since all existing page table iterators called in lazy MMU mode
handle one table at a time, the per-CPU cache does not need to be
larger than PTRS_PER_PTE entries. That also naturally aligns with
the IPTE instruction, which must not cross a page table boundary.

Before this change, the system calls did:

	lazy_mmu_mode_enable_pte()
	...
	<update PTEs>		// up to PTRS_PER_PTE single-IPTEs
	...
	lazy_mmu_mode_disable()

With this change, the system calls do:

    lazy_mmu_mode_enable_pte()
    ...
    <store new PTE values in the per-CPU cache>
    ...
    lazy_mmu_mode_disable()	// apply cache with one multi-IPTE

When applied to large memory ranges, some system calls show
significant speedups:

    mprotect()    ~15x
    munmap()      ~3x
    mremap()      ~28x

At the same time, fork() shows a measurable slowdown of ~1.5x.

The overall results depend on memory size and access patterns,
but the change generally does not degrade performance.

In addition to a process-wide impact, the rework affects the
whole Central Electronics Complex (CEC). Each (global) IPTE
instruction initiates a quiesce state in a CEC, so reducing
the number of IPTE calls relieves CEC-wide quiesce traffic.

In an extreme case of mprotect() contiguously triggering the
quiesce state on four LPARs in parallel, measurements show
~25x fewer quiesce events.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/s390/Kconfig                |   1 +
 arch/s390/include/asm/lazy_mmu.h |   9 +
 arch/s390/include/asm/lowcore.h  |   2 +-
 arch/s390/include/asm/pgtable.h  | 157 +++++++++++--
 arch/s390/kernel/setup.c         |   2 +
 arch/s390/kernel/smp.c           |   7 +
 arch/s390/mm/Makefile            |   2 +-
 arch/s390/mm/lazy_mmu.c          | 382 +++++++++++++++++++++++++++++++
 arch/s390/mm/pgtable.c           |   8 +-
 9 files changed, 546 insertions(+), 24 deletions(-)
 create mode 100644 arch/s390/include/asm/lazy_mmu.h
 create mode 100644 arch/s390/mm/lazy_mmu.c

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 998971f9a071..35cb36e29cdb 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -97,6 +97,7 @@ config S390
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_HAS_KCOV
+	select ARCH_HAS_LAZY_MMU_MODE
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
diff --git a/arch/s390/include/asm/lazy_mmu.h b/arch/s390/include/asm/lazy_mmu.h
new file mode 100644
index 000000000000..98366e9de9bc
--- /dev/null
+++ b/arch/s390/include/asm/lazy_mmu.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LAZY_MMU_H
+#define __LAZY_MMU_H
+
+void lazy_mmu_online_boot_cpu(void);
+int lazy_mmu_online_cpu(gfp_t gfp, unsigned int cpu);
+void lazy_mmu_offline_cpu(unsigned int cpu);
+
+#endif /* __LAZY_MMU_H */
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index cd1ddfdb5d35..afddfbf996e7 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -163,7 +163,7 @@ struct lowcore {
 	__s32	preempt_count;			/* 0x03a8 */
 	__u32	spinlock_lockval;		/* 0x03ac */
 	__u32	spinlock_index;			/* 0x03b0 */
-	__u8	pad_0x03b4[0x03b8-0x03b4];	/* 0x03b4 */
+	__s32	lazy_mmu_count;			/* 0x03b4 */
 	__u64	percpu_offset;			/* 0x03b8 */
 	__u8	percpu_register;		/* 0x03c0 */
 	__u8	pad_0x03c1[0x0400-0x03c1];	/* 0x03c1 */
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index f9a8a92fa160..2b6659d61fa5 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -39,6 +39,64 @@ enum {
 
 extern atomic_long_t direct_pages_count[PG_DIRECT_MAP_MAX];
 
+bool __lazy_mmu_ptep_test_and_clear_young(unsigned long addr, pte_t *ptep, int *res);
+bool __lazy_mmu_ptep_get_and_clear(unsigned long addr, pte_t *ptep, pte_t *res);
+bool __lazy_mmu_ptep_modify_prot_start(unsigned long addr, pte_t *ptep, pte_t *res);
+bool __lazy_mmu_ptep_modify_prot_commit(unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t pte);
+bool __lazy_mmu_ptep_set_wrprotect(unsigned long addr, pte_t *ptep);
+bool __lazy_mmu_set_pte(pte_t *ptep, pte_t pte);
+bool __lazy_mmu_ptep_get(pte_t *ptep, pte_t *res);
+
+static __always_inline bool is_lazy_mmu_active(void)
+{
+	if (__is_defined(__DECOMPRESSOR))
+		return false;
+	if (!get_lowcore()->lazy_mmu_count)
+		return false;
+	return true;
+}
+
+static inline
+bool lazy_mmu_ptep_test_and_clear_young(unsigned long addr, pte_t *ptep, int *res)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_test_and_clear_young(addr, ptep, res);
+}
+
+static inline
+bool lazy_mmu_ptep_get_and_clear(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_get_and_clear(addr, ptep, res);
+}
+
+static inline
+bool lazy_mmu_ptep_modify_prot_start(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_modify_prot_start(addr, ptep, res);
+}
+
+static inline
+bool lazy_mmu_ptep_modify_prot_commit(unsigned long addr, pte_t *ptep,
+				      pte_t old_pte, pte_t pte)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_modify_prot_commit(addr, ptep, old_pte, pte);
+}
+
+static inline
+bool lazy_mmu_ptep_set_wrprotect(unsigned long addr, pte_t *ptep)
+{
+	if (!is_lazy_mmu_active())
+		return false;
+	return __lazy_mmu_ptep_set_wrprotect(addr, ptep);
+}
+
 static inline void update_page_count(int level, long count)
 {
 	if (IS_ENABLED(CONFIG_PROC_FS))
@@ -978,15 +1036,30 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 	WRITE_ONCE(*pmdp, pmd);
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
 }
 
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+	if (!is_lazy_mmu_active() || !__lazy_mmu_set_pte(ptep, pte))
+		__set_pte(ptep, pte);
+}
+
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 #define ptep_get ptep_get
 static inline pte_t ptep_get(pte_t *ptep)
 {
-	return READ_ONCE(*ptep);
+	pte_t res;
+
+	if (!is_lazy_mmu_active() || !__lazy_mmu_ptep_get(ptep, &res))
+		res = __ptep_get(ptep);
+	return res;
 }
 
 #define pmdp_get pmdp_get
@@ -1179,6 +1252,15 @@ static __always_inline void __ptep_ipte_range(unsigned long address, int nr,
 	} while (nr != 255);
 }
 
+void arch_enter_lazy_mmu_mode_with_ptes(struct mm_struct *mm,
+					unsigned long addr, unsigned long end,
+					pte_t *pte);
+#define arch_enter_lazy_mmu_mode_with_ptes arch_enter_lazy_mmu_mode_with_ptes
+
+void arch_enter_lazy_mmu_mode(void);
+void arch_leave_lazy_mmu_mode(void);
+void arch_flush_lazy_mmu_mode(void);
+
 /*
  * This is hard to understand. ptep_get_and_clear and ptep_clear_flush
  * both clear the TLB for the unmapped pte. The reason is that
@@ -1199,10 +1281,16 @@ pte_t ptep_xchg_lazy(struct mm_struct *, unsigned long, pte_t *, pte_t);
 static inline bool ptep_test_and_clear_young(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = ptep_get(ptep);
+	pte_t pte;
+	int res;
 
-	pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
-	return pte_young(pte);
+	if (!lazy_mmu_ptep_test_and_clear_young(addr, ptep, &res)) {
+		pte = __ptep_get(ptep);
+		pte = pte_mkold(pte);
+		pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte);
+		res = pte_young(pte);
+	}
+	return res;
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1218,7 +1306,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 {
 	pte_t res;
 
-	res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+	if (!lazy_mmu_ptep_get_and_clear(addr, ptep, &res))
+		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
 	page_table_check_pte_clear(mm, addr, res);
 	/* At this point the reference through the mapping is still present */
 	if (mm_is_protected(mm) && pte_present(res))
@@ -1227,9 +1316,34 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 }
 
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
-pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
-void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
-			     pte_t *, pte_t, pte_t);
+pte_t ___ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
+void ___ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
+				pte_t *, pte_t, pte_t);
+
+static inline
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
+			     unsigned long addr, pte_t *ptep)
+{
+	pte_t res;
+
+	if (!lazy_mmu_ptep_modify_prot_start(addr, ptep, &res))
+		res = ___ptep_modify_prot_start(vma, addr, ptep);
+	return res;
+}
+
+static inline
+void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, pte_t old_pte, pte_t pte)
+{
+	if (!lazy_mmu_ptep_modify_prot_commit(addr, ptep, old_pte, pte))
+		___ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+}
+
+bool ipte_range_ptep_modify_prot_start(struct vm_area_struct *vma,
+				       unsigned long addr, pte_t *ptep, pte_t *res);
+bool ipte_range_ptep_modify_prot_commit(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					pte_t old_pte, pte_t pte);
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
 static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
@@ -1259,11 +1373,13 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 {
 	pte_t res;
 
-	if (full) {
-		res = ptep_get(ptep);
-		set_pte(ptep, __pte(_PAGE_INVALID));
-	} else {
-		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+	if (!lazy_mmu_ptep_get_and_clear(addr, ptep, &res)) {
+		if (full) {
+			res = __ptep_get(ptep);
+			__set_pte(ptep, __pte(_PAGE_INVALID));
+		} else {
+			res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+		}
 	}
 	page_table_check_pte_clear(mm, addr, res);
 	/* At this point the reference through the mapping is still present */
@@ -1289,10 +1405,15 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = ptep_get(ptep);
+	pte_t pte;
 
-	if (pte_write(pte))
-		ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
+	if (!lazy_mmu_ptep_set_wrprotect(addr, ptep)) {
+		pte = __ptep_get(ptep);
+		if (pte_write(pte)) {
+			pte = pte_wrprotect(pte);
+			ptep_xchg_lazy(mm, addr, ptep, pte);
+		}
+	}
 }
 
 /*
@@ -1325,7 +1446,7 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
 	 * PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
 	 * A local RDP can be used to do the flush.
 	 */
-	if (cpu_has_rdp() && !(pte_val(ptep_get(ptep)) & _PAGE_PROTECT))
+	if (cpu_has_rdp() && !(pte_val(__ptep_get(ptep)) & _PAGE_PROTECT))
 		__ptep_rdp(address, ptep, 1);
 }
 #define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index b60284328fe3..f5a3c9e1b6b8 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -77,6 +77,7 @@
 #include <asm/maccess.h>
 #include <asm/uv.h>
 #include <asm/asm-offsets.h>
+#include <asm/lazy_mmu.h>
 #include "entry.h"
 
 /*
@@ -1012,5 +1013,6 @@ void __init setup_arch(char **cmdline_p)
 
 void __init arch_cpu_finalize_init(void)
 {
+	lazy_mmu_online_boot_cpu();
 	sclp_init();
 }
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 50bb499cf3e5..4a778bc186a4 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -60,6 +60,7 @@
 #include <asm/topology.h>
 #include <asm/vdso.h>
 #include <asm/maccess.h>
+#include <asm/lazy_mmu.h>
 #include "entry.h"
 
 enum {
@@ -867,6 +868,11 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle)
 	rc = pcpu_alloc_lowcore(pcpu, cpu);
 	if (rc)
 		return rc;
+	rc = lazy_mmu_online_cpu(GFP_KERNEL, cpu);
+	if (rc) {
+		pcpu_free_lowcore(pcpu, cpu);
+		return rc;
+	}
 	/*
 	 * Make sure global control register contents do not change
 	 * until new CPU has initialized control registers.
@@ -922,6 +928,7 @@ void __cpu_die(unsigned int cpu)
 	pcpu = per_cpu_ptr(&pcpu_devices, cpu);
 	while (!pcpu_stopped(pcpu))
 		cpu_relax();
+	lazy_mmu_offline_cpu(cpu);
 	pcpu_free_lowcore(pcpu, cpu);
 	cpumask_clear_cpu(cpu, mm_cpumask(&init_mm));
 	cpumask_clear_cpu(cpu, &init_mm.context.cpu_attach_mask);
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index 193899c39ca7..26e9fc11543a 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -3,7 +3,7 @@
 # Makefile for the linux s390-specific parts of the memory manager.
 #
 
-obj-y		:= init.o fault.o extmem.o mmap.o vmem.o maccess.o
+obj-y		:= init.o fault.o extmem.o mmap.o vmem.o maccess.o lazy_mmu.o
 obj-y		+= page-states.o pageattr.o pgtable.o pgalloc.o extable.o
 
 obj-$(CONFIG_CMM)		+= cmm.o
diff --git a/arch/s390/mm/lazy_mmu.c b/arch/s390/mm/lazy_mmu.c
new file mode 100644
index 000000000000..d75b93d9b0de
--- /dev/null
+++ b/arch/s390/mm/lazy_mmu.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pgtable.h>
+#include <linux/kasan.h>
+#include <linux/slab.h>
+#include <asm/facility.h>
+#include <asm/lazy_mmu.h>
+#include <kunit/visibility.h>
+
+#define PTE_POISON	_PAGE_LARGE
+
+struct ipte_range {
+	struct mm_struct *mm;
+	unsigned long base_addr;
+	unsigned long base_end;
+	pte_t *base_pte;
+	pte_t *start_pte;
+	pte_t *end_pte;
+	pte_t cache[PTRS_PER_PTE];
+};
+
+static DEFINE_PER_CPU(struct ipte_range *, ipte_range);
+
+static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
+{
+	unsigned long page_invalid_bit;
+	pte_t *ptep, pte;
+
+	pte = __ptep_get(start);
+	page_invalid_bit = pte_val(pte) & _PAGE_INVALID;
+
+	for (ptep = start + 1; ptep < end; ptep++) {
+		pte = __ptep_get(ptep);
+		if ((pte_val(pte) & _PAGE_INVALID) != page_invalid_bit)
+			break;
+	}
+
+	*valid = !(page_invalid_bit);
+	return ptep - start;
+}
+
+static void __invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
+				   int nr_ptes, pte_t *ptep)
+{
+	atomic_inc(&mm->context.flush_count);
+	if (cpu_has_tlb_lc() && cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))
+		__ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_LOCAL);
+	else
+		__ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_GLOBAL);
+	atomic_dec(&mm->context.flush_count);
+}
+
+static int invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
+				pte_t *start, pte_t *end)
+{
+	int nr_ptes;
+	bool valid;
+
+	nr_ptes = count_contiguous(start, end, &valid);
+	if (valid)
+		__invalidate_pte_range(mm, addr, nr_ptes, start);
+
+	return nr_ptes;
+}
+
+static void set_pte_range(struct mm_struct *mm, unsigned long addr,
+			  pte_t *ptep, pte_t *end, pte_t *cache)
+{
+	int i, nr_ptes;
+
+	while (ptep < end) {
+		nr_ptes = invalidate_pte_range(mm, addr, ptep, end);
+
+		for (i = 0; i < nr_ptes; i++, ptep++, cache++) {
+			__set_pte(ptep, *cache);
+			*cache = __pte(PTE_POISON);
+		}
+
+		addr += nr_ptes * PAGE_SIZE;
+	}
+}
+
+static void enter_ipte_norange(void)
+{
+	struct ipte_range __maybe_unused *range;
+
+	if (!test_facility(13))
+		return;
+
+	range = get_cpu_var(ipte_range);
+	get_lowcore()->lazy_mmu_count++;
+}
+
+static void enter_ipte_range(struct mm_struct *mm,
+			     unsigned long addr, unsigned long end, pte_t *pte)
+{
+	struct ipte_range *range;
+
+	if (!test_facility(13))
+		return;
+
+	range = get_cpu_var(ipte_range);
+	get_lowcore()->lazy_mmu_count++;
+
+	range->mm = mm;
+	range->base_addr = addr;
+	range->base_end = end;
+	range->base_pte = pte;
+}
+
+static void leave_ipte_range(void)
+{
+	pte_t *ptep, *start, *start_cache, *cache;
+	unsigned long start_addr, addr;
+	struct ipte_range *range;
+	int start_idx;
+
+	if (!test_facility(13))
+		return;
+
+	lockdep_assert_preemption_disabled();
+	range = this_cpu_read(ipte_range);
+	if (!range->mm)
+		goto norange;
+	if (!range->start_pte)
+		goto done;
+
+	start = range->start_pte;
+	start_idx = range->start_pte - range->base_pte;
+	start_addr = range->base_addr + start_idx * PAGE_SIZE;
+	addr = start_addr;
+	start_cache = &range->cache[start_idx];
+	cache = start_cache;
+	for (ptep = start; ptep < range->end_pte; ptep++, cache++, addr += PAGE_SIZE) {
+		if (pte_val(*cache) == PTE_POISON) {
+			if (start) {
+				set_pte_range(range->mm, start_addr, start, ptep, start_cache);
+				start = NULL;
+			}
+		} else if (!start) {
+			start = ptep;
+			start_addr = addr;
+			start_cache = cache;
+		}
+	}
+	set_pte_range(range->mm, start_addr, start, ptep, start_cache);
+
+	range->start_pte = NULL;
+	range->end_pte = NULL;
+
+done:
+	range->mm = NULL;
+	range->base_addr = 0;
+	range->base_end = 0;
+	range->base_pte = NULL;
+
+norange:
+	get_lowcore()->lazy_mmu_count--;
+	put_cpu_var(ipte_range);
+}
+
+static void flush_lazy_mmu_mode(void)
+{
+	unsigned long addr, end;
+	struct ipte_range *range;
+	struct mm_struct *mm;
+	pte_t *pte;
+
+	if (!test_facility(13))
+		return;
+
+	range = get_cpu_var(ipte_range);
+	if (range->mm) {
+		mm = range->mm;
+		addr = range->base_addr;
+		end = range->base_end;
+		pte = range->base_pte;
+
+		leave_ipte_range();
+		enter_ipte_range(mm, addr, end, pte);
+	}
+	put_cpu_var(ipte_range);
+}
+
+void arch_enter_lazy_mmu_mode(void)
+{
+	enter_ipte_norange();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_enter_lazy_mmu_mode);
+
+void arch_enter_lazy_mmu_mode_with_ptes(struct mm_struct *mm,
+					unsigned long addr, unsigned long end,
+					pte_t *pte)
+{
+	enter_ipte_range(mm, addr, end, pte);
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_enter_lazy_mmu_mode_with_ptes);
+
+void arch_leave_lazy_mmu_mode(void)
+{
+	leave_ipte_range();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_leave_lazy_mmu_mode);
+
+void arch_flush_lazy_mmu_mode(void)
+{
+	flush_lazy_mmu_mode();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_flush_lazy_mmu_mode);
+
+static void __ipte_range_set_pte(struct ipte_range *range, pte_t *ptep, pte_t pte)
+{
+	unsigned int idx = ptep - range->base_pte;
+
+	lockdep_assert_preemption_disabled();
+	range->cache[idx] = pte;
+
+	if (!range->start_pte) {
+		range->start_pte = ptep;
+		range->end_pte = ptep + 1;
+	} else if (ptep < range->start_pte) {
+		range->start_pte = ptep;
+	} else if (ptep + 1 > range->end_pte) {
+		range->end_pte = ptep + 1;
+	}
+}
+
+static pte_t __ipte_range_ptep_get(struct ipte_range *range, pte_t *ptep)
+{
+	unsigned int idx = ptep - range->base_pte;
+
+	lockdep_assert_preemption_disabled();
+	if (pte_val(range->cache[idx]) == PTE_POISON)
+		return __ptep_get(ptep);
+	return range->cache[idx];
+}
+
+static struct ipte_range *this_ipte_range(pte_t *ptep)
+{
+	struct ipte_range *range;
+	unsigned int nr_ptes;
+
+	range = this_cpu_read(ipte_range);
+	if (ptep < range->base_pte)
+		return NULL;
+	nr_ptes = (range->base_end - range->base_addr) / PAGE_SIZE;
+	if (ptep >= range->base_pte + nr_ptes)
+		return NULL;
+
+	return range;
+}
+
+bool __lazy_mmu_set_pte(pte_t *ptep, pte_t pte)
+{
+	struct ipte_range *range;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	__ipte_range_set_pte(range, ptep, pte);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_get(pte_t *ptep, pte_t *res)
+{
+	struct ipte_range *range;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	*res = __ipte_range_ptep_get(range, ptep);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_test_and_clear_young(unsigned long addr, pte_t *ptep, int *res)
+{
+	struct ipte_range *range;
+	pte_t pte, old;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	old = __ipte_range_ptep_get(range, ptep);
+	pte = pte_mkold(old);
+	__ipte_range_set_pte(range, ptep, pte);
+	*res = pte_young(old);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_get_and_clear(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	struct ipte_range *range;
+	pte_t pte, old;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	old = __ipte_range_ptep_get(range, ptep);
+	pte = __pte(_PAGE_INVALID);
+	__ipte_range_set_pte(range, ptep, pte);
+	*res = old;
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_modify_prot_start(unsigned long addr, pte_t *ptep, pte_t *res)
+{
+	return __lazy_mmu_ptep_get_and_clear(addr, ptep, res);
+}
+
+bool __lazy_mmu_ptep_modify_prot_commit(unsigned long addr, pte_t *ptep,
+					pte_t old_pte, pte_t pte)
+{
+	struct ipte_range *range;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	__ipte_range_set_pte(range, ptep, pte);
+
+	return true;
+}
+
+bool __lazy_mmu_ptep_set_wrprotect(unsigned long addr, pte_t *ptep)
+{
+	struct ipte_range *range;
+	pte_t pte;
+
+	range = this_ipte_range(ptep);
+	if (!range)
+		return false;
+
+	pte = __ipte_range_ptep_get(range, ptep);
+	if (pte_write(pte)) {
+		pte = pte_wrprotect(pte);
+		__ipte_range_set_pte(range, ptep, pte);
+	}
+
+	return true;
+}
+
+int lazy_mmu_online_cpu(gfp_t gfp, unsigned int cpu)
+{
+	struct ipte_range *range;
+	int i;
+
+	if (!test_facility(13))
+		return 0;
+
+	range = kzalloc_obj(*range, gfp);
+	if (!range)
+		return -ENOMEM;
+	for (i = 0; i < ARRAY_SIZE(range->cache); i++)
+		range->cache[i] = __pte(PTE_POISON);
+	per_cpu(ipte_range, cpu) = range;
+
+	return 0;
+}
+
+void lazy_mmu_offline_cpu(unsigned int cpu)
+{
+	struct ipte_range *range;
+
+	if (!test_facility(13))
+		return;
+
+	range = per_cpu(ipte_range, cpu);
+	per_cpu(ipte_range, cpu) = NULL;
+	kfree(range);
+}
+
+void __init lazy_mmu_online_boot_cpu(void)
+{
+	lazy_mmu_online_cpu(GFP_ATOMIC, 0);
+}
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 2acc79383e7d..d18a3263b549 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -170,14 +170,14 @@ pte_t ptep_xchg_lazy(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(ptep_xchg_lazy);
 
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t *ptep)
+pte_t ___ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
+				pte_t *ptep)
 {
 	return ptep_flush_lazy(vma->vm_mm, addr, ptep, 1);
 }
 
-void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t *ptep, pte_t old_pte, pte_t pte)
+void ___ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+				pte_t *ptep, pte_t old_pte, pte_t pte)
 {
 	if (pte_present(pte))
 		pte = clear_pte_bit(pte, __pgprot(_PAGE_UNUSED));
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/7] mm/gup: Cleanup pgtable entry accessors
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (2 preceding siblings ...)
  2026-06-16 12:40 ` [PATCH v3 3/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 5/7] mm/page_vma_mapped_walk: Use ptep_get_lockless() for lockless access Alexander Gordeev
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

PMD and PUD entries revalidation has the same semantics as PTE entry
revalidation. Convert the remaining direct entry dereferences to the
corresponding accessors.

The PTE validation in gup_fast_pte_range() is inconsistent with the
prior value acquisition in the sense that it drops the lockless
access semantics.

Use the lockless accessor not only for the PTE, but also for the PMD
validation, which is likewise inconsistent with the prior value
acquisition in gup_fast_pmd_range().

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/gup.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ad9ded39609c..0692119b7904 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2865,8 +2865,8 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		if (!folio)
 			goto pte_unmap;
 
-		if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
-		    unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+		if (unlikely(pmd_val(pmd) != pmd_val(pmdp_get_lockless(pmdp))) ||
+		    unlikely(pte_val(pte) != pte_val(ptep_get_lockless(ptep)))) {
 			gup_put_folio(folio, 1, flags);
 			goto pte_unmap;
 		}
@@ -2942,7 +2942,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!folio)
 		return 0;
 
-	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+	if (unlikely(pmd_val(orig) != pmd_val(pmdp_get_lockless(pmdp)))) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
 	}
@@ -2985,7 +2985,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!folio)
 		return 0;
 
-	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+	if (unlikely(pud_val(orig) != pud_val(pudp_get(pudp)))) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 5/7] mm/page_vma_mapped_walk: Use ptep_get_lockless() for lockless access
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (3 preceding siblings ...)
  2026-06-16 12:40 ` [PATCH v3 4/7] mm/gup: Cleanup pgtable entry accessors Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 6/7] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 7/7] s390/mm: Lazy " Alexander Gordeev
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Switch from ptep_get() to ptep_get_lockless() accessor for
PTE reads when no lock is taken.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/page_vma_mapped.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a4d52fdb3056..2ccbabfb2cc1 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -41,7 +41,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
 	if (!pvmw->pte)
 		return false;
 
-	ptent = ptep_get(pvmw->pte);
+	ptent = ptep_get_lockless(pvmw->pte);
 
 	if (pte_none(ptent)) {
 		return false;
@@ -183,6 +183,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long end;
 	spinlock_t *ptl;
+	pte_t pteval;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -310,7 +311,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 				goto restart;
 			}
 			pvmw->pte++;
-		} while (pte_none(ptep_get(pvmw->pte)));
+			if (!pvmw->ptl)
+				pteval = ptep_get_lockless(pvmw->pte);
+			else
+				pteval = ptep_get(pvmw->pte);
+		} while (pte_none(pteval));
 
 		if (!pvmw->ptl) {
 			spin_lock(ptl);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 6/7] mm/kasan: Introduce helpers for lazy MMU mode sanitizer
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (4 preceding siblings ...)
  2026-06-16 12:40 ` [PATCH v3 5/7] mm/page_vma_mapped_walk: Use ptep_get_lockless() for lockless access Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  2026-06-16 12:40 ` [PATCH v3 7/7] s390/mm: Lazy " Alexander Gordeev
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Provide helpers that allow architectures implement
illegitimate PTE direct accesses while the lazy MMU
mode is enabled, such as:

	pte_t pte = *ptep;
	*ptep = pte;

By contrast, these would have to be:

	pte_t pte = ptep_get(ptep);
	set_pte(ptep, pte);

The direct PTE accesses pose a real issue on s390.

Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/linux/kasan.h | 16 ++++++++++++++++
 mm/kasan/common.c     | 10 ++++++++++
 mm/kasan/kasan.h      |  2 ++
 3 files changed, 28 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index bf233bde68c7..deadf566b84a 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -134,6 +134,20 @@ static __always_inline void kasan_poison_slab(struct slab *slab)
 		__kasan_poison_slab(slab);
 }
 
+void __kasan_poison_pte(pte_t *pte, int nr);
+static __always_inline void kasan_poison_pte(pte_t *pte, int nr)
+{
+	if (kasan_enabled())
+		__kasan_poison_pte(pte, nr);
+}
+
+void __kasan_unpoison_pte(pte_t *pte, int nr);
+static __always_inline void kasan_unpoison_pte(pte_t *pte, int nr)
+{
+	if (kasan_enabled())
+		__kasan_unpoison_pte(pte, nr);
+}
+
 void __kasan_unpoison_new_object(struct kmem_cache *cache, void *object);
 /**
  * kasan_unpoison_new_object - Temporarily unpoison a new slab object.
@@ -414,6 +428,8 @@ static inline bool kasan_unpoison_pages(struct page *page, unsigned int order,
 	return false;
 }
 static inline void kasan_poison_slab(struct slab *slab) {}
+static inline void kasan_poison_pte(pte_t *pte, int nr) {}
+static inline void kasan_unpoison_pte(pte_t *pte, int nr) {}
 static inline void kasan_unpoison_new_object(struct kmem_cache *cache,
 					void *object) {}
 static inline void kasan_poison_new_object(struct kmem_cache *cache,
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index b7d05c2a6d93..cbf68680614e 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -163,6 +163,16 @@ void __kasan_poison_slab(struct slab *slab)
 		     KASAN_SLAB_REDZONE, false);
 }
 
+void __kasan_poison_pte(pte_t *pte, int nr)
+{
+	kasan_poison(pte, sizeof(*pte) * nr, KASAN_LAZY_MMU_PTE, false);
+}
+
+void __kasan_unpoison_pte(pte_t *pte, int nr)
+{
+	kasan_unpoison(pte, sizeof(*pte) * nr, false);
+}
+
 void __kasan_unpoison_new_object(struct kmem_cache *cache, void *object)
 {
 	kasan_unpoison(object, cache->object_size, false);
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index fc9169a54766..8ba0fbabd75b 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -144,12 +144,14 @@ static inline bool kasan_requires_meta(void)
 #define KASAN_PAGE_REDZONE	0xFE  /* redzone for kmalloc_large allocation */
 #define KASAN_SLAB_REDZONE	0xFC  /* redzone for slab object */
 #define KASAN_SLAB_FREE		0xFB  /* freed slab object */
+#define KASAN_LAZY_MMU_PTE	0xFD
 #define KASAN_VMALLOC_INVALID	0xF8  /* inaccessible space in vmap area */
 #else
 #define KASAN_PAGE_FREE		KASAN_TAG_INVALID
 #define KASAN_PAGE_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_FREE		KASAN_TAG_INVALID
+#define KASAN_LAZY_MMU_PTE	KASAN_TAG_INVALID
 #define KASAN_VMALLOC_INVALID	KASAN_TAG_INVALID /* only used for SW_TAGS */
 #endif
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 7/7] s390/mm: Lazy MMU mode sanitizer
  2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
                   ` (5 preceding siblings ...)
  2026-06-16 12:40 ` [PATCH v3 6/7] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
@ 2026-06-16 12:40 ` Alexander Gordeev
  6 siblings, 0 replies; 8+ messages in thread
From: Alexander Gordeev @ 2026-06-16 12:40 UTC (permalink / raw)
  To: Gerald Schaefer, Heiko Carstens, Christian Borntraeger,
	Vasily Gorbik, Claudio Imbrenda
  Cc: linux-s390, linux-mm, linux-kernel, Kevin Brodsky,
	David Hildenbrand

Detect PTE entries access in lazy MMU mode by means other
than set_pte() and ptep_get() primitives, which would be
a read hazard.

The access to kasan shadow memory from ptep_get_lockless()
mistakenly hits invalid access in case a concurrent lazy
MMU access to the same PTE is happening. To avoid that
disable instrumentation for ptep_get_lockless() altogether.

Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h |  6 ++++++
 arch/s390/mm/lazy_mmu.c         | 27 +++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2b6659d61fa5..a93e7e786457 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1047,6 +1047,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 		__set_pte(ptep, pte);
 }
 
+#define ptep_get_lockless ptep_get_lockless
+static inline __no_sanitize_address pte_t ptep_get_lockless(pte_t *ptep)
+{
+	return READ_ONCE(*ptep);
+}
+
 static inline pte_t __ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
diff --git a/arch/s390/mm/lazy_mmu.c b/arch/s390/mm/lazy_mmu.c
index d75b93d9b0de..ee2385897bc7 100644
--- a/arch/s390/mm/lazy_mmu.c
+++ b/arch/s390/mm/lazy_mmu.c
@@ -63,10 +63,13 @@ static int invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
 }
 
 static void set_pte_range(struct mm_struct *mm, unsigned long addr,
-			  pte_t *ptep, pte_t *end, pte_t *cache)
+			  pte_t *start, pte_t *end, pte_t *cache)
 {
-	int i, nr_ptes;
+	int nr_ptes, nr_total = end - start;
+	pte_t *ptep = start;
+	int i;
 
+	kasan_unpoison_pte(start, nr_total);
 	while (ptep < end) {
 		nr_ptes = invalidate_pte_range(mm, addr, ptep, end);
 
@@ -77,6 +80,7 @@ static void set_pte_range(struct mm_struct *mm, unsigned long addr,
 
 		addr += nr_ptes * PAGE_SIZE;
 	}
+	kasan_poison_pte(start, nr_total);
 }
 
 static void enter_ipte_norange(void)
@@ -94,6 +98,7 @@ static void enter_ipte_range(struct mm_struct *mm,
 			     unsigned long addr, unsigned long end, pte_t *pte)
 {
 	struct ipte_range *range;
+	unsigned int nr_ptes;
 
 	if (!test_facility(13))
 		return;
@@ -105,6 +110,9 @@ static void enter_ipte_range(struct mm_struct *mm,
 	range->base_addr = addr;
 	range->base_end = end;
 	range->base_pte = pte;
+
+	nr_ptes = (range->base_end - range->base_addr) / PAGE_SIZE;
+	kasan_poison_pte(range->base_pte, nr_ptes);
 }
 
 static void leave_ipte_range(void)
@@ -112,6 +120,7 @@ static void leave_ipte_range(void)
 	pte_t *ptep, *start, *start_cache, *cache;
 	unsigned long start_addr, addr;
 	struct ipte_range *range;
+	unsigned int nr_ptes;
 	int start_idx;
 
 	if (!test_facility(13))
@@ -148,6 +157,9 @@ static void leave_ipte_range(void)
 	range->end_pte = NULL;
 
 done:
+	nr_ptes = (range->base_end - range->base_addr) / PAGE_SIZE;
+	kasan_unpoison_pte(range->base_pte, nr_ptes);
+
 	range->mm = NULL;
 	range->base_addr = 0;
 	range->base_end = 0;
@@ -227,10 +239,17 @@ static void __ipte_range_set_pte(struct ipte_range *range, pte_t *ptep, pte_t pt
 static pte_t __ipte_range_ptep_get(struct ipte_range *range, pte_t *ptep)
 {
 	unsigned int idx = ptep - range->base_pte;
+	pte_t pte;
 
 	lockdep_assert_preemption_disabled();
-	if (pte_val(range->cache[idx]) == PTE_POISON)
-		return __ptep_get(ptep);
+	if (pte_val(range->cache[idx]) == PTE_POISON) {
+		kasan_unpoison_pte(ptep, 1);
+		pte = __ptep_get(ptep);
+		kasan_poison_pte(ptep, 1);
+
+		return pte;
+	}
+
 	return range->cache[idx];
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-16 12:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 12:40 [PATCH v3 0/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 1/7] mm: Make lazy MMU mode context-aware Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 2/7] s390/mm: Complete ptep_get() conversion Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 3/7] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 4/7] mm/gup: Cleanup pgtable entry accessors Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 5/7] mm/page_vma_mapped_walk: Use ptep_get_lockless() for lockless access Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 6/7] mm/kasan: Introduce helpers for lazy MMU mode sanitizer Alexander Gordeev
2026-06-16 12:40 ` [PATCH v3 7/7] s390/mm: Lazy " Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox