[PATCH v4 0/4] Optimize mprotect() for large folios

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] Optimize mprotect() for large folios
@ 2025-06-28 11:34 Dev Jain
  2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
                   ` (6 more replies)
  0 siblings, 7 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-28 11:34 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

This patchset optimizes the mprotect() system call for large folios
by PTE-batching. No issues were observed with mm-selftests, build
tested on x86_64.

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds

After the patchset:
T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds

Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.

Here is the test program:

 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>

 #define SIZE (1024*1024*1024)

unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);

static void pte_map_thps(char *mem, size_t size)
{
	size_t offs;
	int ret = 0;


	/* PTE-map each THP by temporarily splitting the VMAs. */
	for (offs = 0; offs < size; offs += pmdsize) {
		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
	}

	if (ret) {
		fprintf(stderr, "ERROR: mprotect() failed\n");
		exit(1);
	}
}

int main(int argc, char *argv[])
{
	char *p;
        int ret = 0;
	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p != (1UL << 30)) {
		perror("mmap");
		return 1;
	}



	memset(p, 0, SIZE);
	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
		perror("madvise");
	explicit_bzero(p, SIZE);
	pte_map_thps(p, SIZE);

	for (int loops = 0; loops < 40; loops++) {
		if (mprotect(p, SIZE, PROT_READ))
			perror("mprotect"), exit(1);
		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
			perror("mprotect"), exit(1);
		explicit_bzero(p, SIZE);
	}
}

---
The patchset is rebased onto Saturday's mm-new.

v3->v4:
 - Refactor skipping logic into a new function, edit patch 1 subject
   to highlight it is only for MM_CP_PROT_NUMA case (David H)
 - Refactor the optimization logic, add more documentation to the generic
   batched functions, do not add clear_flush_ptes, squash patch 4
   and 5 (Ryan)

v2->v3:
 - Add comments for the new APIs (Ryan, Lorenzo)
 - Instead of refactoring, use a "skip_batch" label
 - Move arm64 patches at the end (Ryan)
 - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
 - Resolve implicit declaration; tested build on x86 (Lance Yang)

v1->v2:
 - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
 - Abridge the anon-exclusive condition (Lance Yang)

Dev Jain (4):
  mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  mm: Add batched versions of ptep_modify_prot_start/commit
  mm: Optimize mprotect() by PTE-batching
  arm64: Add batched versions of ptep_modify_prot_start/commit

 arch/arm64/include/asm/pgtable.h |  10 ++
 arch/arm64/mm/mmu.c              |  28 +++-
 include/linux/pgtable.h          |  83 +++++++++-
 mm/mprotect.c                    | 269 +++++++++++++++++++++++--------
 4 files changed, 315 insertions(+), 75 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
@ 2025-06-28 11:34 ` Dev Jain
  2025-06-30  9:42   ` Ryan Roberts
                     ` (2 more replies)
  2025-06-28 11:34 ` [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-28 11:34 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

In case of prot_numa, there are various cases in which we can skip to the
next iteration. Since the skip condition is based on the folio and not
the PTEs, we can skip a PTE batch. Additionally refactor all of this
into a new function to clean up the existing code.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 87 insertions(+), 47 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88709c01177b..af10a7fbe6b8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
+static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *ptep, pte_t pte, int max_nr_ptes)
+{
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+
+	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
+		return 1;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
+			       NULL, NULL, NULL);
+}
+
+static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
+		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
+		int max_nr_ptes)
+{
+	struct folio *folio = NULL;
+	int nr_ptes = 1;
+	bool toptier;
+	int nid;
+
+	/* Avoid TLB flush if possible */
+	if (pte_protnone(oldpte))
+		goto skip_batch;
+
+	folio = vm_normal_folio(vma, addr, oldpte);
+	if (!folio)
+		goto skip_batch;
+
+	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
+		goto skip_batch;
+
+	/* Also skip shared copy-on-write pages */
+	if (is_cow_mapping(vma->vm_flags) &&
+	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
+		goto skip_batch;
+
+	/*
+	 * While migration can move some dirty pages,
+	 * it cannot move them all from MIGRATE_ASYNC
+	 * context.
+	 */
+	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
+		goto skip_batch;
+
+	/*
+	 * Don't mess with PTEs if page is already on the node
+	 * a single-threaded process is running on.
+	 */
+	nid = folio_nid(folio);
+	if (target_node == nid)
+		goto skip_batch;
+
+	toptier = node_is_toptier(nid);
+
+	/*
+	 * Skip scanning top tier node if normal numa
+	 * balancing is disabled
+	 */
+	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
+		goto skip_batch;
+
+	if (folio_use_access_time(folio)) {
+		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
+
+		/* Do not skip in this case */
+		nr_ptes = 0;
+		goto out;
+	}
+
+skip_batch:
+	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
+out:
+	*foliop = folio;
+	return nr_ptes;
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int nr_ptes;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+		nr_ptes = 1;
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
+			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
+			struct folio *folio = NULL;
 			pte_t ptent;
 
 			/*
@@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct folio *folio;
-				int nid;
-				bool toptier;
-
-				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
-					continue;
-
-				folio = vm_normal_folio(vma, addr, oldpte);
-				if (!folio || folio_is_zone_device(folio) ||
-				    folio_test_ksm(folio))
-					continue;
-
-				/* Also skip shared copy-on-write pages */
-				if (is_cow_mapping(vma->vm_flags) &&
-				    (folio_maybe_dma_pinned(folio) ||
-				     folio_maybe_mapped_shared(folio)))
-					continue;
-
-				/*
-				 * While migration can move some dirty pages,
-				 * it cannot move them all from MIGRATE_ASYNC
-				 * context.
-				 */
-				if (folio_is_file_lru(folio) &&
-				    folio_test_dirty(folio))
-					continue;
-
-				/*
-				 * Don't mess with PTEs if page is already on the node
-				 * a single-threaded process is running on.
-				 */
-				nid = folio_nid(folio);
-				if (target_node == nid)
-					continue;
-				toptier = node_is_toptier(nid);
-
-				/*
-				 * Skip scanning top tier node if normal numa
-				 * balancing is disabled
-				 */
-				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
-				    toptier)
+				nr_ptes = prot_numa_skip_ptes(&folio, vma,
+							      addr, oldpte, pte,
+							      target_node,
+							      max_nr_ptes);
+				if (nr_ptes)
 					continue;
-				if (folio_use_access_time(folio))
-					folio_xchg_access_time(folio,
-						jiffies_to_msecs(jiffies));
 			}
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
@@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				pages++;
 			}
 		}
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
  2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
@ 2025-06-28 11:34 ` Dev Jain
  2025-06-30 10:10   ` Ryan Roberts
  2025-06-30 12:57   ` Lorenzo Stoakes
  2025-06-28 11:34 ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-28 11:34 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
Architecture can override these helpers; in case not, they are implemented
as a simple loop over the corresponding single pte helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
 mm/mprotect.c           |  4 +-
 2 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cf1515c163e2..662f39e7475a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 
 /*
  * Commit an update to a pte, leaving any hardware-controlled bits in
- * the PTE unmodified.
+ * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
+ * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
  */
 static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 					   unsigned long addr,
@@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 	__ptep_modify_prot_commit(vma, addr, ptep, pte);
 }
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+/**
+ * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
+ * over a batch of ptes, which protects against asynchronous hardware
+ * modifications to the ptes. The intention is not to prevent the hardware from
+ * making pte updates, but to prevent any updates it may make from being lost.
+ * Please see the comment above ptep_modify_prot_start() for full description.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
+ * in the batch.
+ *
+ * Note that PTE bits in the PTE batch besides the PFN can differ.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ * Since the batch is determined from folio_pte_batch, the PTEs must differ
+ * only in a/d bits (and the soft dirty bit; see fpb_t flags in
+ * mprotect_folio_pte_batch()).
+ */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/**
+ * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
+ * hardware-controlled bits in the PTE unmodified.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @old_pte: Old page table entry (for the first entry) which is now cleared.
+ * @pte: New page table entry to be set.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_commit().
+ *
+ * Context: The caller holds the page table lock. The PTEs are all in the same
+ * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
+ * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
+ * a/d bits on which were off in old_pte.
+ */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; ++i) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+		ptep++;
+		addr += PAGE_SIZE;
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 /*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index af10a7fbe6b8..627b0d67cc4a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 					continue;
 			}
 
-			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
 			if (uffd_wp)
@@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			    can_change_pte_writable(vma, addr, ptent))
 				ptent = pte_mkwrite(ptent, vma);
 
-			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
 			if (pte_needs_flush(oldpte, ptent))
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
  2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
  2025-06-28 11:34 ` [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-06-28 11:34 ` Dev Jain
  2025-06-28 12:39   ` Dev Jain
                     ` (2 more replies)
  2025-06-28 11:34 ` [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-28 11:34 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Use folio_pte_batch to batch process a large folio. Reuse the folio from
prot_numa case if possible.

For all cases other than the PageAnonExclusive case, if the case holds true
for one pte in the batch, one can confirm that that case will hold true for
other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
and access bits across the batch, therefore batching across
pte_dirty(): this is correct since the dirty bit on the PTE really is
just an indication that the folio got written to, so even if the PTE is
not actually dirty (but one of the PTEs in the batch is), the wp-fault
optimization can be made.

The crux now is how to batch around the PageAnonExclusive case; we must
check the corresponding condition for every single page. Therefore, from
the large folio batch, we process sub batches of ptes mapping pages with
the same PageAnonExclusive condition, and process that sub batch, then
determine and process the next sub batch, and so on. Note that this does
not cause any extra overhead; if suppose the size of the folio batch
is 512, then the sub batch processing in total will take 512 iterations,
which is the same as what we would have done before.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 117 insertions(+), 26 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 627b0d67cc4a..28c7ce7728ff 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,35 +40,47 @@
 
 #include "internal.h"
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
-{
-	struct page *page;
+enum tristate {
+	TRI_FALSE = 0,
+	TRI_TRUE = 1,
+	TRI_MAYBE = -1,
+};
 
+/*
+ * Returns enum tristate indicating whether the pte can be changed to writable.
+ * If TRI_MAYBE is returned, then the folio is anonymous and the user must
+ * additionally check PageAnonExclusive() for every page in the desired range.
+ */
+static int maybe_change_pte_writable(struct vm_area_struct *vma,
+				     unsigned long addr, pte_t pte,
+				     struct folio *folio)
+{
 	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
-		return false;
+		return TRI_FALSE;
 
 	/* Don't touch entries that are not even readable. */
 	if (pte_protnone(pte))
-		return false;
+		return TRI_FALSE;
 
 	/* Do we need write faults for softdirty tracking? */
 	if (pte_needs_soft_dirty_wp(vma, pte))
-		return false;
+		return TRI_FALSE;
 
 	/* Do we need write faults for uffd-wp tracking? */
 	if (userfaultfd_pte_wp(vma, pte))
-		return false;
+		return TRI_FALSE;
 
 	if (!(vma->vm_flags & VM_SHARED)) {
 		/*
 		 * Writable MAP_PRIVATE mapping: We can only special-case on
 		 * exclusive anonymous pages, because we know that our
 		 * write-fault handler similarly would map them writable without
-		 * any additional checks while holding the PT lock.
+		 * any additional checks while holding the PT lock. So if the
+		 * folio is not anonymous, we know we cannot change pte to
+		 * writable. If it is anonymous then the caller must further
+		 * check that the page is AnonExclusive().
 		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
+		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
 	}
 
 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
@@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	 * FS was already notified and we can simply mark the PTE writable
 	 * just like the write-fault handler would do.
 	 */
-	return pte_dirty(pte);
+	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
+}
+
+/*
+ * Returns the number of pages within the folio, starting from the page
+ * indicated by pgidx and up to pgidx + max_nr, that have the same value of
+ * PageAnonExclusive(). Must only be called for anonymous folios. Value of
+ * PageAnonExclusive() is returned in *exclusive.
+ */
+static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
+				bool *exclusive)
+{
+	struct page *page;
+	int nr = 1;
+
+	if (!folio) {
+		*exclusive = false;
+		return nr;
+	}
+
+	page = folio_page(folio, pgidx++);
+	*exclusive = PageAnonExclusive(page);
+	while (nr < max_nr) {
+		page = folio_page(folio, pgidx++);
+		if ((*exclusive) != PageAnonExclusive(page))
+			break;
+		nr++;
+	}
+
+	return nr;
+}
+
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte)
+{
+	struct page *page;
+	int ret;
+
+	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
+	if (ret == TRI_MAYBE) {
+		page = vm_normal_page(vma, addr, pte);
+		ret = page && PageAnon(page) && PageAnonExclusive(page);
+	}
+
+	return ret;
 }
 
 static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *ptep, pte_t pte, int max_nr_ptes)
+		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
 {
-	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+
+	flags &= ~switch_off_flags;
 
-	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
+	if (!folio || !folio_test_large(folio))
 		return 1;
 
 	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
@@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
 	}
 
 skip_batch:
-	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
+	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
+					   max_nr_ptes, 0);
 out:
 	*foliop = folio;
 	return nr_ptes;
@@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
 		if (pte_present(oldpte)) {
 			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
 			struct folio *folio = NULL;
-			pte_t ptent;
+			int sub_nr_ptes, pgidx = 0;
+			pte_t ptent, newpte;
+			bool sub_set_write;
+			int set_write;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 					continue;
 			}
 
+			if (!folio)
+				folio = vm_normal_folio(vma, addr, oldpte);
+
+			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
+							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
 			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * example, if a PTE is already dirty and no other
 			 * COW or special handling is required.
 			 */
-			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent, vma);
-
-			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
-			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
-			pages++;
+			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
+				    !pte_write(ptent);
+			if (set_write)
+				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
+
+			while (nr_ptes) {
+				if (set_write == TRI_MAYBE) {
+					sub_nr_ptes = anon_exclusive_batch(folio,
+						pgidx, nr_ptes, &sub_set_write);
+				} else {
+					sub_nr_ptes = nr_ptes;
+					sub_set_write = (set_write == TRI_TRUE);
+				}
+
+				if (sub_set_write)
+					newpte = pte_mkwrite(ptent, vma);
+				else
+					newpte = ptent;
+
+				modify_prot_commit_ptes(vma, addr, pte, oldpte,
+							newpte, sub_nr_ptes);
+				if (pte_needs_flush(oldpte, newpte))
+					tlb_flush_pte_range(tlb, addr,
+						sub_nr_ptes * PAGE_SIZE);
+
+				addr += sub_nr_ptes * PAGE_SIZE;
+				pte += sub_nr_ptes;
+				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
+				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
+				nr_ptes -= sub_nr_ptes;
+				pages += sub_nr_ptes;
+				pgidx += sub_nr_ptes;
+			}
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
                   ` (2 preceding siblings ...)
  2025-06-28 11:34 ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
@ 2025-06-28 11:34 ` Dev Jain
  2025-06-30 10:43   ` Ryan Roberts
  2025-06-29 23:05 ` [PATCH v4 0/4] Optimize mprotect() for large folios Andrew Morton
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-06-28 11:34 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Override the generic definition of modify_prot_start_ptes() to use
get_and_clear_full_ptes(). This helper does a TLBI only for the starting
and ending contpte block of the range, whereas the current implementation
will call ptep_get_and_clear() for every contpte block, thus doing a
TLBI on every contpte block. Therefore, we have a performance win.

The arm64 definition of pte_accessible() allows us to batch in the
errata specific case:

#define pte_accessible(mm, pte)	\
	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))

All ptes are obviously present in the folio batch, and they are also valid.

Override the generic definition of modify_prot_commit_ptes() to simply
use set_ptes() to map the new ptes into the pagetable.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++++
 arch/arm64/mm/mmu.c              | 28 +++++++++++++++++++++++-----
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ba63c8736666..abd2dee416b3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1643,6 +1643,16 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
 
+#define modify_prot_start_ptes modify_prot_start_ptes
+extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *ptep,
+				    unsigned int nr);
+
+#define modify_prot_commit_ptes modify_prot_commit_ptes
+extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+				    pte_t *ptep, pte_t old_pte, pte_t pte,
+				    unsigned int nr);
+
 #ifdef CONFIG_ARM64_CONTPTE
 
 /*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3d5fb37424ab..38325616f467 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -26,6 +26,7 @@
 #include <linux/set_memory.h>
 #include <linux/kfence.h>
 #include <linux/pkeys.h>
+#include <linux/mm_inline.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -1524,24 +1525,41 @@ static int __init prevent_bootmem_remove_init(void)
 early_initcall(prevent_bootmem_remove_init);
 #endif
 
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, unsigned int nr)
 {
+	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
+
 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
 		/*
 		 * Break-before-make (BBM) is required for all user space mappings
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(ptep_get(ptep)))
-			return ptep_clear_flush(vma, addr, ptep);
+		if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte))
+			__flush_tlb_range(vma, addr, nr * PAGE_SIZE,
+					  PAGE_SIZE, true, 3);
 	}
-	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
+
+	return pte;
+}
+
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+{
+	return modify_prot_start_ptes(vma, addr, ptep, 1);
+}
+
+void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, pte_t old_pte, pte_t pte,
+			     unsigned int nr)
+{
+	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
 }
 
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
 			     pte_t old_pte, pte_t pte)
 {
-	set_pte_at(vma->vm_mm, addr, ptep, pte);
+	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
 }
 
 /*
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-28 11:34 ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
@ 2025-06-28 12:39   ` Dev Jain
  2025-06-30 10:31   ` Ryan Roberts
  2025-06-30 12:52   ` Lorenzo Stoakes
  2 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-28 12:39 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 28/06/25 5:04 pm, Dev Jain wrote:
> Use folio_pte_batch to batch process a large folio. Reuse the folio from
> prot_numa case if possible.
>
> For all cases other than the PageAnonExclusive case, if the case holds true
> for one pte in the batch, one can confirm that that case will hold true for
> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
> and access bits across the batch, therefore batching across
> pte_dirty(): this is correct since the dirty bit on the PTE really is
> just an indication that the folio got written to, so even if the PTE is
> not actually dirty (but one of the PTEs in the batch is), the wp-fault
> optimization can be made.
>
> The crux now is how to batch around the PageAnonExclusive case; we must
> check the corresponding condition for every single page. Therefore, from
> the large folio batch, we process sub batches of ptes mapping pages with
> the same PageAnonExclusive condition, and process that sub batch, then
> determine and process the next sub batch, and so on. Note that this does
> not cause any extra overhead; if suppose the size of the folio batch
> is 512, then the sub batch processing in total will take 512 iterations,
> which is the same as what we would have done before.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   

Forgot to add:

Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

as this patch is almost identical to the diff Ryan had suggested.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
                   ` (3 preceding siblings ...)
  2025-06-28 11:34 ` [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-06-29 23:05 ` Andrew Morton
  2025-06-30  3:33   ` Dev Jain
  2025-06-30 11:17 ` Lorenzo Stoakes
  2025-06-30 11:27 ` Lorenzo Stoakes
  6 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2025-06-29 23:05 UTC (permalink / raw)
  To: Dev Jain
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:

> This patchset optimizes the mprotect() system call for large folios
> by PTE-batching. No issues were observed with mm-selftests, build
> tested on x86_64.

um what.  Seems to claim that "selftests still compiles after I messed
with stuff", which isn't very impressive ;)  Please clarify?

> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
> 
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
> 
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
> 
> After the patchset:
> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds

Well that's tasty.

> Observing T1/T2 and T3 before the patchset, we also remove the regression
> introduced by ptep_get() on a contpte block. And, for large folios we get
> an almost 74% performance improvement, albeit the trade-off being a slight
> degradation in the small folio case.
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-29 23:05 ` [PATCH v4 0/4] Optimize mprotect() for large folios Andrew Morton
@ 2025-06-30  3:33   ` Dev Jain
  2025-06-30 10:45     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-06-30  3:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 4:35 am, Andrew Morton wrote:
> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:
>
>> This patchset optimizes the mprotect() system call for large folios
>> by PTE-batching. No issues were observed with mm-selftests, build
>> tested on x86_64.
> um what.  Seems to claim that "selftests still compiles after I messed
> with stuff", which isn't very impressive ;)  Please clarify?

Sorry I mean to say that the mm-selftests pass.

>
>> We use the following test cases to measure performance, mprotect()'ing
>> the mapped memory to read-only then read-write 40 times:
>>
>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>> pte-mapping those THPs
>> Test case 2: Mapping 1G of memory with 64K mTHPs
>> Test case 3: Mapping 1G of memory with 4K pages
>>
>> Average execution time on arm64, Apple M3:
>> Before the patchset:
>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>
>> After the patchset:
>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
> Well that's tasty.
>
>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>> introduced by ptep_get() on a contpte block. And, for large folios we get
>> an almost 74% performance improvement, albeit the trade-off being a slight
>> degradation in the small folio case.
>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
@ 2025-06-30  9:42   ` Ryan Roberts
  2025-06-30  9:49     ` Dev Jain
  2025-06-30 11:25   ` Lorenzo Stoakes
  2025-07-02  9:37   ` Lorenzo Stoakes
  2 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30  9:42 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 28/06/2025 12:34, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch. Additionally refactor all of this
> into a new function to clean up the existing code.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>  1 file changed, 87 insertions(+), 47 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..af10a7fbe6b8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>  
> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))

The !folio check wasn't in the previous version. Why is it needed now?

> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> +			       NULL, NULL, NULL);
> +}
> +
> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
> +		int max_nr_ptes)
> +{
> +	struct folio *folio = NULL;
> +	int nr_ptes = 1;
> +	bool toptier;
> +	int nid;
> +
> +	/* Avoid TLB flush if possible */
> +	if (pte_protnone(oldpte))
> +		goto skip_batch;
> +
> +	folio = vm_normal_folio(vma, addr, oldpte);
> +	if (!folio)
> +		goto skip_batch;
> +
> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> +		goto skip_batch;
> +
> +	/* Also skip shared copy-on-write pages */
> +	if (is_cow_mapping(vma->vm_flags) &&
> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
> +		goto skip_batch;
> +
> +	/*
> +	 * While migration can move some dirty pages,
> +	 * it cannot move them all from MIGRATE_ASYNC
> +	 * context.
> +	 */
> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> +		goto skip_batch;
> +
> +	/*
> +	 * Don't mess with PTEs if page is already on the node
> +	 * a single-threaded process is running on.
> +	 */
> +	nid = folio_nid(folio);
> +	if (target_node == nid)
> +		goto skip_batch;
> +
> +	toptier = node_is_toptier(nid);
> +
> +	/*
> +	 * Skip scanning top tier node if normal numa
> +	 * balancing is disabled
> +	 */
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> +		goto skip_batch;
> +
> +	if (folio_use_access_time(folio)) {
> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> +
> +		/* Do not skip in this case */
> +		nr_ptes = 0;
> +		goto out;

This doesn't smell right... perhaps I'm not understanding the logic. Why do you
return nr_ptes = 0 if you end up in this conditional, but nr_ptes = 1 if you
don't take this conditional? I think you want to return nr_ptes == 0 for both
cases?...

> +	}
> +
> +skip_batch:
> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
> +out:
> +	*foliop = folio;
> +	return nr_ptes;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> +			struct folio *folio = NULL;
>  			pte_t ptent;
>  
>  			/*
> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct folio *folio;
> -				int nid;
> -				bool toptier;
> -
> -				/* Avoid TLB flush if possible */
> -				if (pte_protnone(oldpte))
> -					continue;
> -
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> -					continue;
> -
> -				/* Also skip shared copy-on-write pages */
> -				if (is_cow_mapping(vma->vm_flags) &&
> -				    (folio_maybe_dma_pinned(folio) ||
> -				     folio_maybe_mapped_shared(folio)))
> -					continue;
> -
> -				/*
> -				 * While migration can move some dirty pages,
> -				 * it cannot move them all from MIGRATE_ASYNC
> -				 * context.
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -				    folio_test_dirty(folio))
> -					continue;
> -
> -				/*
> -				 * Don't mess with PTEs if page is already on the node
> -				 * a single-threaded process is running on.
> -				 */
> -				nid = folio_nid(folio);
> -				if (target_node == nid)
> -					continue;
> -				toptier = node_is_toptier(nid);
> -
> -				/*
> -				 * Skip scanning top tier node if normal numa
> -				 * balancing is disabled
> -				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				nr_ptes = prot_numa_skip_ptes(&folio, vma,
> +							      addr, oldpte, pte,
> +							      target_node,
> +							      max_nr_ptes);
> +				if (nr_ptes)
>  					continue;

...But now here nr_ptes == 0 for the "don't skip" case, so won't you process
that PTE twice because while (pte += nr_ptes, ...) won't advance it?

Suggest forcing nr_ptes = 1 after this conditional "continue"?

Thanks,
Ryan


> -				if (folio_use_access_time(folio))
> -					folio_xchg_access_time(folio,
> -						jiffies_to_msecs(jiffies));
>  			}
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30  9:42   ` Ryan Roberts
@ 2025-06-30  9:49     ` Dev Jain
  2025-06-30  9:55       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-06-30  9:49 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 3:12 pm, Ryan Roberts wrote:
> On 28/06/2025 12:34, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>> into a new function to clean up the existing code.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>>   1 file changed, 87 insertions(+), 47 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88709c01177b..af10a7fbe6b8 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	return pte_dirty(pte);
>>   }
>>   
>> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
> The !folio check wasn't in the previous version. Why is it needed now?

It was there, actually. After prot_numa_skip_ptes(), if the folio is still
NULL, we get it using vm_normal_folio(). If this returns NULL, then
mprotect_folio_pte_batch() will return 1 to say that we cannot batch.

>
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> +			       NULL, NULL, NULL);
>> +}
>> +
>> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
>> +		int max_nr_ptes)
>> +{
>> +	struct folio *folio = NULL;
>> +	int nr_ptes = 1;
>> +	bool toptier;
>> +	int nid;
>> +
>> +	/* Avoid TLB flush if possible */
>> +	if (pte_protnone(oldpte))
>> +		goto skip_batch;
>> +
>> +	folio = vm_normal_folio(vma, addr, oldpte);
>> +	if (!folio)
>> +		goto skip_batch;
>> +
>> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>> +		goto skip_batch;
>> +
>> +	/* Also skip shared copy-on-write pages */
>> +	if (is_cow_mapping(vma->vm_flags) &&
>> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
>> +		goto skip_batch;
>> +
>> +	/*
>> +	 * While migration can move some dirty pages,
>> +	 * it cannot move them all from MIGRATE_ASYNC
>> +	 * context.
>> +	 */
>> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>> +		goto skip_batch;
>> +
>> +	/*
>> +	 * Don't mess with PTEs if page is already on the node
>> +	 * a single-threaded process is running on.
>> +	 */
>> +	nid = folio_nid(folio);
>> +	if (target_node == nid)
>> +		goto skip_batch;
>> +
>> +	toptier = node_is_toptier(nid);
>> +
>> +	/*
>> +	 * Skip scanning top tier node if normal numa
>> +	 * balancing is disabled
>> +	 */
>> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
>> +		goto skip_batch;
>> +
>> +	if (folio_use_access_time(folio)) {
>> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>> +
>> +		/* Do not skip in this case */
>> +		nr_ptes = 0;
>> +		goto out;
> This doesn't smell right... perhaps I'm not understanding the logic. Why do you
> return nr_ptes = 0 if you end up in this conditional, but nr_ptes = 1 if you
> don't take this conditional? I think you want to return nr_ptes == 0 for both
> cases?...

In the existing code, we do not skip if we take this conditional. So nr_ptes == 0
is only a hint that we don't have to skip in this case.

>
>> +	}
>> +
>> +skip_batch:
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +out:
>> +	*foliop = folio;
>> +	return nr_ptes;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	int nr_ptes;
>>   
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +			struct folio *folio = NULL;
>>   			pte_t ptent;
>>   
>>   			/*
>> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa) {
>> -				struct folio *folio;
>> -				int nid;
>> -				bool toptier;
>> -
>> -				/* Avoid TLB flush if possible */
>> -				if (pte_protnone(oldpte))
>> -					continue;
>> -
>> -				folio = vm_normal_folio(vma, addr, oldpte);
>> -				if (!folio || folio_is_zone_device(folio) ||
>> -				    folio_test_ksm(folio))
>> -					continue;
>> -
>> -				/* Also skip shared copy-on-write pages */
>> -				if (is_cow_mapping(vma->vm_flags) &&
>> -				    (folio_maybe_dma_pinned(folio) ||
>> -				     folio_maybe_mapped_shared(folio)))
>> -					continue;
>> -
>> -				/*
>> -				 * While migration can move some dirty pages,
>> -				 * it cannot move them all from MIGRATE_ASYNC
>> -				 * context.
>> -				 */
>> -				if (folio_is_file_lru(folio) &&
>> -				    folio_test_dirty(folio))
>> -					continue;
>> -
>> -				/*
>> -				 * Don't mess with PTEs if page is already on the node
>> -				 * a single-threaded process is running on.
>> -				 */
>> -				nid = folio_nid(folio);
>> -				if (target_node == nid)
>> -					continue;
>> -				toptier = node_is_toptier(nid);
>> -
>> -				/*
>> -				 * Skip scanning top tier node if normal numa
>> -				 * balancing is disabled
>> -				 */
>> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> -				    toptier)
>> +				nr_ptes = prot_numa_skip_ptes(&folio, vma,
>> +							      addr, oldpte, pte,
>> +							      target_node,
>> +							      max_nr_ptes);
>> +				if (nr_ptes)
>>   					continue;
> ...But now here nr_ptes == 0 for the "don't skip" case, so won't you process
> that PTE twice because while (pte += nr_ptes, ...) won't advance it?
>
> Suggest forcing nr_ptes = 1 after this conditional "continue"?

nr_ptes will be forced to a non zero value through mprotect_folio_pte_batch().

>
> Thanks,
> Ryan
>
>
>> -				if (folio_use_access_time(folio))
>> -					folio_xchg_access_time(folio,
>> -						jiffies_to_msecs(jiffies));
>>   			}
>>   
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				pages++;
>>   			}
>>   		}
>> -	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>>   	arch_leave_lazy_mmu_mode();
>>   	pte_unmap_unlock(pte - 1, ptl);
>>   


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30  9:49     ` Dev Jain
@ 2025-06-30  9:55       ` Ryan Roberts
  2025-06-30 10:05         ` Dev Jain
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30  9:55 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 30/06/2025 10:49, Dev Jain wrote:
> 
> On 30/06/25 3:12 pm, Ryan Roberts wrote:
>> On 28/06/2025 12:34, Dev Jain wrote:
>>> In case of prot_numa, there are various cases in which we can skip to the
>>> next iteration. Since the skip condition is based on the folio and not
>>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>>> into a new function to clean up the existing code.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>>>   1 file changed, 87 insertions(+), 47 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 88709c01177b..af10a7fbe6b8 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>       return pte_dirty(pte);
>>>   }
>>>   +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>> +{
>>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +
>>> +    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> The !folio check wasn't in the previous version. Why is it needed now?
> 
> It was there, actually. After prot_numa_skip_ptes(), if the folio is still
> NULL, we get it using vm_normal_folio(). If this returns NULL, then
> mprotect_folio_pte_batch() will return 1 to say that we cannot batch.
> 
>>
>>> +        return 1;
>>> +
>>> +    return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>> +                   NULL, NULL, NULL);
>>> +}
>>> +
>>> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct
>>> *vma,
>>> +        unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
>>> +        int max_nr_ptes)
>>> +{
>>> +    struct folio *folio = NULL;
>>> +    int nr_ptes = 1;
>>> +    bool toptier;
>>> +    int nid;
>>> +
>>> +    /* Avoid TLB flush if possible */
>>> +    if (pte_protnone(oldpte))
>>> +        goto skip_batch;
>>> +
>>> +    folio = vm_normal_folio(vma, addr, oldpte);
>>> +    if (!folio)
>>> +        goto skip_batch;
>>> +
>>> +    if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>>> +        goto skip_batch;
>>> +
>>> +    /* Also skip shared copy-on-write pages */
>>> +    if (is_cow_mapping(vma->vm_flags) &&
>>> +        (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
>>> +        goto skip_batch;
>>> +
>>> +    /*
>>> +     * While migration can move some dirty pages,
>>> +     * it cannot move them all from MIGRATE_ASYNC
>>> +     * context.
>>> +     */
>>> +    if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>>> +        goto skip_batch;
>>> +
>>> +    /*
>>> +     * Don't mess with PTEs if page is already on the node
>>> +     * a single-threaded process is running on.
>>> +     */
>>> +    nid = folio_nid(folio);
>>> +    if (target_node == nid)
>>> +        goto skip_batch;
>>> +
>>> +    toptier = node_is_toptier(nid);
>>> +
>>> +    /*
>>> +     * Skip scanning top tier node if normal numa
>>> +     * balancing is disabled
>>> +     */
>>> +    if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
>>> +        goto skip_batch;
>>> +
>>> +    if (folio_use_access_time(folio)) {
>>> +        folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>>> +
>>> +        /* Do not skip in this case */
>>> +        nr_ptes = 0;
>>> +        goto out;
>> This doesn't smell right... perhaps I'm not understanding the logic. Why do you
>> return nr_ptes = 0 if you end up in this conditional, but nr_ptes = 1 if you
>> don't take this conditional? I think you want to return nr_ptes == 0 for both
>> cases?...
> 
> In the existing code, we do not skip if we take this conditional. So nr_ptes == 0
> is only a hint that we don't have to skip in this case.

We also do not skip if we do not take the conditional,right? "hint that we don't
have to skip in this case"... no I think it's a "directive that we must not
skip"? A hint is something that the implementation is free to ignore. But I
don't think that's the case here.

What I'm saying is that I think this block should actually be:

	if (folio_use_access_time(folio))
		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));

	/* Do not skip in this case */
	nr_ptes = 0;
	goto out;

> 
>>
>>> +    }
>>> +
>>> +skip_batch:
>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>> +out:
>>> +    *foliop = folio;
>>> +    return nr_ptes;
>>> +}
>>> +
>>>   static long change_pte_range(struct mmu_gather *tlb,
>>>           struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>>           unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>>> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>       bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>       bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>       bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>> +    int nr_ptes;
>>>         tlb_change_page_size(tlb, PAGE_SIZE);
>>>       pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>       flush_tlb_batched_pending(vma->vm_mm);
>>>       arch_enter_lazy_mmu_mode();
>>>       do {
>>> +        nr_ptes = 1;
>>>           oldpte = ptep_get(pte);
>>>           if (pte_present(oldpte)) {
>>> +            int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>> +            struct folio *folio = NULL;
>>>               pte_t ptent;
>>>                 /*
>>> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                * pages. See similar comment in change_huge_pmd.
>>>                */
>>>               if (prot_numa) {
>>> -                struct folio *folio;
>>> -                int nid;
>>> -                bool toptier;
>>> -
>>> -                /* Avoid TLB flush if possible */
>>> -                if (pte_protnone(oldpte))
>>> -                    continue;
>>> -
>>> -                folio = vm_normal_folio(vma, addr, oldpte);
>>> -                if (!folio || folio_is_zone_device(folio) ||
>>> -                    folio_test_ksm(folio))
>>> -                    continue;
>>> -
>>> -                /* Also skip shared copy-on-write pages */
>>> -                if (is_cow_mapping(vma->vm_flags) &&
>>> -                    (folio_maybe_dma_pinned(folio) ||
>>> -                     folio_maybe_mapped_shared(folio)))
>>> -                    continue;
>>> -
>>> -                /*
>>> -                 * While migration can move some dirty pages,
>>> -                 * it cannot move them all from MIGRATE_ASYNC
>>> -                 * context.
>>> -                 */
>>> -                if (folio_is_file_lru(folio) &&
>>> -                    folio_test_dirty(folio))
>>> -                    continue;
>>> -
>>> -                /*
>>> -                 * Don't mess with PTEs if page is already on the node
>>> -                 * a single-threaded process is running on.
>>> -                 */
>>> -                nid = folio_nid(folio);
>>> -                if (target_node == nid)
>>> -                    continue;
>>> -                toptier = node_is_toptier(nid);
>>> -
>>> -                /*
>>> -                 * Skip scanning top tier node if normal numa
>>> -                 * balancing is disabled
>>> -                 */
>>> -                if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>>> -                    toptier)
>>> +                nr_ptes = prot_numa_skip_ptes(&folio, vma,
>>> +                                  addr, oldpte, pte,
>>> +                                  target_node,
>>> +                                  max_nr_ptes);
>>> +                if (nr_ptes)
>>>                       continue;
>> ...But now here nr_ptes == 0 for the "don't skip" case, so won't you process
>> that PTE twice because while (pte += nr_ptes, ...) won't advance it?
>>
>> Suggest forcing nr_ptes = 1 after this conditional "continue"?
> 
> nr_ptes will be forced to a non zero value through mprotect_folio_pte_batch().

But you don't call mprotect_folio_pte_batch() if you have set nr_ptes = 0;
Perhaps you are referring to calling mprotect_folio_pte_batch() on the
processing path in a future patch? But that means that this patch is buggy
without the future patch.

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>> -                if (folio_use_access_time(folio))
>>> -                    folio_xchg_access_time(folio,
>>> -                        jiffies_to_msecs(jiffies));
>>>               }
>>>                 oldpte = ptep_modify_prot_start(vma, addr, pte);
>>> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                   pages++;
>>>               }
>>>           }
>>> -    } while (pte++, addr += PAGE_SIZE, addr != end);
>>> +    } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>>>       arch_leave_lazy_mmu_mode();
>>>       pte_unmap_unlock(pte - 1, ptl);
>>>   



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30  9:55       ` Ryan Roberts
@ 2025-06-30 10:05         ` Dev Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 10:05 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 3:25 pm, Ryan Roberts wrote:
> On 30/06/2025 10:49, Dev Jain wrote:
>> On 30/06/25 3:12 pm, Ryan Roberts wrote:
>>> On 28/06/2025 12:34, Dev Jain wrote:
>>>> In case of prot_numa, there are various cases in which we can skip to the
>>>> next iteration. Since the skip condition is based on the folio and not
>>>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>>>> into a new function to clean up the existing code.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>>>>    1 file changed, 87 insertions(+), 47 deletions(-)
>>>>
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 88709c01177b..af10a7fbe6b8 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>>        return pte_dirty(pte);
>>>>    }
>>>>    +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>>> +{
>>>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +
>>>> +    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>> The !folio check wasn't in the previous version. Why is it needed now?
>> It was there, actually. After prot_numa_skip_ptes(), if the folio is still
>> NULL, we get it using vm_normal_folio(). If this returns NULL, then
>> mprotect_folio_pte_batch() will return 1 to say that we cannot batch.
>>
>>>> +        return 1;
>>>> +
>>>> +    return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>>> +                   NULL, NULL, NULL);
>>>> +}
>>>> +
>>>> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct
>>>> *vma,
>>>> +        unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
>>>> +        int max_nr_ptes)
>>>> +{
>>>> +    struct folio *folio = NULL;
>>>> +    int nr_ptes = 1;
>>>> +    bool toptier;
>>>> +    int nid;
>>>> +
>>>> +    /* Avoid TLB flush if possible */
>>>> +    if (pte_protnone(oldpte))
>>>> +        goto skip_batch;
>>>> +
>>>> +    folio = vm_normal_folio(vma, addr, oldpte);
>>>> +    if (!folio)
>>>> +        goto skip_batch;
>>>> +
>>>> +    if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>>>> +        goto skip_batch;
>>>> +
>>>> +    /* Also skip shared copy-on-write pages */
>>>> +    if (is_cow_mapping(vma->vm_flags) &&
>>>> +        (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
>>>> +        goto skip_batch;
>>>> +
>>>> +    /*
>>>> +     * While migration can move some dirty pages,
>>>> +     * it cannot move them all from MIGRATE_ASYNC
>>>> +     * context.
>>>> +     */
>>>> +    if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>>>> +        goto skip_batch;
>>>> +
>>>> +    /*
>>>> +     * Don't mess with PTEs if page is already on the node
>>>> +     * a single-threaded process is running on.
>>>> +     */
>>>> +    nid = folio_nid(folio);
>>>> +    if (target_node == nid)
>>>> +        goto skip_batch;
>>>> +
>>>> +    toptier = node_is_toptier(nid);
>>>> +
>>>> +    /*
>>>> +     * Skip scanning top tier node if normal numa
>>>> +     * balancing is disabled
>>>> +     */
>>>> +    if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
>>>> +        goto skip_batch;
>>>> +
>>>> +    if (folio_use_access_time(folio)) {
>>>> +        folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>>>> +
>>>> +        /* Do not skip in this case */
>>>> +        nr_ptes = 0;
>>>> +        goto out;
>>> This doesn't smell right... perhaps I'm not understanding the logic. Why do you
>>> return nr_ptes = 0 if you end up in this conditional, but nr_ptes = 1 if you
>>> don't take this conditional? I think you want to return nr_ptes == 0 for both
>>> cases?...
>> In the existing code, we do not skip if we take this conditional. So nr_ptes == 0
>> is only a hint that we don't have to skip in this case.
> We also do not skip if we do not take the conditional,right? "hint that we don't
> have to skip in this case"... no I think it's a "directive that we must not
> skip"? A hint is something that the implementation is free to ignore. But I
> don't think that's the case here.
>
> What I'm saying is that I think this block should actually be:
>
> 	if (folio_use_access_time(folio))
> 		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>
> 	/* Do not skip in this case */
> 	nr_ptes = 0;
> 	goto out;

Ah you are right. Thanks!

>>>> +    }
>>>> +
>>>> +skip_batch:
>>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>>> +out:
>>>> +    *foliop = folio;
>>>> +    return nr_ptes;
>>>> +}
>>>> +
>>>>    static long change_pte_range(struct mmu_gather *tlb,
>>>>            struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>>>            unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>>>> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>        bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>>        bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>>        bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>>> +    int nr_ptes;
>>>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>>>        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>>> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>        flush_tlb_batched_pending(vma->vm_mm);
>>>>        arch_enter_lazy_mmu_mode();
>>>>        do {
>>>> +        nr_ptes = 1;
>>>>            oldpte = ptep_get(pte);
>>>>            if (pte_present(oldpte)) {
>>>> +            int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>> +            struct folio *folio = NULL;
>>>>                pte_t ptent;
>>>>                  /*
>>>> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                 * pages. See similar comment in change_huge_pmd.
>>>>                 */
>>>>                if (prot_numa) {
>>>> -                struct folio *folio;
>>>> -                int nid;
>>>> -                bool toptier;
>>>> -
>>>> -                /* Avoid TLB flush if possible */
>>>> -                if (pte_protnone(oldpte))
>>>> -                    continue;
>>>> -
>>>> -                folio = vm_normal_folio(vma, addr, oldpte);
>>>> -                if (!folio || folio_is_zone_device(folio) ||
>>>> -                    folio_test_ksm(folio))
>>>> -                    continue;
>>>> -
>>>> -                /* Also skip shared copy-on-write pages */
>>>> -                if (is_cow_mapping(vma->vm_flags) &&
>>>> -                    (folio_maybe_dma_pinned(folio) ||
>>>> -                     folio_maybe_mapped_shared(folio)))
>>>> -                    continue;
>>>> -
>>>> -                /*
>>>> -                 * While migration can move some dirty pages,
>>>> -                 * it cannot move them all from MIGRATE_ASYNC
>>>> -                 * context.
>>>> -                 */
>>>> -                if (folio_is_file_lru(folio) &&
>>>> -                    folio_test_dirty(folio))
>>>> -                    continue;
>>>> -
>>>> -                /*
>>>> -                 * Don't mess with PTEs if page is already on the node
>>>> -                 * a single-threaded process is running on.
>>>> -                 */
>>>> -                nid = folio_nid(folio);
>>>> -                if (target_node == nid)
>>>> -                    continue;
>>>> -                toptier = node_is_toptier(nid);
>>>> -
>>>> -                /*
>>>> -                 * Skip scanning top tier node if normal numa
>>>> -                 * balancing is disabled
>>>> -                 */
>>>> -                if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>>>> -                    toptier)
>>>> +                nr_ptes = prot_numa_skip_ptes(&folio, vma,
>>>> +                                  addr, oldpte, pte,
>>>> +                                  target_node,
>>>> +                                  max_nr_ptes);
>>>> +                if (nr_ptes)
>>>>                        continue;
>>> ...But now here nr_ptes == 0 for the "don't skip" case, so won't you process
>>> that PTE twice because while (pte += nr_ptes, ...) won't advance it?
>>>
>>> Suggest forcing nr_ptes = 1 after this conditional "continue"?
>> nr_ptes will be forced to a non zero value through mprotect_folio_pte_batch().
> But you don't call mprotect_folio_pte_batch() if you have set nr_ptes = 0;
> Perhaps you are referring to calling mprotect_folio_pte_batch() on the
> processing path in a future patch? But that means that this patch is buggy
> without the future patch.

Yup it is there in the future patch. You are correct, I'll respin and force
nr_ptes = 1 in this case.

>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> -                if (folio_use_access_time(folio))
>>>> -                    folio_xchg_access_time(folio,
>>>> -                        jiffies_to_msecs(jiffies));
>>>>                }
>>>>                  oldpte = ptep_modify_prot_start(vma, addr, pte);
>>>> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                    pages++;
>>>>                }
>>>>            }
>>>> -    } while (pte++, addr += PAGE_SIZE, addr != end);
>>>> +    } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>>>>        arch_leave_lazy_mmu_mode();
>>>>        pte_unmap_unlock(pte - 1, ptl);
>>>>    


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-28 11:34 ` [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-06-30 10:10   ` Ryan Roberts
  2025-06-30 10:17     ` Dev Jain
  2025-06-30 12:57   ` Lorenzo Stoakes
  1 sibling, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 10:10 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 28/06/2025 12:34, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers; in case not, they are implemented
> as a simple loop over the corresponding single pte helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mprotect.c           |  4 +-
>  2 files changed, 84 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index cf1515c163e2..662f39e7475a 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  
>  /*
>   * Commit an update to a pte, leaving any hardware-controlled bits in
> - * the PTE unmodified.
> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.

I find this last sentance a bit confusing. I think what you are trying to say is
somehthing like:

"""
old_pte is the value returned from ptep_modify_prot_start() but may additionally
have have young and/or dirty bits set where previously they were not.
"""

?

>   */
>  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  					   unsigned long addr,
> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/**
> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
> + * over a batch of ptes, which protects against asynchronous hardware
> + * modifications to the ptes. The intention is not to prevent the hardware from
> + * making pte updates, but to prevent any updates it may make from being lost.
> + * Please see the comment above ptep_modify_prot_start() for full description.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> + * in the batch.
> + *
> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
> + * mprotect_folio_pte_batch()).

This last sentence is confusing... You had previous said the PFN can differ, but
here you imply on a, d and sd bits are allowed to differ.

> + */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/**
> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> + * hardware-controlled bits in the PTE unmodified.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> + * @pte: New page table entry to be set.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_commit().
> + *
> + * Context: The caller holds the page table lock. The PTEs are all in the same
> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
> + * a/d bits on which were off in old_pte.

Same comment as for ptep_modify_prot_start().

> + */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; ++i) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  #endif /* CONFIG_MMU */
>  
>  /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index af10a7fbe6b8..627b0d67cc4a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  					continue;
>  			}
>  
> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);

You're calling this with nr_ptes = 0 for the prot_numa case. But the
implementation expects minimum nr_ptes == 1.

>  			ptent = pte_modify(oldpte, newprot);
>  
>  			if (uffd_wp)
> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			    can_change_pte_writable(vma, addr, ptent))
>  				ptent = pte_mkwrite(ptent, vma);
>  
> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>  			if (pte_needs_flush(oldpte, ptent))
>  				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>  			pages++;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-30 10:10   ` Ryan Roberts
@ 2025-06-30 10:17     ` Dev Jain
  2025-06-30 10:35       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-06-30 10:17 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 3:40 pm, Ryan Roberts wrote:
> On 28/06/2025 12:34, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>>   mm/mprotect.c           |  4 +-
>>   2 files changed, 84 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index cf1515c163e2..662f39e7475a 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>>   
>>   /*
>>    * Commit an update to a pte, leaving any hardware-controlled bits in
>> - * the PTE unmodified.
>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
> I find this last sentance a bit confusing. I think what you are trying to say is
> somehthing like:
>
> """
> old_pte is the value returned from ptep_modify_prot_start() but may additionally
> have have young and/or dirty bits set where previously they were not.
> """

Thanks.

> ?
>
>>    */
>>   static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   					   unsigned long addr,
>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>>   }
>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>> +
>> +/**
>> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
>> + * over a batch of ptes, which protects against asynchronous hardware
>> + * modifications to the ptes. The intention is not to prevent the hardware from
>> + * making pte updates, but to prevent any updates it may make from being lost.
>> + * Please see the comment above ptep_modify_prot_start() for full description.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
>> + * in the batch.
>> + *
>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
>> + * mprotect_folio_pte_batch()).
> This last sentence is confusing... You had previous said the PFN can differ, but
> here you imply on a, d and sd bits are allowed to differ.

Forgot to mention the PFNs, kind of took them as implied. So mentioning the PFNs
also will do or do you suggest a better wording?

>
>> + */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/**
>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>> + * hardware-controlled bits in the PTE unmodified.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
>> + * @pte: New page table entry to be set.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_commit().
>> + *
>> + * Context: The caller holds the page table lock. The PTEs are all in the same
>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
>> + * a/d bits on which were off in old_pte.
> Same comment as for ptep_modify_prot_start().
>
>> + */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; ++i) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
>> +
>>   #endif /* CONFIG_MMU */
>>   
>>   /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index af10a7fbe6b8..627b0d67cc4a 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   					continue;
>>   			}
>>   
>> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> You're calling this with nr_ptes = 0 for the prot_numa case. But the
> implementation expects minimum nr_ptes == 1.

This will get fixed when I force nr_ptes = 1 in the previous patch right?

>
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>>   			if (uffd_wp)
>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			    can_change_pte_writable(vma, addr, ptent))
>>   				ptent = pte_mkwrite(ptent, vma);
>>   
>> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>   			if (pte_needs_flush(oldpte, ptent))
>>   				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>   			pages++;


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-28 11:34 ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
  2025-06-28 12:39   ` Dev Jain
@ 2025-06-30 10:31   ` Ryan Roberts
  2025-06-30 11:21     ` Dev Jain
  2025-07-01  5:47     ` Dev Jain
  2025-06-30 12:52   ` Lorenzo Stoakes
  2 siblings, 2 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 10:31 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 28/06/2025 12:34, Dev Jain wrote:
> Use folio_pte_batch to batch process a large folio. Reuse the folio from
> prot_numa case if possible.
> 
> For all cases other than the PageAnonExclusive case, if the case holds true
> for one pte in the batch, one can confirm that that case will hold true for
> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
> and access bits across the batch, therefore batching across
> pte_dirty(): this is correct since the dirty bit on the PTE really is
> just an indication that the folio got written to, so even if the PTE is
> not actually dirty (but one of the PTEs in the batch is), the wp-fault
> optimization can be made.
> 
> The crux now is how to batch around the PageAnonExclusive case; we must
> check the corresponding condition for every single page. Therefore, from
> the large folio batch, we process sub batches of ptes mapping pages with
> the same PageAnonExclusive condition, and process that sub batch, then
> determine and process the next sub batch, and so on. Note that this does
> not cause any extra overhead; if suppose the size of the folio batch
> is 512, then the sub batch processing in total will take 512 iterations,
> which is the same as what we would have done before.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 117 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 627b0d67cc4a..28c7ce7728ff 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -40,35 +40,47 @@
>  
>  #include "internal.h"
>  
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> -{
> -	struct page *page;
> +enum tristate {
> +	TRI_FALSE = 0,
> +	TRI_TRUE = 1,
> +	TRI_MAYBE = -1,
> +};
>  
> +/*
> + * Returns enum tristate indicating whether the pte can be changed to writable.
> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
> + * additionally check PageAnonExclusive() for every page in the desired range.
> + */
> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
> +				     unsigned long addr, pte_t pte,
> +				     struct folio *folio)
> +{
>  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
> -		return false;
> +		return TRI_FALSE;
>  
>  	/* Don't touch entries that are not even readable. */
>  	if (pte_protnone(pte))
> -		return false;
> +		return TRI_FALSE;
>  
>  	/* Do we need write faults for softdirty tracking? */
>  	if (pte_needs_soft_dirty_wp(vma, pte))
> -		return false;
> +		return TRI_FALSE;
>  
>  	/* Do we need write faults for uffd-wp tracking? */
>  	if (userfaultfd_pte_wp(vma, pte))
> -		return false;
> +		return TRI_FALSE;
>  
>  	if (!(vma->vm_flags & VM_SHARED)) {
>  		/*
>  		 * Writable MAP_PRIVATE mapping: We can only special-case on
>  		 * exclusive anonymous pages, because we know that our
>  		 * write-fault handler similarly would map them writable without
> -		 * any additional checks while holding the PT lock.
> +		 * any additional checks while holding the PT lock. So if the
> +		 * folio is not anonymous, we know we cannot change pte to
> +		 * writable. If it is anonymous then the caller must further
> +		 * check that the page is AnonExclusive().
>  		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>  	}
>  
>  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	 * FS was already notified and we can simply mark the PTE writable
>  	 * just like the write-fault handler would do.
>  	 */
> -	return pte_dirty(pte);
> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
> +}
> +
> +/*
> + * Returns the number of pages within the folio, starting from the page
> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
> + * PageAnonExclusive() is returned in *exclusive.
> + */
> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
> +				bool *exclusive)
> +{
> +	struct page *page;
> +	int nr = 1;
> +
> +	if (!folio) {
> +		*exclusive = false;
> +		return nr;
> +	}
> +
> +	page = folio_page(folio, pgidx++);
> +	*exclusive = PageAnonExclusive(page);
> +	while (nr < max_nr) {
> +		page = folio_page(folio, pgidx++);
> +		if ((*exclusive) != PageAnonExclusive(page))

nit: brackets not required around *exclusive.

> +			break;
> +		nr++;
> +	}
> +
> +	return nr;
> +}
> +
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte)
> +{
> +	struct page *page;
> +	int ret;
> +
> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
> +	if (ret == TRI_MAYBE) {
> +		page = vm_normal_page(vma, addr, pte);
> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
> +	}
> +
> +	return ret;
>  }
>  
>  static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>  {
> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	flags &= ~switch_off_flags;

This is mega confusing when reading the caller. Because the caller passes
FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.

Can't we just pass in the flags we want?

>  
> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
> +	if (!folio || !folio_test_large(folio))

What's the rational for dropping the max_nr_ptes == 1 condition? If you don't
need it, why did you add it in the earler patch?

>  		return 1;
>  
>  	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>  	}
>  
>  skip_batch:
> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
> +					   max_nr_ptes, 0);
>  out:
>  	*foliop = folio;
>  	return nr_ptes;
> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>  		if (pte_present(oldpte)) {
>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>  			struct folio *folio = NULL;
> -			pte_t ptent;
> +			int sub_nr_ptes, pgidx = 0;
> +			pte_t ptent, newpte;
> +			bool sub_set_write;
> +			int set_write;
>  
>  			/*
>  			 * Avoid trapping faults against the zero or KSM
> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  					continue;
>  			}
>  
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);

From the other thread, my memory is jogged that this function ignores write
permission bit. So I think that's opening up a bug when applied here? If the
first pte is writable but the rest are not (COW), doesn't this now make them all
writable? I don't *think* that's a problem for the prot_numa use, but I could be
wrong.

>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);

Even if I'm wrong about ignoring write bit being a bug, I don't think the docs
for this function permit write bit to be different across the batch?

>  			ptent = pte_modify(oldpte, newprot);
>  
> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * example, if a PTE is already dirty and no other
>  			 * COW or special handling is required.
>  			 */
> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent, vma);
> -
> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -			if (pte_needs_flush(oldpte, ptent))
> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -			pages++;
> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> +				    !pte_write(ptent);
> +			if (set_write)
> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);

Why not just:
			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
				    !pte_write(ptent) &&
				    maybe_change_pte_writable(...);

?

> +
> +			while (nr_ptes) {
> +				if (set_write == TRI_MAYBE) {
> +					sub_nr_ptes = anon_exclusive_batch(folio,
> +						pgidx, nr_ptes, &sub_set_write);
> +				} else {
> +					sub_nr_ptes = nr_ptes;
> +					sub_set_write = (set_write == TRI_TRUE);
> +				}
> +
> +				if (sub_set_write)
> +					newpte = pte_mkwrite(ptent, vma);
> +				else
> +					newpte = ptent;
> +
> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> +							newpte, sub_nr_ptes);
> +				if (pte_needs_flush(oldpte, newpte))

What did we conclude with pte_needs_flush()? I thought there was an arch where
it looked dodgy calling this for just the pte at the head of the batch?

Thanks,
Ryan

> +					tlb_flush_pte_range(tlb, addr,
> +						sub_nr_ptes * PAGE_SIZE);
> +
> +				addr += sub_nr_ptes * PAGE_SIZE;
> +				pte += sub_nr_ptes;
> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> +				nr_ptes -= sub_nr_ptes;
> +				pages += sub_nr_ptes;
> +				pgidx += sub_nr_ptes;
> +			}
>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
>  			pte_t newpte;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-30 10:17     ` Dev Jain
@ 2025-06-30 10:35       ` Ryan Roberts
  2025-06-30 10:42         ` Dev Jain
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 10:35 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 30/06/2025 11:17, Dev Jain wrote:
> 
> On 30/06/25 3:40 pm, Ryan Roberts wrote:
>> On 28/06/2025 12:34, Dev Jain wrote:
>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>> Architecture can override these helpers; in case not, they are implemented
>>> as a simple loop over the corresponding single pte helpers.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>>>   mm/mprotect.c           |  4 +-
>>>   2 files changed, 84 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index cf1515c163e2..662f39e7475a 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
>>> vm_area_struct *vma,
>>>     /*
>>>    * Commit an update to a pte, leaving any hardware-controlled bits in
>>> - * the PTE unmodified.
>>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
>>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
>> I find this last sentance a bit confusing. I think what you are trying to say is
>> somehthing like:
>>
>> """
>> old_pte is the value returned from ptep_modify_prot_start() but may additionally
>> have have young and/or dirty bits set where previously they were not.
>> """
> 
> Thanks.
> 
>> ?
>>
>>>    */
>>>   static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>                          unsigned long addr,
>>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
>>> vm_area_struct *vma,
>>>       __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>   }
>>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>> +
>>> +/**
>>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
>>> transaction
>>> + * over a batch of ptes, which protects against asynchronous hardware
>>> + * modifications to the ptes. The intention is not to prevent the hardware from
>>> + * making pte updates, but to prevent any updates it may make from being lost.
>>> + * Please see the comment above ptep_modify_prot_start() for full description.
>>> + *
>>> + * @vma: The virtual memory area the pages are mapped into.
>>> + * @addr: Address the first page is mapped at.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @nr: Number of entries.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
>>> + * in the batch.
>>> + *
>>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
>>> + *
>>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
>>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
>>> + * mprotect_folio_pte_batch()).
>> This last sentence is confusing... You had previous said the PFN can differ, but
>> here you imply on a, d and sd bits are allowed to differ.
> 
> Forgot to mention the PFNs, kind of took them as implied. So mentioning the PFNs
> also will do or do you suggest a better wording?

Perhaps:

"""
Context: The caller holds the page table lock.  The PTEs map consecutive
pages that belong to the same folio.  All other PTE bits must be identical for
all PTEs in the batch except for young and dirty bits.  The PTEs are all in the
same PMD.
"""

You mention the soft dirty bit not needing to be the same in your current
wording, but I don't think that is correct? soft dirty needs to be the same, right?

> 
>>
>>> + */
>>> +#ifndef modify_prot_start_ptes
>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>> +{
>>> +    pte_t pte, tmp_pte;
>>> +
>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>> +    while (--nr) {
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>> +        if (pte_dirty(tmp_pte))
>>> +            pte = pte_mkdirty(pte);
>>> +        if (pte_young(tmp_pte))
>>> +            pte = pte_mkyoung(pte);
>>> +    }
>>> +    return pte;
>>> +}
>>> +#endif
>>> +
>>> +/**
>>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>>> + * hardware-controlled bits in the PTE unmodified.
>>> + *
>>> + * @vma: The virtual memory area the pages are mapped into.
>>> + * @addr: Address the first page is mapped at.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
>>> + * @pte: New page table entry to be set.
>>> + * @nr: Number of entries.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over ptep_modify_prot_commit().
>>> + *
>>> + * Context: The caller holds the page table lock. The PTEs are all in the same
>>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
>>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
>>> + * a/d bits on which were off in old_pte.
>> Same comment as for ptep_modify_prot_start().
>>
>>> + */
>>> +#ifndef modify_prot_commit_ptes
>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < nr; ++i) {
>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        old_pte = pte_next_pfn(old_pte);
>>> +        pte = pte_next_pfn(pte);
>>> +    }
>>> +}
>>> +#endif
>>> +
>>>   #endif /* CONFIG_MMU */
>>>     /*
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index af10a7fbe6b8..627b0d67cc4a 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                       continue;
>>>               }
>>>   -            oldpte = ptep_modify_prot_start(vma, addr, pte);
>>> +            oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>> You're calling this with nr_ptes = 0 for the prot_numa case. But the
>> implementation expects minimum nr_ptes == 1.
> 
> This will get fixed when I force nr_ptes = 1 in the previous patch right?

Yep, just pointing it out.

> 
>>
>>>               ptent = pte_modify(oldpte, newprot);
>>>                 if (uffd_wp)
>>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                   can_change_pte_writable(vma, addr, ptent))
>>>                   ptent = pte_mkwrite(ptent, vma);
>>>   -            ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>>> +            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>               if (pte_needs_flush(oldpte, ptent))
>>>                   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>               pages++;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-30 10:35       ` Ryan Roberts
@ 2025-06-30 10:42         ` Dev Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 10:42 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 4:05 pm, Ryan Roberts wrote:
> On 30/06/2025 11:17, Dev Jain wrote:
>> On 30/06/25 3:40 pm, Ryan Roberts wrote:
>>> On 28/06/2025 12:34, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers; in case not, they are implemented
>>>> as a simple loop over the corresponding single pte helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>>>>    mm/mprotect.c           |  4 +-
>>>>    2 files changed, 84 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index cf1515c163e2..662f39e7475a 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
>>>> vm_area_struct *vma,
>>>>      /*
>>>>     * Commit an update to a pte, leaving any hardware-controlled bits in
>>>> - * the PTE unmodified.
>>>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
>>>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
>>> I find this last sentance a bit confusing. I think what you are trying to say is
>>> somehthing like:
>>>
>>> """
>>> old_pte is the value returned from ptep_modify_prot_start() but may additionally
>>> have have young and/or dirty bits set where previously they were not.
>>> """
>> Thanks.
>>
>>> ?
>>>
>>>>     */
>>>>    static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>>                           unsigned long addr,
>>>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
>>>> vm_area_struct *vma,
>>>>        __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>>    }
>>>>    #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>>> +
>>>> +/**
>>>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
>>>> transaction
>>>> + * over a batch of ptes, which protects against asynchronous hardware
>>>> + * modifications to the ptes. The intention is not to prevent the hardware from
>>>> + * making pte updates, but to prevent any updates it may make from being lost.
>>>> + * Please see the comment above ptep_modify_prot_start() for full description.
>>>> + *
>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>> + * @addr: Address the first page is mapped at.
>>>> + * @ptep: Page table pointer for the first entry.
>>>> + * @nr: Number of entries.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
>>>> + * in the batch.
>>>> + *
>>>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
>>>> + *
>>>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>>>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
>>>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
>>>> + * mprotect_folio_pte_batch()).
>>> This last sentence is confusing... You had previous said the PFN can differ, but
>>> here you imply on a, d and sd bits are allowed to differ.
>> Forgot to mention the PFNs, kind of took them as implied. So mentioning the PFNs
>> also will do or do you suggest a better wording?
> Perhaps:
>
> """
> Context: The caller holds the page table lock.  The PTEs map consecutive
> pages that belong to the same folio.  All other PTE bits must be identical for
> all PTEs in the batch except for young and dirty bits.  The PTEs are all in the
> same PMD.
> """
>
> You mention the soft dirty bit not needing to be the same in your current
> wording, but I don't think that is correct? soft dirty needs to be the same, right?

Oh god, confused this with the skipping case, you are right.

>
>>>> + */
>>>> +#ifndef modify_prot_start_ptes
>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>>> +{
>>>> +    pte_t pte, tmp_pte;
>>>> +
>>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> +    while (--nr) {
>>>> +        ptep++;
>>>> +        addr += PAGE_SIZE;
>>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> +        if (pte_dirty(tmp_pte))
>>>> +            pte = pte_mkdirty(pte);
>>>> +        if (pte_young(tmp_pte))
>>>> +            pte = pte_mkyoung(pte);
>>>> +    }
>>>> +    return pte;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/**
>>>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>>>> + * hardware-controlled bits in the PTE unmodified.
>>>> + *
>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>> + * @addr: Address the first page is mapped at.
>>>> + * @ptep: Page table pointer for the first entry.
>>>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
>>>> + * @pte: New page table entry to be set.
>>>> + * @nr: Number of entries.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>> + * loop over ptep_modify_prot_commit().
>>>> + *
>>>> + * Context: The caller holds the page table lock. The PTEs are all in the same
>>>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
>>>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
>>>> + * a/d bits on which were off in old_pte.
>>> Same comment as for ptep_modify_prot_start().
>>>
>>>> + */
>>>> +#ifndef modify_prot_commit_ptes
>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < nr; ++i) {
>>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>> +        ptep++;
>>>> +        addr += PAGE_SIZE;
>>>> +        old_pte = pte_next_pfn(old_pte);
>>>> +        pte = pte_next_pfn(pte);
>>>> +    }
>>>> +}
>>>> +#endif
>>>> +
>>>>    #endif /* CONFIG_MMU */
>>>>      /*
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index af10a7fbe6b8..627b0d67cc4a 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                        continue;
>>>>                }
>>>>    -            oldpte = ptep_modify_prot_start(vma, addr, pte);
>>>> +            oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>> You're calling this with nr_ptes = 0 for the prot_numa case. But the
>>> implementation expects minimum nr_ptes == 1.
>> This will get fixed when I force nr_ptes = 1 in the previous patch right?
> Yep, just pointing it out.
>
>>>>                ptent = pte_modify(oldpte, newprot);
>>>>                  if (uffd_wp)
>>>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                    can_change_pte_writable(vma, addr, ptent))
>>>>                    ptent = pte_mkwrite(ptent, vma);
>>>>    -            ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>>>> +            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>>                if (pte_needs_flush(oldpte, ptent))
>>>>                    tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>>                pages++;


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit
  2025-06-28 11:34 ` [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-06-30 10:43   ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 10:43 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 28/06/2025 12:34, Dev Jain wrote:
> Override the generic definition of modify_prot_start_ptes() to use
> get_and_clear_full_ptes(). This helper does a TLBI only for the starting
> and ending contpte block of the range, whereas the current implementation
> will call ptep_get_and_clear() for every contpte block, thus doing a
> TLBI on every contpte block. Therefore, we have a performance win.
> 
> The arm64 definition of pte_accessible() allows us to batch in the
> errata specific case:
> 
> #define pte_accessible(mm, pte)	\
> 	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> 
> All ptes are obviously present in the folio batch, and they are also valid.
> 
> Override the generic definition of modify_prot_commit_ptes() to simply
> use set_ptes() to map the new ptes into the pagetable.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  arch/arm64/include/asm/pgtable.h | 10 ++++++++++
>  arch/arm64/mm/mmu.c              | 28 +++++++++++++++++++++++-----
>  2 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ba63c8736666..abd2dee416b3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1643,6 +1643,16 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
>  
> +#define modify_prot_start_ptes modify_prot_start_ptes
> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +				    unsigned long addr, pte_t *ptep,
> +				    unsigned int nr);
> +
> +#define modify_prot_commit_ptes modify_prot_commit_ptes
> +extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +				    pte_t *ptep, pte_t old_pte, pte_t pte,
> +				    unsigned int nr);
> +
>  #ifdef CONFIG_ARM64_CONTPTE
>  
>  /*
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 3d5fb37424ab..38325616f467 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -26,6 +26,7 @@
>  #include <linux/set_memory.h>
>  #include <linux/kfence.h>
>  #include <linux/pkeys.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/barrier.h>
>  #include <asm/cputype.h>
> @@ -1524,24 +1525,41 @@ static int __init prevent_bootmem_remove_init(void)
>  early_initcall(prevent_bootmem_remove_init);
>  #endif
>  
> -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t *ptep, unsigned int nr)
>  {
> +	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
> +
>  	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>  		/*
>  		 * Break-before-make (BBM) is required for all user space mappings
>  		 * when the permission changes from executable to non-executable
>  		 * in cases where cpu is affected with errata #2645198.
>  		 */
> -		if (pte_user_exec(ptep_get(ptep)))
> -			return ptep_clear_flush(vma, addr, ptep);
> +		if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte))
> +			__flush_tlb_range(vma, addr, nr * PAGE_SIZE,
> +					  PAGE_SIZE, true, 3);
>  	}
> -	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
> +
> +	return pte;
> +}
> +
> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> +{
> +	return modify_prot_start_ptes(vma, addr, ptep, 1);
> +}
> +
> +void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t *ptep, pte_t old_pte, pte_t pte,
> +			     unsigned int nr)
> +{
> +	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
>  }
>  
>  void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>  			     pte_t old_pte, pte_t pte)
>  {
> -	set_pte_at(vma->vm_mm, addr, ptep, pte);
> +	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
>  }
>  
>  /*



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-30  3:33   ` Dev Jain
@ 2025-06-30 10:45     ` Ryan Roberts
  2025-06-30 11:22       ` Dev Jain
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 10:45 UTC (permalink / raw)
  To: Dev Jain, Andrew Morton
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 30/06/2025 04:33, Dev Jain wrote:
> 
> On 30/06/25 4:35 am, Andrew Morton wrote:
>> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:
>>
>>> This patchset optimizes the mprotect() system call for large folios
>>> by PTE-batching. No issues were observed with mm-selftests, build
>>> tested on x86_64.
>> um what.  Seems to claim that "selftests still compiles after I messed
>> with stuff", which isn't very impressive ;)  Please clarify?
> 
> Sorry I mean to say that the mm-selftests pass.

I think you're saying you both compiled and ran the mm selftests for arm64. And
additionally you compiled for x86_64? (Just trying to help clarify).


> 
>>
>>> We use the following test cases to measure performance, mprotect()'ing
>>> the mapped memory to read-only then read-write 40 times:
>>>
>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>> pte-mapping those THPs
>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>> Test case 3: Mapping 1G of memory with 4K pages
>>>
>>> Average execution time on arm64, Apple M3:
>>> Before the patchset:
>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>
>>> After the patchset:
>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>> Well that's tasty.
>>
>>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>>> introduced by ptep_get() on a contpte block. And, for large folios we get
>>> an almost 74% performance improvement, albeit the trade-off being a slight
>>> degradation in the small folio case.
>>>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
                   ` (4 preceding siblings ...)
  2025-06-29 23:05 ` [PATCH v4 0/4] Optimize mprotect() for large folios Andrew Morton
@ 2025-06-30 11:17 ` Lorenzo Stoakes
  2025-06-30 11:25   ` Dev Jain
  2025-06-30 11:27 ` Lorenzo Stoakes
  6 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:17 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote:
> This patchset optimizes the mprotect() system call for large folios
> by PTE-batching. No issues were observed with mm-selftests, build
> tested on x86_64.

Should also be tested on x86-64 not only build tested :)

You are still not really giving details here, so same comment as your mremap()
series, please explain why you're doing this, what for, what benefits you expect
to achieve, where etc.

E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see
benefits on amd64 also and for intel there should be no impact'.

It's probably also worth actually going and checking to make sure that this is
the case re: other arches. See below on that...

>
> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
>
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
>
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>
> After the patchset:
> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>
> Observing T1/T2 and T3 before the patchset, we also remove the regression
> introduced by ptep_get() on a contpte block. And, for large folios we get
> an almost 74% performance improvement, albeit the trade-off being a slight
> degradation in the small folio case.

This is nice, though order-0 is probably going to be your bread and butter no?

Having said that, mprotect() is not a hot path, this delta is small enough to
quite possibly just be noise, and personally I'm not all that bothered.

But let's run this same test on x86-64 too please and get some before/after
numbers just to confirm no major impact.

Thanks for including code.

>
> Here is the test program:
>
>  #define _GNU_SOURCE
>  #include <sys/mman.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <stdio.h>
>  #include <unistd.h>
>
>  #define SIZE (1024*1024*1024)
>
> unsigned long pmdsize = (1UL << 21);
> unsigned long pagesize = (1UL << 12);
>
> static void pte_map_thps(char *mem, size_t size)
> {
> 	size_t offs;
> 	int ret = 0;
>
>
> 	/* PTE-map each THP by temporarily splitting the VMAs. */
> 	for (offs = 0; offs < size; offs += pmdsize) {
> 		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
> 		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
> 	}
>
> 	if (ret) {
> 		fprintf(stderr, "ERROR: mprotect() failed\n");
> 		exit(1);
> 	}
> }
>
> int main(int argc, char *argv[])
> {
> 	char *p;
>         int ret = 0;
> 	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if (p != (1UL << 30)) {
> 		perror("mmap");
> 		return 1;
> 	}
>
>
>
> 	memset(p, 0, SIZE);
> 	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
> 		perror("madvise");
> 	explicit_bzero(p, SIZE);
> 	pte_map_thps(p, SIZE);
>
> 	for (int loops = 0; loops < 40; loops++) {
> 		if (mprotect(p, SIZE, PROT_READ))
> 			perror("mprotect"), exit(1);
> 		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
> 			perror("mprotect"), exit(1);
> 		explicit_bzero(p, SIZE);
> 	}
> }
>
> ---
> The patchset is rebased onto Saturday's mm-new.
>
> v3->v4:
>  - Refactor skipping logic into a new function, edit patch 1 subject
>    to highlight it is only for MM_CP_PROT_NUMA case (David H)
>  - Refactor the optimization logic, add more documentation to the generic
>    batched functions, do not add clear_flush_ptes, squash patch 4
>    and 5 (Ryan)
>
> v2->v3:
>  - Add comments for the new APIs (Ryan, Lorenzo)
>  - Instead of refactoring, use a "skip_batch" label
>  - Move arm64 patches at the end (Ryan)
>  - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
>  - Resolve implicit declaration; tested build on x86 (Lance Yang)
>
> v1->v2:
>  - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
>  - Abridge the anon-exclusive condition (Lance Yang)
>
> Dev Jain (4):
>   mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
>   mm: Add batched versions of ptep_modify_prot_start/commit
>   mm: Optimize mprotect() by PTE-batching
>   arm64: Add batched versions of ptep_modify_prot_start/commit
>
>  arch/arm64/include/asm/pgtable.h |  10 ++
>  arch/arm64/mm/mmu.c              |  28 +++-
>  include/linux/pgtable.h          |  83 +++++++++-
>  mm/mprotect.c                    | 269 +++++++++++++++++++++++--------
>  4 files changed, 315 insertions(+), 75 deletions(-)
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 10:31   ` Ryan Roberts
@ 2025-06-30 11:21     ` Dev Jain
  2025-06-30 11:47       ` Dev Jain
  2025-06-30 11:50       ` Ryan Roberts
  2025-07-01  5:47     ` Dev Jain
  1 sibling, 2 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:21 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 4:01 pm, Ryan Roberts wrote:
> On 28/06/2025 12:34, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>> prot_numa case if possible.
>>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>> and access bits across the batch, therefore batching across
>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>> just an indication that the folio got written to, so even if the PTE is
>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>> optimization can be made.
>>
>> The crux now is how to batch around the PageAnonExclusive case; we must
>> check the corresponding condition for every single page. Therefore, from
>> the large folio batch, we process sub batches of ptes mapping pages with
>> the same PageAnonExclusive condition, and process that sub batch, then
>> determine and process the next sub batch, and so on. Note that this does
>> not cause any extra overhead; if suppose the size of the folio batch
>> is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 627b0d67cc4a..28c7ce7728ff 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,35 +40,47 @@
>>   
>>   #include "internal.h"
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> -{
>> -	struct page *page;
>> +enum tristate {
>> +	TRI_FALSE = 0,
>> +	TRI_TRUE = 1,
>> +	TRI_MAYBE = -1,
>> +};
>>   
>> +/*
>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>> + * additionally check PageAnonExclusive() for every page in the desired range.
>> + */
>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>> +				     unsigned long addr, pte_t pte,
>> +				     struct folio *folio)
>> +{
>>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	/* Don't touch entries that are not even readable. */
>>   	if (pte_protnone(pte))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	/* Do we need write faults for softdirty tracking? */
>>   	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	/* Do we need write faults for uffd-wp tracking? */
>>   	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>   		/*
>>   		 * Writable MAP_PRIVATE mapping: We can only special-case on
>>   		 * exclusive anonymous pages, because we know that our
>>   		 * write-fault handler similarly would map them writable without
>> -		 * any additional checks while holding the PT lock.
>> +		 * any additional checks while holding the PT lock. So if the
>> +		 * folio is not anonymous, we know we cannot change pte to
>> +		 * writable. If it is anonymous then the caller must further
>> +		 * check that the page is AnonExclusive().
>>   		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>   	}
>>   
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	 * FS was already notified and we can simply mark the PTE writable
>>   	 * just like the write-fault handler would do.
>>   	 */
>> -	return pte_dirty(pte);
>> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>> +}
>> +
>> +/*
>> + * Returns the number of pages within the folio, starting from the page
>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>> + * PageAnonExclusive() is returned in *exclusive.
>> + */
>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>> +				bool *exclusive)
>> +{
>> +	struct page *page;
>> +	int nr = 1;
>> +
>> +	if (!folio) {
>> +		*exclusive = false;
>> +		return nr;
>> +	}
>> +
>> +	page = folio_page(folio, pgidx++);
>> +	*exclusive = PageAnonExclusive(page);
>> +	while (nr < max_nr) {
>> +		page = folio_page(folio, pgidx++);
>> +		if ((*exclusive) != PageAnonExclusive(page))
> nit: brackets not required around *exclusive.

Thanks I'll drop it. I have a habit of putting brackets everywhere
because debugging operator precedence bugs is a nightmare - finally
the time has come to learn operator precedence!

>
>> +			break;
>> +		nr++;
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte)
>> +{
>> +	struct page *page;
>> +	int ret;
>> +
>> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>> +	if (ret == TRI_MAYBE) {
>> +		page = vm_normal_page(vma, addr, pte);
>> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
>> +	}
>> +
>> +	return ret;
>>   }
>>   
>>   static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>   {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	flags &= ~switch_off_flags;
> This is mega confusing when reading the caller. Because the caller passes
> FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.
>
> Can't we just pass in the flags we want?

Yup that is cleaner.

>
>>   
>> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (!folio || !folio_test_large(folio))
> What's the rational for dropping the max_nr_ptes == 1 condition? If you don't
> need it, why did you add it in the earler patch?

Stupid me forgot to drop it from the earlier patch.

>
>>   		return 1;
>>   
>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>>   	}
>>   
>>   skip_batch:
>> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +					   max_nr_ptes, 0);
>>   out:
>>   	*foliop = folio;
>>   	return nr_ptes;
>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   		if (pte_present(oldpte)) {
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>   			struct folio *folio = NULL;
>> -			pte_t ptent;
>> +			int sub_nr_ptes, pgidx = 0;
>> +			pte_t ptent, newpte;
>> +			bool sub_set_write;
>> +			int set_write;
>>   
>>   			/*
>>   			 * Avoid trapping faults against the zero or KSM
>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   					continue;
>>   			}
>>   
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>  From the other thread, my memory is jogged that this function ignores write
> permission bit. So I think that's opening up a bug when applied here? If the
> first pte is writable but the rest are not (COW), doesn't this now make them all
> writable? I don't *think* that's a problem for the prot_numa use, but I could be
> wrong.

No, we are not ignoring the write permission bit. There is no way currently to
do that via folio_pte_batch. So the pte batch is either entirely writable or
entirely not.

>
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> Even if I'm wrong about ignoring write bit being a bug, I don't think the docs
> for this function permit write bit to be different across the batch?
>
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * example, if a PTE is already dirty and no other
>>   			 * COW or special handling is required.
>>   			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> +				    !pte_write(ptent);
>> +			if (set_write)
>> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
> Why not just:
> 			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> 				    !pte_write(ptent) &&
> 				    maybe_change_pte_writable(...);

set_write is an int, which is supposed to span {TRI_MAYBE, TRI_FALSE, TRI_TRUE},
whereas the RHS of this statement will always return a boolean.

You proposed it like this in your diff; it took hours for my eyes to catch this : )

>
> ?
>
>> +
>> +			while (nr_ptes) {
>> +				if (set_write == TRI_MAYBE) {
>> +					sub_nr_ptes = anon_exclusive_batch(folio,
>> +						pgidx, nr_ptes, &sub_set_write);
>> +				} else {
>> +					sub_nr_ptes = nr_ptes;
>> +					sub_set_write = (set_write == TRI_TRUE);
>> +				}
>> +
>> +				if (sub_set_write)
>> +					newpte = pte_mkwrite(ptent, vma);
>> +				else
>> +					newpte = ptent;
>> +
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
>> +							newpte, sub_nr_ptes);
>> +				if (pte_needs_flush(oldpte, newpte))
> What did we conclude with pte_needs_flush()? I thought there was an arch where
> it looked dodgy calling this for just the pte at the head of the batch?

Powerpc flushes if access bit transitions from set to unset. x86 does that
for both dirty and access. Both problems are solved by modify_prot_start_ptes()
which collects a/d bits, both in the generic implementation and the arm64
implementation.

>
> Thanks,
> Ryan
>
>> +					tlb_flush_pte_range(tlb, addr,
>> +						sub_nr_ptes * PAGE_SIZE);
>> +
>> +				addr += sub_nr_ptes * PAGE_SIZE;
>> +				pte += sub_nr_ptes;
>> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>> +				nr_ptes -= sub_nr_ptes;
>> +				pages += sub_nr_ptes;
>> +				pgidx += sub_nr_ptes;
>> +			}
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-30 10:45     ` Ryan Roberts
@ 2025-06-30 11:22       ` Dev Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:22 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 4:15 pm, Ryan Roberts wrote:
> On 30/06/2025 04:33, Dev Jain wrote:
>> On 30/06/25 4:35 am, Andrew Morton wrote:
>>> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>> This patchset optimizes the mprotect() system call for large folios
>>>> by PTE-batching. No issues were observed with mm-selftests, build
>>>> tested on x86_64.
>>> um what.  Seems to claim that "selftests still compiles after I messed
>>> with stuff", which isn't very impressive ;)  Please clarify?
>> Sorry I mean to say that the mm-selftests pass.
> I think you're saying you both compiled and ran the mm selftests for arm64. And
> additionally you compiled for x86_64? (Just trying to help clarify).

Yes, ran mm-selftests on arm64, and build-tested the patches for x86.

>
>>>> We use the following test cases to measure performance, mprotect()'ing
>>>> the mapped memory to read-only then read-write 40 times:
>>>>
>>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>>> pte-mapping those THPs
>>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>>> Test case 3: Mapping 1G of memory with 4K pages
>>>>
>>>> Average execution time on arm64, Apple M3:
>>>> Before the patchset:
>>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>>
>>>> After the patchset:
>>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>>> Well that's tasty.
>>>
>>>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>>>> introduced by ptep_get() on a contpte block. And, for large folios we get
>>>> an almost 74% performance improvement, albeit the trade-off being a slight
>>>> degradation in the small folio case.
>>>>
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-30 11:17 ` Lorenzo Stoakes
@ 2025-06-30 11:25   ` Dev Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 4:47 pm, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote:
>> This patchset optimizes the mprotect() system call for large folios
>> by PTE-batching. No issues were observed with mm-selftests, build
>> tested on x86_64.
> Should also be tested on x86-64 not only build tested :)
>
> You are still not really giving details here, so same comment as your mremap()
> series, please explain why you're doing this, what for, what benefits you expect
> to achieve, where etc.
>
> E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see
> benefits on amd64 also and for intel there should be no impact'.

Okay.

>
> It's probably also worth actually going and checking to make sure that this is
> the case re: other arches. See below on that...
>
>> We use the following test cases to measure performance, mprotect()'ing
>> the mapped memory to read-only then read-write 40 times:
>>
>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>> pte-mapping those THPs
>> Test case 2: Mapping 1G of memory with 64K mTHPs
>> Test case 3: Mapping 1G of memory with 4K pages
>>
>> Average execution time on arm64, Apple M3:
>> Before the patchset:
>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>
>> After the patchset:
>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>>
>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>> introduced by ptep_get() on a contpte block. And, for large folios we get
>> an almost 74% performance improvement, albeit the trade-off being a slight
>> degradation in the small folio case.
> This is nice, though order-0 is probably going to be your bread and butter no?
>
> Having said that, mprotect() is not a hot path, this delta is small enough to
> quite possibly just be noise, and personally I'm not all that bothered.

It is only the vm_normal_folio() + folio_test_large() overhead. Trying to avoid
this by the horrible maybe_contiguous_pte_pfns() I introduced somewhere else
is not worth it : )

>
> But let's run this same test on x86-64 too please and get some before/after
> numbers just to confirm no major impact.
>
> Thanks for including code.
>
>> Here is the test program:
>>
>>   #define _GNU_SOURCE
>>   #include <sys/mman.h>
>>   #include <stdlib.h>
>>   #include <string.h>
>>   #include <stdio.h>
>>   #include <unistd.h>
>>
>>   #define SIZE (1024*1024*1024)
>>
>> unsigned long pmdsize = (1UL << 21);
>> unsigned long pagesize = (1UL << 12);
>>
>> static void pte_map_thps(char *mem, size_t size)
>> {
>> 	size_t offs;
>> 	int ret = 0;
>>
>>
>> 	/* PTE-map each THP by temporarily splitting the VMAs. */
>> 	for (offs = 0; offs < size; offs += pmdsize) {
>> 		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
>> 		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
>> 	}
>>
>> 	if (ret) {
>> 		fprintf(stderr, "ERROR: mprotect() failed\n");
>> 		exit(1);
>> 	}
>> }
>>
>> int main(int argc, char *argv[])
>> {
>> 	char *p;
>>          int ret = 0;
>> 	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> 	if (p != (1UL << 30)) {
>> 		perror("mmap");
>> 		return 1;
>> 	}
>>
>>
>>
>> 	memset(p, 0, SIZE);
>> 	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
>> 		perror("madvise");
>> 	explicit_bzero(p, SIZE);
>> 	pte_map_thps(p, SIZE);
>>
>> 	for (int loops = 0; loops < 40; loops++) {
>> 		if (mprotect(p, SIZE, PROT_READ))
>> 			perror("mprotect"), exit(1);
>> 		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
>> 			perror("mprotect"), exit(1);
>> 		explicit_bzero(p, SIZE);
>> 	}
>> }
>>
>> ---
>> The patchset is rebased onto Saturday's mm-new.
>>
>> v3->v4:
>>   - Refactor skipping logic into a new function, edit patch 1 subject
>>     to highlight it is only for MM_CP_PROT_NUMA case (David H)
>>   - Refactor the optimization logic, add more documentation to the generic
>>     batched functions, do not add clear_flush_ptes, squash patch 4
>>     and 5 (Ryan)
>>
>> v2->v3:
>>   - Add comments for the new APIs (Ryan, Lorenzo)
>>   - Instead of refactoring, use a "skip_batch" label
>>   - Move arm64 patches at the end (Ryan)
>>   - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
>>   - Resolve implicit declaration; tested build on x86 (Lance Yang)
>>
>> v1->v2:
>>   - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
>>   - Abridge the anon-exclusive condition (Lance Yang)
>>
>> Dev Jain (4):
>>    mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
>>    mm: Add batched versions of ptep_modify_prot_start/commit
>>    mm: Optimize mprotect() by PTE-batching
>>    arm64: Add batched versions of ptep_modify_prot_start/commit
>>
>>   arch/arm64/include/asm/pgtable.h |  10 ++
>>   arch/arm64/mm/mmu.c              |  28 +++-
>>   include/linux/pgtable.h          |  83 +++++++++-
>>   mm/mprotect.c                    | 269 +++++++++++++++++++++++--------
>>   4 files changed, 315 insertions(+), 75 deletions(-)
>>
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
  2025-06-30  9:42   ` Ryan Roberts
@ 2025-06-30 11:25   ` Lorenzo Stoakes
  2025-06-30 11:39     ` Ryan Roberts
  2025-06-30 11:40     ` Dev Jain
  2025-07-02  9:37   ` Lorenzo Stoakes
  2 siblings, 2 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:25 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch. Additionally refactor all of this
> into a new function to clean up the existing code.

Hmm, is this a completely new concept for this series?

Please try not to introduce brand new things to a series midway through.

This seems to be adding a whole ton of questionable logic for an edge case.

Can we maybe just drop this for this series please?

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>  1 file changed, 87 insertions(+), 47 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..af10a7fbe6b8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>
> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> +			       NULL, NULL, NULL);
> +}

I find it really odd that you're introducing this in a seemingly unrelated change.

Also won't this conflict with David's changes?

I know you like to rush out a dozen series at once, but once again I'm asking
maybe please hold off?

I seem to remember David asked you for the same thing because of this, but maybe
I'm misremembering.

We have only so much review resource and adding in brand new concepts mid-way
and doing things that blatantly conflict with other series really doesn't help.

> +
> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
> +		int max_nr_ptes)
> +{
> +	struct folio *folio = NULL;
> +	int nr_ptes = 1;
> +	bool toptier;
> +	int nid;
> +
> +	/* Avoid TLB flush if possible */
> +	if (pte_protnone(oldpte))
> +		goto skip_batch;
> +
> +	folio = vm_normal_folio(vma, addr, oldpte);
> +	if (!folio)
> +		goto skip_batch;
> +
> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> +		goto skip_batch;
> +
> +	/* Also skip shared copy-on-write pages */
> +	if (is_cow_mapping(vma->vm_flags) &&
> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
> +		goto skip_batch;
> +
> +	/*
> +	 * While migration can move some dirty pages,
> +	 * it cannot move them all from MIGRATE_ASYNC
> +	 * context.
> +	 */
> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> +		goto skip_batch;
> +
> +	/*
> +	 * Don't mess with PTEs if page is already on the node
> +	 * a single-threaded process is running on.
> +	 */
> +	nid = folio_nid(folio);
> +	if (target_node == nid)
> +		goto skip_batch;
> +
> +	toptier = node_is_toptier(nid);
> +
> +	/*
> +	 * Skip scanning top tier node if normal numa
> +	 * balancing is disabled
> +	 */
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> +		goto skip_batch;
> +
> +	if (folio_use_access_time(folio)) {
> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> +
> +		/* Do not skip in this case */
> +		nr_ptes = 0;
> +		goto out;
> +	}
> +
> +skip_batch:
> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
> +out:
> +	*foliop = folio;
> +	return nr_ptes;
> +}

Yeah yuck. I don't like that we're doing all this for this edge case.

> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> +			struct folio *folio = NULL;
>  			pte_t ptent;
>
>  			/*
> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct folio *folio;
> -				int nid;
> -				bool toptier;
> -
> -				/* Avoid TLB flush if possible */
> -				if (pte_protnone(oldpte))
> -					continue;
> -
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> -					continue;
> -
> -				/* Also skip shared copy-on-write pages */
> -				if (is_cow_mapping(vma->vm_flags) &&
> -				    (folio_maybe_dma_pinned(folio) ||
> -				     folio_maybe_mapped_shared(folio)))
> -					continue;
> -
> -				/*
> -				 * While migration can move some dirty pages,
> -				 * it cannot move them all from MIGRATE_ASYNC
> -				 * context.
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -				    folio_test_dirty(folio))
> -					continue;
> -
> -				/*
> -				 * Don't mess with PTEs if page is already on the node
> -				 * a single-threaded process is running on.
> -				 */
> -				nid = folio_nid(folio);
> -				if (target_node == nid)
> -					continue;
> -				toptier = node_is_toptier(nid);
> -
> -				/*
> -				 * Skip scanning top tier node if normal numa
> -				 * balancing is disabled
> -				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				nr_ptes = prot_numa_skip_ptes(&folio, vma,
> +							      addr, oldpte, pte,
> +							      target_node,
> +							      max_nr_ptes);
> +				if (nr_ptes)

I'm not really a fan of this being added (unless I'm missing something here) but
_generally_ it's better to separate out a move and a change if you can.

>  					continue;
> -				if (folio_use_access_time(folio))
> -					folio_xchg_access_time(folio,
> -						jiffies_to_msecs(jiffies));
>  			}
>
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>
> --
> 2.30.2
>

Anyway will hold off on reviewing the actual changes here until we can figure
out whether this is event appropriate here.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
                   ` (5 preceding siblings ...)
  2025-06-30 11:17 ` Lorenzo Stoakes
@ 2025-06-30 11:27 ` Lorenzo Stoakes
  2025-06-30 11:43   ` Dev Jain
  6 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:27 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

To reiterate what I said on 1/4 - overall since this series conflicts with
David's changes - can we hold off on any respin please until David's
settles and lands in mm-new at least?

Thanks.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30 11:25   ` Lorenzo Stoakes
@ 2025-06-30 11:39     ` Ryan Roberts
  2025-06-30 11:53       ` Lorenzo Stoakes
  2025-06-30 11:40     ` Dev Jain
  1 sibling, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 11:39 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, hughd, yang, ziy

On 30/06/2025 12:25, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>> into a new function to clean up the existing code.
> 
> Hmm, is this a completely new concept for this series?
> 
> Please try not to introduce brand new things to a series midway through.
> 
> This seems to be adding a whole ton of questionable logic for an edge case.
> 
> Can we maybe just drop this for this series please?

From my perspective, at least, there are no new logical changes in here vs the
previous version. And I don't think the patches have been re-organised either.
David (I think?) was asking for the name of the patch to be changed to include
MM_CP_PROT_NUMA and also for the code to be moved out of line to it's own
function. That's all that Dev has done AFAICT (although as per my review
comments, the refactoring has introduced a bug).

My preference is that we should ultimately support this batching. It could be a
separate series if you insist, but it's all contbuting to the same goal
ultimately; making mprotect support PTE batching.

Just my 2c.

Thanks,
Ryan

> 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 87 insertions(+), 47 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88709c01177b..af10a7fbe6b8 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>  	return pte_dirty(pte);
>>  }
>>
>> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> +			       NULL, NULL, NULL);
>> +}
> 
> I find it really odd that you're introducing this in a seemingly unrelated change.
> 
> Also won't this conflict with David's changes?
> 
> I know you like to rush out a dozen series at once, but once again I'm asking
> maybe please hold off?
> 
> I seem to remember David asked you for the same thing because of this, but maybe
> I'm misremembering.
> 
> We have only so much review resource and adding in brand new concepts mid-way
> and doing things that blatantly conflict with other series really doesn't help.
> 
>> +
>> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
>> +		int max_nr_ptes)
>> +{
>> +	struct folio *folio = NULL;
>> +	int nr_ptes = 1;
>> +	bool toptier;
>> +	int nid;
>> +
>> +	/* Avoid TLB flush if possible */
>> +	if (pte_protnone(oldpte))
>> +		goto skip_batch;
>> +
>> +	folio = vm_normal_folio(vma, addr, oldpte);
>> +	if (!folio)
>> +		goto skip_batch;
>> +
>> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>> +		goto skip_batch;
>> +
>> +	/* Also skip shared copy-on-write pages */
>> +	if (is_cow_mapping(vma->vm_flags) &&
>> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
>> +		goto skip_batch;
>> +
>> +	/*
>> +	 * While migration can move some dirty pages,
>> +	 * it cannot move them all from MIGRATE_ASYNC
>> +	 * context.
>> +	 */
>> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>> +		goto skip_batch;
>> +
>> +	/*
>> +	 * Don't mess with PTEs if page is already on the node
>> +	 * a single-threaded process is running on.
>> +	 */
>> +	nid = folio_nid(folio);
>> +	if (target_node == nid)
>> +		goto skip_batch;
>> +
>> +	toptier = node_is_toptier(nid);
>> +
>> +	/*
>> +	 * Skip scanning top tier node if normal numa
>> +	 * balancing is disabled
>> +	 */
>> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
>> +		goto skip_batch;
>> +
>> +	if (folio_use_access_time(folio)) {
>> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>> +
>> +		/* Do not skip in this case */
>> +		nr_ptes = 0;
>> +		goto out;
>> +	}
>> +
>> +skip_batch:
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +out:
>> +	*foliop = folio;
>> +	return nr_ptes;
>> +}
> 
> Yeah yuck. I don't like that we're doing all this for this edge case.
> 
>> +
>>  static long change_pte_range(struct mmu_gather *tlb,
>>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	int nr_ptes;
>>
>>  	tlb_change_page_size(tlb, PAGE_SIZE);
>>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  	flush_tlb_batched_pending(vma->vm_mm);
>>  	arch_enter_lazy_mmu_mode();
>>  	do {
>> +		nr_ptes = 1;
>>  		oldpte = ptep_get(pte);
>>  		if (pte_present(oldpte)) {
>> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +			struct folio *folio = NULL;
>>  			pte_t ptent;
>>
>>  			/*
>> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  			 * pages. See similar comment in change_huge_pmd.
>>  			 */
>>  			if (prot_numa) {
>> -				struct folio *folio;
>> -				int nid;
>> -				bool toptier;
>> -
>> -				/* Avoid TLB flush if possible */
>> -				if (pte_protnone(oldpte))
>> -					continue;
>> -
>> -				folio = vm_normal_folio(vma, addr, oldpte);
>> -				if (!folio || folio_is_zone_device(folio) ||
>> -				    folio_test_ksm(folio))
>> -					continue;
>> -
>> -				/* Also skip shared copy-on-write pages */
>> -				if (is_cow_mapping(vma->vm_flags) &&
>> -				    (folio_maybe_dma_pinned(folio) ||
>> -				     folio_maybe_mapped_shared(folio)))
>> -					continue;
>> -
>> -				/*
>> -				 * While migration can move some dirty pages,
>> -				 * it cannot move them all from MIGRATE_ASYNC
>> -				 * context.
>> -				 */
>> -				if (folio_is_file_lru(folio) &&
>> -				    folio_test_dirty(folio))
>> -					continue;
>> -
>> -				/*
>> -				 * Don't mess with PTEs if page is already on the node
>> -				 * a single-threaded process is running on.
>> -				 */
>> -				nid = folio_nid(folio);
>> -				if (target_node == nid)
>> -					continue;
>> -				toptier = node_is_toptier(nid);
>> -
>> -				/*
>> -				 * Skip scanning top tier node if normal numa
>> -				 * balancing is disabled
>> -				 */
>> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> -				    toptier)
>> +				nr_ptes = prot_numa_skip_ptes(&folio, vma,
>> +							      addr, oldpte, pte,
>> +							      target_node,
>> +							      max_nr_ptes);
>> +				if (nr_ptes)
> 
> I'm not really a fan of this being added (unless I'm missing something here) but
> _generally_ it's better to separate out a move and a change if you can.
> 
>>  					continue;
>> -				if (folio_use_access_time(folio))
>> -					folio_xchg_access_time(folio,
>> -						jiffies_to_msecs(jiffies));
>>  			}
>>
>>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  				pages++;
>>  			}
>>  		}
>> -	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>>  	arch_leave_lazy_mmu_mode();
>>  	pte_unmap_unlock(pte - 1, ptl);
>>
>> --
>> 2.30.2
>>
> 
> Anyway will hold off on reviewing the actual changes here until we can figure
> out whether this is event appropriate here.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30 11:25   ` Lorenzo Stoakes
  2025-06-30 11:39     ` Ryan Roberts
@ 2025-06-30 11:40     ` Dev Jain
  2025-06-30 11:51       ` Lorenzo Stoakes
  1 sibling, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 4:55 pm, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>> into a new function to clean up the existing code.
> Hmm, is this a completely new concept for this series?
>
> Please try not to introduce brand new things to a series midway through.
>
> This seems to be adding a whole ton of questionable logic for an edge case.
>
> Can we maybe just drop this for this series please?

I refactored this into a new function on David's suggestion:

https://lore.kernel.org/all/912757c0-8a75-4307-a0bd-8755f6135b5a@redhat.com/

Maybe you are saying, having a refactoring patch first and then the "skip a
PTE batch" second, I'll be happy to do that, that would be cleaner.

>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>>   1 file changed, 87 insertions(+), 47 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88709c01177b..af10a7fbe6b8 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	return pte_dirty(pte);
>>   }
>>
>> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> +			       NULL, NULL, NULL);
>> +}
> I find it really odd that you're introducing this in a seemingly unrelated change.
>
> Also won't this conflict with David's changes?

This series was (ofcourse, IMHO) pretty stable at v3, and there were comments
coming on David's series, so I guessed that he will have to post a v2 anyways
after mine gets merged. My guess could have been wrong. For the khugepaged
batching series, I have sent the migration race patch separately exactly
because of his series, so that in that case the rebasing burden is mine.

>
> I know you like to rush out a dozen series at once, but once again I'm asking
> maybe please hold off?

Lorenzo : ) Except for the mremap series which you pointed out, I make it a point
to never repost in the same week, unless it is an obvious single patch, and even
in that case I give 2-3 days for the reviews to settle. I posted
v3 of this series more than a month ago, so it makes total sense to post this.
Also I have seen many people spamming the list with the next versions on literally
the same day, not that I am using this as a precedent. The mistake I made here
is that on Saturday I saw David's series but then forgot that I am using the
same infrastructure he is changing and went ahead posting this. I suddenly
remembered this during the khugepaged series and dropped the first two patches
for that.
  

>
> I seem to remember David asked you for the same thing because of this, but maybe
> I'm misremembering.

I don't recollect that happening, maybe I am wrong.

>
> We have only so much review resource and adding in brand new concepts mid-way
> and doing things that blatantly conflict with other series really doesn't help.
>
>> +
>> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
>> +		int max_nr_ptes)
>> +{
>> +	struct folio *folio = NULL;
>> +	int nr_ptes = 1;
>> +	bool toptier;
>> +	int nid;
>> +
>> +	/* Avoid TLB flush if possible */
>> +	if (pte_protnone(oldpte))
>> +		goto skip_batch;
>> +
>> +	folio = vm_normal_folio(vma, addr, oldpte);
>> +	if (!folio)
>> +		goto skip_batch;
>> +
>> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>> +		goto skip_batch;
>> +
>> +	/* Also skip shared copy-on-write pages */
>> +	if (is_cow_mapping(vma->vm_flags) &&
>> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
>> +		goto skip_batch;
>> +
>> +	/*
>> +	 * While migration can move some dirty pages,
>> +	 * it cannot move them all from MIGRATE_ASYNC
>> +	 * context.
>> +	 */
>> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>> +		goto skip_batch;
>> +
>> +	/*
>> +	 * Don't mess with PTEs if page is already on the node
>> +	 * a single-threaded process is running on.
>> +	 */
>> +	nid = folio_nid(folio);
>> +	if (target_node == nid)
>> +		goto skip_batch;
>> +
>> +	toptier = node_is_toptier(nid);
>> +
>> +	/*
>> +	 * Skip scanning top tier node if normal numa
>> +	 * balancing is disabled
>> +	 */
>> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
>> +		goto skip_batch;
>> +
>> +	if (folio_use_access_time(folio)) {
>> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>> +
>> +		/* Do not skip in this case */
>> +		nr_ptes = 0;
>> +		goto out;
>> +	}
>> +
>> +skip_batch:
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +out:
>> +	*foliop = folio;
>> +	return nr_ptes;
>> +}
> Yeah yuck. I don't like that we're doing all this for this edge case.
>
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	int nr_ptes;
>>
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +			struct folio *folio = NULL;
>>   			pte_t ptent;
>>
>>   			/*
>> @@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa) {
>> -				struct folio *folio;
>> -				int nid;
>> -				bool toptier;
>> -
>> -				/* Avoid TLB flush if possible */
>> -				if (pte_protnone(oldpte))
>> -					continue;
>> -
>> -				folio = vm_normal_folio(vma, addr, oldpte);
>> -				if (!folio || folio_is_zone_device(folio) ||
>> -				    folio_test_ksm(folio))
>> -					continue;
>> -
>> -				/* Also skip shared copy-on-write pages */
>> -				if (is_cow_mapping(vma->vm_flags) &&
>> -				    (folio_maybe_dma_pinned(folio) ||
>> -				     folio_maybe_mapped_shared(folio)))
>> -					continue;
>> -
>> -				/*
>> -				 * While migration can move some dirty pages,
>> -				 * it cannot move them all from MIGRATE_ASYNC
>> -				 * context.
>> -				 */
>> -				if (folio_is_file_lru(folio) &&
>> -				    folio_test_dirty(folio))
>> -					continue;
>> -
>> -				/*
>> -				 * Don't mess with PTEs if page is already on the node
>> -				 * a single-threaded process is running on.
>> -				 */
>> -				nid = folio_nid(folio);
>> -				if (target_node == nid)
>> -					continue;
>> -				toptier = node_is_toptier(nid);
>> -
>> -				/*
>> -				 * Skip scanning top tier node if normal numa
>> -				 * balancing is disabled
>> -				 */
>> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> -				    toptier)
>> +				nr_ptes = prot_numa_skip_ptes(&folio, vma,
>> +							      addr, oldpte, pte,
>> +							      target_node,
>> +							      max_nr_ptes);
>> +				if (nr_ptes)
> I'm not really a fan of this being added (unless I'm missing something here) but
> _generally_ it's better to separate out a move and a change if you can.

Yup I'll split this patch.

>
>>   					continue;
>> -				if (folio_use_access_time(folio))
>> -					folio_xchg_access_time(folio,
>> -						jiffies_to_msecs(jiffies));
>>   			}
>>
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> @@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				pages++;
>>   			}
>>   		}
>> -	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>>   	arch_leave_lazy_mmu_mode();
>>   	pte_unmap_unlock(pte - 1, ptl);
>>
>> --
>> 2.30.2
>>
> Anyway will hold off on reviewing the actual changes here until we can figure
> out whether this is event appropriate here.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-30 11:27 ` Lorenzo Stoakes
@ 2025-06-30 11:43   ` Dev Jain
  2025-07-01  0:08     ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 4:57 pm, Lorenzo Stoakes wrote:
> To reiterate what I said on 1/4 - overall since this series conflicts with
> David's changes - can we hold off on any respin please until David's
> settles and lands in mm-new at least?
>
> Thanks.

I agree, David's series should get stable by the time I get ready to post
the next version, @Andrew could you remove this from mm-new please?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 11:21     ` Dev Jain
@ 2025-06-30 11:47       ` Dev Jain
  2025-06-30 11:50       ` Ryan Roberts
  1 sibling, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:47 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 4:51 pm, Dev Jain wrote:
>
> On 30/06/25 4:01 pm, Ryan Roberts wrote:
>> On 28/06/2025 12:34, Dev Jain wrote:
>>> Use folio_pte_batch to batch process a large folio. Reuse the folio 
>>> from
>>> prot_numa case if possible.
>>>
>>> For all cases other than the PageAnonExclusive case, if the case 
>>> holds true
>>> for one pte in the batch, one can confirm that that case will hold 
>>> true for
>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do 
>>> not pass
>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>>> and access bits across the batch, therefore batching across
>>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>>> just an indication that the folio got written to, so even if the PTE is
>>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>>> optimization can be made.
>>>
>>> The crux now is how to batch around the PageAnonExclusive case; we must
>>> check the corresponding condition for every single page. Therefore, 
>>> from
>>> the large folio batch, we process sub batches of ptes mapping pages 
>>> with
>>> the same PageAnonExclusive condition, and process that sub batch, then
>>> determine and process the next sub batch, and so on. Note that this 
>>> does
>>> not cause any extra overhead; if suppose the size of the folio batch
>>> is 512, then the sub batch processing in total will take 512 
>>> iterations,
>>> which is the same as what we would have done before.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   mm/mprotect.c | 143 
>>> +++++++++++++++++++++++++++++++++++++++++---------
>>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 627b0d67cc4a..28c7ce7728ff 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -40,35 +40,47 @@
>>>     #include "internal.h"
>>>   -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> -                 pte_t pte)
>>> -{
>>> -    struct page *page;
>>> +enum tristate {
>>> +    TRI_FALSE = 0,
>>> +    TRI_TRUE = 1,
>>> +    TRI_MAYBE = -1,
>>> +};
>>>   +/*
>>> + * Returns enum tristate indicating whether the pte can be changed 
>>> to writable.
>>> + * If TRI_MAYBE is returned, then the folio is anonymous and the 
>>> user must
>>> + * additionally check PageAnonExclusive() for every page in the 
>>> desired range.
>>> + */
>>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>>> +                     unsigned long addr, pte_t pte,
>>> +                     struct folio *folio)
>>> +{
>>>       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Don't touch entries that are not even readable. */
>>>       if (pte_protnone(pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Do we need write faults for softdirty tracking? */
>>>       if (pte_needs_soft_dirty_wp(vma, pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Do we need write faults for uffd-wp tracking? */
>>>       if (userfaultfd_pte_wp(vma, pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         if (!(vma->vm_flags & VM_SHARED)) {
>>>           /*
>>>            * Writable MAP_PRIVATE mapping: We can only special-case on
>>>            * exclusive anonymous pages, because we know that our
>>>            * write-fault handler similarly would map them writable 
>>> without
>>> -         * any additional checks while holding the PT lock.
>>> +         * any additional checks while holding the PT lock. So if the
>>> +         * folio is not anonymous, we know we cannot change pte to
>>> +         * writable. If it is anonymous then the caller must further
>>> +         * check that the page is AnonExclusive().
>>>            */
>>> -        page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> +        return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : 
>>> TRI_FALSE;
>>>       }
>>>         VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct 
>>> vm_area_struct *vma, unsigned long addr,
>>>        * FS was already notified and we can simply mark the PTE 
>>> writable
>>>        * just like the write-fault handler would do.
>>>        */
>>> -    return pte_dirty(pte);
>>> +    return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>>> +}
>>> +
>>> +/*
>>> + * Returns the number of pages within the folio, starting from the 
>>> page
>>> + * indicated by pgidx and up to pgidx + max_nr, that have the same 
>>> value of
>>> + * PageAnonExclusive(). Must only be called for anonymous folios. 
>>> Value of
>>> + * PageAnonExclusive() is returned in *exclusive.
>>> + */
>>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int 
>>> max_nr,
>>> +                bool *exclusive)
>>> +{
>>> +    struct page *page;
>>> +    int nr = 1;
>>> +
>>> +    if (!folio) {
>>> +        *exclusive = false;
>>> +        return nr;
>>> +    }
>>> +
>>> +    page = folio_page(folio, pgidx++);
>>> +    *exclusive = PageAnonExclusive(page);
>>> +    while (nr < max_nr) {
>>> +        page = folio_page(folio, pgidx++);
>>> +        if ((*exclusive) != PageAnonExclusive(page))
>> nit: brackets not required around *exclusive.
>
> Thanks I'll drop it. I have a habit of putting brackets everywhere
> because debugging operator precedence bugs is a nightmare - finally
> the time has come to learn operator precedence!
>
>>
>>> +            break;
>>> +        nr++;
>>> +    }
>>> +
>>> +    return nr;
>>> +}
>>> +
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> +                 pte_t pte)
>>> +{
>>> +    struct page *page;
>>> +    int ret;
>>> +
>>> +    ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>>> +    if (ret == TRI_MAYBE) {
>>> +        page = vm_normal_page(vma, addr, pte);
>>> +        ret = page && PageAnon(page) && PageAnonExclusive(page);
>>> +    }
>>> +
>>> +    return ret;
>>>   }
>>>     static int mprotect_folio_pte_batch(struct folio *folio, 
>>> unsigned long addr,
>>> -        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t 
>>> switch_off_flags)
>>>   {
>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +    fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +
>>> +    flags &= ~switch_off_flags;
>> This is mega confusing when reading the caller. Because the caller 
>> passes
>> FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.
>>
>> Can't we just pass in the flags we want?
>
> Yup that is cleaner.
>
>>
>>>   -    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>> +    if (!folio || !folio_test_large(folio))
>> What's the rational for dropping the max_nr_ptes == 1 condition? If 
>> you don't
>> need it, why did you add it in the earler patch?
>
> Stupid me forgot to drop it from the earlier patch.
>
>>
>>>           return 1;
>>>         return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, 
>>> flags,
>>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio 
>>> **foliop, struct vm_area_struct *vma
>>>       }
>>>     skip_batch:
>>> -    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, 
>>> max_nr_ptes);
>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +                       max_nr_ptes, 0);
>>>   out:
>>>       *foliop = folio;
>>>       return nr_ptes;
>>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather 
>>> *tlb,
>>>           if (pte_present(oldpte)) {
>>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>               struct folio *folio = NULL;
>>> -            pte_t ptent;
>>> +            int sub_nr_ptes, pgidx = 0;
>>> +            pte_t ptent, newpte;
>>> +            bool sub_set_write;
>>> +            int set_write;
>>>                 /*
>>>                * Avoid trapping faults against the zero or KSM
>>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather 
>>> *tlb,
>>>                       continue;
>>>               }
>>>   +            if (!folio)
>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>> +
>>> +            nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, 
>>> oldpte,
>>> +                               max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>>  From the other thread, my memory is jogged that this function 
>> ignores write
>> permission bit. So I think that's opening up a bug when applied here? 
>> If the
>> first pte is writable but the rest are not (COW), doesn't this now 
>> make them all
>> writable? I don't *think* that's a problem for the prot_numa use, but 
>> I could be
>> wrong.
>
> No, we are not ignoring the write permission bit. There is no way 
> currently to
> do that via folio_pte_batch. So the pte batch is either entirely 
> writable or
> entirely not.

Oh no. We do a pte_mkwrprotect() in pte_batch_clear_ignored(), I missed 
that.


>
>>
>>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>> Even if I'm wrong about ignoring write bit being a bug, I don't think 
>> the docs
>> for this function permit write bit to be different across the batch?
>>
>>>               ptent = pte_modify(oldpte, newprot);
>>>   @@ -227,15 +294,39 @@ static long change_pte_range(struct 
>>> mmu_gather *tlb,
>>>                * example, if a PTE is already dirty and no other
>>>                * COW or special handling is required.
>>>                */
>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> -                !pte_write(ptent) &&
>>> -                can_change_pte_writable(vma, addr, ptent))
>>> -                ptent = pte_mkwrite(ptent, vma);
>>> -
>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, 
>>> nr_ptes);
>>> -            if (pte_needs_flush(oldpte, ptent))
>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> -            pages++;
>>> +            set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> +                    !pte_write(ptent);
>>> +            if (set_write)
>>> +                set_write = maybe_change_pte_writable(vma, addr, 
>>> ptent, folio);
>> Why not just:
>>             set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>                     !pte_write(ptent) &&
>>                     maybe_change_pte_writable(...);
>
> set_write is an int, which is supposed to span {TRI_MAYBE, TRI_FALSE, 
> TRI_TRUE},
> whereas the RHS of this statement will always return a boolean.
>
> You proposed it like this in your diff; it took hours for my eyes to 
> catch this : )
>
>>
>> ?
>>
>>> +
>>> +            while (nr_ptes) {
>>> +                if (set_write == TRI_MAYBE) {
>>> +                    sub_nr_ptes = anon_exclusive_batch(folio,
>>> +                        pgidx, nr_ptes, &sub_set_write);
>>> +                } else {
>>> +                    sub_nr_ptes = nr_ptes;
>>> +                    sub_set_write = (set_write == TRI_TRUE);
>>> +                }
>>> +
>>> +                if (sub_set_write)
>>> +                    newpte = pte_mkwrite(ptent, vma);
>>> +                else
>>> +                    newpte = ptent;
>>> +
>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>> +                            newpte, sub_nr_ptes);
>>> +                if (pte_needs_flush(oldpte, newpte))
>> What did we conclude with pte_needs_flush()? I thought there was an 
>> arch where
>> it looked dodgy calling this for just the pte at the head of the batch?
>
> Powerpc flushes if access bit transitions from set to unset. x86 does 
> that
> for both dirty and access. Both problems are solved by 
> modify_prot_start_ptes()
> which collects a/d bits, both in the generic implementation and the arm64
> implementation.
>
>>
>> Thanks,
>> Ryan
>>
>>> + tlb_flush_pte_range(tlb, addr,
>>> +                        sub_nr_ptes * PAGE_SIZE);
>>> +
>>> +                addr += sub_nr_ptes * PAGE_SIZE;
>>> +                pte += sub_nr_ptes;
>>> +                oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>> +                ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>> +                nr_ptes -= sub_nr_ptes;
>>> +                pages += sub_nr_ptes;
>>> +                pgidx += sub_nr_ptes;
>>> +            }
>>>           } else if (is_swap_pte(oldpte)) {
>>>               swp_entry_t entry = pte_to_swp_entry(oldpte);
>>>               pte_t newpte;
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 11:21     ` Dev Jain
  2025-06-30 11:47       ` Dev Jain
@ 2025-06-30 11:50       ` Ryan Roberts
  2025-06-30 11:53         ` Dev Jain
  1 sibling, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-06-30 11:50 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 30/06/2025 12:21, Dev Jain wrote:
> 
> On 30/06/25 4:01 pm, Ryan Roberts wrote:
>> On 28/06/2025 12:34, Dev Jain wrote:
>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>> prot_numa case if possible.
>>>
>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>> for one pte in the batch, one can confirm that that case will hold true for
>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>>> and access bits across the batch, therefore batching across
>>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>>> just an indication that the folio got written to, so even if the PTE is
>>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>>> optimization can be made.
>>>
>>> The crux now is how to batch around the PageAnonExclusive case; we must
>>> check the corresponding condition for every single page. Therefore, from
>>> the large folio batch, we process sub batches of ptes mapping pages with
>>> the same PageAnonExclusive condition, and process that sub batch, then
>>> determine and process the next sub batch, and so on. Note that this does
>>> not cause any extra overhead; if suppose the size of the folio batch
>>> is 512, then the sub batch processing in total will take 512 iterations,
>>> which is the same as what we would have done before.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 627b0d67cc4a..28c7ce7728ff 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -40,35 +40,47 @@
>>>     #include "internal.h"
>>>   -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> -                 pte_t pte)
>>> -{
>>> -    struct page *page;
>>> +enum tristate {
>>> +    TRI_FALSE = 0,
>>> +    TRI_TRUE = 1,
>>> +    TRI_MAYBE = -1,
>>> +};
>>>   +/*
>>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>>> + * additionally check PageAnonExclusive() for every page in the desired range.
>>> + */
>>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>>> +                     unsigned long addr, pte_t pte,
>>> +                     struct folio *folio)
>>> +{
>>>       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Don't touch entries that are not even readable. */
>>>       if (pte_protnone(pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Do we need write faults for softdirty tracking? */
>>>       if (pte_needs_soft_dirty_wp(vma, pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Do we need write faults for uffd-wp tracking? */
>>>       if (userfaultfd_pte_wp(vma, pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         if (!(vma->vm_flags & VM_SHARED)) {
>>>           /*
>>>            * Writable MAP_PRIVATE mapping: We can only special-case on
>>>            * exclusive anonymous pages, because we know that our
>>>            * write-fault handler similarly would map them writable without
>>> -         * any additional checks while holding the PT lock.
>>> +         * any additional checks while holding the PT lock. So if the
>>> +         * folio is not anonymous, we know we cannot change pte to
>>> +         * writable. If it is anonymous then the caller must further
>>> +         * check that the page is AnonExclusive().
>>>            */
>>> -        page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> +        return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>>       }
>>>         VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>        * FS was already notified and we can simply mark the PTE writable
>>>        * just like the write-fault handler would do.
>>>        */
>>> -    return pte_dirty(pte);
>>> +    return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>>> +}
>>> +
>>> +/*
>>> + * Returns the number of pages within the folio, starting from the page
>>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>>> + * PageAnonExclusive() is returned in *exclusive.
>>> + */
>>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>>> +                bool *exclusive)
>>> +{
>>> +    struct page *page;
>>> +    int nr = 1;
>>> +
>>> +    if (!folio) {
>>> +        *exclusive = false;
>>> +        return nr;
>>> +    }
>>> +
>>> +    page = folio_page(folio, pgidx++);
>>> +    *exclusive = PageAnonExclusive(page);
>>> +    while (nr < max_nr) {
>>> +        page = folio_page(folio, pgidx++);
>>> +        if ((*exclusive) != PageAnonExclusive(page))
>> nit: brackets not required around *exclusive.
> 
> Thanks I'll drop it. I have a habit of putting brackets everywhere
> because debugging operator precedence bugs is a nightmare - finally
> the time has come to learn operator precedence!
> 
>>
>>> +            break;
>>> +        nr++;
>>> +    }
>>> +
>>> +    return nr;
>>> +}
>>> +
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                 pte_t pte)
>>> +{
>>> +    struct page *page;
>>> +    int ret;
>>> +
>>> +    ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>>> +    if (ret == TRI_MAYBE) {
>>> +        page = vm_normal_page(vma, addr, pte);
>>> +        ret = page && PageAnon(page) && PageAnonExclusive(page);
>>> +    }
>>> +
>>> +    return ret;
>>>   }
>>>     static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>> -        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>>   {
>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +    fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +
>>> +    flags &= ~switch_off_flags;
>> This is mega confusing when reading the caller. Because the caller passes
>> FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.
>>
>> Can't we just pass in the flags we want?
> 
> Yup that is cleaner.
> 
>>
>>>   -    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>> +    if (!folio || !folio_test_large(folio))
>> What's the rational for dropping the max_nr_ptes == 1 condition? If you don't
>> need it, why did you add it in the earler patch?
> 
> Stupid me forgot to drop it from the earlier patch.
> 
>>
>>>           return 1;
>>>         return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop,
>>> struct vm_area_struct *vma
>>>       }
>>>     skip_batch:
>>> -    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +                       max_nr_ptes, 0);
>>>   out:
>>>       *foliop = folio;
>>>       return nr_ptes;
>>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>           if (pte_present(oldpte)) {
>>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>               struct folio *folio = NULL;
>>> -            pte_t ptent;
>>> +            int sub_nr_ptes, pgidx = 0;
>>> +            pte_t ptent, newpte;
>>> +            bool sub_set_write;
>>> +            int set_write;
>>>                 /*
>>>                * Avoid trapping faults against the zero or KSM
>>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                       continue;
>>>               }
>>>   +            if (!folio)
>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>> +
>>> +            nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +                               max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>>  From the other thread, my memory is jogged that this function ignores write
>> permission bit. So I think that's opening up a bug when applied here? If the
>> first pte is writable but the rest are not (COW), doesn't this now make them all
>> writable? I don't *think* that's a problem for the prot_numa use, but I could be
>> wrong.
> 
> No, we are not ignoring the write permission bit. There is no way currently to
> do that via folio_pte_batch. So the pte batch is either entirely writable or
> entirely not.

How are you enforcing that then? Surely folio_pte_batch() is the only thing
looking at the individual PTEs? It's not guaranteed that just because the PTEs
all belong to a single VMA that the permissions are all the same; you could have
an RW private anon VMA which has been forked so all set to COW then individual
PTEs have faulted and broken COW (as an example).


> 
>>
>>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>> Even if I'm wrong about ignoring write bit being a bug, I don't think the docs
>> for this function permit write bit to be different across the batch?
>>
>>>               ptent = pte_modify(oldpte, newprot);
>>>   @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                * example, if a PTE is already dirty and no other
>>>                * COW or special handling is required.
>>>                */
>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> -                !pte_write(ptent) &&
>>> -                can_change_pte_writable(vma, addr, ptent))
>>> -                ptent = pte_mkwrite(ptent, vma);
>>> -
>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>> -            if (pte_needs_flush(oldpte, ptent))
>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> -            pages++;
>>> +            set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> +                    !pte_write(ptent);
>>> +            if (set_write)
>>> +                set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>> Why not just:
>>             set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>                     !pte_write(ptent) &&
>>                     maybe_change_pte_writable(...);
> 
> set_write is an int, which is supposed to span {TRI_MAYBE, TRI_FALSE, TRI_TRUE},
> whereas the RHS of this statement will always return a boolean.
> 
> You proposed it like this in your diff; it took hours for my eyes to catch this : )

Ahh good spot! I don't really love the tristate thing, but couldn't think of
anything better. So I guess it should really be:

		set_write = ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
			    !pte_write(ptent)) ? TRI_MAYBE : TRI_FALSE;
		if (set_write == TRI_MAYBE)
			set_write = maybe_change_pte_writable(...);

That would make it a bit more obvious as to what is going on for the reader?

Thanks,
Ryan

> 
>>
>> ?
>>
>>> +
>>> +            while (nr_ptes) {
>>> +                if (set_write == TRI_MAYBE) {
>>> +                    sub_nr_ptes = anon_exclusive_batch(folio,
>>> +                        pgidx, nr_ptes, &sub_set_write);
>>> +                } else {
>>> +                    sub_nr_ptes = nr_ptes;
>>> +                    sub_set_write = (set_write == TRI_TRUE);
>>> +                }
>>> +
>>> +                if (sub_set_write)
>>> +                    newpte = pte_mkwrite(ptent, vma);
>>> +                else
>>> +                    newpte = ptent;
>>> +
>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>> +                            newpte, sub_nr_ptes);
>>> +                if (pte_needs_flush(oldpte, newpte))
>> What did we conclude with pte_needs_flush()? I thought there was an arch where
>> it looked dodgy calling this for just the pte at the head of the batch?
> 
> Powerpc flushes if access bit transitions from set to unset. x86 does that
> for both dirty and access. Both problems are solved by modify_prot_start_ptes()
> which collects a/d bits, both in the generic implementation and the arm64
> implementation.
> 
>>
>> Thanks,
>> Ryan
>>
>>> +                    tlb_flush_pte_range(tlb, addr,
>>> +                        sub_nr_ptes * PAGE_SIZE);
>>> +
>>> +                addr += sub_nr_ptes * PAGE_SIZE;
>>> +                pte += sub_nr_ptes;
>>> +                oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>> +                ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>> +                nr_ptes -= sub_nr_ptes;
>>> +                pages += sub_nr_ptes;
>>> +                pgidx += sub_nr_ptes;
>>> +            }
>>>           } else if (is_swap_pte(oldpte)) {
>>>               swp_entry_t entry = pte_to_swp_entry(oldpte);
>>>               pte_t newpte;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30 11:40     ` Dev Jain
@ 2025-06-30 11:51       ` Lorenzo Stoakes
  2025-06-30 11:56         ` Dev Jain
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:51 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Mon, Jun 30, 2025 at 05:10:36PM +0530, Dev Jain wrote:
>
> On 30/06/25 4:55 pm, Lorenzo Stoakes wrote:
> > On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
> > > In case of prot_numa, there are various cases in which we can skip to the
> > > next iteration. Since the skip condition is based on the folio and not
> > > the PTEs, we can skip a PTE batch. Additionally refactor all of this
> > > into a new function to clean up the existing code.
> > Hmm, is this a completely new concept for this series?
> >
> > Please try not to introduce brand new things to a series midway through.
> >
> > This seems to be adding a whole ton of questionable logic for an edge case.
> >
> > Can we maybe just drop this for this series please?
>
> I refactored this into a new function on David's suggestion:
>
> https://lore.kernel.org/all/912757c0-8a75-4307-a0bd-8755f6135b5a@redhat.com/
>
> Maybe you are saying, having a refactoring patch first and then the "skip a
> PTE batch" second, I'll be happy to do that, that would be cleaner.

OK apologies then, my mistake.

So essentially you were doing this explicitly for the prot numa case and it just
wasn't clear in subject line before, ok.

Yes please separate the two out!

> This series was (ofcourse, IMHO) pretty stable at v3, and there were comments
> coming on David's series, so I guessed that he will have to post a v2 anyways
> after mine gets merged. My guess could have been wrong. For the khugepaged
> batching series, I have sent the migration race patch separately exactly
> because of his series, so that in that case the rebasing burden is mine.

This stuff can be difficult to align on, but I'd suggest that when there's
another series in the interim that is fundamentally changing a function
signature that will make your code not compile it's probably best to hold off.

And that's why I'm suggesting this here.

>
> >
> > I know you like to rush out a dozen series at once, but once again I'm asking
> > maybe please hold off?
>
> Lorenzo : ) Except for the mremap series which you pointed out, I make it a point
> to never repost in the same week, unless it is an obvious single patch, and even
> in that case I give 2-3 days for the reviews to settle. I posted
> v3 of this series more than a month ago, so it makes total sense to post this.
> Also I have seen many people spamming the list with the next versions on literally
> the same day, not that I am using this as a precedent. The mistake I made here
> is that on Saturday I saw David's series but then forgot that I am using the
> same infrastructure he is changing and went ahead posting this. I suddenly
> remembered this during the khugepaged series and dropped the first two patches
> for that.

I'm not saying you shot this out shortly after a previous version, I'm saying
you have a whole bunch of series at once (I know as I'm cc'd on them all I think
:), and I'm politely asking you to maybe not send another that causes a known
conflict.

Whether or not you do so is up to you, it was a request.

>
> >
> > I seem to remember David asked you for the same thing because of this, but maybe
> > I'm misremembering.
>
> I don't recollect that happening, maybe I am wrong.

Yeah maybe I'm misremembering, apologies if so!


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30 11:39     ` Ryan Roberts
@ 2025-06-30 11:53       ` Lorenzo Stoakes
  0 siblings, 0 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Mon, Jun 30, 2025 at 12:39:33PM +0100, Ryan Roberts wrote:
> On 30/06/2025 12:25, Lorenzo Stoakes wrote:
> > On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
> >> In case of prot_numa, there are various cases in which we can skip to the
> >> next iteration. Since the skip condition is based on the folio and not
> >> the PTEs, we can skip a PTE batch. Additionally refactor all of this
> >> into a new function to clean up the existing code.
> >
> > Hmm, is this a completely new concept for this series?
> >
> > Please try not to introduce brand new things to a series midway through.
> >
> > This seems to be adding a whole ton of questionable logic for an edge case.
> >
> > Can we maybe just drop this for this series please?
>
> From my perspective, at least, there are no new logical changes in here vs the
> previous version. And I don't think the patches have been re-organised either.
> David (I think?) was asking for the name of the patch to be changed to include
> MM_CP_PROT_NUMA and also for the code to be moved out of line to it's own
> function. That's all that Dev has done AFAICT (although as per my review
> comments, the refactoring has introduced a bug).
>
> My preference is that we should ultimately support this batching. It could be a
> separate series if you insist, but it's all contbuting to the same goal
> ultimately; making mprotect support PTE batching.
>
> Just my 2c.
>
> Thanks,
> Ryan

Ack, was my mistake, apologies. I hadn't realised this was part of the series, I
hadn't looked it for a while...

But I think it's better to have the refactor and the batch bit done separately
so it's clear which is which, unless the change is so trivial as for that to be
just noise.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 11:50       ` Ryan Roberts
@ 2025-06-30 11:53         ` Dev Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:53 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 5:20 pm, Ryan Roberts wrote:
> On 30/06/2025 12:21, Dev Jain wrote:
>> On 30/06/25 4:01 pm, Ryan Roberts wrote:
>>> On 28/06/2025 12:34, Dev Jain wrote:
>>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>>> prot_numa case if possible.
>>>>
>>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>>> for one pte in the batch, one can confirm that that case will hold true for
>>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>>>> and access bits across the batch, therefore batching across
>>>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>>>> just an indication that the folio got written to, so even if the PTE is
>>>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>>>> optimization can be made.
>>>>
>>>> The crux now is how to batch around the PageAnonExclusive case; we must
>>>> check the corresponding condition for every single page. Therefore, from
>>>> the large folio batch, we process sub batches of ptes mapping pages with
>>>> the same PageAnonExclusive condition, and process that sub batch, then
>>>> determine and process the next sub batch, and so on. Note that this does
>>>> not cause any extra overhead; if suppose the size of the folio batch
>>>> is 512, then the sub batch processing in total will take 512 iterations,
>>>> which is the same as what we would have done before.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>>>    1 file changed, 117 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 627b0d67cc4a..28c7ce7728ff 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -40,35 +40,47 @@
>>>>      #include "internal.h"
>>>>    -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -                 pte_t pte)
>>>> -{
>>>> -    struct page *page;
>>>> +enum tristate {
>>>> +    TRI_FALSE = 0,
>>>> +    TRI_TRUE = 1,
>>>> +    TRI_MAYBE = -1,
>>>> +};
>>>>    +/*
>>>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>>>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>>>> + * additionally check PageAnonExclusive() for every page in the desired range.
>>>> + */
>>>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>>>> +                     unsigned long addr, pte_t pte,
>>>> +                     struct folio *folio)
>>>> +{
>>>>        if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>          /* Don't touch entries that are not even readable. */
>>>>        if (pte_protnone(pte))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>          /* Do we need write faults for softdirty tracking? */
>>>>        if (pte_needs_soft_dirty_wp(vma, pte))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>          /* Do we need write faults for uffd-wp tracking? */
>>>>        if (userfaultfd_pte_wp(vma, pte))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>          if (!(vma->vm_flags & VM_SHARED)) {
>>>>            /*
>>>>             * Writable MAP_PRIVATE mapping: We can only special-case on
>>>>             * exclusive anonymous pages, because we know that our
>>>>             * write-fault handler similarly would map them writable without
>>>> -         * any additional checks while holding the PT lock.
>>>> +         * any additional checks while holding the PT lock. So if the
>>>> +         * folio is not anonymous, we know we cannot change pte to
>>>> +         * writable. If it is anonymous then the caller must further
>>>> +         * check that the page is AnonExclusive().
>>>>             */
>>>> -        page = vm_normal_page(vma, addr, pte);
>>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +        return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>>>        }
>>>>          VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>>         * FS was already notified and we can simply mark the PTE writable
>>>>         * just like the write-fault handler would do.
>>>>         */
>>>> -    return pte_dirty(pte);
>>>> +    return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Returns the number of pages within the folio, starting from the page
>>>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>>>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>>>> + * PageAnonExclusive() is returned in *exclusive.
>>>> + */
>>>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>>>> +                bool *exclusive)
>>>> +{
>>>> +    struct page *page;
>>>> +    int nr = 1;
>>>> +
>>>> +    if (!folio) {
>>>> +        *exclusive = false;
>>>> +        return nr;
>>>> +    }
>>>> +
>>>> +    page = folio_page(folio, pgidx++);
>>>> +    *exclusive = PageAnonExclusive(page);
>>>> +    while (nr < max_nr) {
>>>> +        page = folio_page(folio, pgidx++);
>>>> +        if ((*exclusive) != PageAnonExclusive(page))
>>> nit: brackets not required around *exclusive.
>> Thanks I'll drop it. I have a habit of putting brackets everywhere
>> because debugging operator precedence bugs is a nightmare - finally
>> the time has come to learn operator precedence!
>>
>>>> +            break;
>>>> +        nr++;
>>>> +    }
>>>> +
>>>> +    return nr;
>>>> +}
>>>> +
>>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +                 pte_t pte)
>>>> +{
>>>> +    struct page *page;
>>>> +    int ret;
>>>> +
>>>> +    ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>>>> +    if (ret == TRI_MAYBE) {
>>>> +        page = vm_normal_page(vma, addr, pte);
>>>> +        ret = page && PageAnon(page) && PageAnonExclusive(page);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>>    }
>>>>      static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> -        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>>>    {
>>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +    fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +
>>>> +    flags &= ~switch_off_flags;
>>> This is mega confusing when reading the caller. Because the caller passes
>>> FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.
>>>
>>> Can't we just pass in the flags we want?
>> Yup that is cleaner.
>>
>>>>    -    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>>> +    if (!folio || !folio_test_large(folio))
>>> What's the rational for dropping the max_nr_ptes == 1 condition? If you don't
>>> need it, why did you add it in the earler patch?
>> Stupid me forgot to drop it from the earlier patch.
>>
>>>>            return 1;
>>>>          return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop,
>>>> struct vm_area_struct *vma
>>>>        }
>>>>      skip_batch:
>>>> -    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>>> +                       max_nr_ptes, 0);
>>>>    out:
>>>>        *foliop = folio;
>>>>        return nr_ptes;
>>>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>            if (pte_present(oldpte)) {
>>>>                int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>>                struct folio *folio = NULL;
>>>> -            pte_t ptent;
>>>> +            int sub_nr_ptes, pgidx = 0;
>>>> +            pte_t ptent, newpte;
>>>> +            bool sub_set_write;
>>>> +            int set_write;
>>>>                  /*
>>>>                 * Avoid trapping faults against the zero or KSM
>>>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                        continue;
>>>>                }
>>>>    +            if (!folio)
>>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>>> +
>>>> +            nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>>> +                               max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>>>   From the other thread, my memory is jogged that this function ignores write
>>> permission bit. So I think that's opening up a bug when applied here? If the
>>> first pte is writable but the rest are not (COW), doesn't this now make them all
>>> writable? I don't *think* that's a problem for the prot_numa use, but I could be
>>> wrong.
>> No, we are not ignoring the write permission bit. There is no way currently to
>> do that via folio_pte_batch. So the pte batch is either entirely writable or
>> entirely not.
> How are you enforcing that then? Surely folio_pte_batch() is the only thing
> looking at the individual PTEs? It's not guaranteed that just because the PTEs
> all belong to a single VMA that the permissions are all the same; you could have
> an RW private anon VMA which has been forked so all set to COW then individual
> PTEs have faulted and broken COW (as an example).

Yup I just replied in the other mail, I missed the pte_mkwrprotect() in folio_pte_batch().

>
>
>>>>                oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>> Even if I'm wrong about ignoring write bit being a bug, I don't think the docs
>>> for this function permit write bit to be different across the batch?
>>>
>>>>                ptent = pte_modify(oldpte, newprot);
>>>>    @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                 * example, if a PTE is already dirty and no other
>>>>                 * COW or special handling is required.
>>>>                 */
>>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> -                !pte_write(ptent) &&
>>>> -                can_change_pte_writable(vma, addr, ptent))
>>>> -                ptent = pte_mkwrite(ptent, vma);
>>>> -
>>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>> -            if (pte_needs_flush(oldpte, ptent))
>>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> -            pages++;
>>>> +            set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> +                    !pte_write(ptent);
>>>> +            if (set_write)
>>>> +                set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>>> Why not just:
>>>              set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>                      !pte_write(ptent) &&
>>>                      maybe_change_pte_writable(...);
>> set_write is an int, which is supposed to span {TRI_MAYBE, TRI_FALSE, TRI_TRUE},
>> whereas the RHS of this statement will always return a boolean.
>>
>> You proposed it like this in your diff; it took hours for my eyes to catch this : )
> Ahh good spot! I don't really love the tristate thing, but couldn't think of
> anything better. So I guess it should really be:
>
> 		set_write = ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> 			    !pte_write(ptent)) ? TRI_MAYBE : TRI_FALSE;
> 		if (set_write == TRI_MAYBE)
> 			set_write = maybe_change_pte_writable(...);
>
> That would make it a bit more obvious as to what is going on for the reader?

Nice!

>
> Thanks,
> Ryan
>
>>> ?
>>>
>>>> +
>>>> +            while (nr_ptes) {
>>>> +                if (set_write == TRI_MAYBE) {
>>>> +                    sub_nr_ptes = anon_exclusive_batch(folio,
>>>> +                        pgidx, nr_ptes, &sub_set_write);
>>>> +                } else {
>>>> +                    sub_nr_ptes = nr_ptes;
>>>> +                    sub_set_write = (set_write == TRI_TRUE);
>>>> +                }
>>>> +
>>>> +                if (sub_set_write)
>>>> +                    newpte = pte_mkwrite(ptent, vma);
>>>> +                else
>>>> +                    newpte = ptent;
>>>> +
>>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>>> +                            newpte, sub_nr_ptes);
>>>> +                if (pte_needs_flush(oldpte, newpte))
>>> What did we conclude with pte_needs_flush()? I thought there was an arch where
>>> it looked dodgy calling this for just the pte at the head of the batch?
>> Powerpc flushes if access bit transitions from set to unset. x86 does that
>> for both dirty and access. Both problems are solved by modify_prot_start_ptes()
>> which collects a/d bits, both in the generic implementation and the arm64
>> implementation.
>>
>>> Thanks,
>>> Ryan
>>>
>>>> +                    tlb_flush_pte_range(tlb, addr,
>>>> +                        sub_nr_ptes * PAGE_SIZE);
>>>> +
>>>> +                addr += sub_nr_ptes * PAGE_SIZE;
>>>> +                pte += sub_nr_ptes;
>>>> +                oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>>> +                ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>>> +                nr_ptes -= sub_nr_ptes;
>>>> +                pages += sub_nr_ptes;
>>>> +                pgidx += sub_nr_ptes;
>>>> +            }
>>>>            } else if (is_swap_pte(oldpte)) {
>>>>                swp_entry_t entry = pte_to_swp_entry(oldpte);
>>>>                pte_t newpte;


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-30 11:51       ` Lorenzo Stoakes
@ 2025-06-30 11:56         ` Dev Jain
  0 siblings, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-06-30 11:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 5:21 pm, Lorenzo Stoakes wrote:
> On Mon, Jun 30, 2025 at 05:10:36PM +0530, Dev Jain wrote:
>> On 30/06/25 4:55 pm, Lorenzo Stoakes wrote:
>>> On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
>>>> In case of prot_numa, there are various cases in which we can skip to the
>>>> next iteration. Since the skip condition is based on the folio and not
>>>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>>>> into a new function to clean up the existing code.
>>> Hmm, is this a completely new concept for this series?
>>>
>>> Please try not to introduce brand new things to a series midway through.
>>>
>>> This seems to be adding a whole ton of questionable logic for an edge case.
>>>
>>> Can we maybe just drop this for this series please?
>> I refactored this into a new function on David's suggestion:
>>
>> https://lore.kernel.org/all/912757c0-8a75-4307-a0bd-8755f6135b5a@redhat.com/
>>
>> Maybe you are saying, having a refactoring patch first and then the "skip a
>> PTE batch" second, I'll be happy to do that, that would be cleaner.
> OK apologies then, my mistake.
>
> So essentially you were doing this explicitly for the prot numa case and it just
> wasn't clear in subject line before, ok.
>
> Yes please separate the two out!
>
>> This series was (ofcourse, IMHO) pretty stable at v3, and there were comments
>> coming on David's series, so I guessed that he will have to post a v2 anyways
>> after mine gets merged. My guess could have been wrong. For the khugepaged
>> batching series, I have sent the migration race patch separately exactly
>> because of his series, so that in that case the rebasing burden is mine.
> This stuff can be difficult to align on, but I'd suggest that when there's
> another series in the interim that is fundamentally changing a function
> signature that will make your code not compile it's probably best to hold off.

Good point about changing the function signature. Anyways my bad, should have
been more careful, apologies.

>
> And that's why I'm suggesting this here.
>
>>> I know you like to rush out a dozen series at once, but once again I'm asking
>>> maybe please hold off?
>> Lorenzo : ) Except for the mremap series which you pointed out, I make it a point
>> to never repost in the same week, unless it is an obvious single patch, and even
>> in that case I give 2-3 days for the reviews to settle. I posted
>> v3 of this series more than a month ago, so it makes total sense to post this.
>> Also I have seen many people spamming the list with the next versions on literally
>> the same day, not that I am using this as a precedent. The mistake I made here
>> is that on Saturday I saw David's series but then forgot that I am using the
>> same infrastructure he is changing and went ahead posting this. I suddenly
>> remembered this during the khugepaged series and dropped the first two patches
>> for that.
> I'm not saying you shot this out shortly after a previous version, I'm saying
> you have a whole bunch of series at once (I know as I'm cc'd on them all I think
> :), and I'm politely asking you to maybe not send another that causes a known
> conflict.
>
> Whether or not you do so is up to you, it was a request.

Yup my bad for forgetting about the conflict, apologies!

>
>>> I seem to remember David asked you for the same thing because of this, but maybe
>>> I'm misremembering.
>> I don't recollect that happening, maybe I am wrong.
> Yeah maybe I'm misremembering, apologies if so!


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-28 11:34 ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
  2025-06-28 12:39   ` Dev Jain
  2025-06-30 10:31   ` Ryan Roberts
@ 2025-06-30 12:52   ` Lorenzo Stoakes
  2025-07-01  5:30     ` Dev Jain
  2025-07-01  8:03     ` Ryan Roberts
  2 siblings, 2 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 12:52 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, Jun 28, 2025 at 05:04:34PM +0530, Dev Jain wrote:
> Use folio_pte_batch to batch process a large folio. Reuse the folio from
> prot_numa case if possible.
>
> For all cases other than the PageAnonExclusive case, if the case holds true
> for one pte in the batch, one can confirm that that case will hold true for
> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
> and access bits across the batch, therefore batching across
> pte_dirty(): this is correct since the dirty bit on the PTE really is
> just an indication that the folio got written to, so even if the PTE is
> not actually dirty (but one of the PTEs in the batch is), the wp-fault
> optimization can be made.
>
> The crux now is how to batch around the PageAnonExclusive case; we must
> check the corresponding condition for every single page. Therefore, from
> the large folio batch, we process sub batches of ptes mapping pages with
> the same PageAnonExclusive condition, and process that sub batch, then
> determine and process the next sub batch, and so on. Note that this does
> not cause any extra overhead; if suppose the size of the folio batch
> is 512, then the sub batch processing in total will take 512 iterations,
> which is the same as what we would have done before.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 117 insertions(+), 26 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 627b0d67cc4a..28c7ce7728ff 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -40,35 +40,47 @@
>
>  #include "internal.h"
>
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> -{
> -	struct page *page;
> +enum tristate {
> +	TRI_FALSE = 0,
> +	TRI_TRUE = 1,
> +	TRI_MAYBE = -1,
> +};

Yeah no, absolutely not, this is horrible, I don't want to see an arbitrary type
like this added, to a random file, and I absolutely think this adds confusion
and does not in any way help clarify things.

>
> +/*
> + * Returns enum tristate indicating whether the pte can be changed to writable.
> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
> + * additionally check PageAnonExclusive() for every page in the desired range.
> + */
> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
> +				     unsigned long addr, pte_t pte,
> +				     struct folio *folio)
> +{
>  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
> -		return false;
> +		return TRI_FALSE;
>
>  	/* Don't touch entries that are not even readable. */
>  	if (pte_protnone(pte))
> -		return false;
> +		return TRI_FALSE;
>
>  	/* Do we need write faults for softdirty tracking? */
>  	if (pte_needs_soft_dirty_wp(vma, pte))
> -		return false;
> +		return TRI_FALSE;
>
>  	/* Do we need write faults for uffd-wp tracking? */
>  	if (userfaultfd_pte_wp(vma, pte))
> -		return false;
> +		return TRI_FALSE;
>
>  	if (!(vma->vm_flags & VM_SHARED)) {
>  		/*
>  		 * Writable MAP_PRIVATE mapping: We can only special-case on
>  		 * exclusive anonymous pages, because we know that our
>  		 * write-fault handler similarly would map them writable without
> -		 * any additional checks while holding the PT lock.
> +		 * any additional checks while holding the PT lock. So if the
> +		 * folio is not anonymous, we know we cannot change pte to
> +		 * writable. If it is anonymous then the caller must further
> +		 * check that the page is AnonExclusive().
>  		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>  	}
>
>  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	 * FS was already notified and we can simply mark the PTE writable
>  	 * just like the write-fault handler would do.
>  	 */
> -	return pte_dirty(pte);
> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
> +}

Yeah not a fan of this at all.

This is squashing all the logic into one place when we don't really need to.

We can separate out the shared logic and just do something like:

////// Lorenzo's suggestion //////

-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
+static bool maybe_change_pte_writable(struct vm_area_struct *vma,
+		pte_t pte)
 {
-	struct page *page;
-
 	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
 		return false;

@@ -60,16 +58,14 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	if (userfaultfd_pte_wp(vma, pte))
 		return false;

-	if (!(vma->vm_flags & VM_SHARED)) {
-		/*
-		 * Writable MAP_PRIVATE mapping: We can only special-case on
-		 * exclusive anonymous pages, because we know that our
-		 * write-fault handler similarly would map them writable without
-		 * any additional checks while holding the PT lock.
-		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
-	}
+	return true;
+}
+
+static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
+		pte_t pte)
+{
+	if (!maybe_change_pte_writable(vma, pte))
+		return false;

 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));

@@ -83,6 +79,33 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }

+static bool can_change_private_pte_writable(struct vm_area_struct *vma,
+		unsigned long addr, pte_t pte)
+{
+	struct page *page;
+
+	if (!maybe_change_pte_writable(vma, pte))
+		return false;
+
+	/*
+	 * Writable MAP_PRIVATE mapping: We can only special-case on
+	 * exclusive anonymous pages, because we know that our
+	 * write-fault handler similarly would map them writable without
+	 * any additional checks while holding the PT lock.
+	 */
+	page = vm_normal_page(vma, addr, pte);
+	return page && PageAnon(page) && PageAnonExclusive(page);
+}
+
+bool can_change_pte_writable(struct vm_area_struct *vma,
+		unsigned long addr, pte_t pte)
+{
+	if (vma->vm_flags & VM_SHARED)
+		return can_change_shared_pte_writable(vma, pte);
+
+	return can_change_private_pte_writable(vma, addr, pte);
+}
+

////// end of Lorenzo's suggestion //////

You can obviously modify this to change other stuff like whether you feed back
the PAE or not in private case for use in your code.

> +
> +/*
> + * Returns the number of pages within the folio, starting from the page
> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
> + * PageAnonExclusive() is returned in *exclusive.
> + */
> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
> +				bool *exclusive)

Let's generalise it to something like count_folio_fungible_pages()

or maybe count_folio_batchable_pages()?

Yes naming is hard... :P but right now it reads like this is returning a batch
or doing something with a batch.

> +{
> +	struct page *page;
> +	int nr = 1;
> +
> +	if (!folio) {
> +		*exclusive = false;
> +		return nr;
> +	}
> +
> +	page = folio_page(folio, pgidx++);
> +	*exclusive = PageAnonExclusive(page);
> +	while (nr < max_nr) {

The C programming language asks why you don't like using for :)

> +		page = folio_page(folio, pgidx++);
> +		if ((*exclusive) != PageAnonExclusive(page))
> +			break;
> +		nr++;

This *exclusive stuff makes me want to cry :)

Just set a local variable and hand it back at the end.

> +	}
> +
> +	return nr;
> +}
> +
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte)
> +{
> +	struct page *page;
> +	int ret;
> +
> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
> +	if (ret == TRI_MAYBE) {
> +		page = vm_normal_page(vma, addr, pte);
> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
> +	}
> +
> +	return ret;
>  }

See above comments on this stuff.

>
>  static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)

This last parameter is pretty horrible. It's a negative mask so now you're
passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
just planting land mines.

Obviously David's flag changes will also alter all this.

Just add a boolean re: soft dirty.

>  {
> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	flags &= ~switch_off_flags;
>
> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
> +	if (!folio || !folio_test_large(folio))
>  		return 1;

Why remove this last check?

>
>  	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>  	}
>
>  skip_batch:
> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
> +					   max_nr_ptes, 0);

See above about flag param. If you change to boolean, please prefix this with
e.g. /* set_soft_dirty= */ true or whatever the flag ends up being :)

>  out:
>  	*foliop = folio;
>  	return nr_ptes;
> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>  		if (pte_present(oldpte)) {
>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>  			struct folio *folio = NULL;
> -			pte_t ptent;
> +			int sub_nr_ptes, pgidx = 0;
> +			pte_t ptent, newpte;
> +			bool sub_set_write;
> +			int set_write;
>
>  			/*
>  			 * Avoid trapping faults against the zero or KSM
> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  					continue;
>  			}
>
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);

Don't we only care about S/D if pte_needs_soft_dirty_wp()?

>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>
> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * example, if a PTE is already dirty and no other
>  			 * COW or special handling is required.
>  			 */
> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent, vma);
> -
> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -			if (pte_needs_flush(oldpte, ptent))
> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -			pages++;
> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> +				    !pte_write(ptent);
> +			if (set_write)
> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
> +
> +			while (nr_ptes) {
> +				if (set_write == TRI_MAYBE) {
> +					sub_nr_ptes = anon_exclusive_batch(folio,
> +						pgidx, nr_ptes, &sub_set_write);
> +				} else {
> +					sub_nr_ptes = nr_ptes;
> +					sub_set_write = (set_write == TRI_TRUE);
> +				}
> +
> +				if (sub_set_write)
> +					newpte = pte_mkwrite(ptent, vma);
> +				else
> +					newpte = ptent;
> +
> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> +							newpte, sub_nr_ptes);
> +				if (pte_needs_flush(oldpte, newpte))
> +					tlb_flush_pte_range(tlb, addr,
> +						sub_nr_ptes * PAGE_SIZE);
> +
> +				addr += sub_nr_ptes * PAGE_SIZE;
> +				pte += sub_nr_ptes;
> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> +				nr_ptes -= sub_nr_ptes;
> +				pages += sub_nr_ptes;
> +				pgidx += sub_nr_ptes;
> +			}

I hate hate hate having this loop here, let's abstract this please.

I mean I think we can just use mprotect_folio_pte_batch() no? It's not
abstracting much here, and we can just do all this handling there. Maybe have to
pass in a bunch more params, but it saves us having to do all this.

Alternatively, we could add a new wrapper function, but yeah definitely not
this.

Also the C programming language asks... etc etc. ;)

Overall since you always end up processing folio_nr_pages(folio) you can just
have the batch function or a wrapper return this and do updates as necessary
here on that basis, and leave the 'sub' batching to that function.


>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
>  			pte_t newpte;
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-28 11:34 ` [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-06-30 10:10   ` Ryan Roberts
@ 2025-06-30 12:57   ` Lorenzo Stoakes
  2025-07-01  4:44     ` Dev Jain
  1 sibling, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 12:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers; in case not, they are implemented
> as a simple loop over the corresponding single pte helpers.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Looks generally sensible! Some comments below.

> ---
>  include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mprotect.c           |  4 +-
>  2 files changed, 84 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index cf1515c163e2..662f39e7475a 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>
>  /*
>   * Commit an update to a pte, leaving any hardware-controlled bits in
> - * the PTE unmodified.
> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
>   */
>  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  					   unsigned long addr,
> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/**
> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
> + * over a batch of ptes, which protects against asynchronous hardware
> + * modifications to the ptes. The intention is not to prevent the hardware from
> + * making pte updates, but to prevent any updates it may make from being lost.
> + * Please see the comment above ptep_modify_prot_start() for full description.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> + * in the batch.
> + *
> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
> + * mprotect_folio_pte_batch()).
> + */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/**
> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> + * hardware-controlled bits in the PTE unmodified.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> + * @pte: New page table entry to be set.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_commit().
> + *
> + * Context: The caller holds the page table lock. The PTEs are all in the same
> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
> + * a/d bits on which were off in old_pte.
> + */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; ++i) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		ptep++;

Weird place to put this increment, maybe just stick it in the for loop.

> +		addr += PAGE_SIZE;

Same comment here.

> +		old_pte = pte_next_pfn(old_pte);

Could be:

		old_pte = pte;

No?

> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  #endif /* CONFIG_MMU */
>
>  /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index af10a7fbe6b8..627b0d67cc4a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  					continue;
>  			}
>
> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>
>  			if (uffd_wp)
> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			    can_change_pte_writable(vma, addr, ptent))
>  				ptent = pte_mkwrite(ptent, vma);
>
> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>  			if (pte_needs_flush(oldpte, ptent))
>  				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>  			pages++;
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 0/4] Optimize mprotect() for large folios
  2025-06-30 11:43   ` Dev Jain
@ 2025-07-01  0:08     ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2025-07-01  0:08 UTC (permalink / raw)
  To: Dev Jain
  Cc: Lorenzo Stoakes, ryan.roberts, david, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Mon, 30 Jun 2025 17:13:07 +0530 Dev Jain <dev.jain@arm.com> wrote:

> @Andrew could you remove this from mm-new please?

Sure.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-06-30 12:57   ` Lorenzo Stoakes
@ 2025-07-01  4:44     ` Dev Jain
  2025-07-01  7:33       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-07-01  4:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 6:27 pm, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Looks generally sensible! Some comments below.
>
>> ---
>>   include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>>   mm/mprotect.c           |  4 +-
>>   2 files changed, 84 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index cf1515c163e2..662f39e7475a 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>>
>>   /*
>>    * Commit an update to a pte, leaving any hardware-controlled bits in
>> - * the PTE unmodified.
>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
>>    */
>>   static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   					   unsigned long addr,
>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>>   }
>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>> +
>> +/**
>> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
>> + * over a batch of ptes, which protects against asynchronous hardware
>> + * modifications to the ptes. The intention is not to prevent the hardware from
>> + * making pte updates, but to prevent any updates it may make from being lost.
>> + * Please see the comment above ptep_modify_prot_start() for full description.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
>> + * in the batch.
>> + *
>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
>> + * mprotect_folio_pte_batch()).
>> + */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/**
>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>> + * hardware-controlled bits in the PTE unmodified.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
>> + * @pte: New page table entry to be set.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_commit().
>> + *
>> + * Context: The caller holds the page table lock. The PTEs are all in the same
>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
>> + * a/d bits on which were off in old_pte.
>> + */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; ++i) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		ptep++;
> Weird place to put this increment, maybe just stick it in the for loop.
>
>> +		addr += PAGE_SIZE;
> Same comment here.

Sure.

>
>> +		old_pte = pte_next_pfn(old_pte);
> Could be:
>
> 		old_pte = pte;
>
> No?

We will need to update old_pte also since that
is used by powerpc in radix__ptep_modify_prot_commit().

>
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
>> +
>>   #endif /* CONFIG_MMU */
>>
>>   /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index af10a7fbe6b8..627b0d67cc4a 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   					continue;
>>   			}
>>
>> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>
>>   			if (uffd_wp)
>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			    can_change_pte_writable(vma, addr, ptent))
>>   				ptent = pte_mkwrite(ptent, vma);
>>
>> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>   			if (pte_needs_flush(oldpte, ptent))
>>   				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>   			pages++;
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 12:52   ` Lorenzo Stoakes
@ 2025-07-01  5:30     ` Dev Jain
  2025-07-01  8:03     ` Ryan Roberts
  1 sibling, 0 replies; 62+ messages in thread
From: Dev Jain @ 2025-07-01  5:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 30/06/25 6:22 pm, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:34PM +0530, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>> prot_numa case if possible.
>>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>> and access bits across the batch, therefore batching across
>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>> just an indication that the folio got written to, so even if the PTE is
>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>> optimization can be made.
>>
>> The crux now is how to batch around the PageAnonExclusive case; we must
>> check the corresponding condition for every single page. Therefore, from
>> the large folio batch, we process sub batches of ptes mapping pages with
>> the same PageAnonExclusive condition, and process that sub batch, then
>> determine and process the next sub batch, and so on. Note that this does
>> not cause any extra overhead; if suppose the size of the folio batch
>> is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 627b0d67cc4a..28c7ce7728ff 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,35 +40,47 @@
>>
>>   #include "internal.h"
>>
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> -{
>> -	struct page *page;
>> +enum tristate {
>> +	TRI_FALSE = 0,
>> +	TRI_TRUE = 1,
>> +	TRI_MAYBE = -1,
>> +};
> Yeah no, absolutely not, this is horrible, I don't want to see an arbitrary type
> like this added, to a random file, and I absolutely think this adds confusion
> and does not in any way help clarify things.
>
>> +/*
>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>> + * additionally check PageAnonExclusive() for every page in the desired range.
>> + */
>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>> +				     unsigned long addr, pte_t pte,
>> +				     struct folio *folio)
>> +{
>>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>   	/* Don't touch entries that are not even readable. */
>>   	if (pte_protnone(pte))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>   	/* Do we need write faults for softdirty tracking? */
>>   	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>   	/* Do we need write faults for uffd-wp tracking? */
>>   	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>   		/*
>>   		 * Writable MAP_PRIVATE mapping: We can only special-case on
>>   		 * exclusive anonymous pages, because we know that our
>>   		 * write-fault handler similarly would map them writable without
>> -		 * any additional checks while holding the PT lock.
>> +		 * any additional checks while holding the PT lock. So if the
>> +		 * folio is not anonymous, we know we cannot change pte to
>> +		 * writable. If it is anonymous then the caller must further
>> +		 * check that the page is AnonExclusive().
>>   		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>   	}
>>
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	 * FS was already notified and we can simply mark the PTE writable
>>   	 * just like the write-fault handler would do.
>>   	 */
>> -	return pte_dirty(pte);
>> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>> +}
> Yeah not a fan of this at all.
>
> This is squashing all the logic into one place when we don't really need to.
>
> We can separate out the shared logic and just do something like:
>
> ////// Lorenzo's suggestion //////
>
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> +static bool maybe_change_pte_writable(struct vm_area_struct *vma,
> +		pte_t pte)
>   {
> -	struct page *page;
> -
>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>   		return false;
>
> @@ -60,16 +58,14 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>   	if (userfaultfd_pte_wp(vma, pte))
>   		return false;
>
> -	if (!(vma->vm_flags & VM_SHARED)) {
> -		/*
> -		 * Writable MAP_PRIVATE mapping: We can only special-case on
> -		 * exclusive anonymous pages, because we know that our
> -		 * write-fault handler similarly would map them writable without
> -		 * any additional checks while holding the PT lock.
> -		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> -	}
> +	return true;
> +}
> +
> +static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
> +		pte_t pte)
> +{
> +	if (!maybe_change_pte_writable(vma, pte))
> +		return false;
>
>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>
> @@ -83,6 +79,33 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>   	return pte_dirty(pte);
>   }
>
> +static bool can_change_private_pte_writable(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t pte)
> +{
> +	struct page *page;
> +
> +	if (!maybe_change_pte_writable(vma, pte))
> +		return false;
> +
> +	/*
> +	 * Writable MAP_PRIVATE mapping: We can only special-case on
> +	 * exclusive anonymous pages, because we know that our
> +	 * write-fault handler similarly would map them writable without
> +	 * any additional checks while holding the PT lock.
> +	 */
> +	page = vm_normal_page(vma, addr, pte);
> +	return page && PageAnon(page) && PageAnonExclusive(page);
> +}
> +
> +bool can_change_pte_writable(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t pte)
> +{
> +	if (vma->vm_flags & VM_SHARED)
> +		return can_change_shared_pte_writable(vma, pte);
> +
> +	return can_change_private_pte_writable(vma, addr, pte);
> +}
> +
>
> ////// end of Lorenzo's suggestion //////
>
> You can obviously modify this to change other stuff like whether you feed back
> the PAE or not in private case for use in your code.
>
>> +
>> +/*
>> + * Returns the number of pages within the folio, starting from the page
>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>> + * PageAnonExclusive() is returned in *exclusive.
>> + */
>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>> +				bool *exclusive)
> Let's generalise it to something like count_folio_fungible_pages()
>
> or maybe count_folio_batchable_pages()?
>
> Yes naming is hard... :P but right now it reads like this is returning a batch
> or doing something with a batch.
>
>> +{
>> +	struct page *page;
>> +	int nr = 1;
>> +
>> +	if (!folio) {
>> +		*exclusive = false;
>> +		return nr;
>> +	}
>> +
>> +	page = folio_page(folio, pgidx++);
>> +	*exclusive = PageAnonExclusive(page);
>> +	while (nr < max_nr) {
> The C programming language asks why you don't like using for :)
>
>> +		page = folio_page(folio, pgidx++);
>> +		if ((*exclusive) != PageAnonExclusive(page))
>> +			break;
>> +		nr++;
> This *exclusive stuff makes me want to cry :)
>
> Just set a local variable and hand it back at the end.
>
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte)
>> +{
>> +	struct page *page;
>> +	int ret;
>> +
>> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>> +	if (ret == TRI_MAYBE) {
>> +		page = vm_normal_page(vma, addr, pte);
>> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
>> +	}
>> +
>> +	return ret;
>>   }
> See above comments on this stuff.
>
>>   static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
> This last parameter is pretty horrible. It's a negative mask so now you're
> passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
> just planting land mines.
>
> Obviously David's flag changes will also alter all this.
>
> Just add a boolean re: soft dirty.
>
>>   {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	flags &= ~switch_off_flags;
>>
>> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (!folio || !folio_test_large(folio))
>>   		return 1;
> Why remove this last check?

Forgot to drop this check also in the last patch.

>
>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>>   	}
>>
>>   skip_batch:
>> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +					   max_nr_ptes, 0);
> See above about flag param. If you change to boolean, please prefix this with
> e.g. /* set_soft_dirty= */ true or whatever the flag ends up being :)

On Ryan's suggestion I'll just change this to pass the fpb_flags.

>
>>   out:
>>   	*foliop = folio;
>>   	return nr_ptes;
>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   		if (pte_present(oldpte)) {
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>   			struct folio *folio = NULL;
>> -			pte_t ptent;
>> +			int sub_nr_ptes, pgidx = 0;
>> +			pte_t ptent, newpte;
>> +			bool sub_set_write;
>> +			int set_write;
>>
>>   			/*
>>   			 * Avoid trapping faults against the zero or KSM
>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   					continue;
>>   			}
>>
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
> Don't we only care about S/D if pte_needs_soft_dirty_wp()?

That's what the function will do, it will switch off FPB_IGNORE_SOFT_DIRTY
from the mask, meaning that folio_pte_batch will not ignore S/D. Yeah
that switch_off_flags thing is really horrible : )

>
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>
>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * example, if a PTE is already dirty and no other
>>   			 * COW or special handling is required.
>>   			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> +				    !pte_write(ptent);
>> +			if (set_write)
>> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>> +
>> +			while (nr_ptes) {
>> +				if (set_write == TRI_MAYBE) {
>> +					sub_nr_ptes = anon_exclusive_batch(folio,
>> +						pgidx, nr_ptes, &sub_set_write);
>> +				} else {
>> +					sub_nr_ptes = nr_ptes;
>> +					sub_set_write = (set_write == TRI_TRUE);
>> +				}
>> +
>> +				if (sub_set_write)
>> +					newpte = pte_mkwrite(ptent, vma);
>> +				else
>> +					newpte = ptent;
>> +
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
>> +							newpte, sub_nr_ptes);
>> +				if (pte_needs_flush(oldpte, newpte))
>> +					tlb_flush_pte_range(tlb, addr,
>> +						sub_nr_ptes * PAGE_SIZE);
>> +
>> +				addr += sub_nr_ptes * PAGE_SIZE;
>> +				pte += sub_nr_ptes;
>> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>> +				nr_ptes -= sub_nr_ptes;
>> +				pages += sub_nr_ptes;
>> +				pgidx += sub_nr_ptes;
>> +			}
> I hate hate hate having this loop here, let's abstract this please.
>
> I mean I think we can just use mprotect_folio_pte_batch() no? It's not
> abstracting much here, and we can just do all this handling there. Maybe have to
> pass in a bunch more params, but it saves us having to do all this.
>
> Alternatively, we could add a new wrapper function, but yeah definitely not
> this.
>
> Also the C programming language asks... etc etc. ;)
>
> Overall since you always end up processing folio_nr_pages(folio) you can just
> have the batch function or a wrapper return this and do updates as necessary
> here on that basis, and leave the 'sub' batching to that function.

If time permits could you take a look at the corresponding patch in v3? That was
my original implementation, does that look any cleaner?

>
>
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 10:31   ` Ryan Roberts
  2025-06-30 11:21     ` Dev Jain
@ 2025-07-01  5:47     ` Dev Jain
  2025-07-01  7:39       ` Ryan Roberts
  1 sibling, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-07-01  5:47 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 30/06/25 4:01 pm, Ryan Roberts wrote:
> On 28/06/2025 12:34, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>> prot_numa case if possible.
>>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>> and access bits across the batch, therefore batching across
>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>> just an indication that the folio got written to, so even if the PTE is
>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>> optimization can be made.
>>
>> The crux now is how to batch around the PageAnonExclusive case; we must
>> check the corresponding condition for every single page. Therefore, from
>> the large folio batch, we process sub batches of ptes mapping pages with
>> the same PageAnonExclusive condition, and process that sub batch, then
>> determine and process the next sub batch, and so on. Note that this does
>> not cause any extra overhead; if suppose the size of the folio batch
>> is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 627b0d67cc4a..28c7ce7728ff 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,35 +40,47 @@
>>   
>>   #include "internal.h"
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> -{
>> -	struct page *page;
>> +enum tristate {
>> +	TRI_FALSE = 0,
>> +	TRI_TRUE = 1,
>> +	TRI_MAYBE = -1,
>> +};
>>   
>> +/*
>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>> + * additionally check PageAnonExclusive() for every page in the desired range.
>> + */
>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>> +				     unsigned long addr, pte_t pte,
>> +				     struct folio *folio)
>> +{
>>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	/* Don't touch entries that are not even readable. */
>>   	if (pte_protnone(pte))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	/* Do we need write faults for softdirty tracking? */
>>   	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	/* Do we need write faults for uffd-wp tracking? */
>>   	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>   
>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>   		/*
>>   		 * Writable MAP_PRIVATE mapping: We can only special-case on
>>   		 * exclusive anonymous pages, because we know that our
>>   		 * write-fault handler similarly would map them writable without
>> -		 * any additional checks while holding the PT lock.
>> +		 * any additional checks while holding the PT lock. So if the
>> +		 * folio is not anonymous, we know we cannot change pte to
>> +		 * writable. If it is anonymous then the caller must further
>> +		 * check that the page is AnonExclusive().
>>   		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>   	}
>>   
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	 * FS was already notified and we can simply mark the PTE writable
>>   	 * just like the write-fault handler would do.
>>   	 */
>> -	return pte_dirty(pte);
>> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>> +}
>> +
>> +/*
>> + * Returns the number of pages within the folio, starting from the page
>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>> + * PageAnonExclusive() is returned in *exclusive.
>> + */
>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>> +				bool *exclusive)
>> +{
>> +	struct page *page;
>> +	int nr = 1;
>> +
>> +	if (!folio) {
>> +		*exclusive = false;
>> +		return nr;
>> +	}
>> +
>> +	page = folio_page(folio, pgidx++);
>> +	*exclusive = PageAnonExclusive(page);
>> +	while (nr < max_nr) {
>> +		page = folio_page(folio, pgidx++);
>> +		if ((*exclusive) != PageAnonExclusive(page))
> nit: brackets not required around *exclusive.
>
>> +			break;
>> +		nr++;
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte)
>> +{
>> +	struct page *page;
>> +	int ret;
>> +
>> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>> +	if (ret == TRI_MAYBE) {
>> +		page = vm_normal_page(vma, addr, pte);
>> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
>> +	}
>> +
>> +	return ret;
>>   }
>>   
>>   static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>   {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	flags &= ~switch_off_flags;
> This is mega confusing when reading the caller. Because the caller passes
> FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.
>
> Can't we just pass in the flags we want?
>
>>   
>> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (!folio || !folio_test_large(folio))
> What's the rational for dropping the max_nr_ptes == 1 condition? If you don't
> need it, why did you add it in the earler patch?
>
>>   		return 1;
>>   
>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>>   	}
>>   
>>   skip_batch:
>> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +					   max_nr_ptes, 0);
>>   out:
>>   	*foliop = folio;
>>   	return nr_ptes;
>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   		if (pte_present(oldpte)) {
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>   			struct folio *folio = NULL;
>> -			pte_t ptent;
>> +			int sub_nr_ptes, pgidx = 0;
>> +			pte_t ptent, newpte;
>> +			bool sub_set_write;
>> +			int set_write;
>>   
>>   			/*
>>   			 * Avoid trapping faults against the zero or KSM
>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   					continue;
>>   			}
>>   
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>  From the other thread, my memory is jogged that this function ignores write
> permission bit. So I think that's opening up a bug when applied here? If the
> first pte is writable but the rest are not (COW), doesn't this now make them all
> writable? I don't *think* that's a problem for the prot_numa use, but I could be
> wrong.

Can this be fixed by introducing FPB_HONOR_WRITE?

>
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> Even if I'm wrong about ignoring write bit being a bug, I don't think the docs
> for this function permit write bit to be different across the batch?
>
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * example, if a PTE is already dirty and no other
>>   			 * COW or special handling is required.
>>   			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> +				    !pte_write(ptent);
>> +			if (set_write)
>> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
> Why not just:
> 			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> 				    !pte_write(ptent) &&
> 				    maybe_change_pte_writable(...);
>
> ?
>
>> +
>> +			while (nr_ptes) {
>> +				if (set_write == TRI_MAYBE) {
>> +					sub_nr_ptes = anon_exclusive_batch(folio,
>> +						pgidx, nr_ptes, &sub_set_write);
>> +				} else {
>> +					sub_nr_ptes = nr_ptes;
>> +					sub_set_write = (set_write == TRI_TRUE);
>> +				}
>> +
>> +				if (sub_set_write)
>> +					newpte = pte_mkwrite(ptent, vma);
>> +				else
>> +					newpte = ptent;
>> +
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
>> +							newpte, sub_nr_ptes);
>> +				if (pte_needs_flush(oldpte, newpte))
> What did we conclude with pte_needs_flush()? I thought there was an arch where
> it looked dodgy calling this for just the pte at the head of the batch?
>
> Thanks,
> Ryan
>
>> +					tlb_flush_pte_range(tlb, addr,
>> +						sub_nr_ptes * PAGE_SIZE);
>> +
>> +				addr += sub_nr_ptes * PAGE_SIZE;
>> +				pte += sub_nr_ptes;
>> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>> +				nr_ptes -= sub_nr_ptes;
>> +				pages += sub_nr_ptes;
>> +				pgidx += sub_nr_ptes;
>> +			}
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-01  4:44     ` Dev Jain
@ 2025-07-01  7:33       ` Ryan Roberts
  2025-07-01  8:06         ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  7:33 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, hughd, yang, ziy

On 01/07/2025 05:44, Dev Jain wrote:
> 
> On 30/06/25 6:27 pm, Lorenzo Stoakes wrote:
>> On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>> Architecture can override these helpers; in case not, they are implemented
>>> as a simple loop over the corresponding single pte helpers.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Looks generally sensible! Some comments below.
>>
>>> ---
>>>   include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>>>   mm/mprotect.c           |  4 +-
>>>   2 files changed, 84 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index cf1515c163e2..662f39e7475a 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
>>> vm_area_struct *vma,
>>>
>>>   /*
>>>    * Commit an update to a pte, leaving any hardware-controlled bits in
>>> - * the PTE unmodified.
>>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
>>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
>>>    */
>>>   static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>                          unsigned long addr,
>>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
>>> vm_area_struct *vma,
>>>       __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>   }
>>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>> +
>>> +/**
>>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
>>> transaction
>>> + * over a batch of ptes, which protects against asynchronous hardware
>>> + * modifications to the ptes. The intention is not to prevent the hardware from
>>> + * making pte updates, but to prevent any updates it may make from being lost.
>>> + * Please see the comment above ptep_modify_prot_start() for full description.
>>> + *
>>> + * @vma: The virtual memory area the pages are mapped into.
>>> + * @addr: Address the first page is mapped at.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @nr: Number of entries.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
>>> + * in the batch.
>>> + *
>>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
>>> + *
>>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
>>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
>>> + * mprotect_folio_pte_batch()).
>>> + */
>>> +#ifndef modify_prot_start_ptes
>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>> +{
>>> +    pte_t pte, tmp_pte;
>>> +
>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>> +    while (--nr) {
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>> +        if (pte_dirty(tmp_pte))
>>> +            pte = pte_mkdirty(pte);
>>> +        if (pte_young(tmp_pte))
>>> +            pte = pte_mkyoung(pte);
>>> +    }
>>> +    return pte;
>>> +}
>>> +#endif
>>> +
>>> +/**
>>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>>> + * hardware-controlled bits in the PTE unmodified.
>>> + *
>>> + * @vma: The virtual memory area the pages are mapped into.
>>> + * @addr: Address the first page is mapped at.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
>>> + * @pte: New page table entry to be set.
>>> + * @nr: Number of entries.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over ptep_modify_prot_commit().
>>> + *
>>> + * Context: The caller holds the page table lock. The PTEs are all in the same
>>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
>>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
>>> + * a/d bits on which were off in old_pte.
>>> + */
>>> +#ifndef modify_prot_commit_ptes
>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < nr; ++i) {
>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> +        ptep++;
>> Weird place to put this increment, maybe just stick it in the for loop.
>>
>>> +        addr += PAGE_SIZE;
>> Same comment here.
> 
> Sure.
> 
>>
>>> +        old_pte = pte_next_pfn(old_pte);
>> Could be:
>>
>>         old_pte = pte;
>>
>> No?
> 
> We will need to update old_pte also since that
> is used by powerpc in radix__ptep_modify_prot_commit().

I think perhaps Lorenzo has the model in his head where old_pte is the previous
pte in the batch. That's not the case. old_pte is the value of the pte in the
current position of the batch before any changes were made. pte is the new value
for the pte. So we need to expliticly advance the PFN in both old_pte and pte
each iteration round the loop.

> 
>>
>>> +        pte = pte_next_pfn(pte);
>>> +    }
>>> +}
>>> +#endif
>>> +
>>>   #endif /* CONFIG_MMU */
>>>
>>>   /*
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index af10a7fbe6b8..627b0d67cc4a 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                       continue;
>>>               }
>>>
>>> -            oldpte = ptep_modify_prot_start(vma, addr, pte);
>>> +            oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>               ptent = pte_modify(oldpte, newprot);
>>>
>>>               if (uffd_wp)
>>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                   can_change_pte_writable(vma, addr, ptent))
>>>                   ptent = pte_mkwrite(ptent, vma);
>>>
>>> -            ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>>> +            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>               if (pte_needs_flush(oldpte, ptent))
>>>                   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>               pages++;
>>> -- 
>>> 2.30.2
>>>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  5:47     ` Dev Jain
@ 2025-07-01  7:39       ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  7:39 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 01/07/2025 06:47, Dev Jain wrote:
> 
> On 30/06/25 4:01 pm, Ryan Roberts wrote:
>> On 28/06/2025 12:34, Dev Jain wrote:
>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>> prot_numa case if possible.
>>>
>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>> for one pte in the batch, one can confirm that that case will hold true for
>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>>> and access bits across the batch, therefore batching across
>>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>>> just an indication that the folio got written to, so even if the PTE is
>>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>>> optimization can be made.
>>>
>>> The crux now is how to batch around the PageAnonExclusive case; we must
>>> check the corresponding condition for every single page. Therefore, from
>>> the large folio batch, we process sub batches of ptes mapping pages with
>>> the same PageAnonExclusive condition, and process that sub batch, then
>>> determine and process the next sub batch, and so on. Note that this does
>>> not cause any extra overhead; if suppose the size of the folio batch
>>> is 512, then the sub batch processing in total will take 512 iterations,
>>> which is the same as what we would have done before.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 627b0d67cc4a..28c7ce7728ff 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -40,35 +40,47 @@
>>>     #include "internal.h"
>>>   -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> -                 pte_t pte)
>>> -{
>>> -    struct page *page;
>>> +enum tristate {
>>> +    TRI_FALSE = 0,
>>> +    TRI_TRUE = 1,
>>> +    TRI_MAYBE = -1,
>>> +};
>>>   +/*
>>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>>> + * additionally check PageAnonExclusive() for every page in the desired range.
>>> + */
>>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>>> +                     unsigned long addr, pte_t pte,
>>> +                     struct folio *folio)
>>> +{
>>>       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Don't touch entries that are not even readable. */
>>>       if (pte_protnone(pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Do we need write faults for softdirty tracking? */
>>>       if (pte_needs_soft_dirty_wp(vma, pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         /* Do we need write faults for uffd-wp tracking? */
>>>       if (userfaultfd_pte_wp(vma, pte))
>>> -        return false;
>>> +        return TRI_FALSE;
>>>         if (!(vma->vm_flags & VM_SHARED)) {
>>>           /*
>>>            * Writable MAP_PRIVATE mapping: We can only special-case on
>>>            * exclusive anonymous pages, because we know that our
>>>            * write-fault handler similarly would map them writable without
>>> -         * any additional checks while holding the PT lock.
>>> +         * any additional checks while holding the PT lock. So if the
>>> +         * folio is not anonymous, we know we cannot change pte to
>>> +         * writable. If it is anonymous then the caller must further
>>> +         * check that the page is AnonExclusive().
>>>            */
>>> -        page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> +        return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>>       }
>>>         VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>        * FS was already notified and we can simply mark the PTE writable
>>>        * just like the write-fault handler would do.
>>>        */
>>> -    return pte_dirty(pte);
>>> +    return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>>> +}
>>> +
>>> +/*
>>> + * Returns the number of pages within the folio, starting from the page
>>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>>> + * PageAnonExclusive() is returned in *exclusive.
>>> + */
>>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>>> +                bool *exclusive)
>>> +{
>>> +    struct page *page;
>>> +    int nr = 1;
>>> +
>>> +    if (!folio) {
>>> +        *exclusive = false;
>>> +        return nr;
>>> +    }
>>> +
>>> +    page = folio_page(folio, pgidx++);
>>> +    *exclusive = PageAnonExclusive(page);
>>> +    while (nr < max_nr) {
>>> +        page = folio_page(folio, pgidx++);
>>> +        if ((*exclusive) != PageAnonExclusive(page))
>> nit: brackets not required around *exclusive.
>>
>>> +            break;
>>> +        nr++;
>>> +    }
>>> +
>>> +    return nr;
>>> +}
>>> +
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                 pte_t pte)
>>> +{
>>> +    struct page *page;
>>> +    int ret;
>>> +
>>> +    ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>>> +    if (ret == TRI_MAYBE) {
>>> +        page = vm_normal_page(vma, addr, pte);
>>> +        ret = page && PageAnon(page) && PageAnonExclusive(page);
>>> +    }
>>> +
>>> +    return ret;
>>>   }
>>>     static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>> -        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>>   {
>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +    fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +
>>> +    flags &= ~switch_off_flags;
>> This is mega confusing when reading the caller. Because the caller passes
>> FPB_IGNORE_SOFT_DIRTY and that actually means DON'T ignore soft dirty.
>>
>> Can't we just pass in the flags we want?
>>
>>>   -    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>> +    if (!folio || !folio_test_large(folio))
>> What's the rational for dropping the max_nr_ptes == 1 condition? If you don't
>> need it, why did you add it in the earler patch?
>>
>>>           return 1;
>>>         return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop,
>>> struct vm_area_struct *vma
>>>       }
>>>     skip_batch:
>>> -    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +                       max_nr_ptes, 0);
>>>   out:
>>>       *foliop = folio;
>>>       return nr_ptes;
>>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>           if (pte_present(oldpte)) {
>>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>               struct folio *folio = NULL;
>>> -            pte_t ptent;
>>> +            int sub_nr_ptes, pgidx = 0;
>>> +            pte_t ptent, newpte;
>>> +            bool sub_set_write;
>>> +            int set_write;
>>>                 /*
>>>                * Avoid trapping faults against the zero or KSM
>>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                       continue;
>>>               }
>>>   +            if (!folio)
>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>> +
>>> +            nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +                               max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>>  From the other thread, my memory is jogged that this function ignores write
>> permission bit. So I think that's opening up a bug when applied here? If the
>> first pte is writable but the rest are not (COW), doesn't this now make them all
>> writable? I don't *think* that's a problem for the prot_numa use, but I could be
>> wrong.
> 
> Can this be fixed by introducing FPB_HONOR_WRITE?

Yes I think so. Suddenly David's change looks very appealing because it's going
to say there are a set of bits that are ignored by default (young, dirty,
soft-dirty, write) and use FPB_HONOR_ flags to stop ignoring those bits. So we
can follow that pattern for write, I guess? And this avoids mixing FPB_IGNORE_
an FPB_HONOR_ flags.


> 
>>
>>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>> Even if I'm wrong about ignoring write bit being a bug, I don't think the docs
>> for this function permit write bit to be different across the batch?
>>
>>>               ptent = pte_modify(oldpte, newprot);
>>>   @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                * example, if a PTE is already dirty and no other
>>>                * COW or special handling is required.
>>>                */
>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> -                !pte_write(ptent) &&
>>> -                can_change_pte_writable(vma, addr, ptent))
>>> -                ptent = pte_mkwrite(ptent, vma);
>>> -
>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>> -            if (pte_needs_flush(oldpte, ptent))
>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> -            pages++;
>>> +            set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> +                    !pte_write(ptent);
>>> +            if (set_write)
>>> +                set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>> Why not just:
>>             set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>                     !pte_write(ptent) &&
>>                     maybe_change_pte_writable(...);
>>
>> ?
>>
>>> +
>>> +            while (nr_ptes) {
>>> +                if (set_write == TRI_MAYBE) {
>>> +                    sub_nr_ptes = anon_exclusive_batch(folio,
>>> +                        pgidx, nr_ptes, &sub_set_write);
>>> +                } else {
>>> +                    sub_nr_ptes = nr_ptes;
>>> +                    sub_set_write = (set_write == TRI_TRUE);
>>> +                }
>>> +
>>> +                if (sub_set_write)
>>> +                    newpte = pte_mkwrite(ptent, vma);
>>> +                else
>>> +                    newpte = ptent;
>>> +
>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>> +                            newpte, sub_nr_ptes);
>>> +                if (pte_needs_flush(oldpte, newpte))
>> What did we conclude with pte_needs_flush()? I thought there was an arch where
>> it looked dodgy calling this for just the pte at the head of the batch?
>>
>> Thanks,
>> Ryan
>>
>>> +                    tlb_flush_pte_range(tlb, addr,
>>> +                        sub_nr_ptes * PAGE_SIZE);
>>> +
>>> +                addr += sub_nr_ptes * PAGE_SIZE;
>>> +                pte += sub_nr_ptes;
>>> +                oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>> +                ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>> +                nr_ptes -= sub_nr_ptes;
>>> +                pages += sub_nr_ptes;
>>> +                pgidx += sub_nr_ptes;
>>> +            }
>>>           } else if (is_swap_pte(oldpte)) {
>>>               swp_entry_t entry = pte_to_swp_entry(oldpte);
>>>               pte_t newpte;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-06-30 12:52   ` Lorenzo Stoakes
  2025-07-01  5:30     ` Dev Jain
@ 2025-07-01  8:03     ` Ryan Roberts
  2025-07-01  8:06       ` Dev Jain
  2025-07-01  8:15       ` Lorenzo Stoakes
  1 sibling, 2 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  8:03 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, hughd, yang, ziy

On 30/06/2025 13:52, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:34PM +0530, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>> prot_numa case if possible.
>>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>> and access bits across the batch, therefore batching across
>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>> just an indication that the folio got written to, so even if the PTE is
>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>> optimization can be made.
>>
>> The crux now is how to batch around the PageAnonExclusive case; we must
>> check the corresponding condition for every single page. Therefore, from
>> the large folio batch, we process sub batches of ptes mapping pages with
>> the same PageAnonExclusive condition, and process that sub batch, then
>> determine and process the next sub batch, and so on. Note that this does
>> not cause any extra overhead; if suppose the size of the folio batch
>> is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 117 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 627b0d67cc4a..28c7ce7728ff 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,35 +40,47 @@
>>
>>  #include "internal.h"
>>
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> -{
>> -	struct page *page;
>> +enum tristate {
>> +	TRI_FALSE = 0,
>> +	TRI_TRUE = 1,
>> +	TRI_MAYBE = -1,
>> +};
> 
> Yeah no, absolutely not, this is horrible, I don't want to see an arbitrary type
> like this added, to a random file, and I absolutely think this adds confusion
> and does not in any way help clarify things.
> 
>>
>> +/*
>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>> + * additionally check PageAnonExclusive() for every page in the desired range.
>> + */
>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>> +				     unsigned long addr, pte_t pte,
>> +				     struct folio *folio)
>> +{
>>  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>  	/* Don't touch entries that are not even readable. */
>>  	if (pte_protnone(pte))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>  	/* Do we need write faults for softdirty tracking? */
>>  	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>  	/* Do we need write faults for uffd-wp tracking? */
>>  	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +		return TRI_FALSE;
>>
>>  	if (!(vma->vm_flags & VM_SHARED)) {
>>  		/*
>>  		 * Writable MAP_PRIVATE mapping: We can only special-case on
>>  		 * exclusive anonymous pages, because we know that our
>>  		 * write-fault handler similarly would map them writable without
>> -		 * any additional checks while holding the PT lock.
>> +		 * any additional checks while holding the PT lock. So if the
>> +		 * folio is not anonymous, we know we cannot change pte to
>> +		 * writable. If it is anonymous then the caller must further
>> +		 * check that the page is AnonExclusive().
>>  		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>  	}
>>
>>  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>  	 * FS was already notified and we can simply mark the PTE writable
>>  	 * just like the write-fault handler would do.
>>  	 */
>> -	return pte_dirty(pte);
>> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>> +}
> 
> Yeah not a fan of this at all.
> 
> This is squashing all the logic into one place when we don't really need to.
> 
> We can separate out the shared logic and just do something like:
> 
> ////// Lorenzo's suggestion //////
> 
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> +static bool maybe_change_pte_writable(struct vm_area_struct *vma,
> +		pte_t pte)
>  {
> -	struct page *page;
> -
>  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>  		return false;
> 
> @@ -60,16 +58,14 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	if (userfaultfd_pte_wp(vma, pte))
>  		return false;
> 
> -	if (!(vma->vm_flags & VM_SHARED)) {
> -		/*
> -		 * Writable MAP_PRIVATE mapping: We can only special-case on
> -		 * exclusive anonymous pages, because we know that our
> -		 * write-fault handler similarly would map them writable without
> -		 * any additional checks while holding the PT lock.
> -		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> -	}
> +	return true;
> +}
> +
> +static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
> +		pte_t pte)
> +{
> +	if (!maybe_change_pte_writable(vma, pte))
> +		return false;
> 
>  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
> 
> @@ -83,6 +79,33 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
> 
> +static bool can_change_private_pte_writable(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t pte)
> +{
> +	struct page *page;
> +
> +	if (!maybe_change_pte_writable(vma, pte))
> +		return false;
> +
> +	/*
> +	 * Writable MAP_PRIVATE mapping: We can only special-case on
> +	 * exclusive anonymous pages, because we know that our
> +	 * write-fault handler similarly would map them writable without
> +	 * any additional checks while holding the PT lock.
> +	 */
> +	page = vm_normal_page(vma, addr, pte);
> +	return page && PageAnon(page) && PageAnonExclusive(page);
> +}
> +
> +bool can_change_pte_writable(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t pte)
> +{
> +	if (vma->vm_flags & VM_SHARED)
> +		return can_change_shared_pte_writable(vma, pte);
> +
> +	return can_change_private_pte_writable(vma, addr, pte);
> +}
> +
> 
> ////// end of Lorenzo's suggestion //////
> 
> You can obviously modify this to change other stuff like whether you feed back
> the PAE or not in private case for use in your code.

This sugestion for this part of the problem looks much cleaner!

Sorry; this whole struct tristate thing was my idea. I never really liked it but
I was more focussed on trying to illustrate the big picture flow that I thought
would work well with a batch and sub-batches (which it seems below that you
hate... but let's talk about that down there).

> 
>> +
>> +/*
>> + * Returns the number of pages within the folio, starting from the page
>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>> + * PageAnonExclusive() is returned in *exclusive.
>> + */
>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>> +				bool *exclusive)
> 
> Let's generalise it to something like count_folio_fungible_pages()
> 
> or maybe count_folio_batchable_pages()?
> 
> Yes naming is hard... :P but right now it reads like this is returning a batch
> or doing something with a batch.
> 
>> +{
>> +	struct page *page;
>> +	int nr = 1;
>> +
>> +	if (!folio) {
>> +		*exclusive = false;
>> +		return nr;
>> +	}
>> +
>> +	page = folio_page(folio, pgidx++);
>> +	*exclusive = PageAnonExclusive(page);
>> +	while (nr < max_nr) {
> 
> The C programming language asks why you don't like using for :)
> 
>> +		page = folio_page(folio, pgidx++);
>> +		if ((*exclusive) != PageAnonExclusive(page))
>> +			break;
>> +		nr++;
> 
> This *exclusive stuff makes me want to cry :)
> 
> Just set a local variable and hand it back at the end.
> 
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte)
>> +{
>> +	struct page *page;
>> +	int ret;
>> +
>> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>> +	if (ret == TRI_MAYBE) {
>> +		page = vm_normal_page(vma, addr, pte);
>> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
>> +	}
>> +
>> +	return ret;
>>  }
> 
> See above comments on this stuff.
> 
>>
>>  static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
> 
> This last parameter is pretty horrible. It's a negative mask so now you're
> passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
> just planting land mines.
> 
> Obviously David's flag changes will also alter all this.
> 
> Just add a boolean re: soft dirty.

Dev had a boolean for this in the last round. I've seen various functions expand
over time with increasing numbers of bool flags. So I asked to convert to a
flags parameter and just pass in the flags we need. Then it's a bit more future
proof and self documenting. (For the record I dislike the "switch_off_flags"
approach taken here).

> 
>>  {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	flags &= ~switch_off_flags;
>>
>> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (!folio || !folio_test_large(folio))
>>  		return 1;
> 
> Why remove this last check?
> 
>>
>>  	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>>  	}
>>
>>  skip_batch:
>> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +					   max_nr_ptes, 0);
> 
> See above about flag param. If you change to boolean, please prefix this with
> e.g. /* set_soft_dirty= */ true or whatever the flag ends up being :)
> 
>>  out:
>>  	*foliop = folio;
>>  	return nr_ptes;
>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  		if (pte_present(oldpte)) {
>>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>  			struct folio *folio = NULL;
>> -			pte_t ptent;
>> +			int sub_nr_ptes, pgidx = 0;
>> +			pte_t ptent, newpte;
>> +			bool sub_set_write;
>> +			int set_write;
>>
>>  			/*
>>  			 * Avoid trapping faults against the zero or KSM
>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  					continue;
>>  			}
>>
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
> 
> Don't we only care about S/D if pte_needs_soft_dirty_wp()?
> 
>>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>  			ptent = pte_modify(oldpte, newprot);
>>
>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>  			 * example, if a PTE is already dirty and no other
>>  			 * COW or special handling is required.
>>  			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> +				    !pte_write(ptent);
>> +			if (set_write)
>> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>> +
>> +			while (nr_ptes) {
>> +				if (set_write == TRI_MAYBE) {
>> +					sub_nr_ptes = anon_exclusive_batch(folio,
>> +						pgidx, nr_ptes, &sub_set_write);
>> +				} else {
>> +					sub_nr_ptes = nr_ptes;
>> +					sub_set_write = (set_write == TRI_TRUE);
>> +				}
>> +
>> +				if (sub_set_write)
>> +					newpte = pte_mkwrite(ptent, vma);
>> +				else
>> +					newpte = ptent;
>> +
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
>> +							newpte, sub_nr_ptes);
>> +				if (pte_needs_flush(oldpte, newpte))
>> +					tlb_flush_pte_range(tlb, addr,
>> +						sub_nr_ptes * PAGE_SIZE);
>> +
>> +				addr += sub_nr_ptes * PAGE_SIZE;
>> +				pte += sub_nr_ptes;
>> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>> +				nr_ptes -= sub_nr_ptes;
>> +				pages += sub_nr_ptes;
>> +				pgidx += sub_nr_ptes;
>> +			}
> 
> I hate hate hate having this loop here, let's abstract this please.
> 
> I mean I think we can just use mprotect_folio_pte_batch() no? It's not
> abstracting much here, and we can just do all this handling there. Maybe have to
> pass in a bunch more params, but it saves us having to do all this.

In an ideal world we would flatten and just have mprotect_folio_pte_batch()
return the batch size considering all the relevant PTE bits AND the
AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
core folio_pte_batch() function to also look at the AnonExclusive bit, but I
really disliked changing that core function (I think others did too?).

So barring that approach, we are really only left with the batch and sub-batch
approach - although, yes, it could be abstracted more. We could maintain a
context struct that persists across all calls to mprotect_folio_pte_batch() and
it can use that to keep it's state to remember if we are in the middle of a
sub-batch and decide either to call folio_pte_batch() to get a new batch, or
call anon_exclusive_batch() to get the next sub-batch within the current batch.
But that started to feel overly abstracted to me.

This loop approach, as written, felt more straightforward for the reader to
understand (i.e. the least-worst option). Is the context approach what you are
suggesting or do you have something else in mind?

> 
> Alternatively, we could add a new wrapper function, but yeah definitely not
> this.
> 
> Also the C programming language asks... etc etc. ;)
> 
> Overall since you always end up processing folio_nr_pages(folio) you can just
> have the batch function or a wrapper return this and do updates as necessary
> here on that basis, and leave the 'sub' batching to that function.

Sorry I don't understand this statement - could you clarify? Especially the bit
about "always ... processing folio_nr_pages(folio)"; I don't think we do. In
various corner cases the size of the folio has no relationship to the way the
PTEs are mapped.

Thanks,
Ryan

> 
> 
>>  		} else if (is_swap_pte(oldpte)) {
>>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>  			pte_t newpte;
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-01  7:33       ` Ryan Roberts
@ 2025-07-01  8:06         ` Lorenzo Stoakes
  2025-07-01  8:23           ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  8:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 08:33:32AM +0100, Ryan Roberts wrote:
> On 01/07/2025 05:44, Dev Jain wrote:
> >
> > On 30/06/25 6:27 pm, Lorenzo Stoakes wrote:
> >> On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
> >>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >>> Architecture can override these helpers; in case not, they are implemented
> >>> as a simple loop over the corresponding single pte helpers.
> >>>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> Looks generally sensible! Some comments below.
> >>
> >>> ---
> >>>   include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
> >>>   mm/mprotect.c           |  4 +-
> >>>   2 files changed, 84 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>> index cf1515c163e2..662f39e7475a 100644
> >>> --- a/include/linux/pgtable.h
> >>> +++ b/include/linux/pgtable.h
> >>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
> >>> vm_area_struct *vma,
> >>>
> >>>   /*
> >>>    * Commit an update to a pte, leaving any hardware-controlled bits in
> >>> - * the PTE unmodified.
> >>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
> >>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
> >>>    */
> >>>   static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
> >>>                          unsigned long addr,
> >>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
> >>> vm_area_struct *vma,
> >>>       __ptep_modify_prot_commit(vma, addr, ptep, pte);
> >>>   }
> >>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> >>> +
> >>> +/**
> >>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
> >>> transaction
> >>> + * over a batch of ptes, which protects against asynchronous hardware
> >>> + * modifications to the ptes. The intention is not to prevent the hardware from
> >>> + * making pte updates, but to prevent any updates it may make from being lost.
> >>> + * Please see the comment above ptep_modify_prot_start() for full description.
> >>> + *
> >>> + * @vma: The virtual memory area the pages are mapped into.
> >>> + * @addr: Address the first page is mapped at.
> >>> + * @ptep: Page table pointer for the first entry.
> >>> + * @nr: Number of entries.
> >>> + *
> >>> + * May be overridden by the architecture; otherwise, implemented as a simple
> >>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> >>> + * in the batch.
> >>> + *
> >>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> >>> + *
> >>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> >>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> >>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
> >>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
> >>> + * mprotect_folio_pte_batch()).
> >>> + */
> >>> +#ifndef modify_prot_start_ptes
> >>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
> >>> +{
> >>> +    pte_t pte, tmp_pte;
> >>> +
> >>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
> >>> +    while (--nr) {
> >>> +        ptep++;
> >>> +        addr += PAGE_SIZE;
> >>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>> +        if (pte_dirty(tmp_pte))
> >>> +            pte = pte_mkdirty(pte);
> >>> +        if (pte_young(tmp_pte))
> >>> +            pte = pte_mkyoung(pte);
> >>> +    }
> >>> +    return pte;
> >>> +}
> >>> +#endif
> >>> +
> >>> +/**
> >>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> >>> + * hardware-controlled bits in the PTE unmodified.
> >>> + *
> >>> + * @vma: The virtual memory area the pages are mapped into.
> >>> + * @addr: Address the first page is mapped at.
> >>> + * @ptep: Page table pointer for the first entry.
> >>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> >>> + * @pte: New page table entry to be set.
> >>> + * @nr: Number of entries.
> >>> + *
> >>> + * May be overridden by the architecture; otherwise, implemented as a simple
> >>> + * loop over ptep_modify_prot_commit().
> >>> + *
> >>> + * Context: The caller holds the page table lock. The PTEs are all in the same
> >>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
> >>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
> >>> + * a/d bits on which were off in old_pte.
> >>> + */
> >>> +#ifndef modify_prot_commit_ptes
> >>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
> >>> unsigned long addr,
> >>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >>> +{
> >>> +    int i;
> >>> +
> >>> +    for (i = 0; i < nr; ++i) {
> >>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> +        ptep++;
> >> Weird place to put this increment, maybe just stick it in the for loop.
> >>
> >>> +        addr += PAGE_SIZE;
> >> Same comment here.
> >
> > Sure.
> >
> >>
> >>> +        old_pte = pte_next_pfn(old_pte);
> >> Could be:
> >>
> >>         old_pte = pte;
> >>
> >> No?
> >
> > We will need to update old_pte also since that
> > is used by powerpc in radix__ptep_modify_prot_commit().
>
> I think perhaps Lorenzo has the model in his head where old_pte is the previous
> pte in the batch. That's not the case. old_pte is the value of the pte in the
> current position of the batch before any changes were made. pte is the new value
> for the pte. So we need to expliticly advance the PFN in both old_pte and pte
> each iteration round the loop.

Yeah, you're right, apologies, I'd misinterpreted.

I really, really, really hate how all this is implemented. This is obviously an
mprotect() and legacy thing but it's almost designed for confusion. Not the
fault of this series, and todo++ on improving mprotect as a whole (been on my
list for a while...)

So we're ultimately updating ptep (this thing that we update, of course, is
buried in the middle of the function invocation) in:

	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);

We are setting *ptep++ = pte essentially (roughly speaking) right?

And the arch needs to know about any bits that have changed I guess hence
providing old_pte as well right?

OK so yeah, I get it now, we're not actually advancing through ptes here, we're
just advancing the PFN and applying the same 'template'.

How about something like:

static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
	       pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
{
	int i;

	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) {
		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);

		/* Advance PFN only, set same flags. */
		old_pte = pte_next_pfn(old_pte);
		pte = pte_next_pfn(pte);
	}
}

Neatens it up a bit and makes it clear that we're effectively propagating the
flags here.

>
> >
> >>
> >>> +        pte = pte_next_pfn(pte);
> >>> +    }
> >>> +}
> >>> +#endif
> >>> +
> >>>   #endif /* CONFIG_MMU */
> >>>
> >>>   /*
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index af10a7fbe6b8..627b0d67cc4a 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>                       continue;
> >>>               }
> >>>
> >>> -            oldpte = ptep_modify_prot_start(vma, addr, pte);
> >>> +            oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >>>               ptent = pte_modify(oldpte, newprot);
> >>>
> >>>               if (uffd_wp)
> >>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>                   can_change_pte_writable(vma, addr, ptent))
> >>>                   ptent = pte_mkwrite(ptent, vma);
> >>>
> >>> -            ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> >>> +            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> >>>               if (pte_needs_flush(oldpte, ptent))
> >>>                   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> >>>               pages++;
> >>> --
> >>> 2.30.2
> >>>
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  8:03     ` Ryan Roberts
@ 2025-07-01  8:06       ` Dev Jain
  2025-07-01  8:24         ` Ryan Roberts
  2025-07-01  8:15       ` Lorenzo Stoakes
  1 sibling, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-07-01  8:06 UTC (permalink / raw)
  To: Ryan Roberts, Lorenzo Stoakes
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, hughd, yang, ziy


On 01/07/25 1:33 pm, Ryan Roberts wrote:
> On 30/06/2025 13:52, Lorenzo Stoakes wrote:
>> On Sat, Jun 28, 2025 at 05:04:34PM +0530, Dev Jain wrote:
>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>> prot_numa case if possible.
>>>
>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>> for one pte in the batch, one can confirm that that case will hold true for
>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>>> and access bits across the batch, therefore batching across
>>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>>> just an indication that the folio got written to, so even if the PTE is
>>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>>> optimization can be made.
>>>
>>> The crux now is how to batch around the PageAnonExclusive case; we must
>>> check the corresponding condition for every single page. Therefore, from
>>> the large folio batch, we process sub batches of ptes mapping pages with
>>> the same PageAnonExclusive condition, and process that sub batch, then
>>> determine and process the next sub batch, and so on. Note that this does
>>> not cause any extra overhead; if suppose the size of the folio batch
>>> is 512, then the sub batch processing in total will take 512 iterations,
>>> which is the same as what we would have done before.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 627b0d67cc4a..28c7ce7728ff 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -40,35 +40,47 @@
>>>
>>>   #include "internal.h"
>>>
>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> -			     pte_t pte)
>>> -{
>>> -	struct page *page;
>>> +enum tristate {
>>> +	TRI_FALSE = 0,
>>> +	TRI_TRUE = 1,
>>> +	TRI_MAYBE = -1,
>>> +};
>> Yeah no, absolutely not, this is horrible, I don't want to see an arbitrary type
>> like this added, to a random file, and I absolutely think this adds confusion
>> and does not in any way help clarify things.
>>
>>> +/*
>>> + * Returns enum tristate indicating whether the pte can be changed to writable.
>>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>>> + * additionally check PageAnonExclusive() for every page in the desired range.
>>> + */
>>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>>> +				     unsigned long addr, pte_t pte,
>>> +				     struct folio *folio)
>>> +{
>>>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>> -		return false;
>>> +		return TRI_FALSE;
>>>
>>>   	/* Don't touch entries that are not even readable. */
>>>   	if (pte_protnone(pte))
>>> -		return false;
>>> +		return TRI_FALSE;
>>>
>>>   	/* Do we need write faults for softdirty tracking? */
>>>   	if (pte_needs_soft_dirty_wp(vma, pte))
>>> -		return false;
>>> +		return TRI_FALSE;
>>>
>>>   	/* Do we need write faults for uffd-wp tracking? */
>>>   	if (userfaultfd_pte_wp(vma, pte))
>>> -		return false;
>>> +		return TRI_FALSE;
>>>
>>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>>   		/*
>>>   		 * Writable MAP_PRIVATE mapping: We can only special-case on
>>>   		 * exclusive anonymous pages, because we know that our
>>>   		 * write-fault handler similarly would map them writable without
>>> -		 * any additional checks while holding the PT lock.
>>> +		 * any additional checks while holding the PT lock. So if the
>>> +		 * folio is not anonymous, we know we cannot change pte to
>>> +		 * writable. If it is anonymous then the caller must further
>>> +		 * check that the page is AnonExclusive().
>>>   		 */
>>> -		page = vm_normal_page(vma, addr, pte);
>>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>>> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>>   	}
>>>
>>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>   	 * FS was already notified and we can simply mark the PTE writable
>>>   	 * just like the write-fault handler would do.
>>>   	 */
>>> -	return pte_dirty(pte);
>>> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>>> +}
>> Yeah not a fan of this at all.
>>
>> This is squashing all the logic into one place when we don't really need to.
>>
>> We can separate out the shared logic and just do something like:
>>
>> ////// Lorenzo's suggestion //////
>>
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> +static bool maybe_change_pte_writable(struct vm_area_struct *vma,
>> +		pte_t pte)
>>   {
>> -	struct page *page;
>> -
>>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>   		return false;
>>
>> @@ -60,16 +58,14 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	if (userfaultfd_pte_wp(vma, pte))
>>   		return false;
>>
>> -	if (!(vma->vm_flags & VM_SHARED)) {
>> -		/*
>> -		 * Writable MAP_PRIVATE mapping: We can only special-case on
>> -		 * exclusive anonymous pages, because we know that our
>> -		 * write-fault handler similarly would map them writable without
>> -		 * any additional checks while holding the PT lock.
>> -		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> -	}
>> +	return true;
>> +}
>> +
>> +static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
>> +		pte_t pte)
>> +{
>> +	if (!maybe_change_pte_writable(vma, pte))
>> +		return false;
>>
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>
>> @@ -83,6 +79,33 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	return pte_dirty(pte);
>>   }
>>
>> +static bool can_change_private_pte_writable(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t pte)
>> +{
>> +	struct page *page;
>> +
>> +	if (!maybe_change_pte_writable(vma, pte))
>> +		return false;
>> +
>> +	/*
>> +	 * Writable MAP_PRIVATE mapping: We can only special-case on
>> +	 * exclusive anonymous pages, because we know that our
>> +	 * write-fault handler similarly would map them writable without
>> +	 * any additional checks while holding the PT lock.
>> +	 */
>> +	page = vm_normal_page(vma, addr, pte);
>> +	return page && PageAnon(page) && PageAnonExclusive(page);
>> +}
>> +
>> +bool can_change_pte_writable(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t pte)
>> +{
>> +	if (vma->vm_flags & VM_SHARED)
>> +		return can_change_shared_pte_writable(vma, pte);
>> +
>> +	return can_change_private_pte_writable(vma, addr, pte);
>> +}
>> +
>>
>> ////// end of Lorenzo's suggestion //////
>>
>> You can obviously modify this to change other stuff like whether you feed back
>> the PAE or not in private case for use in your code.
> This sugestion for this part of the problem looks much cleaner!
>
> Sorry; this whole struct tristate thing was my idea. I never really liked it but
> I was more focussed on trying to illustrate the big picture flow that I thought
> would work well with a batch and sub-batches (which it seems below that you
> hate... but let's talk about that down there).
>
>>> +
>>> +/*
>>> + * Returns the number of pages within the folio, starting from the page
>>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>>> + * PageAnonExclusive() is returned in *exclusive.
>>> + */
>>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>>> +				bool *exclusive)
>> Let's generalise it to something like count_folio_fungible_pages()
>>
>> or maybe count_folio_batchable_pages()?
>>
>> Yes naming is hard... :P but right now it reads like this is returning a batch
>> or doing something with a batch.
>>
>>> +{
>>> +	struct page *page;
>>> +	int nr = 1;
>>> +
>>> +	if (!folio) {
>>> +		*exclusive = false;
>>> +		return nr;
>>> +	}
>>> +
>>> +	page = folio_page(folio, pgidx++);
>>> +	*exclusive = PageAnonExclusive(page);
>>> +	while (nr < max_nr) {
>> The C programming language asks why you don't like using for :)
>>
>>> +		page = folio_page(folio, pgidx++);
>>> +		if ((*exclusive) != PageAnonExclusive(page))
>>> +			break;
>>> +		nr++;
>> This *exclusive stuff makes me want to cry :)
>>
>> Just set a local variable and hand it back at the end.
>>
>>> +	}
>>> +
>>> +	return nr;
>>> +}
>>> +
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +			     pte_t pte)
>>> +{
>>> +	struct page *page;
>>> +	int ret;
>>> +
>>> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>>> +	if (ret == TRI_MAYBE) {
>>> +		page = vm_normal_page(vma, addr, pte);
>>> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
>>> +	}
>>> +
>>> +	return ret;
>>>   }
>> See above comments on this stuff.
>>
>>>   static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
>>> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>> This last parameter is pretty horrible. It's a negative mask so now you're
>> passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
>> just planting land mines.
>>
>> Obviously David's flag changes will also alter all this.
>>
>> Just add a boolean re: soft dirty.
> Dev had a boolean for this in the last round. I've seen various functions expand
> over time with increasing numbers of bool flags. So I asked to convert to a
> flags parameter and just pass in the flags we need. Then it's a bit more future
> proof and self documenting. (For the record I dislike the "switch_off_flags"
> approach taken here).
>
>>>   {
>>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +
>>> +	flags &= ~switch_off_flags;
>>>
>>> -	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>> +	if (!folio || !folio_test_large(folio))
>>>   		return 1;
>> Why remove this last check?
>>
>>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
>>>   	}
>>>
>>>   skip_batch:
>>> -	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>> +	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +					   max_nr_ptes, 0);
>> See above about flag param. If you change to boolean, please prefix this with
>> e.g. /* set_soft_dirty= */ true or whatever the flag ends up being :)
>>
>>>   out:
>>>   	*foliop = folio;
>>>   	return nr_ptes;
>>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>   		if (pte_present(oldpte)) {
>>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>   			struct folio *folio = NULL;
>>> -			pte_t ptent;
>>> +			int sub_nr_ptes, pgidx = 0;
>>> +			pte_t ptent, newpte;
>>> +			bool sub_set_write;
>>> +			int set_write;
>>>
>>>   			/*
>>>   			 * Avoid trapping faults against the zero or KSM
>>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>   					continue;
>>>   			}
>>>
>>> +			if (!folio)
>>> +				folio = vm_normal_folio(vma, addr, oldpte);
>>> +
>>> +			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>> +							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>> Don't we only care about S/D if pte_needs_soft_dirty_wp()?
>>
>>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>   			ptent = pte_modify(oldpte, newprot);
>>>
>>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>   			 * example, if a PTE is already dirty and no other
>>>   			 * COW or special handling is required.
>>>   			 */
>>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> -			    !pte_write(ptent) &&
>>> -			    can_change_pte_writable(vma, addr, ptent))
>>> -				ptent = pte_mkwrite(ptent, vma);
>>> -
>>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>> -			if (pte_needs_flush(oldpte, ptent))
>>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> -			pages++;
>>> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> +				    !pte_write(ptent);
>>> +			if (set_write)
>>> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>>> +
>>> +			while (nr_ptes) {
>>> +				if (set_write == TRI_MAYBE) {
>>> +					sub_nr_ptes = anon_exclusive_batch(folio,
>>> +						pgidx, nr_ptes, &sub_set_write);
>>> +				} else {
>>> +					sub_nr_ptes = nr_ptes;
>>> +					sub_set_write = (set_write == TRI_TRUE);
>>> +				}
>>> +
>>> +				if (sub_set_write)
>>> +					newpte = pte_mkwrite(ptent, vma);
>>> +				else
>>> +					newpte = ptent;
>>> +
>>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>> +							newpte, sub_nr_ptes);
>>> +				if (pte_needs_flush(oldpte, newpte))
>>> +					tlb_flush_pte_range(tlb, addr,
>>> +						sub_nr_ptes * PAGE_SIZE);
>>> +
>>> +				addr += sub_nr_ptes * PAGE_SIZE;
>>> +				pte += sub_nr_ptes;
>>> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>> +				nr_ptes -= sub_nr_ptes;
>>> +				pages += sub_nr_ptes;
>>> +				pgidx += sub_nr_ptes;
>>> +			}
>> I hate hate hate having this loop here, let's abstract this please.
>>
>> I mean I think we can just use mprotect_folio_pte_batch() no? It's not
>> abstracting much here, and we can just do all this handling there. Maybe have to
>> pass in a bunch more params, but it saves us having to do all this.
> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
> return the batch size considering all the relevant PTE bits AND the
> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
> really disliked changing that core function (I think others did too?).

That patch was in our private exchange, not on the list.

>
> So barring that approach, we are really only left with the batch and sub-batch
> approach - although, yes, it could be abstracted more. We could maintain a
> context struct that persists across all calls to mprotect_folio_pte_batch() and
> it can use that to keep it's state to remember if we are in the middle of a
> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
> call anon_exclusive_batch() to get the next sub-batch within the current batch.
> But that started to feel overly abstracted to me.
>
> This loop approach, as written, felt more straightforward for the reader to
> understand (i.e. the least-worst option). Is the context approach what you are
> suggesting or do you have something else in mind?
>
>> Alternatively, we could add a new wrapper function, but yeah definitely not
>> this.
>>
>> Also the C programming language asks... etc etc. ;)
>>
>> Overall since you always end up processing folio_nr_pages(folio) you can just
>> have the batch function or a wrapper return this and do updates as necessary
>> here on that basis, and leave the 'sub' batching to that function.
> Sorry I don't understand this statement - could you clarify? Especially the bit
> about "always ... processing folio_nr_pages(folio)"; I don't think we do. In
> various corner cases the size of the folio has no relationship to the way the
> PTEs are mapped.
>
> Thanks,
> Ryan
>
>>
>>>   		} else if (is_swap_pte(oldpte)) {
>>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>>   			pte_t newpte;
>>> --
>>> 2.30.2
>>>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  8:03     ` Ryan Roberts
  2025-07-01  8:06       ` Dev Jain
@ 2025-07-01  8:15       ` Lorenzo Stoakes
  2025-07-01  8:30         ` Ryan Roberts
  1 sibling, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  8:15 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 09:03:27AM +0100, Ryan Roberts wrote:
> >
> > ////// end of Lorenzo's suggestion //////
> >
> > You can obviously modify this to change other stuff like whether you feed back
> > the PAE or not in private case for use in your code.
>
> This sugestion for this part of the problem looks much cleaner!

Thanks :)

>
> Sorry; this whole struct tristate thing was my idea. I never really liked it but
> I was more focussed on trying to illustrate the big picture flow that I thought
> would work well with a batch and sub-batches (which it seems below that you
> hate... but let's talk about that down there).

Yeah, this is fiddly stuff so I get it as a sort of psuedocode, but as code
obviously I've made my feelings known haha.

It seems that we can apply the fundamental underlying idea without needing to do
it this way at any rate so we should be good.

> >>
> >>  static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> >> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
> >> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
> >
> > This last parameter is pretty horrible. It's a negative mask so now you're
> > passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
> > just planting land mines.
> >
> > Obviously David's flag changes will also alter all this.
> >
> > Just add a boolean re: soft dirty.
>
> Dev had a boolean for this in the last round. I've seen various functions expand
> over time with increasing numbers of bool flags. So I asked to convert to a
> flags parameter and just pass in the flags we need. Then it's a bit more future
> proof and self documenting. (For the record I dislike the "switch_off_flags"
> approach taken here).

Yeah, but we can change this when it needs to be changed. When it comes to
internal non-uAPI stuff I don't think we need to be too worried about
future-proofing like this at least just yet.

Do not fear the future churn... ;)

I mean I guess David's new flags will make this less egregious anyway.

> >>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >>  			ptent = pte_modify(oldpte, newprot);
> >>
> >> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>  			 * example, if a PTE is already dirty and no other
> >>  			 * COW or special handling is required.
> >>  			 */
> >> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> >> -			    !pte_write(ptent) &&
> >> -			    can_change_pte_writable(vma, addr, ptent))
> >> -				ptent = pte_mkwrite(ptent, vma);
> >> -
> >> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> >> -			if (pte_needs_flush(oldpte, ptent))
> >> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> >> -			pages++;
> >> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> >> +				    !pte_write(ptent);
> >> +			if (set_write)
> >> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
> >> +
> >> +			while (nr_ptes) {
> >> +				if (set_write == TRI_MAYBE) {
> >> +					sub_nr_ptes = anon_exclusive_batch(folio,
> >> +						pgidx, nr_ptes, &sub_set_write);
> >> +				} else {
> >> +					sub_nr_ptes = nr_ptes;
> >> +					sub_set_write = (set_write == TRI_TRUE);
> >> +				}
> >> +
> >> +				if (sub_set_write)
> >> +					newpte = pte_mkwrite(ptent, vma);
> >> +				else
> >> +					newpte = ptent;
> >> +
> >> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> >> +							newpte, sub_nr_ptes);
> >> +				if (pte_needs_flush(oldpte, newpte))
> >> +					tlb_flush_pte_range(tlb, addr,
> >> +						sub_nr_ptes * PAGE_SIZE);
> >> +
> >> +				addr += sub_nr_ptes * PAGE_SIZE;
> >> +				pte += sub_nr_ptes;
> >> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> >> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> >> +				nr_ptes -= sub_nr_ptes;
> >> +				pages += sub_nr_ptes;
> >> +				pgidx += sub_nr_ptes;
> >> +			}
> >
> > I hate hate hate having this loop here, let's abstract this please.
> >
> > I mean I think we can just use mprotect_folio_pte_batch() no? It's not
> > abstracting much here, and we can just do all this handling there. Maybe have to
> > pass in a bunch more params, but it saves us having to do all this.
>
> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
> return the batch size considering all the relevant PTE bits AND the
> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
> really disliked changing that core function (I think others did too?).

Yeah let's not change the core function.

My suggestion is to have mprotect_folio_pte_batch() do this.

>
> So barring that approach, we are really only left with the batch and sub-batch
> approach - although, yes, it could be abstracted more. We could maintain a
> context struct that persists across all calls to mprotect_folio_pte_batch() and
> it can use that to keep it's state to remember if we are in the middle of a
> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
> call anon_exclusive_batch() to get the next sub-batch within the current batch.
> But that started to feel overly abstracted to me.

Having this nested batch/sub-batch loop really feels worse. You just get lost in
the complexity here very easily.

But i"m also not sure we need to maintain _that_ much state?

We're already looping over all of the PTEs here, so abstracting _the entire
loop_ and all the sub-batch stuff to another function, that is
mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
ton of sense.

>
> This loop approach, as written, felt more straightforward for the reader to
> understand (i.e. the least-worst option). Is the context approach what you are
> suggesting or do you have something else in mind?
>

See above.

> >
> > Alternatively, we could add a new wrapper function, but yeah definitely not
> > this.
> >
> > Also the C programming language asks... etc etc. ;)
> >
> > Overall since you always end up processing folio_nr_pages(folio) you can just
> > have the batch function or a wrapper return this and do updates as necessary
> > here on that basis, and leave the 'sub' batching to that function.
>
> Sorry I don't understand this statement - could you clarify? Especially the bit
> about "always ... processing folio_nr_pages(folio)"; I don't think we do. In
> various corner cases the size of the folio has no relationship to the way the
> PTEs are mapped.

Right yeah I put this badly. Obviously you can have all sorts of fun with large
folios partially mapped and page-table split and etc. etc.

I should have said 'always process nr_ptes'.

The idea is to abstract this sub-batch stuff to another function, fundamentally.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-01  8:06         ` Lorenzo Stoakes
@ 2025-07-01  8:23           ` Ryan Roberts
  2025-07-01  8:34             ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  8:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 01/07/2025 09:06, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 08:33:32AM +0100, Ryan Roberts wrote:
>> On 01/07/2025 05:44, Dev Jain wrote:
>>>
>>> On 30/06/25 6:27 pm, Lorenzo Stoakes wrote:
>>>> On Sat, Jun 28, 2025 at 05:04:33PM +0530, Dev Jain wrote:
>>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>>> Architecture can override these helpers; in case not, they are implemented
>>>>> as a simple loop over the corresponding single pte helpers.
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> Looks generally sensible! Some comments below.
>>>>
>>>>> ---
>>>>>   include/linux/pgtable.h | 83 ++++++++++++++++++++++++++++++++++++++++-
>>>>>   mm/mprotect.c           |  4 +-
>>>>>   2 files changed, 84 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index cf1515c163e2..662f39e7475a 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -1331,7 +1331,8 @@ static inline pte_t ptep_modify_prot_start(struct
>>>>> vm_area_struct *vma,
>>>>>
>>>>>   /*
>>>>>    * Commit an update to a pte, leaving any hardware-controlled bits in
>>>>> - * the PTE unmodified.
>>>>> + * the PTE unmodified. The pte may have been "upgraded" w.r.t a/d bits compared
>>>>> + * to the old_pte, as in, it may have a/d bits on which were off in old_pte.
>>>>>    */
>>>>>   static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>>>                          unsigned long addr,
>>>>> @@ -1340,6 +1341,86 @@ static inline void ptep_modify_prot_commit(struct
>>>>> vm_area_struct *vma,
>>>>>       __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>>>   }
>>>>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>>>> +
>>>>> +/**
>>>>> + * modify_prot_start_ptes - Start a pte protection read-modify-write
>>>>> transaction
>>>>> + * over a batch of ptes, which protects against asynchronous hardware
>>>>> + * modifications to the ptes. The intention is not to prevent the hardware from
>>>>> + * making pte updates, but to prevent any updates it may make from being lost.
>>>>> + * Please see the comment above ptep_modify_prot_start() for full description.
>>>>> + *
>>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>>> + * @addr: Address the first page is mapped at.
>>>>> + * @ptep: Page table pointer for the first entry.
>>>>> + * @nr: Number of entries.
>>>>> + *
>>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>>> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
>>>>> + * in the batch.
>>>>> + *
>>>>> + * Note that PTE bits in the PTE batch besides the PFN can differ.
>>>>> + *
>>>>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>>>>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>>>> + * Since the batch is determined from folio_pte_batch, the PTEs must differ
>>>>> + * only in a/d bits (and the soft dirty bit; see fpb_t flags in
>>>>> + * mprotect_folio_pte_batch()).
>>>>> + */
>>>>> +#ifndef modify_prot_start_ptes
>>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>> +{
>>>>> +    pte_t pte, tmp_pte;
>>>>> +
>>>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>> +    while (--nr) {
>>>>> +        ptep++;
>>>>> +        addr += PAGE_SIZE;
>>>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>> +        if (pte_dirty(tmp_pte))
>>>>> +            pte = pte_mkdirty(pte);
>>>>> +        if (pte_young(tmp_pte))
>>>>> +            pte = pte_mkyoung(pte);
>>>>> +    }
>>>>> +    return pte;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> +/**
>>>>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>>>>> + * hardware-controlled bits in the PTE unmodified.
>>>>> + *
>>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>>> + * @addr: Address the first page is mapped at.
>>>>> + * @ptep: Page table pointer for the first entry.
>>>>> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
>>>>> + * @pte: New page table entry to be set.
>>>>> + * @nr: Number of entries.
>>>>> + *
>>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>>> + * loop over ptep_modify_prot_commit().
>>>>> + *
>>>>> + * Context: The caller holds the page table lock. The PTEs are all in the same
>>>>> + * PMD. On exit, the set ptes in the batch map the same folio. The pte may have
>>>>> + * been "upgraded" w.r.t a/d bits compared to the old_pte, as in, it may have
>>>>> + * a/d bits on which were off in old_pte.
>>>>> + */
>>>>> +#ifndef modify_prot_commit_ptes
>>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>>>> unsigned long addr,
>>>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    for (i = 0; i < nr; ++i) {
>>>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>>> +        ptep++;
>>>> Weird place to put this increment, maybe just stick it in the for loop.
>>>>
>>>>> +        addr += PAGE_SIZE;
>>>> Same comment here.
>>>
>>> Sure.
>>>
>>>>
>>>>> +        old_pte = pte_next_pfn(old_pte);
>>>> Could be:
>>>>
>>>>         old_pte = pte;
>>>>
>>>> No?
>>>
>>> We will need to update old_pte also since that
>>> is used by powerpc in radix__ptep_modify_prot_commit().
>>
>> I think perhaps Lorenzo has the model in his head where old_pte is the previous
>> pte in the batch. That's not the case. old_pte is the value of the pte in the
>> current position of the batch before any changes were made. pte is the new value
>> for the pte. So we need to expliticly advance the PFN in both old_pte and pte
>> each iteration round the loop.
> 
> Yeah, you're right, apologies, I'd misinterpreted.
> 
> I really, really, really hate how all this is implemented. This is obviously an
> mprotect() and legacy thing but it's almost designed for confusion. Not the
> fault of this series, and todo++ on improving mprotect as a whole (been on my
> list for a while...)

Agreed. I struggled for a long time with some of the pgtable helper abstractions
to the arch and all the assumptions they make. But ultimately all Dev is trying
to do here is make some incremental improvements, following the established
patterns. Hopefully you agree that cleanups on a larger scale should be reserved
for a systematic, focussed series.

> 
> So we're ultimately updating ptep (this thing that we update, of course, is
> buried in the middle of the function invocation) in:
> 
> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> 
> We are setting *ptep++ = pte essentially (roughly speaking) right?

Yeah, pretty much.

The API was originally created for Xen IIRC. The problem is that the HW can
update the A/D bits asynchronously if the PTE is valid (from the HW perspective)
so the previous approach was to get_and_clear (atomic), modify, write. But that
required 2 Xen hypervisor calls per PTE. This start/commit approach allows Xen
to both avoid the get_and_clear() and batch the writes for all PTEs in a lazy
mmu batch. So hypervisor calls are reduced from 2 per PTE to 1 per lazy mmu
batch. TBH I'm no Xen expert; some of those details may be off, but big picture
is correct.

Anyway, arm64 doesn't care about any of that, but it does override
ptep_modify_prot_start() / ptep_modify_prot_commit() to implement an erratum
workaround. And it can benefit substantially from batching.

> 
> And the arch needs to know about any bits that have changed I guess hence
> providing old_pte as well right?
> 
> OK so yeah, I get it now, we're not actually advancing through ptes here, we're
> just advancing the PFN and applying the same 'template'.
> 
> How about something like:
> 
> static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> 	       pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> {
> 	int i;
> 
> 	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) {
> 		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> 
> 		/* Advance PFN only, set same flags. */
> 		old_pte = pte_next_pfn(old_pte);
> 		pte = pte_next_pfn(pte);
> 	}
> }
> 
> Neatens it up a bit and makes it clear that we're effectively propagating the
> flags here.

Yes, except we don't usually refer to the non-pfn parts of a pte as "flags". We
normally call them pgprot or prot. God knows why...

> 
>>
>>>
>>>>
>>>>> +        pte = pte_next_pfn(pte);
>>>>> +    }
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>   #endif /* CONFIG_MMU */
>>>>>
>>>>>   /*
>>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>>> index af10a7fbe6b8..627b0d67cc4a 100644
>>>>> --- a/mm/mprotect.c
>>>>> +++ b/mm/mprotect.c
>>>>> @@ -206,7 +206,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>>                       continue;
>>>>>               }
>>>>>
>>>>> -            oldpte = ptep_modify_prot_start(vma, addr, pte);
>>>>> +            oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>>>               ptent = pte_modify(oldpte, newprot);
>>>>>
>>>>>               if (uffd_wp)
>>>>> @@ -232,7 +232,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>>                   can_change_pte_writable(vma, addr, ptent))
>>>>>                   ptent = pte_mkwrite(ptent, vma);
>>>>>
>>>>> -            ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>>>>> +            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>>>               if (pte_needs_flush(oldpte, ptent))
>>>>>                   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>>>               pages++;
>>>>> --
>>>>> 2.30.2
>>>>>
>>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  8:06       ` Dev Jain
@ 2025-07-01  8:24         ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  8:24 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, hughd, yang, ziy

On 01/07/2025 09:06, Dev Jain wrote:
> 
> On 01/07/25 1:33 pm, Ryan Roberts wrote:
>> On 30/06/2025 13:52, Lorenzo Stoakes wrote:
>>> On Sat, Jun 28, 2025 at 05:04:34PM +0530, Dev Jain wrote:
>>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>>> prot_numa case if possible.
>>>>
>>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>>> for one pte in the batch, one can confirm that that case will hold true for
>>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
>>>> and access bits across the batch, therefore batching across
>>>> pte_dirty(): this is correct since the dirty bit on the PTE really is
>>>> just an indication that the folio got written to, so even if the PTE is
>>>> not actually dirty (but one of the PTEs in the batch is), the wp-fault
>>>> optimization can be made.
>>>>
>>>> The crux now is how to batch around the PageAnonExclusive case; we must
>>>> check the corresponding condition for every single page. Therefore, from
>>>> the large folio batch, we process sub batches of ptes mapping pages with
>>>> the same PageAnonExclusive condition, and process that sub batch, then
>>>> determine and process the next sub batch, and so on. Note that this does
>>>> not cause any extra overhead; if suppose the size of the folio batch
>>>> is 512, then the sub batch processing in total will take 512 iterations,
>>>> which is the same as what we would have done before.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>   mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
>>>>   1 file changed, 117 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 627b0d67cc4a..28c7ce7728ff 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -40,35 +40,47 @@
>>>>
>>>>   #include "internal.h"
>>>>
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -                 pte_t pte)
>>>> -{
>>>> -    struct page *page;
>>>> +enum tristate {
>>>> +    TRI_FALSE = 0,
>>>> +    TRI_TRUE = 1,
>>>> +    TRI_MAYBE = -1,
>>>> +};
>>> Yeah no, absolutely not, this is horrible, I don't want to see an arbitrary type
>>> like this added, to a random file, and I absolutely think this adds confusion
>>> and does not in any way help clarify things.
>>>
>>>> +/*
>>>> + * Returns enum tristate indicating whether the pte can be changed to
>>>> writable.
>>>> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
>>>> + * additionally check PageAnonExclusive() for every page in the desired range.
>>>> + */
>>>> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
>>>> +                     unsigned long addr, pte_t pte,
>>>> +                     struct folio *folio)
>>>> +{
>>>>       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>
>>>>       /* Don't touch entries that are not even readable. */
>>>>       if (pte_protnone(pte))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>
>>>>       /* Do we need write faults for softdirty tracking? */
>>>>       if (pte_needs_soft_dirty_wp(vma, pte))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>
>>>>       /* Do we need write faults for uffd-wp tracking? */
>>>>       if (userfaultfd_pte_wp(vma, pte))
>>>> -        return false;
>>>> +        return TRI_FALSE;
>>>>
>>>>       if (!(vma->vm_flags & VM_SHARED)) {
>>>>           /*
>>>>            * Writable MAP_PRIVATE mapping: We can only special-case on
>>>>            * exclusive anonymous pages, because we know that our
>>>>            * write-fault handler similarly would map them writable without
>>>> -         * any additional checks while holding the PT lock.
>>>> +         * any additional checks while holding the PT lock. So if the
>>>> +         * folio is not anonymous, we know we cannot change pte to
>>>> +         * writable. If it is anonymous then the caller must further
>>>> +         * check that the page is AnonExclusive().
>>>>            */
>>>> -        page = vm_normal_page(vma, addr, pte);
>>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +        return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>>>>       }
>>>>
>>>>       VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>>> @@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>>        * FS was already notified and we can simply mark the PTE writable
>>>>        * just like the write-fault handler would do.
>>>>        */
>>>> -    return pte_dirty(pte);
>>>> +    return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
>>>> +}
>>> Yeah not a fan of this at all.
>>>
>>> This is squashing all the logic into one place when we don't really need to.
>>>
>>> We can separate out the shared logic and just do something like:
>>>
>>> ////// Lorenzo's suggestion //////
>>>
>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> -                 pte_t pte)
>>> +static bool maybe_change_pte_writable(struct vm_area_struct *vma,
>>> +        pte_t pte)
>>>   {
>>> -    struct page *page;
>>> -
>>>       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>>           return false;
>>>
>>> @@ -60,16 +58,14 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>       if (userfaultfd_pte_wp(vma, pte))
>>>           return false;
>>>
>>> -    if (!(vma->vm_flags & VM_SHARED)) {
>>> -        /*
>>> -         * Writable MAP_PRIVATE mapping: We can only special-case on
>>> -         * exclusive anonymous pages, because we know that our
>>> -         * write-fault handler similarly would map them writable without
>>> -         * any additional checks while holding the PT lock.
>>> -         */
>>> -        page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> -    }
>>> +    return true;
>>> +}
>>> +
>>> +static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
>>> +        pte_t pte)
>>> +{
>>> +    if (!maybe_change_pte_writable(vma, pte))
>>> +        return false;
>>>
>>>       VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>>
>>> @@ -83,6 +79,33 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>       return pte_dirty(pte);
>>>   }
>>>
>>> +static bool can_change_private_pte_writable(struct vm_area_struct *vma,
>>> +        unsigned long addr, pte_t pte)
>>> +{
>>> +    struct page *page;
>>> +
>>> +    if (!maybe_change_pte_writable(vma, pte))
>>> +        return false;
>>> +
>>> +    /*
>>> +     * Writable MAP_PRIVATE mapping: We can only special-case on
>>> +     * exclusive anonymous pages, because we know that our
>>> +     * write-fault handler similarly would map them writable without
>>> +     * any additional checks while holding the PT lock.
>>> +     */
>>> +    page = vm_normal_page(vma, addr, pte);
>>> +    return page && PageAnon(page) && PageAnonExclusive(page);
>>> +}
>>> +
>>> +bool can_change_pte_writable(struct vm_area_struct *vma,
>>> +        unsigned long addr, pte_t pte)
>>> +{
>>> +    if (vma->vm_flags & VM_SHARED)
>>> +        return can_change_shared_pte_writable(vma, pte);
>>> +
>>> +    return can_change_private_pte_writable(vma, addr, pte);
>>> +}
>>> +
>>>
>>> ////// end of Lorenzo's suggestion //////
>>>
>>> You can obviously modify this to change other stuff like whether you feed back
>>> the PAE or not in private case for use in your code.
>> This sugestion for this part of the problem looks much cleaner!
>>
>> Sorry; this whole struct tristate thing was my idea. I never really liked it but
>> I was more focussed on trying to illustrate the big picture flow that I thought
>> would work well with a batch and sub-batches (which it seems below that you
>> hate... but let's talk about that down there).
>>
>>>> +
>>>> +/*
>>>> + * Returns the number of pages within the folio, starting from the page
>>>> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
>>>> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
>>>> + * PageAnonExclusive() is returned in *exclusive.
>>>> + */
>>>> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
>>>> +                bool *exclusive)
>>> Let's generalise it to something like count_folio_fungible_pages()
>>>
>>> or maybe count_folio_batchable_pages()?
>>>
>>> Yes naming is hard... :P but right now it reads like this is returning a batch
>>> or doing something with a batch.
>>>
>>>> +{
>>>> +    struct page *page;
>>>> +    int nr = 1;
>>>> +
>>>> +    if (!folio) {
>>>> +        *exclusive = false;
>>>> +        return nr;
>>>> +    }
>>>> +
>>>> +    page = folio_page(folio, pgidx++);
>>>> +    *exclusive = PageAnonExclusive(page);
>>>> +    while (nr < max_nr) {
>>> The C programming language asks why you don't like using for :)
>>>
>>>> +        page = folio_page(folio, pgidx++);
>>>> +        if ((*exclusive) != PageAnonExclusive(page))
>>>> +            break;
>>>> +        nr++;
>>> This *exclusive stuff makes me want to cry :)
>>>
>>> Just set a local variable and hand it back at the end.
>>>
>>>> +    }
>>>> +
>>>> +    return nr;
>>>> +}
>>>> +
>>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +                 pte_t pte)
>>>> +{
>>>> +    struct page *page;
>>>> +    int ret;
>>>> +
>>>> +    ret = maybe_change_pte_writable(vma, addr, pte, NULL);
>>>> +    if (ret == TRI_MAYBE) {
>>>> +        page = vm_normal_page(vma, addr, pte);
>>>> +        ret = page && PageAnon(page) && PageAnonExclusive(page);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>>   }
>>> See above comments on this stuff.
>>>
>>>>   static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> -        pte_t *ptep, pte_t pte, int max_nr_ptes)
>>>> +        pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>> This last parameter is pretty horrible. It's a negative mask so now you're
>>> passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
>>> just planting land mines.
>>>
>>> Obviously David's flag changes will also alter all this.
>>>
>>> Just add a boolean re: soft dirty.
>> Dev had a boolean for this in the last round. I've seen various functions expand
>> over time with increasing numbers of bool flags. So I asked to convert to a
>> flags parameter and just pass in the flags we need. Then it's a bit more future
>> proof and self documenting. (For the record I dislike the "switch_off_flags"
>> approach taken here).
>>
>>>>   {
>>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +    fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +
>>>> +    flags &= ~switch_off_flags;
>>>>
>>>> -    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>>> +    if (!folio || !folio_test_large(folio))
>>>>           return 1;
>>> Why remove this last check?
>>>
>>>>       return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>>> @@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop,
>>>> struct vm_area_struct *vma
>>>>       }
>>>>
>>>>   skip_batch:
>>>> -    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
>>>> +    nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>>> +                       max_nr_ptes, 0);
>>> See above about flag param. If you change to boolean, please prefix this with
>>> e.g. /* set_soft_dirty= */ true or whatever the flag ends up being :)
>>>
>>>>   out:
>>>>       *foliop = folio;
>>>>       return nr_ptes;
>>>> @@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>           if (pte_present(oldpte)) {
>>>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>>               struct folio *folio = NULL;
>>>> -            pte_t ptent;
>>>> +            int sub_nr_ptes, pgidx = 0;
>>>> +            pte_t ptent, newpte;
>>>> +            bool sub_set_write;
>>>> +            int set_write;
>>>>
>>>>               /*
>>>>                * Avoid trapping faults against the zero or KSM
>>>> @@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                       continue;
>>>>               }
>>>>
>>>> +            if (!folio)
>>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>>> +
>>>> +            nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
>>>> +                               max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
>>> Don't we only care about S/D if pte_needs_soft_dirty_wp()?
>>>
>>>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>>               ptent = pte_modify(oldpte, newprot);
>>>>
>>>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                * example, if a PTE is already dirty and no other
>>>>                * COW or special handling is required.
>>>>                */
>>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> -                !pte_write(ptent) &&
>>>> -                can_change_pte_writable(vma, addr, ptent))
>>>> -                ptent = pte_mkwrite(ptent, vma);
>>>> -
>>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>> -            if (pte_needs_flush(oldpte, ptent))
>>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> -            pages++;
>>>> +            set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> +                    !pte_write(ptent);
>>>> +            if (set_write)
>>>> +                set_write = maybe_change_pte_writable(vma, addr, ptent,
>>>> folio);
>>>> +
>>>> +            while (nr_ptes) {
>>>> +                if (set_write == TRI_MAYBE) {
>>>> +                    sub_nr_ptes = anon_exclusive_batch(folio,
>>>> +                        pgidx, nr_ptes, &sub_set_write);
>>>> +                } else {
>>>> +                    sub_nr_ptes = nr_ptes;
>>>> +                    sub_set_write = (set_write == TRI_TRUE);
>>>> +                }
>>>> +
>>>> +                if (sub_set_write)
>>>> +                    newpte = pte_mkwrite(ptent, vma);
>>>> +                else
>>>> +                    newpte = ptent;
>>>> +
>>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>>> +                            newpte, sub_nr_ptes);
>>>> +                if (pte_needs_flush(oldpte, newpte))
>>>> +                    tlb_flush_pte_range(tlb, addr,
>>>> +                        sub_nr_ptes * PAGE_SIZE);
>>>> +
>>>> +                addr += sub_nr_ptes * PAGE_SIZE;
>>>> +                pte += sub_nr_ptes;
>>>> +                oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>>> +                ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>>> +                nr_ptes -= sub_nr_ptes;
>>>> +                pages += sub_nr_ptes;
>>>> +                pgidx += sub_nr_ptes;
>>>> +            }
>>> I hate hate hate having this loop here, let's abstract this please.
>>>
>>> I mean I think we can just use mprotect_folio_pte_batch() no? It's not
>>> abstracting much here, and we can just do all this handling there. Maybe have to
>>> pass in a bunch more params, but it saves us having to do all this.
>> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
>> return the batch size considering all the relevant PTE bits AND the
>> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
>> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
>> really disliked changing that core function (I think others did too?).
> 
> That patch was in our private exchange, not on the list.

Ahh, my bad. Well perhaps that type of approach is what Lorenzo is arguing in
favour of then? Perhaps I should just shut up and get out of the way :)

> 
>>
>> So barring that approach, we are really only left with the batch and sub-batch
>> approach - although, yes, it could be abstracted more. We could maintain a
>> context struct that persists across all calls to mprotect_folio_pte_batch() and
>> it can use that to keep it's state to remember if we are in the middle of a
>> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
>> call anon_exclusive_batch() to get the next sub-batch within the current batch.
>> But that started to feel overly abstracted to me.
>>
>> This loop approach, as written, felt more straightforward for the reader to
>> understand (i.e. the least-worst option). Is the context approach what you are
>> suggesting or do you have something else in mind?
>>
>>> Alternatively, we could add a new wrapper function, but yeah definitely not
>>> this.
>>>
>>> Also the C programming language asks... etc etc. ;)
>>>
>>> Overall since you always end up processing folio_nr_pages(folio) you can just
>>> have the batch function or a wrapper return this and do updates as necessary
>>> here on that basis, and leave the 'sub' batching to that function.
>> Sorry I don't understand this statement - could you clarify? Especially the bit
>> about "always ... processing folio_nr_pages(folio)"; I don't think we do. In
>> various corner cases the size of the folio has no relationship to the way the
>> PTEs are mapped.
>>
>> Thanks,
>> Ryan
>>
>>>
>>>>           } else if (is_swap_pte(oldpte)) {
>>>>               swp_entry_t entry = pte_to_swp_entry(oldpte);
>>>>               pte_t newpte;
>>>> -- 
>>>> 2.30.2
>>>>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  8:15       ` Lorenzo Stoakes
@ 2025-07-01  8:30         ` Ryan Roberts
  2025-07-01  8:51           ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  8:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 01/07/2025 09:15, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 09:03:27AM +0100, Ryan Roberts wrote:
>>>
>>> ////// end of Lorenzo's suggestion //////
>>>
>>> You can obviously modify this to change other stuff like whether you feed back
>>> the PAE or not in private case for use in your code.
>>
>> This sugestion for this part of the problem looks much cleaner!
> 
> Thanks :)
> 
>>
>> Sorry; this whole struct tristate thing was my idea. I never really liked it but
>> I was more focussed on trying to illustrate the big picture flow that I thought
>> would work well with a batch and sub-batches (which it seems below that you
>> hate... but let's talk about that down there).
> 
> Yeah, this is fiddly stuff so I get it as a sort of psuedocode, but as code
> obviously I've made my feelings known haha.
> 
> It seems that we can apply the fundamental underlying idea without needing to do
> it this way at any rate so we should be good.
> 
>>>>
>>>>  static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>>>> -		pte_t *ptep, pte_t pte, int max_nr_ptes)
>>>> +		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
>>>
>>> This last parameter is pretty horrible. It's a negative mask so now you're
>>> passing 'ignore soft dirty' to the function meaning 'don't ignore it'. This is
>>> just planting land mines.
>>>
>>> Obviously David's flag changes will also alter all this.
>>>
>>> Just add a boolean re: soft dirty.
>>
>> Dev had a boolean for this in the last round. I've seen various functions expand
>> over time with increasing numbers of bool flags. So I asked to convert to a
>> flags parameter and just pass in the flags we need. Then it's a bit more future
>> proof and self documenting. (For the record I dislike the "switch_off_flags"
>> approach taken here).
> 
> Yeah, but we can change this when it needs to be changed. When it comes to
> internal non-uAPI stuff I don't think we need to be too worried about
> future-proofing like this at least just yet.
> 
> Do not fear the future churn... ;)
> 
> I mean I guess David's new flags will make this less egregious anyway.
> 
>>>>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>>  			ptent = pte_modify(oldpte, newprot);
>>>>
>>>> @@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>  			 * example, if a PTE is already dirty and no other
>>>>  			 * COW or special handling is required.
>>>>  			 */
>>>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> -			    !pte_write(ptent) &&
>>>> -			    can_change_pte_writable(vma, addr, ptent))
>>>> -				ptent = pte_mkwrite(ptent, vma);
>>>> -
>>>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>> -			if (pte_needs_flush(oldpte, ptent))
>>>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> -			pages++;
>>>> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> +				    !pte_write(ptent);
>>>> +			if (set_write)
>>>> +				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
>>>> +
>>>> +			while (nr_ptes) {
>>>> +				if (set_write == TRI_MAYBE) {
>>>> +					sub_nr_ptes = anon_exclusive_batch(folio,
>>>> +						pgidx, nr_ptes, &sub_set_write);
>>>> +				} else {
>>>> +					sub_nr_ptes = nr_ptes;
>>>> +					sub_set_write = (set_write == TRI_TRUE);
>>>> +				}
>>>> +
>>>> +				if (sub_set_write)
>>>> +					newpte = pte_mkwrite(ptent, vma);
>>>> +				else
>>>> +					newpte = ptent;
>>>> +
>>>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>>> +							newpte, sub_nr_ptes);
>>>> +				if (pte_needs_flush(oldpte, newpte))
>>>> +					tlb_flush_pte_range(tlb, addr,
>>>> +						sub_nr_ptes * PAGE_SIZE);
>>>> +
>>>> +				addr += sub_nr_ptes * PAGE_SIZE;
>>>> +				pte += sub_nr_ptes;
>>>> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>>> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>>> +				nr_ptes -= sub_nr_ptes;
>>>> +				pages += sub_nr_ptes;
>>>> +				pgidx += sub_nr_ptes;
>>>> +			}
>>>
>>> I hate hate hate having this loop here, let's abstract this please.
>>>
>>> I mean I think we can just use mprotect_folio_pte_batch() no? It's not
>>> abstracting much here, and we can just do all this handling there. Maybe have to
>>> pass in a bunch more params, but it saves us having to do all this.
>>
>> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
>> return the batch size considering all the relevant PTE bits AND the
>> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
>> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
>> really disliked changing that core function (I think others did too?).
> 
> Yeah let's not change the core function.
> 
> My suggestion is to have mprotect_folio_pte_batch() do this.
> 
>>
>> So barring that approach, we are really only left with the batch and sub-batch
>> approach - although, yes, it could be abstracted more. We could maintain a
>> context struct that persists across all calls to mprotect_folio_pte_batch() and
>> it can use that to keep it's state to remember if we are in the middle of a
>> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
>> call anon_exclusive_batch() to get the next sub-batch within the current batch.
>> But that started to feel overly abstracted to me.
> 
> Having this nested batch/sub-batch loop really feels worse. You just get lost in
> the complexity here very easily.
> 
> But i"m also not sure we need to maintain _that_ much state?
> 
> We're already looping over all of the PTEs here, so abstracting _the entire
> loop_ and all the sub-batch stuff to another function, that is
> mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
> ton of sense.

So effectively turn mprotect_folio_pte_batch() into an iterator; have a struct
and a funtion to init the struct for the the number of ptes we want to iterate
over, then a per iteration function that progressively returns batches?

Then we just have a simple loop here that gets the next batch and processes it?

> 
>>
>> This loop approach, as written, felt more straightforward for the reader to
>> understand (i.e. the least-worst option). Is the context approach what you are
>> suggesting or do you have something else in mind?
>>
> 
> See above.
> 
>>>
>>> Alternatively, we could add a new wrapper function, but yeah definitely not
>>> this.
>>>
>>> Also the C programming language asks... etc etc. ;)
>>>
>>> Overall since you always end up processing folio_nr_pages(folio) you can just
>>> have the batch function or a wrapper return this and do updates as necessary
>>> here on that basis, and leave the 'sub' batching to that function.
>>
>> Sorry I don't understand this statement - could you clarify? Especially the bit
>> about "always ... processing folio_nr_pages(folio)"; I don't think we do. In
>> various corner cases the size of the folio has no relationship to the way the
>> PTEs are mapped.
> 
> Right yeah I put this badly. Obviously you can have all sorts of fun with large
> folios partially mapped and page-table split and etc. etc.
> 
> I should have said 'always process nr_ptes'.
> 
> The idea is to abstract this sub-batch stuff to another function, fundamentally.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-01  8:23           ` Ryan Roberts
@ 2025-07-01  8:34             ` Lorenzo Stoakes
  0 siblings, 0 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  8:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 09:23:05AM +0100, Ryan Roberts wrote:
> >>>>> +#ifndef modify_prot_commit_ptes
> >>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
> >>>>> unsigned long addr,
> >>>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >>>>> +{
> >>>>> +    int i;
> >>>>> +
> >>>>> +    for (i = 0; i < nr; ++i) {
> >>>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>>>> +        ptep++;
> >>>> Weird place to put this increment, maybe just stick it in the for loop.
> >>>>
> >>>>> +        addr += PAGE_SIZE;
> >>>> Same comment here.
> >>>
> >>> Sure.
> >>>
> >>>>
> >>>>> +        old_pte = pte_next_pfn(old_pte);
> >>>> Could be:
> >>>>
> >>>>         old_pte = pte;
> >>>>
> >>>> No?
> >>>
> >>> We will need to update old_pte also since that
> >>> is used by powerpc in radix__ptep_modify_prot_commit().
> >>
> >> I think perhaps Lorenzo has the model in his head where old_pte is the previous
> >> pte in the batch. That's not the case. old_pte is the value of the pte in the
> >> current position of the batch before any changes were made. pte is the new value
> >> for the pte. So we need to expliticly advance the PFN in both old_pte and pte
> >> each iteration round the loop.
> >
> > Yeah, you're right, apologies, I'd misinterpreted.
> >
> > I really, really, really hate how all this is implemented. This is obviously an
> > mprotect() and legacy thing but it's almost designed for confusion. Not the
> > fault of this series, and todo++ on improving mprotect as a whole (been on my
> > list for a while...)
>
> Agreed. I struggled for a long time with some of the pgtable helper abstractions
> to the arch and all the assumptions they make. But ultimately all Dev is trying
> to do here is make some incremental improvements, following the established
> patterns. Hopefully you agree that cleanups on a larger scale should be reserved
> for a systematic, focussed series.

Totally agree, when I mention my distaste for existing logic, see those as
asides, I'm not asking the series to be blocked until that's fixed :)

I'm happy for us to take Dev's changes obviously once review issues are
resolved.

I think my suggestion below helps get us to a good compromise (modulo
mm's beautifully overloaded/confusing use of terminology :>)

>
> >
> > So we're ultimately updating ptep (this thing that we update, of course, is
> > buried in the middle of the function invocation) in:
> >
> > 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >
> > We are setting *ptep++ = pte essentially (roughly speaking) right?
>
> Yeah, pretty much.
>
> The API was originally created for Xen IIRC. The problem is that the HW can
> update the A/D bits asynchronously if the PTE is valid (from the HW perspective)
> so the previous approach was to get_and_clear (atomic), modify, write. But that
> required 2 Xen hypervisor calls per PTE. This start/commit approach allows Xen
> to both avoid the get_and_clear() and batch the writes for all PTEs in a lazy
> mmu batch. So hypervisor calls are reduced from 2 per PTE to 1 per lazy mmu
> batch. TBH I'm no Xen expert; some of those details may be off, but big picture
> is correct.

Yeah, here we go again with some horror show stuff on Xen's behalf. I've played
Half Life, so I already know to fear Xen ;)

I believe David has _thoughts_ on this :)

Again this is aside stuff.

>
> Anyway, arm64 doesn't care about any of that, but it does override
> ptep_modify_prot_start() / ptep_modify_prot_commit() to implement an erratum
> workaround. And it can benefit substantially from batching.

Ack. And of course, the batching part is why we're all here!

>
> >
> > And the arch needs to know about any bits that have changed I guess hence
> > providing old_pte as well right?
> >
> > OK so yeah, I get it now, we're not actually advancing through ptes here, we're
> > just advancing the PFN and applying the same 'template'.
> >
> > How about something like:
> >
> > static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> > 	       pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> > {
> > 	int i;
> >
> > 	for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) {
> > 		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >
> > 		/* Advance PFN only, set same flags. */
> > 		old_pte = pte_next_pfn(old_pte);
> > 		pte = pte_next_pfn(pte);
> > 	}
> > }
> >
> > Neatens it up a bit and makes it clear that we're effectively propagating the
> > flags here.
>
> Yes, except we don't usually refer to the non-pfn parts of a pte as "flags". We
> normally call them pgprot or prot. God knows why...

Ah of course we love to do this kind of thing to oursevles :>)

Dev - suggestion above then but s/flags/prot/.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  8:30         ` Ryan Roberts
@ 2025-07-01  8:51           ` Lorenzo Stoakes
  2025-07-01  9:53             ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01  8:51 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 09:30:51AM +0100, Ryan Roberts wrote:
> >> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
> >> return the batch size considering all the relevant PTE bits AND the
> >> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
> >> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
> >> really disliked changing that core function (I think others did too?).
> >
> > Yeah let's not change the core function.
> >
> > My suggestion is to have mprotect_folio_pte_batch() do this.
> >
> >>
> >> So barring that approach, we are really only left with the batch and sub-batch
> >> approach - although, yes, it could be abstracted more. We could maintain a
> >> context struct that persists across all calls to mprotect_folio_pte_batch() and
> >> it can use that to keep it's state to remember if we are in the middle of a
> >> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
> >> call anon_exclusive_batch() to get the next sub-batch within the current batch.
> >> But that started to feel overly abstracted to me.
> >
> > Having this nested batch/sub-batch loop really feels worse. You just get lost in
> > the complexity here very easily.
> >
> > But i"m also not sure we need to maintain _that_ much state?
> >
> > We're already looping over all of the PTEs here, so abstracting _the entire
> > loop_ and all the sub-batch stuff to another function, that is
> > mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
> > ton of sense.
>
> So effectively turn mprotect_folio_pte_batch() into an iterator; have a struct
> and a funtion to init the struct for the the number of ptes we want to iterate
> over, then a per iteration function that progressively returns batches?

Is that really necessary though?

Idea is that mprotect_folio_pte_batch() returns the nr_ptes _taking into account
the PAE stuff_.

Would this break anything?

We might need to pass a flag to say 'don't account for this' for prot numa case.

>
> Then we just have a simple loop here that gets the next batch and processes it?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  8:51           ` Lorenzo Stoakes
@ 2025-07-01  9:53             ` Ryan Roberts
  2025-07-01 10:21               ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01  9:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 01/07/2025 09:51, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 09:30:51AM +0100, Ryan Roberts wrote:
>>>> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
>>>> return the batch size considering all the relevant PTE bits AND the
>>>> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
>>>> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
>>>> really disliked changing that core function (I think others did too?).
>>>
>>> Yeah let's not change the core function.
>>>
>>> My suggestion is to have mprotect_folio_pte_batch() do this.
>>>
>>>>
>>>> So barring that approach, we are really only left with the batch and sub-batch
>>>> approach - although, yes, it could be abstracted more. We could maintain a
>>>> context struct that persists across all calls to mprotect_folio_pte_batch() and
>>>> it can use that to keep it's state to remember if we are in the middle of a
>>>> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
>>>> call anon_exclusive_batch() to get the next sub-batch within the current batch.
>>>> But that started to feel overly abstracted to me.
>>>
>>> Having this nested batch/sub-batch loop really feels worse. You just get lost in
>>> the complexity here very easily.
>>>
>>> But i"m also not sure we need to maintain _that_ much state?
>>>
>>> We're already looping over all of the PTEs here, so abstracting _the entire
>>> loop_ and all the sub-batch stuff to another function, that is
>>> mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
>>> ton of sense.
>>
>> So effectively turn mprotect_folio_pte_batch() into an iterator; have a struct
>> and a funtion to init the struct for the the number of ptes we want to iterate
>> over, then a per iteration function that progressively returns batches?
> 
> Is that really necessary though?
> 
> Idea is that mprotect_folio_pte_batch() returns the nr_ptes _taking into account
> the PAE stuff_.

The issue is the efficiency. Assuming we want to keep the PTE scan contained
within the core folio_pte_batch() function and we _don't_ want to add PAE
awareness to that function, then we have 2 separate, independent loops; one for
PTE scanning and the other for PAE scanning. If the first loop scans through ans
returns 512, but then the PAE scan returns 1, we return 1. If we don't remember
for the next time that we already determined we have a PTE batch of 512 (now
511) then we will rescan the 511 PTEs and potentially return 1 again due to PAE.
Then 510, then 509...

That feels inefficient to me. So I'd much rather just remember that we have a
batch of 512, then split into sub batches as needed for PAE compliance. Then we
only scan each PTE once and each struct page once.

But to achieve this, we either need to merge the 2 loops or we need to carry
state across function calls (i.e. like an iterator).

> 
> Would this break anything?

It's not about breaking anything, it's about scanning efficiently. Perhaps you
don't think it's worth worrying about in practice?

> 
> We might need to pass a flag to say 'don't account for this' for prot numa case.

Yep, another bool ;-)

> 
>>
>> Then we just have a simple loop here that gets the next batch and processes it?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01  9:53             ` Ryan Roberts
@ 2025-07-01 10:21               ` Lorenzo Stoakes
  2025-07-01 11:31                 ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 10:21 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 10:53:08AM +0100, Ryan Roberts wrote:
> On 01/07/2025 09:51, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 09:30:51AM +0100, Ryan Roberts wrote:
> >>>> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
> >>>> return the batch size considering all the relevant PTE bits AND the
> >>>> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
> >>>> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
> >>>> really disliked changing that core function (I think others did too?).
> >>>
> >>> Yeah let's not change the core function.
> >>>
> >>> My suggestion is to have mprotect_folio_pte_batch() do this.
> >>>
> >>>>
> >>>> So barring that approach, we are really only left with the batch and sub-batch
> >>>> approach - although, yes, it could be abstracted more. We could maintain a
> >>>> context struct that persists across all calls to mprotect_folio_pte_batch() and
> >>>> it can use that to keep it's state to remember if we are in the middle of a
> >>>> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
> >>>> call anon_exclusive_batch() to get the next sub-batch within the current batch.
> >>>> But that started to feel overly abstracted to me.
> >>>
> >>> Having this nested batch/sub-batch loop really feels worse. You just get lost in
> >>> the complexity here very easily.
> >>>
> >>> But i"m also not sure we need to maintain _that_ much state?
> >>>
> >>> We're already looping over all of the PTEs here, so abstracting _the entire
> >>> loop_ and all the sub-batch stuff to another function, that is
> >>> mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
> >>> ton of sense.
> >>
> >> So effectively turn mprotect_folio_pte_batch() into an iterator; have a struct
> >> and a funtion to init the struct for the the number of ptes we want to iterate
> >> over, then a per iteration function that progressively returns batches?
> >
> > Is that really necessary though?
> >
> > Idea is that mprotect_folio_pte_batch() returns the nr_ptes _taking into account
> > the PAE stuff_.
>
> The issue is the efficiency. Assuming we want to keep the PTE scan contained
> within the core folio_pte_batch() function and we _don't_ want to add PAE
> awareness to that function, then we have 2 separate, independent loops; one for
> PTE scanning and the other for PAE scanning. If the first loop scans through ans
> returns 512, but then the PAE scan returns 1, we return 1. If we don't remember
> for the next time that we already determined we have a PTE batch of 512 (now
> 511) then we will rescan the 511 PTEs and potentially return 1 again due to PAE.
> Then 510, then 509...

Hm really?

The idea is mprotect_folio_pte_batch() would look ahead and determine the
PAE/non-PAE sub-batch and return this nr_pages. It'd check 'this page is PAE, so
when's the next page that is not/hit max_nr_pages?'

So that'd be 1 in the first case.

Then you loop around and go again, and this time it'd check 'this page is !PAE,
so when's the next page that is/hit max_nr_pages?' and then it'd return 511.

A better example I think is e.g. if we had, for the sake argument, it return 16,
16, 480.

Then we scan ahead 16, set nr_ptes = 16, process 16 PTEs. Then the same again,
then the same again only for 480 PTEs.

Each time we set nr_ptes = the sub-batch size.

So I don't think we'll see O(n^2) here?

It would be like:

	do {
		/* now returns sub-batch count */
		nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
				   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);

		... rest of logic remains roughly the same ...
	} while (...);

I may be being naive here in some way?

>
> That feels inefficient to me. So I'd much rather just remember that we have a
> batch of 512, then split into sub batches as needed for PAE compliance. Then we
> only scan each PTE once and each struct page once.
>
> But to achieve this, we either need to merge the 2 loops or we need to carry
> state across function calls (i.e. like an iterator).
>
> >
> > Would this break anything?
>
> It's not about breaking anything, it's about scanning efficiently. Perhaps you
> don't think it's worth worrying about in practice?

The question re: breaking was - if we re-do things like getting oldpte, ptent,
etc. on each sub-batch does _that_ break anything?


The current implementation is not acceptable on the basis of adding a horrible
level of complexity. That function is already terrible, and adding an inner loop
for this batch special casing with _sub batches_ to account for PAE- nobody is
going to understand what's going on.

	do {
		if (...) {
			while (...) {
				help!!!


We can do better, and I'm going to go further and say - this series _has_ to do
better, because I can't accept that, however we do it.

I want the efficiency gainz as much as you guys but I"m convinced we can do it
without causing eye bleeding confusion...

>
> >
> > We might need to pass a flag to say 'don't account for this' for prot numa case.
>
> Yep, another bool ;-)

Yeah... but we can't sensibly add a flag for this so the flag idea doesn't fly
for that either... :>)

I mean I don't think we actually need that flag, let it skip the sub-batch size
then check again. Now that, I reckon, is a small overhead.

>
> >
> >>
> >> Then we just have a simple loop here that gets the next batch and processes it?
>

Another advantage of doing things this way is we can actually add a comment
explaining the sub-batch size.

Right now there's just absolutely _nothing_ and it's entirely unclear what on
earth is going on.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01 10:21               ` Lorenzo Stoakes
@ 2025-07-01 11:31                 ` Ryan Roberts
  2025-07-01 13:40                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-07-01 11:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 01/07/2025 11:21, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 10:53:08AM +0100, Ryan Roberts wrote:
>> On 01/07/2025 09:51, Lorenzo Stoakes wrote:
>>> On Tue, Jul 01, 2025 at 09:30:51AM +0100, Ryan Roberts wrote:
>>>>>> In an ideal world we would flatten and just have mprotect_folio_pte_batch()
>>>>>> return the batch size considering all the relevant PTE bits AND the
>>>>>> AnonExclusive bit on the pages. IIRC one of Dev's earlier versions modified the
>>>>>> core folio_pte_batch() function to also look at the AnonExclusive bit, but I
>>>>>> really disliked changing that core function (I think others did too?).
>>>>>
>>>>> Yeah let's not change the core function.
>>>>>
>>>>> My suggestion is to have mprotect_folio_pte_batch() do this.
>>>>>
>>>>>>
>>>>>> So barring that approach, we are really only left with the batch and sub-batch
>>>>>> approach - although, yes, it could be abstracted more. We could maintain a
>>>>>> context struct that persists across all calls to mprotect_folio_pte_batch() and
>>>>>> it can use that to keep it's state to remember if we are in the middle of a
>>>>>> sub-batch and decide either to call folio_pte_batch() to get a new batch, or
>>>>>> call anon_exclusive_batch() to get the next sub-batch within the current batch.
>>>>>> But that started to feel overly abstracted to me.
>>>>>
>>>>> Having this nested batch/sub-batch loop really feels worse. You just get lost in
>>>>> the complexity here very easily.
>>>>>
>>>>> But i"m also not sure we need to maintain _that_ much state?
>>>>>
>>>>> We're already looping over all of the PTEs here, so abstracting _the entire
>>>>> loop_ and all the sub-batch stuff to another function, that is
>>>>> mprotect_folio_pte_batch() I think sensibly, so it handles this for you makes a
>>>>> ton of sense.
>>>>
>>>> So effectively turn mprotect_folio_pte_batch() into an iterator; have a struct
>>>> and a funtion to init the struct for the the number of ptes we want to iterate
>>>> over, then a per iteration function that progressively returns batches?
>>>
>>> Is that really necessary though?
>>>
>>> Idea is that mprotect_folio_pte_batch() returns the nr_ptes _taking into account
>>> the PAE stuff_.
>>
>> The issue is the efficiency. Assuming we want to keep the PTE scan contained
>> within the core folio_pte_batch() function and we _don't_ want to add PAE
>> awareness to that function, then we have 2 separate, independent loops; one for
>> PTE scanning and the other for PAE scanning. If the first loop scans through ans
>> returns 512, but then the PAE scan returns 1, we return 1. If we don't remember
>> for the next time that we already determined we have a PTE batch of 512 (now
>> 511) then we will rescan the 511 PTEs and potentially return 1 again due to PAE.
>> Then 510, then 509...
> 
> Hm really?
> 
> The idea is mprotect_folio_pte_batch() would look ahead and determine the
> PAE/non-PAE sub-batch and return this nr_pages. It'd check 'this page is PAE, so
> when's the next page that is not/hit max_nr_pages?'
> 
> So that'd be 1 in the first case.
> 
> Then you loop around and go again, and this time it'd check 'this page is !PAE,
> so when's the next page that is/hit max_nr_pages?' and then it'd return 511.
> 
> A better example I think is e.g. if we had, for the sake argument, it return 16,
> 16, 480.
> 
> Then we scan ahead 16, set nr_ptes = 16, process 16 PTEs. Then the same again,
> then the same again only for 480 PTEs.
> 
> Each time we set nr_ptes = the sub-batch size.
> 
> So I don't think we'll see O(n^2) here?
> 
> It would be like:
> 
> 	do {
> 		/* now returns sub-batch count */
> 		nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
> 				   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
> 
> 		... rest of logic remains roughly the same ...
> 	} while (...);
> 
> I may be being naive here in some way?

I believe so, yes. But usually it's me that ends up being wrong. Let me try to
explain my point and we shall see...

There are 2 separate requirements that need to be met for a batch to be assembled:

  - All PTEs have to map consecutive pages from the same folio, all with the
    same pgprots (except a/d/sd).
  - If anon, all of the mapped pages must have the same PAE value.

The first requirement is managed by scanning forward through PTEs until it hits
the first PTE that is non-conformant (or hits max_nr). Currently implemented by
folio_pte_batch().

The second requirement is managed by scanning through the struct pages, checking
PAE (or hits max_nr).

The final batch is determined according to the smaller of the 2 batches
determined using both these checks.

So, assuming we don't want to fold both of those into the same loop (which would
enable checking the PTE and PAE in lock-step, mprotect_folio_pte_batch() needs
to call both folio_pte_batch() and loop over the pages looking at PAE in order
to decide where the batch boundary is.

If we want it to be stateless, then if it scans the PTEs first and that batch is
larger than the batch determined for the subsequent PAE scan, we return the
smaller and next time it is called it will rescan those excess PTEs. The same
logic applies in reverse if you scan PAE first.

If we make it stateless, it can remember "I've already scanned PTEs and the PTE
batch ends at X. So now I just need to iterate through that to create
sub-batches taking PAE into account".


> 
>>
>> That feels inefficient to me. So I'd much rather just remember that we have a
>> batch of 512, then split into sub batches as needed for PAE compliance. Then we
>> only scan each PTE once and each struct page once.
>>
>> But to achieve this, we either need to merge the 2 loops or we need to carry
>> state across function calls (i.e. like an iterator).
>>
>>>
>>> Would this break anything?
>>
>> It's not about breaking anything, it's about scanning efficiently. Perhaps you
>> don't think it's worth worrying about in practice?
> 
> The question re: breaking was - if we re-do things like getting oldpte, ptent,
> etc. on each sub-batch does _that_ break anything?
> 
> 
> The current implementation is not acceptable on the basis of adding a horrible
> level of complexity. That function is already terrible, and adding an inner loop
> for this batch special casing with _sub batches_ to account for PAE- nobody is
> going to understand what's going on.
> 
> 	do {
> 		if (...) {
> 			while (...) {
> 				help!!!
> 
> 
> We can do better, and I'm going to go further and say - this series _has_ to do
> better, because I can't accept that, however we do it.
> 
> I want the efficiency gainz as much as you guys but I"m convinced we can do it
> without causing eye bleeding confusion...

That's completely reasonable - we will get there! I'm very happy for this to be
refactored into help function(s) to make it more accessible.

I'm just saying that fundamentally, we either need to flatten this to a single
loop so that the PTE and PAE can be assessed in lock-step and we never
over-scan. Or we need to keep some state to remember when we have already
scanned for a PTE batch and are currently working our way through that chunking
it into sub-batches based on PAE. I don't think we should entertain a stateless
two-loop solution.

> 
>>
>>>
>>> We might need to pass a flag to say 'don't account for this' for prot numa case.
>>
>> Yep, another bool ;-)
> 
> Yeah... but we can't sensibly add a flag for this so the flag idea doesn't fly
> for that either... :>)
> 
> I mean I don't think we actually need that flag, let it skip the sub-batch size
> then check again. Now that, I reckon, is a small overhead.

Yeah, agreed. That's probably fine in practice.

> 
>>
>>>
>>>>
>>>> Then we just have a simple loop here that gets the next batch and processes it?
>>
> 
> Another advantage of doing things this way is we can actually add a comment
> explaining the sub-batch size.
> 
> Right now there's just absolutely _nothing_ and it's entirely unclear what on
> earth is going on.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01 11:31                 ` Ryan Roberts
@ 2025-07-01 13:40                   ` Lorenzo Stoakes
  2025-07-02 10:32                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-01 13:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 12:31:02PM +0100, Ryan Roberts wrote:
> >> The issue is the efficiency. Assuming we want to keep the PTE scan contained
> >> within the core folio_pte_batch() function and we _don't_ want to add PAE
> >> awareness to that function, then we have 2 separate, independent loops; one for
> >> PTE scanning and the other for PAE scanning. If the first loop scans through ans
> >> returns 512, but then the PAE scan returns 1, we return 1. If we don't remember
> >> for the next time that we already determined we have a PTE batch of 512 (now
> >> 511) then we will rescan the 511 PTEs and potentially return 1 again due to PAE.
> >> Then 510, then 509...
> >
> > Hm really?
> >
> > The idea is mprotect_folio_pte_batch() would look ahead and determine the
> > PAE/non-PAE sub-batch and return this nr_pages. It'd check 'this page is PAE, so
> > when's the next page that is not/hit max_nr_pages?'
> >
> > So that'd be 1 in the first case.
> >
> > Then you loop around and go again, and this time it'd check 'this page is !PAE,
> > so when's the next page that is/hit max_nr_pages?' and then it'd return 511.
> >
> > A better example I think is e.g. if we had, for the sake argument, it return 16,
> > 16, 480.
> >
> > Then we scan ahead 16, set nr_ptes = 16, process 16 PTEs. Then the same again,
> > then the same again only for 480 PTEs.
> >
> > Each time we set nr_ptes = the sub-batch size.
> >
> > So I don't think we'll see O(n^2) here?
> >
> > It would be like:
> >
> > 	do {
> > 		/* now returns sub-batch count */
> > 		nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
> > 				   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
> >
> > 		... rest of logic remains roughly the same ...
> > 	} while (...);
> >
> > I may be being naive here in some way?
>
> I believe so, yes. But usually it's me that ends up being wrong. Let me try to
> explain my point and we shall see...

Haha no, being embarrassingly wrong is my speciality. Don't take that from me ;)

But in this case I think you're right here actually and I've underestimated the
complication here.

>
> There are 2 separate requirements that need to be met for a batch to be assembled:
>
>   - All PTEs have to map consecutive pages from the same folio, all with the
>     same pgprots (except a/d/sd).
>   - If anon, all of the mapped pages must have the same PAE value.
>
> The first requirement is managed by scanning forward through PTEs until it hits
> the first PTE that is non-conformant (or hits max_nr). Currently implemented by
> folio_pte_batch().

OK so I think this is the crux. The folio_pte_batch() is naive to the PAE thing,
and so we end up having to rescan.

>
> The second requirement is managed by scanning through the struct pages, checking
> PAE (or hits max_nr).
>
> The final batch is determined according to the smaller of the 2 batches
> determined using both these checks.
>
> So, assuming we don't want to fold both of those into the same loop (which would
> enable checking the PTE and PAE in lock-step, mprotect_folio_pte_batch() needs
> to call both folio_pte_batch() and loop over the pages looking at PAE in order
> to decide where the batch boundary is.
>
> If we want it to be stateless, then if it scans the PTEs first and that batch is
> larger than the batch determined for the subsequent PAE scan, we return the
> smaller and next time it is called it will rescan those excess PTEs. The same
> logic applies in reverse if you scan PAE first.
>
> If we make it stateless, it can remember "I've already scanned PTEs and the PTE
> batch ends at X. So now I just need to iterate through that to create
> sub-batches taking PAE into account".

Right yeah. Statelessness is not crucial here and doesn't seem workable then.


> > The current implementation is not acceptable on the basis of adding a horrible
> > level of complexity. That function is already terrible, and adding an inner loop
> > for this batch special casing with _sub batches_ to account for PAE- nobody is
> > going to understand what's going on.
> >
> > 	do {
> > 		if (...) {
> > 			while (...) {
> > 				help!!!
> >
> >
> > We can do better, and I'm going to go further and say - this series _has_ to do
> > better, because I can't accept that, however we do it.
> >
> > I want the efficiency gainz as much as you guys but I"m convinced we can do it
> > without causing eye bleeding confusion...
>
> That's completely reasonable - we will get there! I'm very happy for this to be
> refactored into help function(s) to make it more accessible.

We'll definitely get there :)

>
> I'm just saying that fundamentally, we either need to flatten this to a single
> loop so that the PTE and PAE can be assessed in lock-step and we never
> over-scan. Or we need to keep some state to remember when we have already
> scanned for a PTE batch and are currently working our way through that chunking
> it into sub-batches based on PAE. I don't think we should entertain a stateless
> two-loop solution.

Yes.

>
> >
> >>
> >>>
> >>> We might need to pass a flag to say 'don't account for this' for prot numa case.
> >>
> >> Yep, another bool ;-)
> >
> > Yeah... but we can't sensibly add a flag for this so the flag idea doesn't fly
> > for that either... :>)
> >
> > I mean I don't think we actually need that flag, let it skip the sub-batch size
> > then check again. Now that, I reckon, is a small overhead.
>
> Yeah, agreed. That's probably fine in practice.

Ack.

Let me fiddle with this code and see if I can suggest something sensible.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
  2025-06-30  9:42   ` Ryan Roberts
  2025-06-30 11:25   ` Lorenzo Stoakes
@ 2025-07-02  9:37   ` Lorenzo Stoakes
  2025-07-02 15:01     ` Dev Jain
  2 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02  9:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch. Additionally refactor all of this
> into a new function to clean up the existing code.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>  1 file changed, 87 insertions(+), 47 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..af10a7fbe6b8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>
> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> +			       NULL, NULL, NULL);
> +}
> +
> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
> +		int max_nr_ptes)
> +{

While it's nice to separate this out, it's not so nice to pass folio as a
pointer to a pointer like this and maybe or maybe not set it.

Just get the folio before you call this... you'll need it either way.

I'll wait until you separate it all out before reviewing next revision as a bit
tricky as-is.

Thanks!


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-01 13:40                   ` Lorenzo Stoakes
@ 2025-07-02 10:32                     ` Lorenzo Stoakes
  2025-07-02 15:03                       ` Dev Jain
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02 10:32 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Jul 01, 2025 at 02:40:14PM +0100, Lorenzo Stoakes wrote:
> Let me fiddle with this code and see if I can suggest something sensible.

OK this is _even more fiddly_ than I thought.

Dev - feel free to do a respin (once David's stuff's landed in mm-new, which I
think maybe it has now?), and we can maybe discuss a v6 with everything else in
place if this is still problematic.

Ryan mentioned an iterator solution, which actually now seems sensible here, if
fiddly. Please do try to attack that in v5 to see if something's workable there.

Why are computers so complicated...

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-07-02  9:37   ` Lorenzo Stoakes
@ 2025-07-02 15:01     ` Dev Jain
  2025-07-02 15:37       ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-07-02 15:01 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 02/07/25 3:07 pm, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch. Additionally refactor all of this
>> into a new function to clean up the existing code.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
>>   1 file changed, 87 insertions(+), 47 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88709c01177b..af10a7fbe6b8 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	return pte_dirty(pte);
>>   }
>>
>> +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr_ptes)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> +			       NULL, NULL, NULL);
>> +}
>> +
>> +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
>> +		int max_nr_ptes)
>> +{
> While it's nice to separate this out, it's not so nice to pass folio as a
> pointer to a pointer like this and maybe or maybe not set it.
>
> Just get the folio before you call this... you'll need it either way.

I did that on David's suggestion:

https://lore.kernel.org/all/8c389ee5-f7a4-44f6-a0d6-cc01c3da4d91@redhat.com/

We were trying to reuse the folio if available from prot_numa_skip_ptes,
to avoid using vm_normal_folio() again. Not sure if avoiding vm_normal_folio
is worth the complexity.

>
> I'll wait until you separate it all out before reviewing next revision as a bit
> tricky as-is.
>
> Thanks!


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-02 10:32                     ` Lorenzo Stoakes
@ 2025-07-02 15:03                       ` Dev Jain
  2025-07-02 15:22                         ` Lorenzo Stoakes
  0 siblings, 1 reply; 62+ messages in thread
From: Dev Jain @ 2025-07-02 15:03 UTC (permalink / raw)
  To: Lorenzo Stoakes, Ryan Roberts
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, hughd, yang, ziy


On 02/07/25 4:02 pm, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 02:40:14PM +0100, Lorenzo Stoakes wrote:
>> Let me fiddle with this code and see if I can suggest something sensible.
> OK this is _even more fiddly_ than I thought.
>
> Dev - feel free to do a respin (once David's stuff's landed in mm-new, which I
> think maybe it has now?), and we can maybe discuss a v6 with everything else in
> place if this is still problematic.
>
> Ryan mentioned an iterator solution, which actually now seems sensible here, if
> fiddly. Please do try to attack that in v5 to see if something's workable there.
>
> Why are computers so complicated...

Sure! I'll be out next week so the mm list can take a breather from my patches : )
My brain exploded trying to understand your and Ryan's implementation conversation,
will read it all and try to come up with something nice.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-02 15:03                       ` Dev Jain
@ 2025-07-02 15:22                         ` Lorenzo Stoakes
  2025-07-03 12:59                           ` David Hildenbrand
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02 15:22 UTC (permalink / raw)
  To: Dev Jain
  Cc: Ryan Roberts, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Jul 02, 2025 at 08:33:39PM +0530, Dev Jain wrote:
>
> On 02/07/25 4:02 pm, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 02:40:14PM +0100, Lorenzo Stoakes wrote:
> > > Let me fiddle with this code and see if I can suggest something sensible.
> > OK this is _even more fiddly_ than I thought.
> >
> > Dev - feel free to do a respin (once David's stuff's landed in mm-new, which I
> > think maybe it has now?), and we can maybe discuss a v6 with everything else in
> > place if this is still problematic.
> >
> > Ryan mentioned an iterator solution, which actually now seems sensible here, if
> > fiddly. Please do try to attack that in v5 to see if something's workable there.
> >
> > Why are computers so complicated...
>
> Sure! I'll be out next week so the mm list can take a breather from my patches : )
> My brain exploded trying to understand your and Ryan's implementation conversation,
> will read it all and try to come up with something nice.
>

No worries, it's actually genuinely annoyingly tricky this... :) we may need a
few more iterations to get there.

Really the underlying issue (other than inherent complexity) is the poor
mprotect implementation in the first instance that makes the complexity here
less easy to integrate.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-07-02 15:01     ` Dev Jain
@ 2025-07-02 15:37       ` Lorenzo Stoakes
  0 siblings, 0 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02 15:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Jul 02, 2025 at 08:31:33PM +0530, Dev Jain wrote:
>
> On 02/07/25 3:07 pm, Lorenzo Stoakes wrote:
> > On Sat, Jun 28, 2025 at 05:04:32PM +0530, Dev Jain wrote:
> > > In case of prot_numa, there are various cases in which we can skip to the
> > > next iteration. Since the skip condition is based on the folio and not
> > > the PTEs, we can skip a PTE batch. Additionally refactor all of this
> > > into a new function to clean up the existing code.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   mm/mprotect.c | 134 ++++++++++++++++++++++++++++++++------------------
> > >   1 file changed, 87 insertions(+), 47 deletions(-)
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 88709c01177b..af10a7fbe6b8 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > >   	return pte_dirty(pte);
> > >   }
> > >
> > > +static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
> > > +		pte_t *ptep, pte_t pte, int max_nr_ptes)
> > > +{
> > > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +
> > > +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
> > > +		return 1;
> > > +
> > > +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> > > +			       NULL, NULL, NULL);
> > > +}
> > > +
> > > +static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
> > > +		unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
> > > +		int max_nr_ptes)
> > > +{
> > While it's nice to separate this out, it's not so nice to pass folio as a
> > pointer to a pointer like this and maybe or maybe not set it.
> >
> > Just get the folio before you call this... you'll need it either way.
>
> I did that on David's suggestion:
>
> https://lore.kernel.org/all/8c389ee5-f7a4-44f6-a0d6-cc01c3da4d91@redhat.com/
>
> We were trying to reuse the folio if available from prot_numa_skip_ptes,
> to avoid using vm_normal_folio() again. Not sure if avoiding vm_normal_folio
> is worth the complexity.

Well, do you need to? You're doing vm_normal_folio() in both cases, why not just
put the vm_normal_folio() lookup in change_pte_range() before invoking this
function, then reuse the folio in the loop?

Oh right, I guess David was concerned about not needing to look it up in the
pte_protnone(oldpte) case?

I'm not sure that's worth it honestly. If we do _really_ want to do this, then
at least put the param last


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
  2025-07-02 15:22                         ` Lorenzo Stoakes
@ 2025-07-03 12:59                           ` David Hildenbrand
  0 siblings, 0 replies; 62+ messages in thread
From: David Hildenbrand @ 2025-07-03 12:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: Ryan Roberts, akpm, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 02.07.25 17:22, Lorenzo Stoakes wrote:
> On Wed, Jul 02, 2025 at 08:33:39PM +0530, Dev Jain wrote:
>>
>> On 02/07/25 4:02 pm, Lorenzo Stoakes wrote:
>>> On Tue, Jul 01, 2025 at 02:40:14PM +0100, Lorenzo Stoakes wrote:
>>>> Let me fiddle with this code and see if I can suggest something sensible.
>>> OK this is _even more fiddly_ than I thought.
>>>
>>> Dev - feel free to do a respin (once David's stuff's landed in mm-new, which I
>>> think maybe it has now?), and we can maybe discuss a v6 with everything else in
>>> place if this is still problematic.
>>>
>>> Ryan mentioned an iterator solution, which actually now seems sensible here, if
>>> fiddly. Please do try to attack that in v5 to see if something's workable there.
>>>
>>> Why are computers so complicated...
>>
>> Sure! I'll be out next week so the mm list can take a breather from my patches : )
>> My brain exploded trying to understand your and Ryan's implementation conversation,
>> will read it all and try to come up with something nice.
>>
> 
> No worries, it's actually genuinely annoyingly tricky this... :) we may need a
> few more iterations to get there.
> 
> Really the underlying issue (other than inherent complexity) is the poor
> mprotect implementation in the first instance that makes the complexity here
> less easy to integrate.
> 

I haven't really scanned all the discussions yet, just something to keep 
in mind:

In the common case, all PAE values will match. This is the case to 
optimize for.

Having some PAE values not match is the corner case, that can be left 
suboptimal if it makes the code nicer.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2025-07-03 14:30 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
2025-06-30  9:42   ` Ryan Roberts
2025-06-30  9:49     ` Dev Jain
2025-06-30  9:55       ` Ryan Roberts
2025-06-30 10:05         ` Dev Jain
2025-06-30 11:25   ` Lorenzo Stoakes
2025-06-30 11:39     ` Ryan Roberts
2025-06-30 11:53       ` Lorenzo Stoakes
2025-06-30 11:40     ` Dev Jain
2025-06-30 11:51       ` Lorenzo Stoakes
2025-06-30 11:56         ` Dev Jain
2025-07-02  9:37   ` Lorenzo Stoakes
2025-07-02 15:01     ` Dev Jain
2025-07-02 15:37       ` Lorenzo Stoakes
2025-06-28 11:34 ` [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-06-30 10:10   ` Ryan Roberts
2025-06-30 10:17     ` Dev Jain
2025-06-30 10:35       ` Ryan Roberts
2025-06-30 10:42         ` Dev Jain
2025-06-30 12:57   ` Lorenzo Stoakes
2025-07-01  4:44     ` Dev Jain
2025-07-01  7:33       ` Ryan Roberts
2025-07-01  8:06         ` Lorenzo Stoakes
2025-07-01  8:23           ` Ryan Roberts
2025-07-01  8:34             ` Lorenzo Stoakes
2025-06-28 11:34 ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
2025-06-28 12:39   ` Dev Jain
2025-06-30 10:31   ` Ryan Roberts
2025-06-30 11:21     ` Dev Jain
2025-06-30 11:47       ` Dev Jain
2025-06-30 11:50       ` Ryan Roberts
2025-06-30 11:53         ` Dev Jain
2025-07-01  5:47     ` Dev Jain
2025-07-01  7:39       ` Ryan Roberts
2025-06-30 12:52   ` Lorenzo Stoakes
2025-07-01  5:30     ` Dev Jain
2025-07-01  8:03     ` Ryan Roberts
2025-07-01  8:06       ` Dev Jain
2025-07-01  8:24         ` Ryan Roberts
2025-07-01  8:15       ` Lorenzo Stoakes
2025-07-01  8:30         ` Ryan Roberts
2025-07-01  8:51           ` Lorenzo Stoakes
2025-07-01  9:53             ` Ryan Roberts
2025-07-01 10:21               ` Lorenzo Stoakes
2025-07-01 11:31                 ` Ryan Roberts
2025-07-01 13:40                   ` Lorenzo Stoakes
2025-07-02 10:32                     ` Lorenzo Stoakes
2025-07-02 15:03                       ` Dev Jain
2025-07-02 15:22                         ` Lorenzo Stoakes
2025-07-03 12:59                           ` David Hildenbrand
2025-06-28 11:34 ` [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-06-30 10:43   ` Ryan Roberts
2025-06-29 23:05 ` [PATCH v4 0/4] Optimize mprotect() for large folios Andrew Morton
2025-06-30  3:33   ` Dev Jain
2025-06-30 10:45     ` Ryan Roberts
2025-06-30 11:22       ` Dev Jain
2025-06-30 11:17 ` Lorenzo Stoakes
2025-06-30 11:25   ` Dev Jain
2025-06-30 11:27 ` Lorenzo Stoakes
2025-06-30 11:43   ` Dev Jain
2025-07-01  0:08     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).