[PATCH v2 0/7] Optimize mprotect for large folios

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] Optimize mprotect for large folios
@ 2025-04-29  5:23 Dev Jain
  2025-04-29  5:23 ` [PATCH v2 1/7] mm: Refactor code in mprotect Dev Jain
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

This patchset optimizes the mprotect() system call for large folios
by PTE-batching.

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds

After the patchset:
T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds

Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement.

v1->v2:
 - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
 - Abridge the anon-exclusive condition (Lance Yang)

Dev Jain (7):
  mm: Refactor code in mprotect
  mm: Optimize mprotect() by batch-skipping PTEs
  mm: Add batched versions of ptep_modify_prot_start/commit
  arm64: Add batched version of ptep_modify_prot_start
  arm64: Add batched version of ptep_modify_prot_commit
  mm: Batch around can_change_pte_writable()
  mm: Optimize mprotect() through PTE-batching

 arch/arm64/include/asm/pgtable.h |  10 ++
 arch/arm64/mm/mmu.c              |  21 +++-
 include/linux/mm.h               |   4 +-
 include/linux/pgtable.h          |  42 ++++++++
 mm/gup.c                         |   2 +-
 mm/huge_memory.c                 |   4 +-
 mm/memory.c                      |   6 +-
 mm/mprotect.c                    | 165 ++++++++++++++++++++-----------
 mm/pgtable-generic.c             |  16 ++-
 9 files changed, 198 insertions(+), 72 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 1/7] mm: Refactor code in mprotect
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-29  6:41   ` Anshuman Khandual
  2025-04-29 11:00   ` Lorenzo Stoakes
  2025-04-29  5:23 ` [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

Reduce indentation in change_pte_range() by refactoring some of the code
into a new function. No functional change.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 116 +++++++++++++++++++++++++++++---------------------
 1 file changed, 68 insertions(+), 48 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88608d0dc2c2..70f59aa8c2a8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,6 +83,71 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
+
+
+static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
+		int target_node)
+{
+	bool toptier;
+	int nid;
+
+	/* Also skip shared copy-on-write pages */
+	if (is_cow_mapping(vma->vm_flags) &&
+	    (folio_maybe_dma_pinned(folio) ||
+	     folio_maybe_mapped_shared(folio)))
+		return true;
+
+	/*
+	 * While migration can move some dirty pages,
+	 * it cannot move them all from MIGRATE_ASYNC
+	 * context.
+	 */
+	if (folio_is_file_lru(folio) &&
+	    folio_test_dirty(folio))
+		return true;
+
+	/*
+	 * Don't mess with PTEs if page is already on the node
+	 * a single-threaded process is running on.
+	 */
+	nid = folio_nid(folio);
+	if (target_node == nid)
+		return true;
+	toptier = node_is_toptier(nid);
+
+	/*
+	 * Skip scanning top tier node if normal numa
+	 * balancing is disabled
+	 */
+	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+	    toptier)
+		return true;
+	return false;
+}
+
+static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
+		unsigned long addr, pte_t oldpte, int target_node)
+{
+	struct folio *folio;
+	int ret;
+
+	/* Avoid TLB flush if possible */
+	if (pte_protnone(oldpte))
+		return true;
+
+	folio = vm_normal_folio(vma, addr, oldpte);
+	if (!folio || folio_is_zone_device(folio) ||
+	    folio_test_ksm(folio))
+		return true;
+	ret = prot_numa_skip(vma, folio, target_node);
+	if (ret)
+		return ret;
+	if (folio_use_access_time(folio))
+		folio_xchg_access_time(folio,
+			jiffies_to_msecs(jiffies));
+	return false;
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -116,56 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
-			if (prot_numa) {
-				struct folio *folio;
-				int nid;
-				bool toptier;
-
-				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
-					continue;
-
-				folio = vm_normal_folio(vma, addr, oldpte);
-				if (!folio || folio_is_zone_device(folio) ||
-				    folio_test_ksm(folio))
-					continue;
-
-				/* Also skip shared copy-on-write pages */
-				if (is_cow_mapping(vma->vm_flags) &&
-				    (folio_maybe_dma_pinned(folio) ||
-				     folio_maybe_mapped_shared(folio)))
-					continue;
-
-				/*
-				 * While migration can move some dirty pages,
-				 * it cannot move them all from MIGRATE_ASYNC
-				 * context.
-				 */
-				if (folio_is_file_lru(folio) &&
-				    folio_test_dirty(folio))
+			if (prot_numa &&
+			    prot_numa_avoid_fault(vma, addr,
+						  oldpte, target_node))
 					continue;
 
-				/*
-				 * Don't mess with PTEs if page is already on the node
-				 * a single-threaded process is running on.
-				 */
-				nid = folio_nid(folio);
-				if (target_node == nid)
-					continue;
-				toptier = node_is_toptier(nid);
-
-				/*
-				 * Skip scanning top tier node if normal numa
-				 * balancing is disabled
-				 */
-				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
-				    toptier)
-					continue;
-				if (folio_use_access_time(folio))
-					folio_xchg_access_time(folio,
-						jiffies_to_msecs(jiffies));
-			}
-
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
  2025-04-29  5:23 ` [PATCH v2 1/7] mm: Refactor code in mprotect Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-29  7:14   ` Anshuman Khandual
  2025-04-29 13:19   ` Lorenzo Stoakes
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

In case of prot_numa, there are various cases in which we can skip to the
next iteration. Since the skip condition is based on the folio and not
the PTEs, we can skip a PTE batch.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 70f59aa8c2a8..ec5d17af7650 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
 	bool toptier;
 	int nid;
 
+	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
+		return true;
+
 	/* Also skip shared copy-on-write pages */
 	if (is_cow_mapping(vma->vm_flags) &&
 	    (folio_maybe_dma_pinned(folio) ||
@@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
 }
 
 static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
-		unsigned long addr, pte_t oldpte, int target_node)
+		unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
+		int max_nr, int *nr)
 {
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct folio *folio;
 	int ret;
 
@@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
 		return true;
 
 	folio = vm_normal_folio(vma, addr, oldpte);
-	if (!folio || folio_is_zone_device(folio) ||
-	    folio_test_ksm(folio))
+	if (!folio)
 		return true;
+
 	ret = prot_numa_skip(vma, folio, target_node);
-	if (ret)
+	if (ret) {
+		if (folio_test_large(folio) && max_nr != 1)
+			*nr = folio_pte_batch(folio, addr, pte, oldpte,
+					      max_nr, flags, NULL, NULL, NULL);
 		return ret;
+	}
 	if (folio_use_access_time(folio))
 		folio_xchg_access_time(folio,
 			jiffies_to_msecs(jiffies));
@@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int nr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+		nr = 1;
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
+			int max_nr = (end - addr) >> PAGE_SHIFT;
 			pte_t ptent;
 
 			/*
@@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa &&
-			    prot_numa_avoid_fault(vma, addr,
-						  oldpte, target_node))
+			    prot_numa_avoid_fault(vma, addr, pte,
+						  oldpte, target_node,
+							  max_nr, &nr))
 					continue;
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
@@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				pages++;
 			}
 		}
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += nr, addr += nr * PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
  2025-04-29  5:23 ` [PATCH v2 1/7] mm: Refactor code in mprotect Dev Jain
  2025-04-29  5:23 ` [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-29  8:39   ` Anshuman Khandual
                     ` (4 more replies)
  2025-04-29  5:23 ` [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start Dev Jain
                   ` (5 subsequent siblings)
  8 siblings, 5 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
Architecture can override these helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..ed287289335f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+/* See the comment for ptep_modify_prot_start */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/* See the comment for ptep_modify_prot_commit */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	for (;;) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
                   ` (2 preceding siblings ...)
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-30  5:43   ` Anshuman Khandual
  2025-04-29  5:23 ` [PATCH v2 5/7] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

Override the generic definition to use get_and_clear_full_ptes(), so that
we do a TLBI possibly only on the "contpte-edges" of the large PTE block,
instead of doing it for every contpte block, which happens for ptep_get_and_clear().

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  5 +++++
 arch/arm64/mm/mmu.c              | 12 +++++++++---
 include/linux/pgtable.h          |  4 ++++
 mm/pgtable-generic.c             | 16 +++++++++++-----
 4 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2a77f11b78d5..8872ea5f0642 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
 
+#define modify_prot_start_ptes modify_prot_start_ptes
+extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *ptep,
+				    unsigned int nr);
+
 #ifdef CONFIG_ARM64_CONTPTE
 
 /*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8fcf59ba39db..fe60be8774f4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
 early_initcall(prevent_bootmem_remove_init);
 #endif
 
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, unsigned int nr)
 {
 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
 		/*
@@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * in cases where cpu is affected with errata #2645198.
 		 */
 		if (pte_user_exec(ptep_get(ptep)))
-			return ptep_clear_flush(vma, addr, ptep);
+			return clear_flush_ptes(vma, addr, ptep, nr);
 	}
-	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
+}
+
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+{
+	return modify_prot_start_ptes(vma, addr, ptep, 1);
 }
 
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ed287289335f..10cdb87ccecf 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
 			      pte_t *ptep);
 #endif
 
+extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pte_t *ptep, unsigned int nr);
+
 #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
 extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
 			      unsigned long address,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5a882f2b10f9..e238f88c3cac 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 #endif
 
-#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
-pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
-		       pte_t *ptep)
+pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep, unsigned int nr)
 {
 	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
-	pte = ptep_get_and_clear(mm, address, ptep);
+	pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
 	if (pte_accessible(mm, pte))
-		flush_tlb_page(vma, address);
+		flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
 	return pte;
 }
+
+#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
+pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep)
+{
+	return clear_flush_ptes(vma, address, ptep, 1);
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 5/7] arm64: Add batched version of ptep_modify_prot_commit
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
                   ` (3 preceding siblings ...)
  2025-04-29  5:23 ` [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-29  5:23 ` [PATCH v2 6/7] mm: Batch around can_change_pte_writable() Dev Jain
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

Override the generic definition to simply use set_ptes() to map the new
ptes into the pagetable.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 5 +++++
 arch/arm64/mm/mmu.c              | 9 ++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8872ea5f0642..0b13ca38f80c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1558,6 +1558,11 @@ extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    unsigned int nr);
 
+#define modify_prot_commit_ptes modify_prot_commit_ptes
+extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+				    pte_t *ptep, pte_t old_pte, pte_t pte,
+				    unsigned int nr);
+
 #ifdef CONFIG_ARM64_CONTPTE
 
 /*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index fe60be8774f4..5f04bcdcd946 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1543,10 +1543,17 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 	return modify_prot_start_ptes(vma, addr, ptep, 1);
 }
 
+void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, pte_t old_pte, pte_t pte,
+			     unsigned int nr)
+{
+	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
+}
+
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
 			     pte_t old_pte, pte_t pte)
 {
-	set_pte_at(vma->vm_mm, addr, ptep, pte);
+	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
 }
 
 /*
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
                   ` (4 preceding siblings ...)
  2025-04-29  5:23 ` [PATCH v2 5/7] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-29  9:15   ` David Hildenbrand
                     ` (2 more replies)
  2025-04-29  5:23 ` [PATCH v2 7/7] mm: Optimize mprotect() through PTE-batching Dev Jain
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

In preparation for patch 7, we need to properly batch around
can_change_pte_writable(). We batch around pte_needs_soft_dirty_wp() by
the corresponding fpb flag, we batch around the page-anon exclusive check
using folio_maybe_mapped_shared(); modify_prot_start_ptes() collects the
dirty and access bits across the batch, therefore batching across
pte_dirty(): this is correct since the dirty bit on the PTE really
is just an indication that the folio got written to, so even if
the PTE is not actually dirty (but one of the PTEs in the batch is),
the wp-fault optimization can be made.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/mm.h |  4 ++--
 mm/gup.c           |  2 +-
 mm/huge_memory.c   |  4 ++--
 mm/memory.c        |  6 +++---
 mm/mprotect.c      | 11 ++++++-----
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21dd110b6655..2f639f6d93f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2487,8 +2487,8 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte);
+bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, struct folio *folio, unsigned int nr);
 extern long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, unsigned long cp_flags);
diff --git a/mm/gup.c b/mm/gup.c
index f32168339390..c39f587842a0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -614,7 +614,7 @@ static inline bool can_follow_write_common(struct page *page,
 		return false;
 
 	/*
-	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * See can_change_ptes_writable(): we broke COW and could map the page
 	 * writable if we have an exclusive anonymous page ...
 	 */
 	return page && PageAnon(page) && PageAnonExclusive(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..a58445fcedfc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2032,12 +2032,12 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 		return false;
 
 	if (!(vma->vm_flags & VM_SHARED)) {
-		/* See can_change_pte_writable(). */
+		/* See can_change_ptes_writable(). */
 		page = vm_normal_page_pmd(vma, addr, pmd);
 		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
-	/* See can_change_pte_writable(). */
+	/* See can_change_ptes_writable(). */
 	return pmd_dirty(pmd);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 68c1d962d0ad..e7ebc6b70421 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -750,7 +750,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 		pte = pte_mkuffd_wp(pte);
 
 	if ((vma->vm_flags & VM_WRITE) &&
-	    can_change_pte_writable(vma, address, pte)) {
+	    can_change_ptes_writable(vma, address, pte, NULL, 1)) {
 		if (folio_test_dirty(folio))
 			pte = pte_mkdirty(pte);
 		pte = pte_mkwrite(pte, vma);
@@ -5796,7 +5796,7 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
 			ptent = pte_modify(ptent, vma->vm_page_prot);
 			writable = pte_write(ptent);
 			if (!writable && pte_write_upgrade &&
-			    can_change_pte_writable(vma, addr, ptent))
+			    can_change_ptes_writable(vma, addr, ptent, NULL, 1))
 				writable = true;
 		}
 
@@ -5837,7 +5837,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 */
 	writable = pte_write(pte);
 	if (!writable && pte_write_upgrade &&
-	    can_change_pte_writable(vma, vmf->address, pte))
+	    can_change_ptes_writable(vma, vmf->address, pte, NULL, 1))
 		writable = true;
 
 	folio = vm_normal_folio(vma, vmf->address, pte);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ec5d17af7650..baff009fc981 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,8 +40,8 @@
 
 #include "internal.h"
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
+bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t pte, struct folio *folio, unsigned int nr)
 {
 	struct page *page;
 
@@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 		 * write-fault handler similarly would map them writable without
 		 * any additional checks while holding the PT lock.
 		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
+		if (!folio)
+			folio = vm_normal_folio(vma, addr, pte);
+		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
 	}
 
 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
@@ -222,7 +223,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
+			    can_change_ptes_writable(vma, addr, ptent, folio, 1))
 				ptent = pte_mkwrite(ptent, vma);
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 7/7] mm: Optimize mprotect() through PTE-batching
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
                   ` (5 preceding siblings ...)
  2025-04-29  5:23 ` [PATCH v2 6/7] mm: Batch around can_change_pte_writable() Dev Jain
@ 2025-04-29  5:23 ` Dev Jain
  2025-04-29  7:06 ` [PATCH v2 0/7] Optimize mprotect for large folios Lance Yang
  2025-04-29 11:03 ` Lorenzo Stoakes
  8 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  5:23 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

The common pte_present case does not require the folio. Elide the overhead of
vm_normal_folio() for the small folio case, by making an approximation:
for arm64, pte_batch_hint() is conclusive. For other arches, if the pfns
pointed to by the current and the next PTE are contiguous, check whether
a large folio is actually mapped, and only then make the batch optimization.
Reuse the folio from prot_numa case if possible. Since modify_prot_start_ptes()
gathers access/dirty bits, it lets us batch around pte_needs_flush()
(for parisc, the definition includes the access bit).

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 49 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index baff009fc981..f8382806611f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -129,7 +129,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
 	return false;
 }
 
-static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
+static struct folio *prot_numa_avoid_fault(struct vm_area_struct *vma,
 		unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
 		int max_nr, int *nr)
 {
@@ -139,25 +139,37 @@ static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
 
 	/* Avoid TLB flush if possible */
 	if (pte_protnone(oldpte))
-		return true;
+		return NULL;
 
 	folio = vm_normal_folio(vma, addr, oldpte);
 	if (!folio)
-		return true;
+		return NULL;
 
 	ret = prot_numa_skip(vma, folio, target_node);
 	if (ret) {
 		if (folio_test_large(folio) && max_nr != 1)
 			*nr = folio_pte_batch(folio, addr, pte, oldpte,
 					      max_nr, flags, NULL, NULL, NULL);
-		return ret;
+		return NULL;
 	}
 	if (folio_use_access_time(folio))
 		folio_xchg_access_time(folio,
 			jiffies_to_msecs(jiffies));
-	return false;
+	return folio;
 }
 
+static bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
+{
+	pte_t *next_ptep, next_pte;
+
+	if (pte_batch_hint(ptep, pte) != 1)
+		return true;
+
+	next_ptep = ptep + 1;
+	next_pte = ptep_get(next_ptep);
+
+	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
+}
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -188,19 +200,28 @@ static long change_pte_range(struct mmu_gather *tlb,
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
 			int max_nr = (end - addr) >> PAGE_SHIFT;
+			const fpb_t flags = FPB_IGNORE_DIRTY;
+			struct folio *folio = NULL;
 			pte_t ptent;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
-			if (prot_numa &&
-			    prot_numa_avoid_fault(vma, addr, pte,
-						  oldpte, target_node,
-							  max_nr, &nr))
+			if (prot_numa) {
+				folio = prot_numa_avoid_fault(vma, addr, pte,
+					oldpte, target_node, max_nr, &nr);
+				if (!folio)
 					continue;
+			}
 
-			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			if (!folio && (max_nr != 1) && maybe_contiguous_pte_pfns(pte, oldpte)) {
+				folio = vm_normal_folio(vma, addr, oldpte);
+				if (folio_test_large(folio))
+					nr = folio_pte_batch(folio, addr, pte,
+					oldpte, max_nr, flags, NULL, NULL, NULL);
+			}
+			oldpte = modify_prot_start_ptes(vma, addr, pte, nr);
 			ptent = pte_modify(oldpte, newprot);
 
 			if (uffd_wp)
@@ -223,13 +244,13 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
-			    can_change_ptes_writable(vma, addr, ptent, folio, 1))
+			    can_change_ptes_writable(vma, addr, ptent, folio, nr))
 				ptent = pte_mkwrite(ptent, vma);
 
-			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr);
 			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
-			pages++;
+				tlb_flush_pte_range(tlb, addr, nr * PAGE_SIZE);
+			pages += nr;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/7] mm: Refactor code in mprotect
  2025-04-29  5:23 ` [PATCH v2 1/7] mm: Refactor code in mprotect Dev Jain
@ 2025-04-29  6:41   ` Anshuman Khandual
  2025-04-29  6:54     ` Dev Jain
  2025-04-29 11:00   ` Lorenzo Stoakes
  1 sibling, 1 reply; 53+ messages in thread
From: Anshuman Khandual @ 2025-04-29  6:41 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 4/29/25 10:53, Dev Jain wrote:
> Reduce indentation in change_pte_range() by refactoring some of the code
> into a new function. No functional change.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 116 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 68 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88608d0dc2c2..70f59aa8c2a8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,71 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>  
> +
> +
> +static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
> +		int target_node)
> +{
> +	bool toptier;
> +	int nid;
> +
> +	/* Also skip shared copy-on-write pages */
> +	if (is_cow_mapping(vma->vm_flags) &&
> +	    (folio_maybe_dma_pinned(folio) ||
> +	     folio_maybe_mapped_shared(folio)))
> +		return true;
> +
> +	/*
> +	 * While migration can move some dirty pages,
> +	 * it cannot move them all from MIGRATE_ASYNC
> +	 * context.
> +	 */
> +	if (folio_is_file_lru(folio) &&
> +	    folio_test_dirty(folio))
> +		return true;
> +
> +	/*
> +	 * Don't mess with PTEs if page is already on the node
> +	 * a single-threaded process is running on.
> +	 */
> +	nid = folio_nid(folio);
> +	if (target_node == nid)
> +		return true;
> +	toptier = node_is_toptier(nid);
> +
> +	/*
> +	 * Skip scanning top tier node if normal numa
> +	 * balancing is disabled
> +	 */
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> +	    toptier)
> +		return true;
> +	return false;
> +}
> +
> +static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t oldpte, int target_node)
> +{
> +	struct folio *folio;
> +	int ret;
> +
> +	/* Avoid TLB flush if possible */
> +	if (pte_protnone(oldpte))
> +		return true;
> +
> +	folio = vm_normal_folio(vma, addr, oldpte);
> +	if (!folio || folio_is_zone_device(folio) ||
> +	    folio_test_ksm(folio))
> +		return true;
> +	ret = prot_numa_skip(vma, folio, target_node);

What purpose does it solve to create additional helper prot_numa_skip().
IOW - why cannot all of these be inside prot_numa_avoid_fault() itself.

> +	if (ret)
> +		return ret;
> +	if (folio_use_access_time(folio))
> +		folio_xchg_access_time(folio,
> +			jiffies_to_msecs(jiffies));
> +	return false;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -116,56 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * Avoid trapping faults against the zero or KSM
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
> -			if (prot_numa) {
> -				struct folio *folio;
> -				int nid;
> -				bool toptier;
> -
> -				/* Avoid TLB flush if possible */
> -				if (pte_protnone(oldpte))
> -					continue;
> -
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> -					continue;
> -
> -				/* Also skip shared copy-on-write pages */
> -				if (is_cow_mapping(vma->vm_flags) &&
> -				    (folio_maybe_dma_pinned(folio) ||
> -				     folio_maybe_mapped_shared(folio)))
> -					continue;
> -
> -				/*
> -				 * While migration can move some dirty pages,
> -				 * it cannot move them all from MIGRATE_ASYNC
> -				 * context.
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -				    folio_test_dirty(folio))
> +			if (prot_numa &&
> +			    prot_numa_avoid_fault(vma, addr,
> +						  oldpte, target_node))
>  					continue;
>  
> -				/*
> -				 * Don't mess with PTEs if page is already on the node
> -				 * a single-threaded process is running on.
> -				 */
> -				nid = folio_nid(folio);
> -				if (target_node == nid)
> -					continue;
> -				toptier = node_is_toptier(nid);
> -
> -				/*
> -				 * Skip scanning top tier node if normal numa
> -				 * balancing is disabled
> -				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> -					continue;
> -				if (folio_use_access_time(folio))
> -					folio_xchg_access_time(folio,
> -						jiffies_to_msecs(jiffies));
> -			}
> -
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
>  			ptent = pte_modify(oldpte, newprot);
>  


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/7] mm: Refactor code in mprotect
  2025-04-29  6:41   ` Anshuman Khandual
@ 2025-04-29  6:54     ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  6:54 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 29/04/25 12:11 pm, Anshuman Khandual wrote:
> 
> 
> On 4/29/25 10:53, Dev Jain wrote:
>> Reduce indentation in change_pte_range() by refactoring some of the code
>> into a new function. No functional change.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 116 +++++++++++++++++++++++++++++---------------------
>>   1 file changed, 68 insertions(+), 48 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88608d0dc2c2..70f59aa8c2a8 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,71 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	return pte_dirty(pte);
>>   }
>>   
>> +
>> +
>> +static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>> +		int target_node)
>> +{
>> +	bool toptier;
>> +	int nid;
>> +
>> +	/* Also skip shared copy-on-write pages */
>> +	if (is_cow_mapping(vma->vm_flags) &&
>> +	    (folio_maybe_dma_pinned(folio) ||
>> +	     folio_maybe_mapped_shared(folio)))
>> +		return true;
>> +
>> +	/*
>> +	 * While migration can move some dirty pages,
>> +	 * it cannot move them all from MIGRATE_ASYNC
>> +	 * context.
>> +	 */
>> +	if (folio_is_file_lru(folio) &&
>> +	    folio_test_dirty(folio))
>> +		return true;
>> +
>> +	/*
>> +	 * Don't mess with PTEs if page is already on the node
>> +	 * a single-threaded process is running on.
>> +	 */
>> +	nid = folio_nid(folio);
>> +	if (target_node == nid)
>> +		return true;
>> +	toptier = node_is_toptier(nid);
>> +
>> +	/*
>> +	 * Skip scanning top tier node if normal numa
>> +	 * balancing is disabled
>> +	 */
>> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> +	    toptier)
>> +		return true;
>> +	return false;
>> +}
>> +
>> +static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t oldpte, int target_node)
>> +{
>> +	struct folio *folio;
>> +	int ret;
>> +
>> +	/* Avoid TLB flush if possible */
>> +	if (pte_protnone(oldpte))
>> +		return true;
>> +
>> +	folio = vm_normal_folio(vma, addr, oldpte);
>> +	if (!folio || folio_is_zone_device(folio) ||
>> +	    folio_test_ksm(folio))
>> +		return true;
>> +	ret = prot_numa_skip(vma, folio, target_node);
> 
> What purpose does it solve to create additional helper prot_numa_skip().
> IOW - why cannot all of these be inside prot_numa_avoid_fault() itself.

That is in preparation for patch 2, so that I can skip a PTE batch.

> 
>> +	if (ret)
>> +		return ret;
>> +	if (folio_use_access_time(folio))
>> +		folio_xchg_access_time(folio,
>> +			jiffies_to_msecs(jiffies));
>> +	return false;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -116,56 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * Avoid trapping faults against the zero or KSM
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>> -			if (prot_numa) {
>> -				struct folio *folio;
>> -				int nid;
>> -				bool toptier;
>> -
>> -				/* Avoid TLB flush if possible */
>> -				if (pte_protnone(oldpte))
>> -					continue;
>> -
>> -				folio = vm_normal_folio(vma, addr, oldpte);
>> -				if (!folio || folio_is_zone_device(folio) ||
>> -				    folio_test_ksm(folio))
>> -					continue;
>> -
>> -				/* Also skip shared copy-on-write pages */
>> -				if (is_cow_mapping(vma->vm_flags) &&
>> -				    (folio_maybe_dma_pinned(folio) ||
>> -				     folio_maybe_mapped_shared(folio)))
>> -					continue;
>> -
>> -				/*
>> -				 * While migration can move some dirty pages,
>> -				 * it cannot move them all from MIGRATE_ASYNC
>> -				 * context.
>> -				 */
>> -				if (folio_is_file_lru(folio) &&
>> -				    folio_test_dirty(folio))
>> +			if (prot_numa &&
>> +			    prot_numa_avoid_fault(vma, addr,
>> +						  oldpte, target_node))
>>   					continue;
>>   
>> -				/*
>> -				 * Don't mess with PTEs if page is already on the node
>> -				 * a single-threaded process is running on.
>> -				 */
>> -				nid = folio_nid(folio);
>> -				if (target_node == nid)
>> -					continue;
>> -				toptier = node_is_toptier(nid);
>> -
>> -				/*
>> -				 * Skip scanning top tier node if normal numa
>> -				 * balancing is disabled
>> -				 */
>> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> -				    toptier)
>> -					continue;
>> -				if (folio_use_access_time(folio))
>> -					folio_xchg_access_time(folio,
>> -						jiffies_to_msecs(jiffies));
>> -			}
>> -
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>>   			ptent = pte_modify(oldpte, newprot);
>>   



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
                   ` (6 preceding siblings ...)
  2025-04-29  5:23 ` [PATCH v2 7/7] mm: Optimize mprotect() through PTE-batching Dev Jain
@ 2025-04-29  7:06 ` Lance Yang
  2025-04-29  9:02   ` Dev Jain
  2025-04-29 11:03 ` Lorenzo Stoakes
  8 siblings, 1 reply; 53+ messages in thread
From: Lance Yang @ 2025-04-29  7:06 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

Hey Dev,

Hmm... I also hit the same compilation errors:

In file included from ./include/linux/kasan.h:37,
                  from ./include/linux/slab.h:260,
                  from ./include/linux/crypto.h:19,
                  from arch/x86/kernel/asm-offsets.c:9:
./include/linux/pgtable.h: In function ‘modify_prot_start_ptes’:
./include/linux/pgtable.h:905:15: error: implicit declaration of 
function ‘ptep_modify_prot_start’ [-Werror=implicit-function-declaration]
   905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
       |               ^~~~~~~~~~~~~~~~~~~~~~
./include/linux/pgtable.h:905:15: error: incompatible types when 
assigning to type ‘pte_t’ from type ‘int’
./include/linux/pgtable.h:909:27: error: incompatible types when 
assigning to type ‘pte_t’ from type ‘int’
   909 |                 tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
       |                           ^~~~~~~~~~~~~~~~~~~~~~
./include/linux/pgtable.h: In function ‘modify_prot_commit_ptes’:
./include/linux/pgtable.h:925:17: error: implicit declaration of 
function ‘ptep_modify_prot_commit’ [-Werror=implicit-function-declaration]
   925 |                 ptep_modify_prot_commit(vma, addr, ptep, 
old_pte, pte);
       |                 ^~~~~~~~~~~~~~~~~~~~~~~
./include/linux/pgtable.h: At top level:
./include/linux/pgtable.h:1360:21: error: conflicting types for 
‘ptep_modify_prot_start’; have ‘pte_t(struct vm_area_struct *, long 
unsigned int,  pte_t *)’
  1360 | static inline pte_t ptep_modify_prot_start(struct 
vm_area_struct *vma,
       |                     ^~~~~~~~~~~~~~~~~~~~~~
./include/linux/pgtable.h:905:15: note: previous implicit declaration of 
‘ptep_modify_prot_start’ with type ‘int()’
   905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
       |               ^~~~~~~~~~~~~~~~~~~~~~
./include/linux/pgtable.h:1371:20: warning: conflicting types for 
‘ptep_modify_prot_commit’; have ‘void(struct vm_area_struct *, long 
unsigned int,  pte_t *, pte_t,  pte_t)’
  1371 | static inline void ptep_modify_prot_commit(struct 
vm_area_struct *vma,
       |                    ^~~~~~~~~~~~~~~~~~~~~~~
./include/linux/pgtable.h:1371:20: error: static declaration of 
‘ptep_modify_prot_commit’ follows non-static declaration
./include/linux/pgtable.h:925:17: note: previous implicit declaration of 
‘ptep_modify_prot_commit’ with type ‘void(struct vm_area_struct *, long 
unsigned int,  pte_t *, pte_t,  pte_t)’
   925 |                 ptep_modify_prot_commit(vma, addr, ptep, 
old_pte, pte);
       |                 ^~~~~~~~~~~~~~~~~~~~~~~
   CC 
/home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/libstring.o
   CC 
/home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/libctype.o
   CC 
/home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/str_error_r.o
   CC 
/home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/librbtree.o
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:98: arch/x86/kernel/asm-offsets.s] 
Error 1
make[1]: *** 
[/home/runner/work/mm-test-robot/mm-test-robot/linux/Makefile:1280: 
prepare0] Error 2
make[1]: *** Waiting for unfinished jobs....
   LD 
/home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/objtool-in.o
   LINK 
/home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/objtool
make: *** [Makefile:248: __sub-make] Error 2

Well, modify_prot_start_ptes() calls ptep_modify_prot_start(), but x86
does not define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION. To avoid
implicit declaration errors, the architecture-independent
ptep_modify_prot_start() must be defined before modify_prot_start_ptes().

With the changes below, things work correctly now ;)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 10cdb87ccecf..d9d6c49bb914 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -895,44 +895,6 @@ static inline void wrprotect_ptes(struct mm_struct 
*mm, unsigned long addr,
  }
  #endif

-/* See the comment for ptep_modify_prot_start */
-#ifndef modify_prot_start_ptes
-static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *ptep, unsigned int nr)
-{
-	pte_t pte, tmp_pte;
-
-	pte = ptep_modify_prot_start(vma, addr, ptep);
-	while (--nr) {
-		ptep++;
-		addr += PAGE_SIZE;
-		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
-		if (pte_dirty(tmp_pte))
-			pte = pte_mkdirty(pte);
-		if (pte_young(tmp_pte))
-			pte = pte_mkyoung(pte);
-	}
-	return pte;
-}
-#endif
-
-/* See the comment for ptep_modify_prot_commit */
-#ifndef modify_prot_commit_ptes
-static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, 
unsigned long addr,
-		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
-{
-	for (;;) {
-		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
-		if (--nr == 0)
-			break;
-		ptep++;
-		addr += PAGE_SIZE;
-		old_pte = pte_next_pfn(old_pte);
-		pte = pte_next_pfn(pte);
-	}
-}
-#endif
-
  /*
   * On some architectures hardware does not set page access bit when 
accessing
   * memory page, it is responsibility of software setting this bit. It 
brings
@@ -1375,6 +1337,45 @@ static inline void ptep_modify_prot_commit(struct 
vm_area_struct *vma,
  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
  }
  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+/* See the comment for ptep_modify_prot_start */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/* See the comment for ptep_modify_prot_commit */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, 
unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	for (;;) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
  #endif /* CONFIG_MMU */

  /*
--

Thanks,
Lance

On 2025/4/29 13:23, Dev Jain wrote:
> This patchset optimizes the mprotect() system call for large folios
> by PTE-batching.
> 
> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
> 
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
> 
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
> 
> After the patchset:
> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
> 
> Observing T1/T2 and T3 before the patchset, we also remove the regression
> introduced by ptep_get() on a contpte block. And, for large folios we get
> an almost 74% performance improvement.
> 
> v1->v2:
>   - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
>   - Abridge the anon-exclusive condition (Lance Yang)
> 
> Dev Jain (7):
>    mm: Refactor code in mprotect
>    mm: Optimize mprotect() by batch-skipping PTEs
>    mm: Add batched versions of ptep_modify_prot_start/commit
>    arm64: Add batched version of ptep_modify_prot_start
>    arm64: Add batched version of ptep_modify_prot_commit
>    mm: Batch around can_change_pte_writable()
>    mm: Optimize mprotect() through PTE-batching
> 
>   arch/arm64/include/asm/pgtable.h |  10 ++
>   arch/arm64/mm/mmu.c              |  21 +++-
>   include/linux/mm.h               |   4 +-
>   include/linux/pgtable.h          |  42 ++++++++
>   mm/gup.c                         |   2 +-
>   mm/huge_memory.c                 |   4 +-
>   mm/memory.c                      |   6 +-
>   mm/mprotect.c                    | 165 ++++++++++++++++++++-----------
>   mm/pgtable-generic.c             |  16 ++-
>   9 files changed, 198 insertions(+), 72 deletions(-)
> 



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-29  5:23 ` [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
@ 2025-04-29  7:14   ` Anshuman Khandual
  2025-04-29  8:59     ` Dev Jain
  2025-04-29 13:19   ` Lorenzo Stoakes
  1 sibling, 1 reply; 53+ messages in thread
From: Anshuman Khandual @ 2025-04-29  7:14 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy

On 4/29/25 10:53, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 70f59aa8c2a8..ec5d17af7650 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>  	bool toptier;
>  	int nid;
>  
> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> +		return true;
> +

Moving these here from prot_numa_avoid_fault() could have been done
earlier, while adding prot_numa_skip() itself in the previous patch
(in case this helper is determined to be really required).

>  	/* Also skip shared copy-on-write pages */
>  	if (is_cow_mapping(vma->vm_flags) &&
>  	    (folio_maybe_dma_pinned(folio) ||
> @@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>  }
>  
>  static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
> -		unsigned long addr, pte_t oldpte, int target_node)
> +		unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
> +		int max_nr, int *nr)
>  {
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;

Flags are all correct.

>  	struct folio *folio;
>  	int ret;
>  
> @@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>  		return true;
>  
>  	folio = vm_normal_folio(vma, addr, oldpte);
> -	if (!folio || folio_is_zone_device(folio) ||
> -	    folio_test_ksm(folio))
> +	if (!folio)
>  		return true;
> +
>  	ret = prot_numa_skip(vma, folio, target_node);
> -	if (ret)
> +	if (ret) {
> +		if (folio_test_large(folio) && max_nr != 1)

Conditional checks are all correct.

> +			*nr = folio_pte_batch(folio, addr, pte, oldpte,
> +					      max_nr, flags, NULL, NULL, NULL);
>  		return ret;
> +	}
>  	if (folio_use_access_time(folio))
>  		folio_xchg_access_time(folio,
>  			jiffies_to_msecs(jiffies));
> @@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr = 1;

'nr' resets each iteration.

>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr = (end - addr) >> PAGE_SHIFT;

Small nit - 'max_nr' declaration could be moved earlier along with 'nr'.

>  			pte_t ptent;
>  
>  			/*
> @@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa &&
> -			    prot_numa_avoid_fault(vma, addr,
> -						  oldpte, target_node))
> +			    prot_numa_avoid_fault(vma, addr, pte,
> +						  oldpte, target_node,
> +							  max_nr, &nr))
>  					continue;
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> @@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr, addr += nr * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  

Otherwise LGTM


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-04-29  8:39   ` Anshuman Khandual
  2025-04-29  9:01     ` Dev Jain
  2025-04-29 13:52   ` Lorenzo Stoakes
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Anshuman Khandual @ 2025-04-29  8:39 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 4/29/25 10:53, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +/* See the comment for ptep_modify_prot_start */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif

Is there any particular reason why the first loop here is while { }
based where as the second one is a for { } loop ?

> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings

These helpers should be added with at least a single caller using them.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-29  7:14   ` Anshuman Khandual
@ 2025-04-29  8:59     ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  8:59 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 29/04/25 12:44 pm, Anshuman Khandual wrote:
> On 4/29/25 10:53, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 27 ++++++++++++++++++++-------
>>   1 file changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 70f59aa8c2a8..ec5d17af7650 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>>   	bool toptier;
>>   	int nid;
>>   
>> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>> +		return true;
>> +
> 
> Moving these here from prot_numa_avoid_fault() could have been done
> earlier, while adding prot_numa_skip() itself in the previous patch
> (in case this helper is determined to be really required).

True. I'll do that.

> 
>>   	/* Also skip shared copy-on-write pages */
>>   	if (is_cow_mapping(vma->vm_flags) &&
>>   	    (folio_maybe_dma_pinned(folio) ||
>> @@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>>   }
>>   
>>   static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>> -		unsigned long addr, pte_t oldpte, int target_node)
>> +		unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
>> +		int max_nr, int *nr)
>>   {
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> 
> Flags are all correct.
> 
>>   	struct folio *folio;
>>   	int ret;
>>   
>> @@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>>   		return true;
>>   
>>   	folio = vm_normal_folio(vma, addr, oldpte);
>> -	if (!folio || folio_is_zone_device(folio) ||
>> -	    folio_test_ksm(folio))
>> +	if (!folio)
>>   		return true;
>> +
>>   	ret = prot_numa_skip(vma, folio, target_node);
>> -	if (ret)
>> +	if (ret) {
>> +		if (folio_test_large(folio) && max_nr != 1)
> 
> Conditional checks are all correct.
> 
>> +			*nr = folio_pte_batch(folio, addr, pte, oldpte,
>> +					      max_nr, flags, NULL, NULL, NULL);
>>   		return ret;
>> +	}
>>   	if (folio_use_access_time(folio))
>>   		folio_xchg_access_time(folio,
>>   			jiffies_to_msecs(jiffies));
>> @@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	int nr;
>>   
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		nr = 1;
> 
> 'nr' resets each iteration.
> 
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>> +			int max_nr = (end - addr) >> PAGE_SHIFT;
> 
> Small nit - 'max_nr' declaration could be moved earlier along with 'nr'.

Sure.

> 
>>   			pte_t ptent;
>>   
>>   			/*
>> @@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa &&
>> -			    prot_numa_avoid_fault(vma, addr,
>> -						  oldpte, target_node))
>> +			    prot_numa_avoid_fault(vma, addr, pte,
>> +						  oldpte, target_node,
>> +							  max_nr, &nr))
>>   					continue;
>>   
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> @@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				pages++;
>>   			}
>>   		}
>> -	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +	} while (pte += nr, addr += nr * PAGE_SIZE, addr != end);
>>   	arch_leave_lazy_mmu_mode();
>>   	pte_unmap_unlock(pte - 1, ptl);
>>   
> 
> Otherwise LGTM



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  8:39   ` Anshuman Khandual
@ 2025-04-29  9:01     ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-29  9:01 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 29/04/25 2:09 pm, Anshuman Khandual wrote:
> 
> 
> On 4/29/25 10:53, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..ed287289335f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>   
>> +/* See the comment for ptep_modify_prot_start */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/* See the comment for ptep_modify_prot_commit */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	for (;;) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
> 
> Is there any particular reason why the first loop here is while { }
> based where as the second one is a for { } loop ?

modify_prot_start_ptes follows the pattern of get_and_clear_full_ptes. 
modify_prot_commit_ptes follows the pattern of wrprotect_ptes.

> 
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
> 
> These helpers should be added with at least a single caller using them.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-29  7:06 ` [PATCH v2 0/7] Optimize mprotect for large folios Lance Yang
@ 2025-04-29  9:02   ` Dev Jain
  2025-04-29 10:41     ` Lorenzo Stoakes
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-04-29  9:02 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 29/04/25 12:36 pm, Lance Yang wrote:
> Hey Dev,
> 
> Hmm... I also hit the same compilation errors:
> 
> In file included from ./include/linux/kasan.h:37,
>                   from ./include/linux/slab.h:260,
>                   from ./include/linux/crypto.h:19,
>                   from arch/x86/kernel/asm-offsets.c:9:
> ./include/linux/pgtable.h: In function ‘modify_prot_start_ptes’:
> ./include/linux/pgtable.h:905:15: error: implicit declaration of 
> function ‘ptep_modify_prot_start’ [-Werror=implicit-function-declaration]
>    905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>        |               ^~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/pgtable.h:905:15: error: incompatible types when 
> assigning to type ‘pte_t’ from type ‘int’
> ./include/linux/pgtable.h:909:27: error: incompatible types when 
> assigning to type ‘pte_t’ from type ‘int’
>    909 |                 tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>        |                           ^~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/pgtable.h: In function ‘modify_prot_commit_ptes’:
> ./include/linux/pgtable.h:925:17: error: implicit declaration of 
> function ‘ptep_modify_prot_commit’ [-Werror=implicit-function-declaration]
>    925 |                 ptep_modify_prot_commit(vma, addr, ptep, 
> old_pte, pte);
>        |                 ^~~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/pgtable.h: At top level:
> ./include/linux/pgtable.h:1360:21: error: conflicting types for 
> ‘ptep_modify_prot_start’; have ‘pte_t(struct vm_area_struct *, long 
> unsigned int,  pte_t *)’
>   1360 | static inline pte_t ptep_modify_prot_start(struct 
> vm_area_struct *vma,
>        |                     ^~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/pgtable.h:905:15: note: previous implicit declaration of 
> ‘ptep_modify_prot_start’ with type ‘int()’
>    905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>        |               ^~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/pgtable.h:1371:20: warning: conflicting types for 
> ‘ptep_modify_prot_commit’; have ‘void(struct vm_area_struct *, long 
> unsigned int,  pte_t *, pte_t,  pte_t)’
>   1371 | static inline void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>        |                    ^~~~~~~~~~~~~~~~~~~~~~~
> ./include/linux/pgtable.h:1371:20: error: static declaration of 
> ‘ptep_modify_prot_commit’ follows non-static declaration
> ./include/linux/pgtable.h:925:17: note: previous implicit declaration of 
> ‘ptep_modify_prot_commit’ with type ‘void(struct vm_area_struct *, long 
> unsigned int,  pte_t *, pte_t,  pte_t)’
>    925 |                 ptep_modify_prot_commit(vma, addr, ptep, 
> old_pte, pte);
>        |                 ^~~~~~~~~~~~~~~~~~~~~~~
>    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/ 
> libstring.o
>    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/ 
> libctype.o
>    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/ 
> str_error_r.o
>    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/ 
> librbtree.o
> cc1: some warnings being treated as errors
> make[2]: *** [scripts/Makefile.build:98: arch/x86/kernel/asm-offsets.s] 
> Error 1
> make[1]: *** [/home/runner/work/mm-test-robot/mm-test-robot/linux/ 
> Makefile:1280: prepare0] Error 2
> make[1]: *** Waiting for unfinished jobs....
>    LD /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/ 
> objtool-in.o
>    LINK /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
> objtool/objtool
> make: *** [Makefile:248: __sub-make] Error 2
> 
> Well, modify_prot_start_ptes() calls ptep_modify_prot_start(), but x86
> does not define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION. To avoid
> implicit declaration errors, the architecture-independent
> ptep_modify_prot_start() must be defined before modify_prot_start_ptes().
> 
> With the changes below, things work correctly now ;)

Ah thanks! My bad :(

> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 10cdb87ccecf..d9d6c49bb914 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -895,44 +895,6 @@ static inline void wrprotect_ptes(struct mm_struct 
> *mm, unsigned long addr,
>   }
>   #endif
> 
> -/* See the comment for ptep_modify_prot_start */
> -#ifndef modify_prot_start_ptes
> -static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> -        unsigned long addr, pte_t *ptep, unsigned int nr)
> -{
> -    pte_t pte, tmp_pte;
> -
> -    pte = ptep_modify_prot_start(vma, addr, ptep);
> -    while (--nr) {
> -        ptep++;
> -        addr += PAGE_SIZE;
> -        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> -        if (pte_dirty(tmp_pte))
> -            pte = pte_mkdirty(pte);
> -        if (pte_young(tmp_pte))
> -            pte = pte_mkyoung(pte);
> -    }
> -    return pte;
> -}
> -#endif
> -
> -/* See the comment for ptep_modify_prot_commit */
> -#ifndef modify_prot_commit_ptes
> -static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, 
> unsigned long addr,
> -        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> -{
> -    for (;;) {
> -        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> -        if (--nr == 0)
> -            break;
> -        ptep++;
> -        addr += PAGE_SIZE;
> -        old_pte = pte_next_pfn(old_pte);
> -        pte = pte_next_pfn(pte);
> -    }
> -}
> -#endif
> -
>   /*
>    * On some architectures hardware does not set page access bit when 
> accessing
>    * memory page, it is responsibility of software setting this bit. It 
> brings
> @@ -1375,6 +1337,45 @@ static inline void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>       __ptep_modify_prot_commit(vma, addr, ptep, pte);
>   }
>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/* See the comment for ptep_modify_prot_start */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +        unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +    pte_t pte, tmp_pte;
> +
> +    pte = ptep_modify_prot_start(vma, addr, ptep);
> +    while (--nr) {
> +        ptep++;
> +        addr += PAGE_SIZE;
> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +        if (pte_dirty(tmp_pte))
> +            pte = pte_mkdirty(pte);
> +        if (pte_young(tmp_pte))
> +            pte = pte_mkyoung(pte);
> +    }
> +    return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, 
> unsigned long addr,
> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +    for (;;) {
> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +        if (--nr == 0)
> +            break;
> +        ptep++;
> +        addr += PAGE_SIZE;
> +        old_pte = pte_next_pfn(old_pte);
> +        pte = pte_next_pfn(pte);
> +    }
> +}
> +#endif
> +
>   #endif /* CONFIG_MMU */
> 
>   /*
> -- 
> 
> Thanks,
> Lance
> 
> On 2025/4/29 13:23, Dev Jain wrote:
>> This patchset optimizes the mprotect() system call for large folios
>> by PTE-batching.
>>
>> We use the following test cases to measure performance, mprotect()'ing
>> the mapped memory to read-only then read-write 40 times:
>>
>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>> pte-mapping those THPs
>> Test case 2: Mapping 1G of memory with 64K mTHPs
>> Test case 3: Mapping 1G of memory with 4K pages
>>
>> Average execution time on arm64, Apple M3:
>> Before the patchset:
>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>
>> After the patchset:
>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
>>
>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>> introduced by ptep_get() on a contpte block. And, for large folios we get
>> an almost 74% performance improvement.
>>
>> v1->v2:
>>   - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the 
>> header more resilient)
>>   - Abridge the anon-exclusive condition (Lance Yang)
>>
>> Dev Jain (7):
>>    mm: Refactor code in mprotect
>>    mm: Optimize mprotect() by batch-skipping PTEs
>>    mm: Add batched versions of ptep_modify_prot_start/commit
>>    arm64: Add batched version of ptep_modify_prot_start
>>    arm64: Add batched version of ptep_modify_prot_commit
>>    mm: Batch around can_change_pte_writable()
>>    mm: Optimize mprotect() through PTE-batching
>>
>>   arch/arm64/include/asm/pgtable.h |  10 ++
>>   arch/arm64/mm/mmu.c              |  21 +++-
>>   include/linux/mm.h               |   4 +-
>>   include/linux/pgtable.h          |  42 ++++++++
>>   mm/gup.c                         |   2 +-
>>   mm/huge_memory.c                 |   4 +-
>>   mm/memory.c                      |   6 +-
>>   mm/mprotect.c                    | 165 ++++++++++++++++++++-----------
>>   mm/pgtable-generic.c             |  16 ++-
>>   9 files changed, 198 insertions(+), 72 deletions(-)
>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  5:23 ` [PATCH v2 6/7] mm: Batch around can_change_pte_writable() Dev Jain
@ 2025-04-29  9:15   ` David Hildenbrand
  2025-04-29  9:19   ` David Hildenbrand
  2025-04-30  6:17   ` kernel test robot
  2 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-04-29  9:15 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On 29.04.25 07:23, Dev Jain wrote:
> In preparation for patch 7, we need to properly batch around
> can_change_pte_writable(). We batch around pte_needs_soft_dirty_wp() by
> the corresponding fpb flag, we batch around the page-anon exclusive check
> using folio_maybe_mapped_shared(); modify_prot_start_ptes() collects the
> dirty and access bits across the batch, therefore batching across
> pte_dirty(): this is correct since the dirty bit on the PTE really
> is just an indication that the folio got written to, so even if
> the PTE is not actually dirty (but one of the PTEs in the batch is),
> the wp-fault optimization can be made.

If you want to add a batched version of can_change_pte_writable(), do it 
right away instead of just adding parameters to functions.

Then, add a simple

#define can_change_pte_writable(...) can_change_ptes_writable(..., 1);

So you don't have to touch all callers and don't have to update the 
function name in comments.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  5:23 ` [PATCH v2 6/7] mm: Batch around can_change_pte_writable() Dev Jain
  2025-04-29  9:15   ` David Hildenbrand
@ 2025-04-29  9:19   ` David Hildenbrand
  2025-04-29  9:27     ` David Hildenbrand
  2025-04-30  6:17   ` kernel test robot
  2 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-04-29  9:19 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy


>   #include "internal.h"
>   
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> +			      pte_t pte, struct folio *folio, unsigned int nr)
>   {
>   	struct page *page;
>   
> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>   		 * write-fault handler similarly would map them writable without
>   		 * any additional checks while holding the PT lock.
>   		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> +		if (!folio)
> +			folio = vm_normal_folio(vma, addr, pte);
> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);

Oh no, now I spot it. That is horribly wrong.

Please understand first what you are doing.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  9:19   ` David Hildenbrand
@ 2025-04-29  9:27     ` David Hildenbrand
  2025-04-29 13:57       ` Lorenzo Stoakes
  2025-05-06  9:16       ` Dev Jain
  0 siblings, 2 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-04-29  9:27 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On 29.04.25 11:19, David Hildenbrand wrote:
> 
>>    #include "internal.h"
>>    
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			      pte_t pte, struct folio *folio, unsigned int nr)
>>    {
>>    	struct page *page;
>>    
>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>    		 * write-fault handler similarly would map them writable without
>>    		 * any additional checks while holding the PT lock.
>>    		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		if (!folio)
>> +			folio = vm_normal_folio(vma, addr, pte);
>> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
> 
> Oh no, now I spot it. That is horribly wrong.
> 
> Please understand first what you are doing.

Also, would expect that the cow.c selftest would catch that:

"vmsplice() + unmap in child with mprotect() optimization"

After fork() we have a R/O PTE in the parent. Our child then uses 
vmsplice() and unmaps the R/O PTE, meaning it is only left mapped by the 
parent.

ret = mprotect(mem, size, PROT_READ);
ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);

should turn the PTE writable, although it shouldn't.

If that test case does not detect the issue you're introducing, we 
should look into adding a test case that detects it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-29  9:02   ` Dev Jain
@ 2025-04-29 10:41     ` Lorenzo Stoakes
  2025-04-30  5:42       ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 10:41 UTC (permalink / raw)
  To: Dev Jain
  Cc: Lance Yang, akpm, ryan.roberts, david, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

FWIW can confirm the same thing. Lance's fixes sort most of it out, but I also
get this error:

mm/mprotect.c: In function ‘can_change_ptes_writable’:
mm/mprotect.c:46:22: error: unused variable ‘page’ [-Werror=unused-variable]
   46 |         struct page *page;
      |                      ^~~~

So you also need to remove this unused variable at the stop of
can_change_ptes_writable().

Cheers, Lorenzo

On Tue, Apr 29, 2025 at 02:32:59PM +0530, Dev Jain wrote:
>
>
> On 29/04/25 12:36 pm, Lance Yang wrote:
> > Hey Dev,
> >
> > Hmm... I also hit the same compilation errors:
> >
> > In file included from ./include/linux/kasan.h:37,
> >                   from ./include/linux/slab.h:260,
> >                   from ./include/linux/crypto.h:19,
> >                   from arch/x86/kernel/asm-offsets.c:9:
> > ./include/linux/pgtable.h: In function ‘modify_prot_start_ptes’:
> > ./include/linux/pgtable.h:905:15: error: implicit declaration of
> > function ‘ptep_modify_prot_start’
> > [-Werror=implicit-function-declaration]
> >    905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
> >        |               ^~~~~~~~~~~~~~~~~~~~~~
> > ./include/linux/pgtable.h:905:15: error: incompatible types when
> > assigning to type ‘pte_t’ from type ‘int’
> > ./include/linux/pgtable.h:909:27: error: incompatible types when
> > assigning to type ‘pte_t’ from type ‘int’
> >    909 |                 tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >        |                           ^~~~~~~~~~~~~~~~~~~~~~
> > ./include/linux/pgtable.h: In function ‘modify_prot_commit_ptes’:
> > ./include/linux/pgtable.h:925:17: error: implicit declaration of
> > function ‘ptep_modify_prot_commit’
> > [-Werror=implicit-function-declaration]
> >    925 |                 ptep_modify_prot_commit(vma, addr, ptep,
> > old_pte, pte);
> >        |                 ^~~~~~~~~~~~~~~~~~~~~~~
> > ./include/linux/pgtable.h: At top level:
> > ./include/linux/pgtable.h:1360:21: error: conflicting types for
> > ‘ptep_modify_prot_start’; have ‘pte_t(struct vm_area_struct *, long
> > unsigned int,  pte_t *)’
> >   1360 | static inline pte_t ptep_modify_prot_start(struct
> > vm_area_struct *vma,
> >        |                     ^~~~~~~~~~~~~~~~~~~~~~
> > ./include/linux/pgtable.h:905:15: note: previous implicit declaration of
> > ‘ptep_modify_prot_start’ with type ‘int()’
> >    905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
> >        |               ^~~~~~~~~~~~~~~~~~~~~~
> > ./include/linux/pgtable.h:1371:20: warning: conflicting types for
> > ‘ptep_modify_prot_commit’; have ‘void(struct vm_area_struct *, long
> > unsigned int,  pte_t *, pte_t,  pte_t)’
> >   1371 | static inline void ptep_modify_prot_commit(struct
> > vm_area_struct *vma,
> >        |                    ^~~~~~~~~~~~~~~~~~~~~~~
> > ./include/linux/pgtable.h:1371:20: error: static declaration of
> > ‘ptep_modify_prot_commit’ follows non-static declaration
> > ./include/linux/pgtable.h:925:17: note: previous implicit declaration of
> > ‘ptep_modify_prot_commit’ with type ‘void(struct vm_area_struct *, long
> > unsigned int,  pte_t *, pte_t,  pte_t)’
> >    925 |                 ptep_modify_prot_commit(vma, addr, ptep,
> > old_pte, pte);
> >        |                 ^~~~~~~~~~~~~~~~~~~~~~~
> >    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
> > libstring.o
> >    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
> > libctype.o
> >    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
> > str_error_r.o
> >    CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
> > librbtree.o
> > cc1: some warnings being treated as errors
> > make[2]: *** [scripts/Makefile.build:98: arch/x86/kernel/asm-offsets.s]
> > Error 1
> > make[1]: *** [/home/runner/work/mm-test-robot/mm-test-robot/linux/
> > Makefile:1280: prepare0] Error 2
> > make[1]: *** Waiting for unfinished jobs....
> >    LD /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
> > objtool-in.o
> >    LINK /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/
> > objtool/objtool
> > make: *** [Makefile:248: __sub-make] Error 2
> >
> > Well, modify_prot_start_ptes() calls ptep_modify_prot_start(), but x86
> > does not define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION. To avoid
> > implicit declaration errors, the architecture-independent
> > ptep_modify_prot_start() must be defined before modify_prot_start_ptes().
> >
> > With the changes below, things work correctly now ;)
>
> Ah thanks! My bad :(
>
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 10cdb87ccecf..d9d6c49bb914 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -895,44 +895,6 @@ static inline void wrprotect_ptes(struct mm_struct
> > *mm, unsigned long addr,
> >   }
> >   #endif
> >
> > -/* See the comment for ptep_modify_prot_start */
> > -#ifndef modify_prot_start_ptes
> > -static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> > -        unsigned long addr, pte_t *ptep, unsigned int nr)
> > -{
> > -    pte_t pte, tmp_pte;
> > -
> > -    pte = ptep_modify_prot_start(vma, addr, ptep);
> > -    while (--nr) {
> > -        ptep++;
> > -        addr += PAGE_SIZE;
> > -        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> > -        if (pte_dirty(tmp_pte))
> > -            pte = pte_mkdirty(pte);
> > -        if (pte_young(tmp_pte))
> > -            pte = pte_mkyoung(pte);
> > -    }
> > -    return pte;
> > -}
> > -#endif
> > -
> > -/* See the comment for ptep_modify_prot_commit */
> > -#ifndef modify_prot_commit_ptes
> > -static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
> > unsigned long addr,
> > -        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> > -{
> > -    for (;;) {
> > -        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > -        if (--nr == 0)
> > -            break;
> > -        ptep++;
> > -        addr += PAGE_SIZE;
> > -        old_pte = pte_next_pfn(old_pte);
> > -        pte = pte_next_pfn(pte);
> > -    }
> > -}
> > -#endif
> > -
> >   /*
> >    * On some architectures hardware does not set page access bit when
> > accessing
> >    * memory page, it is responsibility of software setting this bit. It
> > brings
> > @@ -1375,6 +1337,45 @@ static inline void ptep_modify_prot_commit(struct
> > vm_area_struct *vma,
> >       __ptep_modify_prot_commit(vma, addr, ptep, pte);
> >   }
> >   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> > +
> > +/* See the comment for ptep_modify_prot_start */
> > +#ifndef modify_prot_start_ptes
> > +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> > +        unsigned long addr, pte_t *ptep, unsigned int nr)
> > +{
> > +    pte_t pte, tmp_pte;
> > +
> > +    pte = ptep_modify_prot_start(vma, addr, ptep);
> > +    while (--nr) {
> > +        ptep++;
> > +        addr += PAGE_SIZE;
> > +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> > +        if (pte_dirty(tmp_pte))
> > +            pte = pte_mkdirty(pte);
> > +        if (pte_young(tmp_pte))
> > +            pte = pte_mkyoung(pte);
> > +    }
> > +    return pte;
> > +}
> > +#endif
> > +
> > +/* See the comment for ptep_modify_prot_commit */
> > +#ifndef modify_prot_commit_ptes
> > +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
> > unsigned long addr,
> > +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> > +{
> > +    for (;;) {
> > +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > +        if (--nr == 0)
> > +            break;
> > +        ptep++;
> > +        addr += PAGE_SIZE;
> > +        old_pte = pte_next_pfn(old_pte);
> > +        pte = pte_next_pfn(pte);
> > +    }
> > +}
> > +#endif
> > +
> >   #endif /* CONFIG_MMU */
> >
> >   /*
> > --
> >
> > Thanks,
> > Lance
> >
> > On 2025/4/29 13:23, Dev Jain wrote:
> > > This patchset optimizes the mprotect() system call for large folios
> > > by PTE-batching.
> > >
> > > We use the following test cases to measure performance, mprotect()'ing
> > > the mapped memory to read-only then read-write 40 times:
> > >
> > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> > > pte-mapping those THPs
> > > Test case 2: Mapping 1G of memory with 64K mTHPs
> > > Test case 3: Mapping 1G of memory with 4K pages
> > >
> > > Average execution time on arm64, Apple M3:
> > > Before the patchset:
> > > T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
> > >
> > > After the patchset:
> > > T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
> > >
> > > Observing T1/T2 and T3 before the patchset, we also remove the regression
> > > introduced by ptep_get() on a contpte block. And, for large folios we get
> > > an almost 74% performance improvement.
> > >
> > > v1->v2:
> > >   - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the
> > > header more resilient)
> > >   - Abridge the anon-exclusive condition (Lance Yang)
> > >
> > > Dev Jain (7):
> > >    mm: Refactor code in mprotect
> > >    mm: Optimize mprotect() by batch-skipping PTEs
> > >    mm: Add batched versions of ptep_modify_prot_start/commit
> > >    arm64: Add batched version of ptep_modify_prot_start
> > >    arm64: Add batched version of ptep_modify_prot_commit
> > >    mm: Batch around can_change_pte_writable()
> > >    mm: Optimize mprotect() through PTE-batching
> > >
> > >   arch/arm64/include/asm/pgtable.h |  10 ++
> > >   arch/arm64/mm/mmu.c              |  21 +++-
> > >   include/linux/mm.h               |   4 +-
> > >   include/linux/pgtable.h          |  42 ++++++++
> > >   mm/gup.c                         |   2 +-
> > >   mm/huge_memory.c                 |   4 +-
> > >   mm/memory.c                      |   6 +-
> > >   mm/mprotect.c                    | 165 ++++++++++++++++++++-----------
> > >   mm/pgtable-generic.c             |  16 ++-
> > >   9 files changed, 198 insertions(+), 72 deletions(-)
> > >
> >
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/7] mm: Refactor code in mprotect
  2025-04-29  5:23 ` [PATCH v2 1/7] mm: Refactor code in mprotect Dev Jain
  2025-04-29  6:41   ` Anshuman Khandual
@ 2025-04-29 11:00   ` Lorenzo Stoakes
  1 sibling, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 11:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

For changes like this, difftastic comes in very handy :)

On Tue, Apr 29, 2025 at 10:53:30AM +0530, Dev Jain wrote:
> Reduce indentation in change_pte_range() by refactoring some of the code
> into a new function. No functional change.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Overall a big fan of the intent of this patch! This is a nice cleanup, just
need to nail down details.

> ---
>  mm/mprotect.c | 116 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 68 insertions(+), 48 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88608d0dc2c2..70f59aa8c2a8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,71 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>
> +
> +

Nit: stray extra newline.

> +static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
> +		int target_node)

This is a bit weird, it's like you have two functions to determine whether to
skip a PTE entry, but named differently?

I think you say in response to a comment elsewhere that you intend to further
split things up in subsequent patches, but this kinda bugs me as subjective as
it is :)

I'd say rename prot_numa_avoid_fault() -> can_skip_prot_numa_pte()

And this to can_skip_prot_numa_folio()?

Then again, the below function does some folio stuff too, so I'm not sure
exactly what the separation is? Can you explain?

Also it'd be good to add some brief comment, something like 'the prot_numa
change-prot (cp) flag indicates that this protection change is due to NUMA
hinting, we determine if we actually have work to do or can skip this folio
entirely'.

Or equivalent in the below function.

> +{
> +	bool toptier;
> +	int nid;
> +
> +	/* Also skip shared copy-on-write pages */
> +	if (is_cow_mapping(vma->vm_flags) &&
> +	    (folio_maybe_dma_pinned(folio) ||
> +	     folio_maybe_mapped_shared(folio)))
> +		return true;
> +
> +	/*
> +	 * While migration can move some dirty pages,
> +	 * it cannot move them all from MIGRATE_ASYNC
> +	 * context.
> +	 */
> +	if (folio_is_file_lru(folio) &&
> +	    folio_test_dirty(folio))
> +		return true;
> +
> +	/*
> +	 * Don't mess with PTEs if page is already on the node
> +	 * a single-threaded process is running on.
> +	 */
> +	nid = folio_nid(folio);
> +	if (target_node == nid)
> +		return true;
> +	toptier = node_is_toptier(nid);
> +
> +	/*
> +	 * Skip scanning top tier node if normal numa
> +	 * balancing is disabled
> +	 */
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> +	    toptier)
> +		return true;
> +	return false;
> +}
> +
> +static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t oldpte, int target_node)
> +{
> +	struct folio *folio;
> +	int ret;
> +
> +	/* Avoid TLB flush if possible */
> +	if (pte_protnone(oldpte))
> +		return true;
> +
> +	folio = vm_normal_folio(vma, addr, oldpte);
> +	if (!folio || folio_is_zone_device(folio) ||
> +	    folio_test_ksm(folio))
> +		return true;
> +	ret = prot_numa_skip(vma, folio, target_node);
> +	if (ret)
> +		return ret;

This is a bit silly as it returns a boolean value, surely;

	if (prot_numa_skip(vma, folio, target_node))
		return true;

Is better?

> +	if (folio_use_access_time(folio))
> +		folio_xchg_access_time(folio,
> +			jiffies_to_msecs(jiffies));

Why is this here and not in prot_numa_skip() or whatever we rename it to?

> +	return false;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -116,56 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * Avoid trapping faults against the zero or KSM
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
> -			if (prot_numa) {
> -				struct folio *folio;
> -				int nid;
> -				bool toptier;
> -
> -				/* Avoid TLB flush if possible */
> -				if (pte_protnone(oldpte))
> -					continue;
> -
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> -					continue;
> -
> -				/* Also skip shared copy-on-write pages */
> -				if (is_cow_mapping(vma->vm_flags) &&
> -				    (folio_maybe_dma_pinned(folio) ||
> -				     folio_maybe_mapped_shared(folio)))
> -					continue;
> -
> -				/*
> -				 * While migration can move some dirty pages,
> -				 * it cannot move them all from MIGRATE_ASYNC
> -				 * context.
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -				    folio_test_dirty(folio))
> +			if (prot_numa &&
> +			    prot_numa_avoid_fault(vma, addr,
> +						  oldpte, target_node))
>  					continue;
>
> -				/*
> -				 * Don't mess with PTEs if page is already on the node
> -				 * a single-threaded process is running on.
> -				 */
> -				nid = folio_nid(folio);
> -				if (target_node == nid)
> -					continue;
> -				toptier = node_is_toptier(nid);
> -
> -				/*
> -				 * Skip scanning top tier node if normal numa
> -				 * balancing is disabled
> -				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> -					continue;
> -				if (folio_use_access_time(folio))
> -					folio_xchg_access_time(folio,
> -						jiffies_to_msecs(jiffies));
> -			}
> -
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
>  			ptent = pte_modify(oldpte, newprot);
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
                   ` (7 preceding siblings ...)
  2025-04-29  7:06 ` [PATCH v2 0/7] Optimize mprotect for large folios Lance Yang
@ 2025-04-29 11:03 ` Lorenzo Stoakes
  2025-04-29 14:02   ` David Hildenbrand
  8 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 11:03 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

-cc namit@vmware.com

Also,

On future respins can we please drop Namit from vmware from cc list?

I am being spammed with these messages whenever I reply here:

'A custom mail flow rule created by an admin at onevmw.onmicrosoft.com has
blocked your message.  Recipient is not authorized to accept external mail'

Or ask him to fix this :)

Cheers, Lorenzo

On Tue, Apr 29, 2025 at 10:53:29AM +0530, Dev Jain wrote:
> This patchset optimizes the mprotect() system call for large folios
> by PTE-batching.
>
> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
>
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
>
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>
> After the patchset:
> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
>
> Observing T1/T2 and T3 before the patchset, we also remove the regression
> introduced by ptep_get() on a contpte block. And, for large folios we get
> an almost 74% performance improvement.
>
> v1->v2:
>  - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
>  - Abridge the anon-exclusive condition (Lance Yang)
>
> Dev Jain (7):
>   mm: Refactor code in mprotect
>   mm: Optimize mprotect() by batch-skipping PTEs
>   mm: Add batched versions of ptep_modify_prot_start/commit
>   arm64: Add batched version of ptep_modify_prot_start
>   arm64: Add batched version of ptep_modify_prot_commit
>   mm: Batch around can_change_pte_writable()
>   mm: Optimize mprotect() through PTE-batching
>
>  arch/arm64/include/asm/pgtable.h |  10 ++
>  arch/arm64/mm/mmu.c              |  21 +++-
>  include/linux/mm.h               |   4 +-
>  include/linux/pgtable.h          |  42 ++++++++
>  mm/gup.c                         |   2 +-
>  mm/huge_memory.c                 |   4 +-
>  mm/memory.c                      |   6 +-
>  mm/mprotect.c                    | 165 ++++++++++++++++++++-----------
>  mm/pgtable-generic.c             |  16 ++-
>  9 files changed, 198 insertions(+), 72 deletions(-)
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-29  5:23 ` [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
  2025-04-29  7:14   ` Anshuman Khandual
@ 2025-04-29 13:19   ` Lorenzo Stoakes
  2025-04-30  6:37     ` Dev Jain
  1 sibling, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 13:19 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

Very very very nitty on subject (sorry I realise this is annoying :P) -
generally don't need to capitalise 'Optimize' here :>)

Generally I like the idea here. But some issues on impl.

On Tue, Apr 29, 2025 at 10:53:31AM +0530, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 70f59aa8c2a8..ec5d17af7650 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>  	bool toptier;
>  	int nid;
>
> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> +		return true;
> +

Hm why not just put this here from the start? I think you should put this back
in the prior commit.

>  	/* Also skip shared copy-on-write pages */
>  	if (is_cow_mapping(vma->vm_flags) &&
>  	    (folio_maybe_dma_pinned(folio) ||
> @@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>  }
>
>  static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
> -		unsigned long addr, pte_t oldpte, int target_node)
> +		unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
> +		int max_nr, int *nr)

Hate this ptr to nr.

Why not just return nr, if it's 0 then skip? Simple!

>  {
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	struct folio *folio;
>  	int ret;
>
> @@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>  		return true;
>
>  	folio = vm_normal_folio(vma, addr, oldpte);
> -	if (!folio || folio_is_zone_device(folio) ||
> -	    folio_test_ksm(folio))
> +	if (!folio)
>  		return true;
> +

Very nitty, but stray extra line unless intended...

Not sure why we can't just put this !folio check in prot_numa_skip()?

>  	ret = prot_numa_skip(vma, folio, target_node);
> -	if (ret)
> +	if (ret) {
> +		if (folio_test_large(folio) && max_nr != 1)
> +			*nr = folio_pte_batch(folio, addr, pte, oldpte,
> +					      max_nr, flags, NULL, NULL, NULL);

So max_nr can <= 0 too? Shouldn't this be max_nr > 1?

>  		return ret;

Again x = fn_return_bool(); if (x) { return x; } is a bit silly, just do if
(fn_return_bool()) { return true; }.

If we return the number of pages, then this can become really simple, like:

I feel like maybe we should abstract the folio large handling here, though it'd
be a tiny function so hm.

Anyway assuming we leave it in place, and return number of pages processed, this
can become:

if (prot_numa_skip(vma, folio, target_node)) {
	if (folio_test_large(folio) && max_nr > 1)
		return folio_pte_batch(folio, addr, pte, oldpte, max_nr, flags,
				NULL, NULL, NULL);
	return 1;
}

Which is neater I think!


> +	}
>  	if (folio_use_access_time(folio))
>  		folio_xchg_access_time(folio,
>  			jiffies_to_msecs(jiffies));
> @@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr;
>
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr = (end - addr) >> PAGE_SHIFT;

Not a fan of open-coding this. Since we already provide addr, why not just
provide end as well and have prot_numa_avoid_fault() calculate it?

>  			pte_t ptent;
>
>  			/*
> @@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa &&
> -			    prot_numa_avoid_fault(vma, addr,
> -						  oldpte, target_node))
> +			    prot_numa_avoid_fault(vma, addr, pte,
> +						  oldpte, target_node,
> +							  max_nr, &nr))
>  					continue;
>
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> @@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr, addr += nr * PAGE_SIZE, addr != end);

This is icky, having 'nr' here like this.

But alternatives might be _even more_ icky (that is advancing both on
prot_numa_avoid_fault() so probably we need to keep it like this.

Maybe more a moan at the C programming language tbh haha!


>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-04-29  8:39   ` Anshuman Khandual
@ 2025-04-29 13:52   ` Lorenzo Stoakes
  2025-04-30  6:25     ` Dev Jain
  2025-04-30 14:09     ` Ryan Roberts
  2025-04-30  5:35   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 13:52 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +/* See the comment for ptep_modify_prot_start */

I feel like you really should add a little more here, perhaps point out
that it's batched etc.

> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)

This name is a bit confusing, it's not any ptes, it's those pte entries
belonging to a large folio capped to the PTE table right that you are
batching right?

Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
the name?

We definitely need to mention in comment or name or _somewhere_ the intent
and motivation for this.

> +{
> +	pte_t pte, tmp_pte;
> +

are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
love this interface, where you require the user to know the number of
remaining PTE entries in a PTE table.

> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {

This loop is a bit horrible. It seems needlessly confusing and you're in
_dire_ need of comments to explain what's going on.

So my understanding is, you have the user figure out:

nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)

Then, you want to return the pte entry belonging to the start of the large
folio batch, but you want to adjust that pte value to propagate dirty and
young page table flags if any page table entries within the range contain
those page table flags, having called ptep_modify_prot_start() on all of
them?

This is quite a bit to a. put in a header like this and b. not
comment/explain.

So maybe something like:

pte = ptep_modify_prot_start(vma, addr, ptep);

/* Iterate through large folio tail PTEs. */
for (pg = 1; pg < nr; pg++) {
	pte_t inner_pte;

	ptep++;
	addr += PAGE_SIZE;

	inner_pte = ptep_modify_prot_start(vma, addr, ptep);

	/* We must propagate A/D state from tail PTEs. */
	if (pte_dirty(inner_pte))
		pte = pte_mkdirty(pte);
	if (pte_young(inner_pte))
		pte = pte_mkyoung(pte);
}

Would work better?

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);

> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);

Why are you propagating these?

> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */

Same comments as above, needs more meat on the bones!

> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,

Again need to reference large folio, batched or something relevant here,
'ptes' is super vague.

> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)

Nit, but you put 'p' suffix on ptep but not on 'old_pte'?

I'm even more concerned about the 'nr' API here now.

So this is now a user-calculated:

min3(large_folio_pages, number of pte entries left in ptep,
	number of pte entries left in old_pte)

It really feels like something that should be calculated here, or at least
be broken out more clearly.

You definitely _at the very least_ need to document it in a comment.

> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;

Why are you doing an infinite loop here with a break like this? Again feels
needlessly confusing.

I think it's ok to duplicate this single line for the sake of clarity,
also.

Which gives us:

unsigned long pg;

ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
for (pg = 1; pg < nr; pg++) {
	ptep++;
	addr += PAGE_SIZE;
	old_pte = pte_next_pfn(old_pte);
	pte = pte_next_pfn(pte);

	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
}

There are alternative approaches, but I think doing an infinite loop that
breaks and especially the confusing 'if (--foo) break;' stuff is much
harder to parse than a super simple ranged loop.

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  9:27     ` David Hildenbrand
@ 2025-04-29 13:57       ` Lorenzo Stoakes
  2025-04-29 14:00         ` David Hildenbrand
  2025-04-30  5:44         ` Dev Jain
  2025-05-06  9:16       ` Dev Jain
  1 sibling, 2 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-29 13:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Tue, Apr 29, 2025 at 11:27:43AM +0200, David Hildenbrand wrote:
> On 29.04.25 11:19, David Hildenbrand wrote:
> >
> > >    #include "internal.h"
> > > -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > > -			     pte_t pte)
> > > +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> > > +			      pte_t pte, struct folio *folio, unsigned int nr)
> > >    {
> > >    	struct page *page;
> > > @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > >    		 * write-fault handler similarly would map them writable without
> > >    		 * any additional checks while holding the PT lock.
> > >    		 */
> > > -		page = vm_normal_page(vma, addr, pte);
> > > -		return page && PageAnon(page) && PageAnonExclusive(page);
> > > +		if (!folio)
> > > +			folio = vm_normal_folio(vma, addr, pte);
> > > +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
> >
> > Oh no, now I spot it. That is horribly wrong.
> >
> > Please understand first what you are doing.
>
> Also, would expect that the cow.c selftest would catch that:
>
> "vmsplice() + unmap in child with mprotect() optimization"
>
> After fork() we have a R/O PTE in the parent. Our child then uses vmsplice()
> and unmaps the R/O PTE, meaning it is only left mapped by the parent.
>
> ret = mprotect(mem, size, PROT_READ);
> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>
> should turn the PTE writable, although it shouldn't.

This makes me concerned about the stability of this series as a whole...

>
> If that test case does not detect the issue you're introducing, we should
> look into adding a test case that detects it.

There are 25 tests that fail for the cow self-test with this series
applied:

# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with base page
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (16 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (16 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (16 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (32 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (32 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (32 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (64 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (64 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (64 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (128 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (128 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (128 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (256 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (256 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (256 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (512 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (512 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (512 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (1024 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (1024 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (1024 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (2048 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (2048 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (2048 kB)

Dev, please take a little more time to test your series :) the current
patch set doesn't compile and needs fixes applied to do so, and we're at
v2, and you've clearly not run self-tests as these also fail.

Please ensure you do a smoke test and check compilation before sending out,
as well as running self tests also.

Thanks, Lorenzo

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29 13:57       ` Lorenzo Stoakes
@ 2025-04-29 14:00         ` David Hildenbrand
  2025-04-30  5:44         ` Dev Jain
  1 sibling, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-04-29 14:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 29.04.25 15:57, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 11:27:43AM +0200, David Hildenbrand wrote:
>> On 29.04.25 11:19, David Hildenbrand wrote:
>>>
>>>>     #include "internal.h"
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -			     pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +			      pte_t pte, struct folio *folio, unsigned int nr)
>>>>     {
>>>>     	struct page *page;
>>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>>     		 * write-fault handler similarly would map them writable without
>>>>     		 * any additional checks while holding the PT lock.
>>>>     		 */
>>>> -		page = vm_normal_page(vma, addr, pte);
>>>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +		if (!folio)
>>>> +			folio = vm_normal_folio(vma, addr, pte);
>>>> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
>>>
>>> Oh no, now I spot it. That is horribly wrong.
>>>
>>> Please understand first what you are doing.
>>
>> Also, would expect that the cow.c selftest would catch that:
>>
>> "vmsplice() + unmap in child with mprotect() optimization"
>>
>> After fork() we have a R/O PTE in the parent. Our child then uses vmsplice()
>> and unmaps the R/O PTE, meaning it is only left mapped by the parent.
>>
>> ret = mprotect(mem, size, PROT_READ);
>> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>>
>> should turn the PTE writable, although it shouldn't.
> 
> This makes me concerned about the stability of this series as a whole...
> 
>>
>> If that test case does not detect the issue you're introducing, we should
>> look into adding a test case that detects it.
> 
> There are 25 tests that fail for the cow self-test with this series
> applied:
> 
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with base page
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (2048 kB)

As expected ... :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-29 11:03 ` Lorenzo Stoakes
@ 2025-04-29 14:02   ` David Hildenbrand
  0 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-04-29 14:02 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 29.04.25 13:03, Lorenzo Stoakes wrote:
> -cc namit@vmware.com

Yes, Nadav is no longer working for VMWare.

.mailmap should already include the correct mapping to the gmail address 
AFAIKS?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-04-29  8:39   ` Anshuman Khandual
  2025-04-29 13:52   ` Lorenzo Stoakes
@ 2025-04-30  5:35   ` kernel test robot
  2025-04-30  5:45   ` kernel test robot
  2025-04-30 14:16   ` Ryan Roberts
  4 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2025-04-30  5:35 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: oe-kbuild-all, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy, Dev Jain

Hi Dev,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on arm64/for-next/core linus/master v6.15-rc4 next-20250429]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-Refactor-code-in-mprotect/20250429-133151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250429052336.18912-4-dev.jain%40arm.com
patch subject: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
config: arm-randconfig-001-20250430 (https://download.01.org/0day-ci/archive/20250430/202504301319.aph2eccP-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250430/202504301319.aph2eccP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504301319.aph2eccP-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:31,
                    from arch/arm/kernel/asm-offsets.c:12:
   include/linux/pgtable.h: In function 'modify_prot_start_ptes':
   include/linux/pgtable.h:901:8: error: implicit declaration of function 'ptep_modify_prot_start' [-Werror=implicit-function-declaration]
     901 |  pte = ptep_modify_prot_start(vma, addr, ptep);
         |        ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h: In function 'modify_prot_commit_ptes':
   include/linux/pgtable.h:921:3: error: implicit declaration of function 'ptep_modify_prot_commit' [-Werror=implicit-function-declaration]
     921 |   ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |   ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h: At top level:
>> include/linux/pgtable.h:1356:21: error: conflicting types for 'ptep_modify_prot_start'
    1356 | static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
         |                     ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h:901:8: note: previous implicit declaration of 'ptep_modify_prot_start' was here
     901 |  pte = ptep_modify_prot_start(vma, addr, ptep);
         |        ^~~~~~~~~~~~~~~~~~~~~~
>> include/linux/pgtable.h:1367:20: warning: conflicting types for 'ptep_modify_prot_commit'
    1367 | static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
         |                    ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h:1367:20: error: static declaration of 'ptep_modify_prot_commit' follows non-static declaration
   include/linux/pgtable.h:921:3: note: previous implicit declaration of 'ptep_modify_prot_commit' was here
     921 |   ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |   ^~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
   make[3]: *** [scripts/Makefile.build:98: arch/arm/kernel/asm-offsets.s] Error 1 shuffle=2044487564
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1280: prepare0] Error 2 shuffle=2044487564
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:248: __sub-make] Error 2 shuffle=2044487564
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:248: __sub-make] Error 2 shuffle=2044487564
   make: Target 'prepare' not remade because of errors.


vim +/ptep_modify_prot_start +1356 include/linux/pgtable.h

1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1340  
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1341  #ifndef __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1342  /*
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1343   * Start a pte protection read-modify-write transaction, which
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1344   * protects against asynchronous hardware modifications to the pte.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1345   * The intention is not to prevent the hardware from making pte
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1346   * updates, but to prevent any updates it may make from being lost.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1347   *
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1348   * This does not protect against other software modifications of the
2eb70aab25dd9b include/linux/pgtable.h       Bhaskar Chowdhury   2021-05-06  1349   * pte; the appropriate pte lock must be held over the transaction.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1350   *
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1351   * Note that this interface is intended to be batchable, meaning that
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1352   * ptep_modify_prot_commit may not actually update the pte, but merely
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1353   * queue the update to be done at some later time.  The update must be
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1354   * actually committed before the pte lock is released, however.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1355   */
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05 @1356  static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1357  					   unsigned long addr,
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1358  					   pte_t *ptep)
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1359  {
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05  1360  	return __ptep_modify_prot_start(vma, addr, ptep);
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1361  }
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1362  
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1363  /*
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1364   * Commit an update to a pte, leaving any hardware-controlled bits in
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1365   * the PTE unmodified.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1366   */
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05 @1367  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1368  					   unsigned long addr,
04a8645304500b include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05  1369  					   pte_t *ptep, pte_t old_pte, pte_t pte)
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1370  {
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05  1371  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1372  }
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1373  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
fe1a6875fcaaac include/asm-generic/pgtable.h Sebastian Siewior   2008-07-15  1374  #endif /* CONFIG_MMU */
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1375  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-29 10:41     ` Lorenzo Stoakes
@ 2025-04-30  5:42       ` Dev Jain
  2025-04-30  6:22         ` Lance Yang
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-04-30  5:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Lance Yang, akpm, ryan.roberts, david, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 29/04/25 4:11 pm, Lorenzo Stoakes wrote:
> FWIW can confirm the same thing. Lance's fixes sort most of it out, but I also
> get this error:
> 
> mm/mprotect.c: In function ‘can_change_ptes_writable’:
> mm/mprotect.c:46:22: error: unused variable ‘page’ [-Werror=unused-variable]
>     46 |         struct page *page;
>        |                      ^~~~
> 
> So you also need to remove this unused variable at the stop of
> can_change_ptes_writable().

Strange that my build didn't catch this.

> 
> Cheers, Lorenzo
> 
> On Tue, Apr 29, 2025 at 02:32:59PM +0530, Dev Jain wrote:
>>
>>
>> On 29/04/25 12:36 pm, Lance Yang wrote:
>>> Hey Dev,
>>>
>>> Hmm... I also hit the same compilation errors:
>>>
>>> In file included from ./include/linux/kasan.h:37,
>>>                    from ./include/linux/slab.h:260,
>>>                    from ./include/linux/crypto.h:19,
>>>                    from arch/x86/kernel/asm-offsets.c:9:
>>> ./include/linux/pgtable.h: In function ‘modify_prot_start_ptes’:
>>> ./include/linux/pgtable.h:905:15: error: implicit declaration of
>>> function ‘ptep_modify_prot_start’
>>> [-Werror=implicit-function-declaration]
>>>     905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>>>         |               ^~~~~~~~~~~~~~~~~~~~~~
>>> ./include/linux/pgtable.h:905:15: error: incompatible types when
>>> assigning to type ‘pte_t’ from type ‘int’
>>> ./include/linux/pgtable.h:909:27: error: incompatible types when
>>> assigning to type ‘pte_t’ from type ‘int’
>>>     909 |                 tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>         |                           ^~~~~~~~~~~~~~~~~~~~~~
>>> ./include/linux/pgtable.h: In function ‘modify_prot_commit_ptes’:
>>> ./include/linux/pgtable.h:925:17: error: implicit declaration of
>>> function ‘ptep_modify_prot_commit’
>>> [-Werror=implicit-function-declaration]
>>>     925 |                 ptep_modify_prot_commit(vma, addr, ptep,
>>> old_pte, pte);
>>>         |                 ^~~~~~~~~~~~~~~~~~~~~~~
>>> ./include/linux/pgtable.h: At top level:
>>> ./include/linux/pgtable.h:1360:21: error: conflicting types for
>>> ‘ptep_modify_prot_start’; have ‘pte_t(struct vm_area_struct *, long
>>> unsigned int,  pte_t *)’
>>>    1360 | static inline pte_t ptep_modify_prot_start(struct
>>> vm_area_struct *vma,
>>>         |                     ^~~~~~~~~~~~~~~~~~~~~~
>>> ./include/linux/pgtable.h:905:15: note: previous implicit declaration of
>>> ‘ptep_modify_prot_start’ with type ‘int()’
>>>     905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>>>         |               ^~~~~~~~~~~~~~~~~~~~~~
>>> ./include/linux/pgtable.h:1371:20: warning: conflicting types for
>>> ‘ptep_modify_prot_commit’; have ‘void(struct vm_area_struct *, long
>>> unsigned int,  pte_t *, pte_t,  pte_t)’
>>>    1371 | static inline void ptep_modify_prot_commit(struct
>>> vm_area_struct *vma,
>>>         |                    ^~~~~~~~~~~~~~~~~~~~~~~
>>> ./include/linux/pgtable.h:1371:20: error: static declaration of
>>> ‘ptep_modify_prot_commit’ follows non-static declaration
>>> ./include/linux/pgtable.h:925:17: note: previous implicit declaration of
>>> ‘ptep_modify_prot_commit’ with type ‘void(struct vm_area_struct *, long
>>> unsigned int,  pte_t *, pte_t,  pte_t)’
>>>     925 |                 ptep_modify_prot_commit(vma, addr, ptep,
>>> old_pte, pte);
>>>         |                 ^~~~~~~~~~~~~~~~~~~~~~~
>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
>>> libstring.o
>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
>>> libctype.o
>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
>>> str_error_r.o
>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
>>> librbtree.o
>>> cc1: some warnings being treated as errors
>>> make[2]: *** [scripts/Makefile.build:98: arch/x86/kernel/asm-offsets.s]
>>> Error 1
>>> make[1]: *** [/home/runner/work/mm-test-robot/mm-test-robot/linux/
>>> Makefile:1280: prepare0] Error 2
>>> make[1]: *** Waiting for unfinished jobs....
>>>     LD /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/objtool/
>>> objtool-in.o
>>>     LINK /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/
>>> objtool/objtool
>>> make: *** [Makefile:248: __sub-make] Error 2
>>>
>>> Well, modify_prot_start_ptes() calls ptep_modify_prot_start(), but x86
>>> does not define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION. To avoid
>>> implicit declaration errors, the architecture-independent
>>> ptep_modify_prot_start() must be defined before modify_prot_start_ptes().
>>>
>>> With the changes below, things work correctly now ;)
>>
>> Ah thanks! My bad :(
>>
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 10cdb87ccecf..d9d6c49bb914 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -895,44 +895,6 @@ static inline void wrprotect_ptes(struct mm_struct
>>> *mm, unsigned long addr,
>>>    }
>>>    #endif
>>>
>>> -/* See the comment for ptep_modify_prot_start */
>>> -#ifndef modify_prot_start_ptes
>>> -static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>> -        unsigned long addr, pte_t *ptep, unsigned int nr)
>>> -{
>>> -    pte_t pte, tmp_pte;
>>> -
>>> -    pte = ptep_modify_prot_start(vma, addr, ptep);
>>> -    while (--nr) {
>>> -        ptep++;
>>> -        addr += PAGE_SIZE;
>>> -        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>> -        if (pte_dirty(tmp_pte))
>>> -            pte = pte_mkdirty(pte);
>>> -        if (pte_young(tmp_pte))
>>> -            pte = pte_mkyoung(pte);
>>> -    }
>>> -    return pte;
>>> -}
>>> -#endif
>>> -
>>> -/* See the comment for ptep_modify_prot_commit */
>>> -#ifndef modify_prot_commit_ptes
>>> -static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> -        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>> -{
>>> -    for (;;) {
>>> -        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> -        if (--nr == 0)
>>> -            break;
>>> -        ptep++;
>>> -        addr += PAGE_SIZE;
>>> -        old_pte = pte_next_pfn(old_pte);
>>> -        pte = pte_next_pfn(pte);
>>> -    }
>>> -}
>>> -#endif
>>> -
>>>    /*
>>>     * On some architectures hardware does not set page access bit when
>>> accessing
>>>     * memory page, it is responsibility of software setting this bit. It
>>> brings
>>> @@ -1375,6 +1337,45 @@ static inline void ptep_modify_prot_commit(struct
>>> vm_area_struct *vma,
>>>        __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>    }
>>>    #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>> +
>>> +/* See the comment for ptep_modify_prot_start */
>>> +#ifndef modify_prot_start_ptes
>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>> +{
>>> +    pte_t pte, tmp_pte;
>>> +
>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>> +    while (--nr) {
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>> +        if (pte_dirty(tmp_pte))
>>> +            pte = pte_mkdirty(pte);
>>> +        if (pte_young(tmp_pte))
>>> +            pte = pte_mkyoung(pte);
>>> +    }
>>> +    return pte;
>>> +}
>>> +#endif
>>> +
>>> +/* See the comment for ptep_modify_prot_commit */
>>> +#ifndef modify_prot_commit_ptes
>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>> +{
>>> +    for (;;) {
>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> +        if (--nr == 0)
>>> +            break;
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        old_pte = pte_next_pfn(old_pte);
>>> +        pte = pte_next_pfn(pte);
>>> +    }
>>> +}
>>> +#endif
>>> +
>>>    #endif /* CONFIG_MMU */
>>>
>>>    /*
>>> --
>>>
>>> Thanks,
>>> Lance
>>>
>>> On 2025/4/29 13:23, Dev Jain wrote:
>>>> This patchset optimizes the mprotect() system call for large folios
>>>> by PTE-batching.
>>>>
>>>> We use the following test cases to measure performance, mprotect()'ing
>>>> the mapped memory to read-only then read-write 40 times:
>>>>
>>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>>> pte-mapping those THPs
>>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>>> Test case 3: Mapping 1G of memory with 4K pages
>>>>
>>>> Average execution time on arm64, Apple M3:
>>>> Before the patchset:
>>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>>
>>>> After the patchset:
>>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
>>>>
>>>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>>>> introduced by ptep_get() on a contpte block. And, for large folios we get
>>>> an almost 74% performance improvement.
>>>>
>>>> v1->v2:
>>>>    - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the
>>>> header more resilient)
>>>>    - Abridge the anon-exclusive condition (Lance Yang)
>>>>
>>>> Dev Jain (7):
>>>>     mm: Refactor code in mprotect
>>>>     mm: Optimize mprotect() by batch-skipping PTEs
>>>>     mm: Add batched versions of ptep_modify_prot_start/commit
>>>>     arm64: Add batched version of ptep_modify_prot_start
>>>>     arm64: Add batched version of ptep_modify_prot_commit
>>>>     mm: Batch around can_change_pte_writable()
>>>>     mm: Optimize mprotect() through PTE-batching
>>>>
>>>>    arch/arm64/include/asm/pgtable.h |  10 ++
>>>>    arch/arm64/mm/mmu.c              |  21 +++-
>>>>    include/linux/mm.h               |   4 +-
>>>>    include/linux/pgtable.h          |  42 ++++++++
>>>>    mm/gup.c                         |   2 +-
>>>>    mm/huge_memory.c                 |   4 +-
>>>>    mm/memory.c                      |   6 +-
>>>>    mm/mprotect.c                    | 165 ++++++++++++++++++++-----------
>>>>    mm/pgtable-generic.c             |  16 ++-
>>>>    9 files changed, 198 insertions(+), 72 deletions(-)
>>>>
>>>
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start
  2025-04-29  5:23 ` [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start Dev Jain
@ 2025-04-30  5:43   ` Anshuman Khandual
  2025-04-30  5:49     ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Anshuman Khandual @ 2025-04-30  5:43 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy

On 4/29/25 10:53, Dev Jain wrote:
> Override the generic definition to use get_and_clear_full_ptes(), so that
> we do a TLBI possibly only on the "contpte-edges" of the large PTE block,
> instead of doing it for every contpte block, which happens for ptep_get_and_clear().

Could you please explain what does "contpte-edges" really signify in the
context of large PTE blocks ? Also how TLBI operation only on these edges
will never run into the risk of missing TLB invalidation of some other
mapped areas ?

> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  5 +++++
>  arch/arm64/mm/mmu.c              | 12 +++++++++---
>  include/linux/pgtable.h          |  4 ++++
>  mm/pgtable-generic.c             | 16 +++++++++++-----
>  4 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 2a77f11b78d5..8872ea5f0642 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
>  
> +#define modify_prot_start_ptes modify_prot_start_ptes
> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +				    unsigned long addr, pte_t *ptep,
> +				    unsigned int nr);
> +
>  #ifdef CONFIG_ARM64_CONTPTE
>  
>  /*
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8fcf59ba39db..fe60be8774f4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
>  early_initcall(prevent_bootmem_remove_init);
>  #endif
>  
> -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t *ptep, unsigned int nr)
>  {
>  	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>  		/*
> @@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>  		 * in cases where cpu is affected with errata #2645198.
>  		 */
>  		if (pte_user_exec(ptep_get(ptep)))
> -			return ptep_clear_flush(vma, addr, ptep);
> +			return clear_flush_ptes(vma, addr, ptep, nr);
>  	}
> -	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
> +	return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
> +}
> +
> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> +{
> +	return modify_prot_start_ptes(vma, addr, ptep, 1);
>  }
>  
>  void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index ed287289335f..10cdb87ccecf 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>  			      pte_t *ptep);
>  #endif
>  
> +extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
> +			      unsigned long address,
> +			      pte_t *ptep, unsigned int nr);
> +
>  #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
>  extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
>  			      unsigned long address,
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 5a882f2b10f9..e238f88c3cac 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  #endif
>  
> -#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
> -pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
> -		       pte_t *ptep)
> +pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
> +		       pte_t *ptep, unsigned int nr)
>  {
>  	struct mm_struct *mm = (vma)->vm_mm;
>  	pte_t pte;
> -	pte = ptep_get_and_clear(mm, address, ptep);
> +	pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
>  	if (pte_accessible(mm, pte))
> -		flush_tlb_page(vma, address);
> +		flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
>  	return pte;
>  }
> +
> +#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
> +		       pte_t *ptep)
> +{
> +	return clear_flush_ptes(vma, address, ptep, 1);
> +}
>  #endif
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29 13:57       ` Lorenzo Stoakes
  2025-04-29 14:00         ` David Hildenbrand
@ 2025-04-30  5:44         ` Dev Jain
  1 sibling, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-30  5:44 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy



On 29/04/25 7:27 pm, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 11:27:43AM +0200, David Hildenbrand wrote:
>> On 29.04.25 11:19, David Hildenbrand wrote:
>>>
>>>>     #include "internal.h"
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -			     pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +			      pte_t pte, struct folio *folio, unsigned int nr)
>>>>     {
>>>>     	struct page *page;
>>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>>     		 * write-fault handler similarly would map them writable without
>>>>     		 * any additional checks while holding the PT lock.
>>>>     		 */
>>>> -		page = vm_normal_page(vma, addr, pte);
>>>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +		if (!folio)
>>>> +			folio = vm_normal_folio(vma, addr, pte);
>>>> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
>>>
>>> Oh no, now I spot it. That is horribly wrong.
>>>
>>> Please understand first what you are doing.
>>
>> Also, would expect that the cow.c selftest would catch that:
>>
>> "vmsplice() + unmap in child with mprotect() optimization"
>>
>> After fork() we have a R/O PTE in the parent. Our child then uses vmsplice()
>> and unmaps the R/O PTE, meaning it is only left mapped by the parent.
>>
>> ret = mprotect(mem, size, PROT_READ);
>> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>>
>> should turn the PTE writable, although it shouldn't.
> 
> This makes me concerned about the stability of this series as a whole...
> 
>>
>> If that test case does not detect the issue you're introducing, we should
>> look into adding a test case that detects it.
> 
> There are 25 tests that fail for the cow self-test with this series
> applied:
> 
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with base page
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (2048 kB)
> 
> 
> Dev, please take a little more time to test your series :) the current
> patch set doesn't compile and needs fixes applied to do so, and we're at
> v2, and you've clearly not run self-tests as these also fail.
> 
> Please ensure you do a smoke test and check compilation before sending out,
> as well as running self tests also.

Apologies, I over-confidently skipped over selftests, and didn't build 
for x86 :( Shall take care.

> 
> Thanks, Lorenzo
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                     ` (2 preceding siblings ...)
  2025-04-30  5:35   ` kernel test robot
@ 2025-04-30  5:45   ` kernel test robot
  2025-04-30 14:16   ` Ryan Roberts
  4 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2025-04-30  5:45 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: llvm, oe-kbuild-all, ryan.roberts, david, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett,
	lorenzo.stoakes, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, namit, hughd,
	yang, ziy, Dev Jain

Hi Dev,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on arm64/for-next/core linus/master v6.15-rc4 next-20250429]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-Refactor-code-in-mprotect/20250429-133151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250429052336.18912-4-dev.jain%40arm.com
patch subject: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
config: arm-randconfig-003-20250430 (https://download.01.org/0day-ci/archive/20250430/202504301328.ltLGuTxD-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250430/202504301328.ltLGuTxD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504301328.ltLGuTxD-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:31:
>> include/linux/pgtable.h:901:8: error: call to undeclared function 'ptep_modify_prot_start'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     901 |         pte = ptep_modify_prot_start(vma, addr, ptep);
         |               ^
>> include/linux/pgtable.h:921:3: error: call to undeclared function 'ptep_modify_prot_commit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     921 |                 ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |                 ^
>> include/linux/pgtable.h:1356:21: error: static declaration of 'ptep_modify_prot_start' follows non-static declaration
    1356 | static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
         |                     ^
   include/linux/pgtable.h:901:8: note: previous implicit declaration is here
     901 |         pte = ptep_modify_prot_start(vma, addr, ptep);
         |               ^
   include/linux/pgtable.h:1367:20: error: static declaration of 'ptep_modify_prot_commit' follows non-static declaration
    1367 | static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
         |                    ^
   include/linux/pgtable.h:921:3: note: previous implicit declaration is here
     921 |                 ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |                 ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                         ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                                       ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:21: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     157 | _SIG_SET_BINOP(sigorsets, _sig_or)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:138:8: note: expanded from macro '_SIG_SET_BINOP'
     138 |                 a3 = a->sig[3]; a2 = a->sig[2];                         \
         |                      ^      ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:157:1: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     157 | _SIG_SET_BINOP(sigorsets, _sig_or)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:138:24: note: expanded from macro '_SIG_SET_BINOP'
     138 |                 a3 = a->sig[3]; a2 = a->sig[2];                         \
         |                                      ^      ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:


vim +/ptep_modify_prot_start +901 include/linux/pgtable.h

   893	
   894	/* See the comment for ptep_modify_prot_start */
   895	#ifndef modify_prot_start_ptes
   896	static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
   897			unsigned long addr, pte_t *ptep, unsigned int nr)
   898	{
   899		pte_t pte, tmp_pte;
   900	
 > 901		pte = ptep_modify_prot_start(vma, addr, ptep);
   902		while (--nr) {
   903			ptep++;
   904			addr += PAGE_SIZE;
   905			tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
   906			if (pte_dirty(tmp_pte))
   907				pte = pte_mkdirty(pte);
   908			if (pte_young(tmp_pte))
   909				pte = pte_mkyoung(pte);
   910		}
   911		return pte;
   912	}
   913	#endif
   914	
   915	/* See the comment for ptep_modify_prot_commit */
   916	#ifndef modify_prot_commit_ptes
   917	static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
   918			pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
   919	{
   920		for (;;) {
 > 921			ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
   922			if (--nr == 0)
   923				break;
   924			ptep++;
   925			addr += PAGE_SIZE;
   926			old_pte = pte_next_pfn(old_pte);
   927			pte = pte_next_pfn(pte);
   928		}
   929	}
   930	#endif
   931	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start
  2025-04-30  5:43   ` Anshuman Khandual
@ 2025-04-30  5:49     ` Dev Jain
  2025-04-30  6:14       ` Anshuman Khandual
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-04-30  5:49 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 30/04/25 11:13 am, Anshuman Khandual wrote:
> On 4/29/25 10:53, Dev Jain wrote:
>> Override the generic definition to use get_and_clear_full_ptes(), so that
>> we do a TLBI possibly only on the "contpte-edges" of the large PTE block,
>> instead of doing it for every contpte block, which happens for ptep_get_and_clear().
> 
> Could you please explain what does "contpte-edges" really signify in the
> context of large PTE blocks ? Also how TLBI operation only on these edges
> will never run into the risk of missing TLB invalidation of some other
> mapped areas ?

We are doing a TLBI over the whole range already, in the mprotect code:
see tlb_flush_pte_range. What the arm64 internal API does, irrespective 
of the caller, is to do a TLBI for every contpte block in case of 
unfolding. We don't need that for the intermediate blocks because the 
caller does that. We do need a TLBI for the start and end contpte block,
because in case the range we are invalidating partially covers them, 
then the caller will not do a TLBI for the non-overlapped PTEs of the block.
I'll explain some more in the changelog next version.

> 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h |  5 +++++
>>   arch/arm64/mm/mmu.c              | 12 +++++++++---
>>   include/linux/pgtable.h          |  4 ++++
>>   mm/pgtable-generic.c             | 16 +++++++++++-----
>>   4 files changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 2a77f11b78d5..8872ea5f0642 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   				    unsigned long addr, pte_t *ptep,
>>   				    pte_t old_pte, pte_t new_pte);
>>   
>> +#define modify_prot_start_ptes modify_prot_start_ptes
>> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +				    unsigned long addr, pte_t *ptep,
>> +				    unsigned int nr);
>> +
>>   #ifdef CONFIG_ARM64_CONTPTE
>>   
>>   /*
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..fe60be8774f4 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
>>   early_initcall(prevent_bootmem_remove_init);
>>   #endif
>>   
>> -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t *ptep, unsigned int nr)
>>   {
>>   	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>>   		/*
>> @@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>>   		 * in cases where cpu is affected with errata #2645198.
>>   		 */
>>   		if (pte_user_exec(ptep_get(ptep)))
>> -			return ptep_clear_flush(vma, addr, ptep);
>> +			return clear_flush_ptes(vma, addr, ptep, nr);
>>   	}
>> -	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
>> +	return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
>> +}
>> +
>> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>> +{
>> +	return modify_prot_start_ptes(vma, addr, ptep, 1);
>>   }
>>   
>>   void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index ed287289335f..10cdb87ccecf 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>   			      pte_t *ptep);
>>   #endif
>>   
>> +extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
>> +			      unsigned long address,
>> +			      pte_t *ptep, unsigned int nr);
>> +
>>   #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
>>   extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
>>   			      unsigned long address,
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 5a882f2b10f9..e238f88c3cac 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   }
>>   #endif
>>   
>> -#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>> -pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>> -		       pte_t *ptep)
>> +pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
>> +		       pte_t *ptep, unsigned int nr)
>>   {
>>   	struct mm_struct *mm = (vma)->vm_mm;
>>   	pte_t pte;
>> -	pte = ptep_get_and_clear(mm, address, ptep);
>> +	pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
>>   	if (pte_accessible(mm, pte))
>> -		flush_tlb_page(vma, address);
>> +		flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
>>   	return pte;
>>   }
>> +
>> +#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>> +		       pte_t *ptep)
>> +{
>> +	return clear_flush_ptes(vma, address, ptep, 1);
>> +}
>>   #endif
>>   
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start
  2025-04-30  5:49     ` Dev Jain
@ 2025-04-30  6:14       ` Anshuman Khandual
  2025-04-30  6:32         ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Anshuman Khandual @ 2025-04-30  6:14 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 4/30/25 11:19, Dev Jain wrote:
> 
> 
> On 30/04/25 11:13 am, Anshuman Khandual wrote:
>> On 4/29/25 10:53, Dev Jain wrote:
>>> Override the generic definition to use get_and_clear_full_ptes(), so that
>>> we do a TLBI possibly only on the "contpte-edges" of the large PTE block,
>>> instead of doing it for every contpte block, which happens for ptep_get_and_clear().
>>
>> Could you please explain what does "contpte-edges" really signify in the
>> context of large PTE blocks ? Also how TLBI operation only on these edges
>> will never run into the risk of missing TLB invalidation of some other
>> mapped areas ?
> 
> We are doing a TLBI over the whole range already, in the mprotect code:
> see tlb_flush_pte_range. What the arm64 internal API does, irrespective of the caller, is to do a TLBI for every contpte block in case of unfolding. We don't need that for the intermediate blocks because the caller does that. We do need a TLBI for the start and end contpte block,
> because in case the range we are invalidating partially covers them, then the caller will not do a TLBI for the non-overlapped PTEs of the block.

But is not splitting the TLBI flush responsibility between the callers
(intermediate blocks) and the platform API (contpte-edges) - some what
problematic from a semantics perspective, and will be more susceptible
for missing TLB flushes etc ?

> I'll explain some more in the changelog next version.
> 
>>
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   arch/arm64/include/asm/pgtable.h |  5 +++++
>>>   arch/arm64/mm/mmu.c              | 12 +++++++++---
>>>   include/linux/pgtable.h          |  4 ++++
>>>   mm/pgtable-generic.c             | 16 +++++++++++-----
>>>   4 files changed, 29 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 2a77f11b78d5..8872ea5f0642 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>                       unsigned long addr, pte_t *ptep,
>>>                       pte_t old_pte, pte_t new_pte);
>>>   +#define modify_prot_start_ptes modify_prot_start_ptes
>>> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep,
>>> +                    unsigned int nr);
>>> +
>>>   #ifdef CONFIG_ARM64_CONTPTE
>>>     /*
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 8fcf59ba39db..fe60be8774f4 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
>>>   early_initcall(prevent_bootmem_remove_init);
>>>   #endif
>>>   -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>>> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
>>> +                 pte_t *ptep, unsigned int nr)
>>>   {
>>>       if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>>>           /*
>>> @@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>>>            * in cases where cpu is affected with errata #2645198.
>>>            */
>>>           if (pte_user_exec(ptep_get(ptep)))
>>> -            return ptep_clear_flush(vma, addr, ptep);
>>> +            return clear_flush_ptes(vma, addr, ptep, nr);
>>>       }
>>> -    return ptep_get_and_clear(vma->vm_mm, addr, ptep);
>>> +    return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
>>> +}
>>> +
>>> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>>> +{
>>> +    return modify_prot_start_ptes(vma, addr, ptep, 1);
>>>   }
>>>     void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index ed287289335f..10cdb87ccecf 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>>                     pte_t *ptep);
>>>   #endif
>>>   +extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
>>> +                  unsigned long address,
>>> +                  pte_t *ptep, unsigned int nr);
>>> +
>>>   #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
>>>   extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
>>>                     unsigned long address,
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index 5a882f2b10f9..e238f88c3cac 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>   }
>>>   #endif
>>>   -#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>> -pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>>> -               pte_t *ptep)
>>> +pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
>>> +               pte_t *ptep, unsigned int nr)
>>>   {
>>>       struct mm_struct *mm = (vma)->vm_mm;
>>>       pte_t pte;
>>> -    pte = ptep_get_and_clear(mm, address, ptep);
>>> +    pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
>>>       if (pte_accessible(mm, pte))
>>> -        flush_tlb_page(vma, address);
>>> +        flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
>>>       return pte;
>>>   }
>>> +
>>> +#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>>> +               pte_t *ptep)
>>> +{
>>> +    return clear_flush_ptes(vma, address, ptep, 1);
>>> +}
>>>   #endif
>>>     #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  5:23 ` [PATCH v2 6/7] mm: Batch around can_change_pte_writable() Dev Jain
  2025-04-29  9:15   ` David Hildenbrand
  2025-04-29  9:19   ` David Hildenbrand
@ 2025-04-30  6:17   ` kernel test robot
  2 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2025-04-30  6:17 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: llvm, oe-kbuild-all, ryan.roberts, david, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett,
	lorenzo.stoakes, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, namit, hughd,
	yang, ziy, Dev Jain

Hi Dev,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on arm64/for-next/core linus/master v6.15-rc4 next-20250429]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-Refactor-code-in-mprotect/20250429-133151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250429052336.18912-7-dev.jain%40arm.com
patch subject: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
config: arm64-randconfig-002-20250430 (https://download.01.org/0day-ci/archive/20250430/202504301306.AU2G1yvg-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250430/202504301306.AU2G1yvg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504301306.AU2G1yvg-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/mprotect.c:46:15: warning: unused variable 'page' [-Wunused-variable]
      46 |         struct page *page;
         |                      ^~~~
   mm/mprotect.c:226:51: error: use of undeclared identifier 'folio'
     226 |                             can_change_ptes_writable(vma, addr, ptent, folio, 1))
         |                                                                        ^
   1 warning and 1 error generated.


vim +/page +46 mm/mprotect.c

36f881883c5794 Kirill A. Shutemov 2015-06-24  42  
695112a1385b39 Dev Jain           2025-04-29  43  bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
695112a1385b39 Dev Jain           2025-04-29  44  			      pte_t pte, struct folio *folio, unsigned int nr)
64fe24a3e05e5f David Hildenbrand  2022-06-14  45  {
64fe24a3e05e5f David Hildenbrand  2022-06-14 @46  	struct page *page;
64fe24a3e05e5f David Hildenbrand  2022-06-14  47  
7ea7e333842ed5 David Hildenbrand  2022-11-08  48  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
7ea7e333842ed5 David Hildenbrand  2022-11-08  49  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  50  
7ea7e333842ed5 David Hildenbrand  2022-11-08  51  	/* Don't touch entries that are not even readable. */
d84887739d5c98 Nadav Amit         2022-11-08  52  	if (pte_protnone(pte))
64fe24a3e05e5f David Hildenbrand  2022-06-14  53  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  54  
64fe24a3e05e5f David Hildenbrand  2022-06-14  55  	/* Do we need write faults for softdirty tracking? */
f38ee285191813 Barry Song         2024-06-08  56  	if (pte_needs_soft_dirty_wp(vma, pte))
64fe24a3e05e5f David Hildenbrand  2022-06-14  57  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  58  
64fe24a3e05e5f David Hildenbrand  2022-06-14  59  	/* Do we need write faults for uffd-wp tracking? */
64fe24a3e05e5f David Hildenbrand  2022-06-14  60  	if (userfaultfd_pte_wp(vma, pte))
64fe24a3e05e5f David Hildenbrand  2022-06-14  61  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  62  
64fe24a3e05e5f David Hildenbrand  2022-06-14  63  	if (!(vma->vm_flags & VM_SHARED)) {
64fe24a3e05e5f David Hildenbrand  2022-06-14  64  		/*
7ea7e333842ed5 David Hildenbrand  2022-11-08  65  		 * Writable MAP_PRIVATE mapping: We can only special-case on
7ea7e333842ed5 David Hildenbrand  2022-11-08  66  		 * exclusive anonymous pages, because we know that our
7ea7e333842ed5 David Hildenbrand  2022-11-08  67  		 * write-fault handler similarly would map them writable without
7ea7e333842ed5 David Hildenbrand  2022-11-08  68  		 * any additional checks while holding the PT lock.
64fe24a3e05e5f David Hildenbrand  2022-06-14  69  		 */
695112a1385b39 Dev Jain           2025-04-29  70  		if (!folio)
695112a1385b39 Dev Jain           2025-04-29  71  			folio = vm_normal_folio(vma, addr, pte);
695112a1385b39 Dev Jain           2025-04-29  72  		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
64fe24a3e05e5f David Hildenbrand  2022-06-14  73  	}
64fe24a3e05e5f David Hildenbrand  2022-06-14  74  
fce831c92092ad David Hildenbrand  2024-05-22  75  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
fce831c92092ad David Hildenbrand  2024-05-22  76  
7ea7e333842ed5 David Hildenbrand  2022-11-08  77  	/*
7ea7e333842ed5 David Hildenbrand  2022-11-08  78  	 * Writable MAP_SHARED mapping: "clean" might indicate that the FS still
7ea7e333842ed5 David Hildenbrand  2022-11-08  79  	 * needs a real write-fault for writenotify
7ea7e333842ed5 David Hildenbrand  2022-11-08  80  	 * (see vma_wants_writenotify()). If "dirty", the assumption is that the
7ea7e333842ed5 David Hildenbrand  2022-11-08  81  	 * FS was already notified and we can simply mark the PTE writable
7ea7e333842ed5 David Hildenbrand  2022-11-08  82  	 * just like the write-fault handler would do.
7ea7e333842ed5 David Hildenbrand  2022-11-08  83  	 */
d84887739d5c98 Nadav Amit         2022-11-08  84  	return pte_dirty(pte);
64fe24a3e05e5f David Hildenbrand  2022-06-14  85  }
64fe24a3e05e5f David Hildenbrand  2022-06-14  86  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-30  5:42       ` Dev Jain
@ 2025-04-30  6:22         ` Lance Yang
  2025-04-30  7:07           ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Lance Yang @ 2025-04-30  6:22 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 2025/4/30 13:42, Dev Jain wrote:
> 
> 
> On 29/04/25 4:11 pm, Lorenzo Stoakes wrote:
>> FWIW can confirm the same thing. Lance's fixes sort most of it out, 
>> but I also
>> get this error:

Good catch!

>>
>> mm/mprotect.c: In function ‘can_change_ptes_writable’:
>> mm/mprotect.c:46:22: error: unused variable ‘page’ [-Werror=unused- 
>> variable]
>>     46 |         struct page *page;
>>        |                      ^~~~
>>
>> So you also need to remove this unused variable at the stop of
>> can_change_ptes_writable().
> 
> Strange that my build didn't catch this.

Well, to catch unused variable warnings with GCC, enable stricter
checks by passing -Wunused-variable via KCFLAGS, and use 
-Werror=unused-variable
to force the build to fail if any variable is declared but unused:

make -j$(nproc) KCFLAGS="-Wunused-variable -Werror=unused-variable"

Thanks,
Lance


> 
>>
>> Cheers, Lorenzo
>>
>> On Tue, Apr 29, 2025 at 02:32:59PM +0530, Dev Jain wrote:
>>>
>>>
>>> On 29/04/25 12:36 pm, Lance Yang wrote:
>>>> Hey Dev,
>>>>
>>>> Hmm... I also hit the same compilation errors:
>>>>
>>>> In file included from ./include/linux/kasan.h:37,
>>>>                    from ./include/linux/slab.h:260,
>>>>                    from ./include/linux/crypto.h:19,
>>>>                    from arch/x86/kernel/asm-offsets.c:9:
>>>> ./include/linux/pgtable.h: In function ‘modify_prot_start_ptes’:
>>>> ./include/linux/pgtable.h:905:15: error: implicit declaration of
>>>> function ‘ptep_modify_prot_start’
>>>> [-Werror=implicit-function-declaration]
>>>>     905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>         |               ^~~~~~~~~~~~~~~~~~~~~~
>>>> ./include/linux/pgtable.h:905:15: error: incompatible types when
>>>> assigning to type ‘pte_t’ from type ‘int’
>>>> ./include/linux/pgtable.h:909:27: error: incompatible types when
>>>> assigning to type ‘pte_t’ from type ‘int’
>>>>     909 |                 tmp_pte = ptep_modify_prot_start(vma, 
>>>> addr, ptep);
>>>>         |                           ^~~~~~~~~~~~~~~~~~~~~~
>>>> ./include/linux/pgtable.h: In function ‘modify_prot_commit_ptes’:
>>>> ./include/linux/pgtable.h:925:17: error: implicit declaration of
>>>> function ‘ptep_modify_prot_commit’
>>>> [-Werror=implicit-function-declaration]
>>>>     925 |                 ptep_modify_prot_commit(vma, addr, ptep,
>>>> old_pte, pte);
>>>>         |                 ^~~~~~~~~~~~~~~~~~~~~~~
>>>> ./include/linux/pgtable.h: At top level:
>>>> ./include/linux/pgtable.h:1360:21: error: conflicting types for
>>>> ‘ptep_modify_prot_start’; have ‘pte_t(struct vm_area_struct *, long
>>>> unsigned int,  pte_t *)’
>>>>    1360 | static inline pte_t ptep_modify_prot_start(struct
>>>> vm_area_struct *vma,
>>>>         |                     ^~~~~~~~~~~~~~~~~~~~~~
>>>> ./include/linux/pgtable.h:905:15: note: previous implicit 
>>>> declaration of
>>>> ‘ptep_modify_prot_start’ with type ‘int()’
>>>>     905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>         |               ^~~~~~~~~~~~~~~~~~~~~~
>>>> ./include/linux/pgtable.h:1371:20: warning: conflicting types for
>>>> ‘ptep_modify_prot_commit’; have ‘void(struct vm_area_struct *, long
>>>> unsigned int,  pte_t *, pte_t,  pte_t)’
>>>>    1371 | static inline void ptep_modify_prot_commit(struct
>>>> vm_area_struct *vma,
>>>>         |                    ^~~~~~~~~~~~~~~~~~~~~~~
>>>> ./include/linux/pgtable.h:1371:20: error: static declaration of
>>>> ‘ptep_modify_prot_commit’ follows non-static declaration
>>>> ./include/linux/pgtable.h:925:17: note: previous implicit 
>>>> declaration of
>>>> ‘ptep_modify_prot_commit’ with type ‘void(struct vm_area_struct *, long
>>>> unsigned int,  pte_t *, pte_t,  pte_t)’
>>>>     925 |                 ptep_modify_prot_commit(vma, addr, ptep,
>>>> old_pte, pte);
>>>>         |                 ^~~~~~~~~~~~~~~~~~~~~~~
>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>> objtool/
>>>> libstring.o
>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>> objtool/
>>>> libctype.o
>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>> objtool/
>>>> str_error_r.o
>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>> objtool/
>>>> librbtree.o
>>>> cc1: some warnings being treated as errors
>>>> make[2]: *** [scripts/Makefile.build:98: arch/x86/kernel/asm-offsets.s]
>>>> Error 1
>>>> make[1]: *** [/home/runner/work/mm-test-robot/mm-test-robot/linux/
>>>> Makefile:1280: prepare0] Error 2
>>>> make[1]: *** Waiting for unfinished jobs....
>>>>     LD /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>> objtool/
>>>> objtool-in.o
>>>>     LINK /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/
>>>> objtool/objtool
>>>> make: *** [Makefile:248: __sub-make] Error 2
>>>>
>>>> Well, modify_prot_start_ptes() calls ptep_modify_prot_start(), but x86
>>>> does not define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION. To avoid
>>>> implicit declaration errors, the architecture-independent
>>>> ptep_modify_prot_start() must be defined before 
>>>> modify_prot_start_ptes().
>>>>
>>>> With the changes below, things work correctly now ;)
>>>
>>> Ah thanks! My bad :(
>>>
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 10cdb87ccecf..d9d6c49bb914 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -895,44 +895,6 @@ static inline void wrprotect_ptes(struct mm_struct
>>>> *mm, unsigned long addr,
>>>>    }
>>>>    #endif
>>>>
>>>> -/* See the comment for ptep_modify_prot_start */
>>>> -#ifndef modify_prot_start_ptes
>>>> -static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> -        unsigned long addr, pte_t *ptep, unsigned int nr)
>>>> -{
>>>> -    pte_t pte, tmp_pte;
>>>> -
>>>> -    pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> -    while (--nr) {
>>>> -        ptep++;
>>>> -        addr += PAGE_SIZE;
>>>> -        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> -        if (pte_dirty(tmp_pte))
>>>> -            pte = pte_mkdirty(pte);
>>>> -        if (pte_young(tmp_pte))
>>>> -            pte = pte_mkyoung(pte);
>>>> -    }
>>>> -    return pte;
>>>> -}
>>>> -#endif
>>>> -
>>>> -/* See the comment for ptep_modify_prot_commit */
>>>> -#ifndef modify_prot_commit_ptes
>>>> -static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> -        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>> -{
>>>> -    for (;;) {
>>>> -        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>> -        if (--nr == 0)
>>>> -            break;
>>>> -        ptep++;
>>>> -        addr += PAGE_SIZE;
>>>> -        old_pte = pte_next_pfn(old_pte);
>>>> -        pte = pte_next_pfn(pte);
>>>> -    }
>>>> -}
>>>> -#endif
>>>> -
>>>>    /*
>>>>     * On some architectures hardware does not set page access bit when
>>>> accessing
>>>>     * memory page, it is responsibility of software setting this 
>>>> bit. It
>>>> brings
>>>> @@ -1375,6 +1337,45 @@ static inline void 
>>>> ptep_modify_prot_commit(struct
>>>> vm_area_struct *vma,
>>>>        __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>>    }
>>>>    #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>>> +
>>>> +/* See the comment for ptep_modify_prot_start */
>>>> +#ifndef modify_prot_start_ptes
>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>>> +{
>>>> +    pte_t pte, tmp_pte;
>>>> +
>>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> +    while (--nr) {
>>>> +        ptep++;
>>>> +        addr += PAGE_SIZE;
>>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> +        if (pte_dirty(tmp_pte))
>>>> +            pte = pte_mkdirty(pte);
>>>> +        if (pte_young(tmp_pte))
>>>> +            pte = pte_mkyoung(pte);
>>>> +    }
>>>> +    return pte;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/* See the comment for ptep_modify_prot_commit */
>>>> +#ifndef modify_prot_commit_ptes
>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>> +{
>>>> +    for (;;) {
>>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>> +        if (--nr == 0)
>>>> +            break;
>>>> +        ptep++;
>>>> +        addr += PAGE_SIZE;
>>>> +        old_pte = pte_next_pfn(old_pte);
>>>> +        pte = pte_next_pfn(pte);
>>>> +    }
>>>> +}
>>>> +#endif
>>>> +
>>>>    #endif /* CONFIG_MMU */
>>>>
>>>>    /*
>>>> -- 
>>>>
>>>> Thanks,
>>>> Lance
>>>>
>>>> On 2025/4/29 13:23, Dev Jain wrote:
>>>>> This patchset optimizes the mprotect() system call for large folios
>>>>> by PTE-batching.
>>>>>
>>>>> We use the following test cases to measure performance, mprotect()'ing
>>>>> the mapped memory to read-only then read-write 40 times:
>>>>>
>>>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>>>> pte-mapping those THPs
>>>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>>>> Test case 3: Mapping 1G of memory with 4K pages
>>>>>
>>>>> Average execution time on arm64, Apple M3:
>>>>> Before the patchset:
>>>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>>>
>>>>> After the patchset:
>>>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
>>>>>
>>>>> Observing T1/T2 and T3 before the patchset, we also remove the 
>>>>> regression
>>>>> introduced by ptep_get() on a contpte block. And, for large folios 
>>>>> we get
>>>>> an almost 74% performance improvement.
>>>>>
>>>>> v1->v2:
>>>>>    - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the
>>>>> header more resilient)
>>>>>    - Abridge the anon-exclusive condition (Lance Yang)
>>>>>
>>>>> Dev Jain (7):
>>>>>     mm: Refactor code in mprotect
>>>>>     mm: Optimize mprotect() by batch-skipping PTEs
>>>>>     mm: Add batched versions of ptep_modify_prot_start/commit
>>>>>     arm64: Add batched version of ptep_modify_prot_start
>>>>>     arm64: Add batched version of ptep_modify_prot_commit
>>>>>     mm: Batch around can_change_pte_writable()
>>>>>     mm: Optimize mprotect() through PTE-batching
>>>>>
>>>>>    arch/arm64/include/asm/pgtable.h |  10 ++
>>>>>    arch/arm64/mm/mmu.c              |  21 +++-
>>>>>    include/linux/mm.h               |   4 +-
>>>>>    include/linux/pgtable.h          |  42 ++++++++
>>>>>    mm/gup.c                         |   2 +-
>>>>>    mm/huge_memory.c                 |   4 +-
>>>>>    mm/memory.c                      |   6 +-
>>>>>    mm/mprotect.c                    | 165 +++++++++++++++++++ 
>>>>> +-----------
>>>>>    mm/pgtable-generic.c             |  16 ++-
>>>>>    9 files changed, 198 insertions(+), 72 deletions(-)
>>>>>
>>>>
>>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29 13:52   ` Lorenzo Stoakes
@ 2025-04-30  6:25     ` Dev Jain
  2025-04-30 14:37       ` Lorenzo Stoakes
  2025-04-30 14:09     ` Ryan Roberts
  1 sibling, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-04-30  6:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..ed287289335f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>
>> +/* See the comment for ptep_modify_prot_start */
> 
> I feel like you really should add a little more here, perhaps point out
> that it's batched etc.

Sure. I couldn't easily figure out a way to write the documentation 
nicely, I'll do it this time.

> 
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> 
> This name is a bit confusing, it's not any ptes, it's those pte entries
> belonging to a large folio capped to the PTE table right that you are
> batching right?

yes, but I am just following the convention. See wrprotect_ptes(), etc. 
I don't have a strong preference anyways.

> 
> Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> the name?

How about modify_prot_start_batched_ptes()?

> 
> We definitely need to mention in comment or name or _somewhere_ the intent
> and motivation for this.
> 
>> +{
>> +	pte_t pte, tmp_pte;
>> +
> 
> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> love this interface, where you require the user to know the number of
> remaining PTE entries in a PTE table.

Shall I write in the comments that the range is supposed to be within a 
PTE table?

> 
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> 
> This loop is a bit horrible. It seems needlessly confusing and you're in
> _dire_ need of comments to explain what's going on.

Again, following the pattern of get_and_clear_full_ptes :)
> 
> So my understanding is, you have the user figure out:
> 
> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> 
> Then, you want to return the pte entry belonging to the start of the large
> folio batch, but you want to adjust that pte value to propagate dirty and
> young page table flags if any page table entries within the range contain
> those page table flags, having called ptep_modify_prot_start() on all of
> them?
> 
> This is quite a bit to a. put in a header like this and b. not
> comment/explain.
> 
> So maybe something like:
> 
> pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> /* Iterate through large folio tail PTEs. */
> for (pg = 1; pg < nr; pg++) {
> 	pte_t inner_pte;
> 
> 	ptep++;
> 	addr += PAGE_SIZE;
> 
> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 	/* We must propagate A/D state from tail PTEs. */
> 	if (pte_dirty(inner_pte))
> 		pte = pte_mkdirty(pte);
> 	if (pte_young(inner_pte))
> 		pte = pte_mkyoung(pte);
> }
> 
> Would work better?

No preference, I'll do this then.

> 
> 
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 
> 
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
> 
> Why are you propagating these?

Because the a/d bits are per-folio; and, this will help us batch around 
can_change_pte_writable (return pte_dirty(pte)) and, batch around 
pte_needs_flush() for parisc.

> 
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/* See the comment for ptep_modify_prot_commit */
> 
> Same comments as above, needs more meat on the bones!
> 
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> 
> Again need to reference large folio, batched or something relevant here,
> 'ptes' is super vague.
> 
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> 
> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?

Because ptep is a pointer, and old_pte isn't.

> 
> I'm even more concerned about the 'nr' API here now.
> 
> So this is now a user-calculated:
> 
> min3(large_folio_pages, number of pte entries left in ptep,
> 	number of pte entries left in old_pte)
> 
> It really feels like something that should be calculated here, or at least
> be broken out more clearly.
> 
> You definitely _at the very least_ need to document it in a comment.
> 
>> +{
>> +	for (;;) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		if (--nr == 0)
>> +			break;
> 
> Why are you doing an infinite loop here with a break like this? Again feels
> needlessly confusing.

Following wrprotect_ptes().
I agree that this is confusing, which is why I thought why it was done 
in the first place :) but I just followed what already is.
I'll change this to a simple for loop if that is your inclination.

> 
> I think it's ok to duplicate this single line for the sake of clarity,
> also.
> 
> Which gives us:
> 
> unsigned long pg;
> 
> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> for (pg = 1; pg < nr; pg++) {
> 	ptep++;
> 	addr += PAGE_SIZE;
> 	old_pte = pte_next_pfn(old_pte);
> 	pte = pte_next_pfn(pte);
> 
> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> }
> 
> There are alternative approaches, but I think doing an infinite loop that
> breaks and especially the confusing 'if (--foo) break;' stuff is much
> harder to parse than a super simple ranged loop.
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start
  2025-04-30  6:14       ` Anshuman Khandual
@ 2025-04-30  6:32         ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-30  6:32 UTC (permalink / raw)
  To: Anshuman Khandual, akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy



On 30/04/25 11:44 am, Anshuman Khandual wrote:
> 
> 
> On 4/30/25 11:19, Dev Jain wrote:
>>
>>
>> On 30/04/25 11:13 am, Anshuman Khandual wrote:
>>> On 4/29/25 10:53, Dev Jain wrote:
>>>> Override the generic definition to use get_and_clear_full_ptes(), so that
>>>> we do a TLBI possibly only on the "contpte-edges" of the large PTE block,
>>>> instead of doing it for every contpte block, which happens for ptep_get_and_clear().
>>>
>>> Could you please explain what does "contpte-edges" really signify in the
>>> context of large PTE blocks ? Also how TLBI operation only on these edges
>>> will never run into the risk of missing TLB invalidation of some other
>>> mapped areas ?
>>
>> We are doing a TLBI over the whole range already, in the mprotect code:
>> see tlb_flush_pte_range. What the arm64 internal API does, irrespective of the caller, is to do a TLBI for every contpte block in case of unfolding. We don't need that for the intermediate blocks because the caller does that. We do need a TLBI for the start and end contpte block,
>> because in case the range we are invalidating partially covers them, then the caller will not do a TLBI for the non-overlapped PTEs of the block.
> 
> But is not splitting the TLBI flush responsibility between the callers
> (intermediate blocks) and the platform API (contpte-edges) - some what
> problematic from a semantics perspective, and will be more susceptible
> for missing TLB flushes etc ?

I seem to agree that this is a semantic problem. Although, we won't ever 
miss a TLB flush - we will only have extras. It is the responsibility of 
the caller to do the TLBI, the platform API is only checking whether we 
need some more TLBIs.

> 
>> I'll explain some more in the changelog next version.
>>
>>>
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    arch/arm64/include/asm/pgtable.h |  5 +++++
>>>>    arch/arm64/mm/mmu.c              | 12 +++++++++---
>>>>    include/linux/pgtable.h          |  4 ++++
>>>>    mm/pgtable-generic.c             | 16 +++++++++++-----
>>>>    4 files changed, 29 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 2a77f11b78d5..8872ea5f0642 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>>>                        unsigned long addr, pte_t *ptep,
>>>>                        pte_t old_pte, pte_t new_pte);
>>>>    +#define modify_prot_start_ptes modify_prot_start_ptes
>>>> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep,
>>>> +                    unsigned int nr);
>>>> +
>>>>    #ifdef CONFIG_ARM64_CONTPTE
>>>>      /*
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 8fcf59ba39db..fe60be8774f4 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
>>>>    early_initcall(prevent_bootmem_remove_init);
>>>>    #endif
>>>>    -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>>>> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
>>>> +                 pte_t *ptep, unsigned int nr)
>>>>    {
>>>>        if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>>>>            /*
>>>> @@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>>>>             * in cases where cpu is affected with errata #2645198.
>>>>             */
>>>>            if (pte_user_exec(ptep_get(ptep)))
>>>> -            return ptep_clear_flush(vma, addr, ptep);
>>>> +            return clear_flush_ptes(vma, addr, ptep, nr);
>>>>        }
>>>> -    return ptep_get_and_clear(vma->vm_mm, addr, ptep);
>>>> +    return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
>>>> +}
>>>> +
>>>> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    return modify_prot_start_ptes(vma, addr, ptep, 1);
>>>>    }
>>>>      void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index ed287289335f..10cdb87ccecf 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>>>                      pte_t *ptep);
>>>>    #endif
>>>>    +extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
>>>> +                  unsigned long address,
>>>> +                  pte_t *ptep, unsigned int nr);
>>>> +
>>>>    #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
>>>>    extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
>>>>                      unsigned long address,
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index 5a882f2b10f9..e238f88c3cac 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>    }
>>>>    #endif
>>>>    -#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>>> -pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>>>> -               pte_t *ptep)
>>>> +pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
>>>> +               pte_t *ptep, unsigned int nr)
>>>>    {
>>>>        struct mm_struct *mm = (vma)->vm_mm;
>>>>        pte_t pte;
>>>> -    pte = ptep_get_and_clear(mm, address, ptep);
>>>> +    pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
>>>>        if (pte_accessible(mm, pte))
>>>> -        flush_tlb_page(vma, address);
>>>> +        flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
>>>>        return pte;
>>>>    }
>>>> +
>>>> +#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>>> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>>>> +               pte_t *ptep)
>>>> +{
>>>> +    return clear_flush_ptes(vma, address, ptep, 1);
>>>> +}
>>>>    #endif
>>>>      #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-29 13:19   ` Lorenzo Stoakes
@ 2025-04-30  6:37     ` Dev Jain
  2025-04-30 13:18       ` Ryan Roberts
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-04-30  6:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 29/04/25 6:49 pm, Lorenzo Stoakes wrote:
> Very very very nitty on subject (sorry I realise this is annoying :P) -
> generally don't need to capitalise 'Optimize' here :>)
> 
> Generally I like the idea here. But some issues on impl.
> 
> On Tue, Apr 29, 2025 at 10:53:31AM +0530, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 27 ++++++++++++++++++++-------
>>   1 file changed, 20 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 70f59aa8c2a8..ec5d17af7650 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>>   	bool toptier;
>>   	int nid;
>>
>> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>> +		return true;
>> +
> 
> Hm why not just put this here from the start? I think you should put this back
> in the prior commit.
> 
>>   	/* Also skip shared copy-on-write pages */
>>   	if (is_cow_mapping(vma->vm_flags) &&
>>   	    (folio_maybe_dma_pinned(folio) ||
>> @@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma, struct folio *folio,
>>   }
>>
>>   static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>> -		unsigned long addr, pte_t oldpte, int target_node)
>> +		unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
>> +		int max_nr, int *nr)
> 
> Hate this ptr to nr.
> 
> Why not just return nr, if it's 0 then skip? Simple!
> 
>>   {
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>   	struct folio *folio;
>>   	int ret;
>>
>> @@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>>   		return true;
>>
>>   	folio = vm_normal_folio(vma, addr, oldpte);
>> -	if (!folio || folio_is_zone_device(folio) ||
>> -	    folio_test_ksm(folio))
>> +	if (!folio)
>>   		return true;
>> +
> 
> Very nitty, but stray extra line unless intended...
> 
> Not sure why we can't just put this !folio check in prot_numa_skip()?

Because we won't be able to batch if the folio is NULL.

I think I really messed up by having separate patch 1 and 2. The real 
intent of patch 1 was to do batching in patch 2 *and* not have insane 
indentation. Perhaps I should merge them, or completely separate them 
logically, I'll figure this out.

> 
>>   	ret = prot_numa_skip(vma, folio, target_node);
>> -	if (ret)
>> +	if (ret) {
>> +		if (folio_test_large(folio) && max_nr != 1)
>> +			*nr = folio_pte_batch(folio, addr, pte, oldpte,
>> +					      max_nr, flags, NULL, NULL, NULL);
> 
> So max_nr can <= 0 too? Shouldn't this be max_nr > 1?
> 
>>   		return ret;
> 
> Again x = fn_return_bool(); if (x) { return x; } is a bit silly, just do if
> (fn_return_bool()) { return true; }.
> 
> If we return the number of pages, then this can become really simple, like:
> 
> I feel like maybe we should abstract the folio large handling here, though it'd
> be a tiny function so hm.
> 
> Anyway assuming we leave it in place, and return number of pages processed, this
> can become:
> 
> if (prot_numa_skip(vma, folio, target_node)) {
> 	if (folio_test_large(folio) && max_nr > 1)
> 		return folio_pte_batch(folio, addr, pte, oldpte, max_nr, flags,
> 				NULL, NULL, NULL);
> 	return 1;
> }
> 
> Which is neater I think!
> 
> 
>> +	}
>>   	if (folio_use_access_time(folio))
>>   		folio_xchg_access_time(folio,
>>   			jiffies_to_msecs(jiffies));
>> @@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	int nr;
>>
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		nr = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>> +			int max_nr = (end - addr) >> PAGE_SHIFT;
> 
> Not a fan of open-coding this. Since we already provide addr, why not just
> provide end as well and have prot_numa_avoid_fault() calculate it?
> 
>>   			pte_t ptent;
>>
>>   			/*
>> @@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa &&
>> -			    prot_numa_avoid_fault(vma, addr,
>> -						  oldpte, target_node))
>> +			    prot_numa_avoid_fault(vma, addr, pte,
>> +						  oldpte, target_node,
>> +							  max_nr, &nr))
>>   					continue;
>>
>>   			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> @@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				pages++;
>>   			}
>>   		}
>> -	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +	} while (pte += nr, addr += nr * PAGE_SIZE, addr != end);
> 
> This is icky, having 'nr' here like this.
> 
> But alternatives might be _even more_ icky (that is advancing both on
> prot_numa_avoid_fault() so probably we need to keep it like this.
> 
> Maybe more a moan at the C programming language tbh haha!
> 
> 
>>   	arch_leave_lazy_mmu_mode();
>>   	pte_unmap_unlock(pte - 1, ptl);
>>
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/7] Optimize mprotect for large folios
  2025-04-30  6:22         ` Lance Yang
@ 2025-04-30  7:07           ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-04-30  7:07 UTC (permalink / raw)
  To: Lance Yang, Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 30/04/25 11:52 am, Lance Yang wrote:
> 
> 
> On 2025/4/30 13:42, Dev Jain wrote:
>>
>>
>> On 29/04/25 4:11 pm, Lorenzo Stoakes wrote:
>>> FWIW can confirm the same thing. Lance's fixes sort most of it out, 
>>> but I also
>>> get this error:
> 
> Good catch!
> 
>>>
>>> mm/mprotect.c: In function ‘can_change_ptes_writable’:
>>> mm/mprotect.c:46:22: error: unused variable ‘page’ [-Werror=unused- 
>>> variable]
>>>     46 |         struct page *page;
>>>        |                      ^~~~
>>>
>>> So you also need to remove this unused variable at the stop of
>>> can_change_ptes_writable().
>>
>> Strange that my build didn't catch this.
> 
> Well, to catch unused variable warnings with GCC, enable stricter
> checks by passing -Wunused-variable via KCFLAGS, and use -Werror=unused- 
> variable
> to force the build to fail if any variable is declared but unused:
> 
> make -j$(nproc) KCFLAGS="-Wunused-variable -Werror=unused-variable"

I thought this was the default, but thanks Lance!

> 
> Thanks,
> Lance
> 
> 
>>
>>>
>>> Cheers, Lorenzo
>>>
>>> On Tue, Apr 29, 2025 at 02:32:59PM +0530, Dev Jain wrote:
>>>>
>>>>
>>>> On 29/04/25 12:36 pm, Lance Yang wrote:
>>>>> Hey Dev,
>>>>>
>>>>> Hmm... I also hit the same compilation errors:
>>>>>
>>>>> In file included from ./include/linux/kasan.h:37,
>>>>>                    from ./include/linux/slab.h:260,
>>>>>                    from ./include/linux/crypto.h:19,
>>>>>                    from arch/x86/kernel/asm-offsets.c:9:
>>>>> ./include/linux/pgtable.h: In function ‘modify_prot_start_ptes’:
>>>>> ./include/linux/pgtable.h:905:15: error: implicit declaration of
>>>>> function ‘ptep_modify_prot_start’
>>>>> [-Werror=implicit-function-declaration]
>>>>>     905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>         |               ^~~~~~~~~~~~~~~~~~~~~~
>>>>> ./include/linux/pgtable.h:905:15: error: incompatible types when
>>>>> assigning to type ‘pte_t’ from type ‘int’
>>>>> ./include/linux/pgtable.h:909:27: error: incompatible types when
>>>>> assigning to type ‘pte_t’ from type ‘int’
>>>>>     909 |                 tmp_pte = ptep_modify_prot_start(vma, 
>>>>> addr, ptep);
>>>>>         |                           ^~~~~~~~~~~~~~~~~~~~~~
>>>>> ./include/linux/pgtable.h: In function ‘modify_prot_commit_ptes’:
>>>>> ./include/linux/pgtable.h:925:17: error: implicit declaration of
>>>>> function ‘ptep_modify_prot_commit’
>>>>> [-Werror=implicit-function-declaration]
>>>>>     925 |                 ptep_modify_prot_commit(vma, addr, ptep,
>>>>> old_pte, pte);
>>>>>         |                 ^~~~~~~~~~~~~~~~~~~~~~~
>>>>> ./include/linux/pgtable.h: At top level:
>>>>> ./include/linux/pgtable.h:1360:21: error: conflicting types for
>>>>> ‘ptep_modify_prot_start’; have ‘pte_t(struct vm_area_struct *, long
>>>>> unsigned int,  pte_t *)’
>>>>>    1360 | static inline pte_t ptep_modify_prot_start(struct
>>>>> vm_area_struct *vma,
>>>>>         |                     ^~~~~~~~~~~~~~~~~~~~~~
>>>>> ./include/linux/pgtable.h:905:15: note: previous implicit 
>>>>> declaration of
>>>>> ‘ptep_modify_prot_start’ with type ‘int()’
>>>>>     905 |         pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>         |               ^~~~~~~~~~~~~~~~~~~~~~
>>>>> ./include/linux/pgtable.h:1371:20: warning: conflicting types for
>>>>> ‘ptep_modify_prot_commit’; have ‘void(struct vm_area_struct *, long
>>>>> unsigned int,  pte_t *, pte_t,  pte_t)’
>>>>>    1371 | static inline void ptep_modify_prot_commit(struct
>>>>> vm_area_struct *vma,
>>>>>         |                    ^~~~~~~~~~~~~~~~~~~~~~~
>>>>> ./include/linux/pgtable.h:1371:20: error: static declaration of
>>>>> ‘ptep_modify_prot_commit’ follows non-static declaration
>>>>> ./include/linux/pgtable.h:925:17: note: previous implicit 
>>>>> declaration of
>>>>> ‘ptep_modify_prot_commit’ with type ‘void(struct vm_area_struct *, 
>>>>> long
>>>>> unsigned int,  pte_t *, pte_t,  pte_t)’
>>>>>     925 |                 ptep_modify_prot_commit(vma, addr, ptep,
>>>>> old_pte, pte);
>>>>>         |                 ^~~~~~~~~~~~~~~~~~~~~~~
>>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>>> objtool/
>>>>> libstring.o
>>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>>> objtool/
>>>>> libctype.o
>>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>>> objtool/
>>>>> str_error_r.o
>>>>>     CC /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>>> objtool/
>>>>> librbtree.o
>>>>> cc1: some warnings being treated as errors
>>>>> make[2]: *** [scripts/Makefile.build:98: arch/x86/kernel/asm- 
>>>>> offsets.s]
>>>>> Error 1
>>>>> make[1]: *** [/home/runner/work/mm-test-robot/mm-test-robot/linux/
>>>>> Makefile:1280: prepare0] Error 2
>>>>> make[1]: *** Waiting for unfinished jobs....
>>>>>     LD /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/ 
>>>>> objtool/
>>>>> objtool-in.o
>>>>>     LINK /home/runner/work/mm-test-robot/mm-test-robot/linux/tools/
>>>>> objtool/objtool
>>>>> make: *** [Makefile:248: __sub-make] Error 2
>>>>>
>>>>> Well, modify_prot_start_ptes() calls ptep_modify_prot_start(), but x86
>>>>> does not define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION. To avoid
>>>>> implicit declaration errors, the architecture-independent
>>>>> ptep_modify_prot_start() must be defined before 
>>>>> modify_prot_start_ptes().
>>>>>
>>>>> With the changes below, things work correctly now ;)
>>>>
>>>> Ah thanks! My bad :(
>>>>
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index 10cdb87ccecf..d9d6c49bb914 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -895,44 +895,6 @@ static inline void wrprotect_ptes(struct 
>>>>> mm_struct
>>>>> *mm, unsigned long addr,
>>>>>    }
>>>>>    #endif
>>>>>
>>>>> -/* See the comment for ptep_modify_prot_start */
>>>>> -#ifndef modify_prot_start_ptes
>>>>> -static inline pte_t modify_prot_start_ptes(struct vm_area_struct 
>>>>> *vma,
>>>>> -        unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>> -{
>>>>> -    pte_t pte, tmp_pte;
>>>>> -
>>>>> -    pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>> -    while (--nr) {
>>>>> -        ptep++;
>>>>> -        addr += PAGE_SIZE;
>>>>> -        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>> -        if (pte_dirty(tmp_pte))
>>>>> -            pte = pte_mkdirty(pte);
>>>>> -        if (pte_young(tmp_pte))
>>>>> -            pte = pte_mkyoung(pte);
>>>>> -    }
>>>>> -    return pte;
>>>>> -}
>>>>> -#endif
>>>>> -
>>>>> -/* See the comment for ptep_modify_prot_commit */
>>>>> -#ifndef modify_prot_commit_ptes
>>>>> -static inline void modify_prot_commit_ptes(struct vm_area_struct 
>>>>> *vma,
>>>>> unsigned long addr,
>>>>> -        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>>> -{
>>>>> -    for (;;) {
>>>>> -        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>>> -        if (--nr == 0)
>>>>> -            break;
>>>>> -        ptep++;
>>>>> -        addr += PAGE_SIZE;
>>>>> -        old_pte = pte_next_pfn(old_pte);
>>>>> -        pte = pte_next_pfn(pte);
>>>>> -    }
>>>>> -}
>>>>> -#endif
>>>>> -
>>>>>    /*
>>>>>     * On some architectures hardware does not set page access bit when
>>>>> accessing
>>>>>     * memory page, it is responsibility of software setting this 
>>>>> bit. It
>>>>> brings
>>>>> @@ -1375,6 +1337,45 @@ static inline void 
>>>>> ptep_modify_prot_commit(struct
>>>>> vm_area_struct *vma,
>>>>>        __ptep_modify_prot_commit(vma, addr, ptep, pte);
>>>>>    }
>>>>>    #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>>>>> +
>>>>> +/* See the comment for ptep_modify_prot_start */
>>>>> +#ifndef modify_prot_start_ptes
>>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct 
>>>>> *vma,
>>>>> +        unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>> +{
>>>>> +    pte_t pte, tmp_pte;
>>>>> +
>>>>> +    pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>> +    while (--nr) {
>>>>> +        ptep++;
>>>>> +        addr += PAGE_SIZE;
>>>>> +        tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>> +        if (pte_dirty(tmp_pte))
>>>>> +            pte = pte_mkdirty(pte);
>>>>> +        if (pte_young(tmp_pte))
>>>>> +            pte = pte_mkyoung(pte);
>>>>> +    }
>>>>> +    return pte;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> +/* See the comment for ptep_modify_prot_commit */
>>>>> +#ifndef modify_prot_commit_ptes
>>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct 
>>>>> *vma,
>>>>> unsigned long addr,
>>>>> +        pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>>> +{
>>>>> +    for (;;) {
>>>>> +        ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>>> +        if (--nr == 0)
>>>>> +            break;
>>>>> +        ptep++;
>>>>> +        addr += PAGE_SIZE;
>>>>> +        old_pte = pte_next_pfn(old_pte);
>>>>> +        pte = pte_next_pfn(pte);
>>>>> +    }
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>    #endif /* CONFIG_MMU */
>>>>>
>>>>>    /*
>>>>> -- 
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>> On 2025/4/29 13:23, Dev Jain wrote:
>>>>>> This patchset optimizes the mprotect() system call for large folios
>>>>>> by PTE-batching.
>>>>>>
>>>>>> We use the following test cases to measure performance, 
>>>>>> mprotect()'ing
>>>>>> the mapped memory to read-only then read-write 40 times:
>>>>>>
>>>>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>>>>> pte-mapping those THPs
>>>>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>>>>> Test case 3: Mapping 1G of memory with 4K pages
>>>>>>
>>>>>> Average execution time on arm64, Apple M3:
>>>>>> Before the patchset:
>>>>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>>>>
>>>>>> After the patchset:
>>>>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.2 seconds
>>>>>>
>>>>>> Observing T1/T2 and T3 before the patchset, we also remove the 
>>>>>> regression
>>>>>> introduced by ptep_get() on a contpte block. And, for large folios 
>>>>>> we get
>>>>>> an almost 74% performance improvement.
>>>>>>
>>>>>> v1->v2:
>>>>>>    - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the
>>>>>> header more resilient)
>>>>>>    - Abridge the anon-exclusive condition (Lance Yang)
>>>>>>
>>>>>> Dev Jain (7):
>>>>>>     mm: Refactor code in mprotect
>>>>>>     mm: Optimize mprotect() by batch-skipping PTEs
>>>>>>     mm: Add batched versions of ptep_modify_prot_start/commit
>>>>>>     arm64: Add batched version of ptep_modify_prot_start
>>>>>>     arm64: Add batched version of ptep_modify_prot_commit
>>>>>>     mm: Batch around can_change_pte_writable()
>>>>>>     mm: Optimize mprotect() through PTE-batching
>>>>>>
>>>>>>    arch/arm64/include/asm/pgtable.h |  10 ++
>>>>>>    arch/arm64/mm/mmu.c              |  21 +++-
>>>>>>    include/linux/mm.h               |   4 +-
>>>>>>    include/linux/pgtable.h          |  42 ++++++++
>>>>>>    mm/gup.c                         |   2 +-
>>>>>>    mm/huge_memory.c                 |   4 +-
>>>>>>    mm/memory.c                      |   6 +-
>>>>>>    mm/mprotect.c                    | 165 +++++++++++++++++++ 
>>>>>> +-----------
>>>>>>    mm/pgtable-generic.c             |  16 ++-
>>>>>>    9 files changed, 198 insertions(+), 72 deletions(-)
>>>>>>
>>>>>
>>>>
>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-30  6:37     ` Dev Jain
@ 2025-04-30 13:18       ` Ryan Roberts
  2025-04-30 13:36         ` Lorenzo Stoakes
  0 siblings, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-04-30 13:18 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, namit, hughd,
	yang, ziy

On 30/04/2025 07:37, Dev Jain wrote:
> 
> 
> On 29/04/25 6:49 pm, Lorenzo Stoakes wrote:
>> Very very very nitty on subject (sorry I realise this is annoying :P) -
>> generally don't need to capitalise 'Optimize' here :>)
>>
>> Generally I like the idea here. But some issues on impl.
>>
>> On Tue, Apr 29, 2025 at 10:53:31AM +0530, Dev Jain wrote:
>>> In case of prot_numa, there are various cases in which we can skip to the
>>> next iteration. Since the skip condition is based on the folio and not
>>> the PTEs, we can skip a PTE batch.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   mm/mprotect.c | 27 ++++++++++++++++++++-------
>>>   1 file changed, 20 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 70f59aa8c2a8..ec5d17af7650 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma,
>>> struct folio *folio,
>>>       bool toptier;
>>>       int nid;
>>>
>>> +    if (folio_is_zone_device(folio) || folio_test_ksm(folio))
>>> +        return true;
>>> +
>>
>> Hm why not just put this here from the start? I think you should put this back
>> in the prior commit.
>>
>>>       /* Also skip shared copy-on-write pages */
>>>       if (is_cow_mapping(vma->vm_flags) &&
>>>           (folio_maybe_dma_pinned(folio) ||
>>> @@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma,
>>> struct folio *folio,
>>>   }
>>>
>>>   static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
>>> -        unsigned long addr, pte_t oldpte, int target_node)
>>> +        unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
>>> +        int max_nr, int *nr)
>>
>> Hate this ptr to nr.
>>
>> Why not just return nr, if it's 0 then skip? Simple!
>>
>>>   {
>>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>       struct folio *folio;
>>>       int ret;
>>>
>>> @@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct
>>> *vma,
>>>           return true;
>>>
>>>       folio = vm_normal_folio(vma, addr, oldpte);
>>> -    if (!folio || folio_is_zone_device(folio) ||
>>> -        folio_test_ksm(folio))
>>> +    if (!folio)
>>>           return true;
>>> +
>>
>> Very nitty, but stray extra line unless intended...
>>
>> Not sure why we can't just put this !folio check in prot_numa_skip()?
> 
> Because we won't be able to batch if the folio is NULL.
> 
> I think I really messed up by having separate patch 1 and 2. The real intent of
> patch 1 was to do batching in patch 2 *and* not have insane indentation. Perhaps
> I should merge them, or completely separate them logically, I'll figure this out.

I'd be inclined to just merge into single patch...

> 
>>
>>>       ret = prot_numa_skip(vma, folio, target_node);
>>> -    if (ret)
>>> +    if (ret) {
>>> +        if (folio_test_large(folio) && max_nr != 1)
>>> +            *nr = folio_pte_batch(folio, addr, pte, oldpte,
>>> +                          max_nr, flags, NULL, NULL, NULL);
>>
>> So max_nr can <= 0 too? Shouldn't this be max_nr > 1?
>>
>>>           return ret;
>>
>> Again x = fn_return_bool(); if (x) { return x; } is a bit silly, just do if
>> (fn_return_bool()) { return true; }.
>>
>> If we return the number of pages, then this can become really simple, like:
>>
>> I feel like maybe we should abstract the folio large handling here, though it'd
>> be a tiny function so hm.
>>
>> Anyway assuming we leave it in place, and return number of pages processed, this
>> can become:
>>
>> if (prot_numa_skip(vma, folio, target_node)) {
>>     if (folio_test_large(folio) && max_nr > 1)
>>         return folio_pte_batch(folio, addr, pte, oldpte, max_nr, flags,
>>                 NULL, NULL, NULL);
>>     return 1;
>> }
>>
>> Which is neater I think!
>>
>>
>>> +    }
>>>       if (folio_use_access_time(folio))
>>>           folio_xchg_access_time(folio,
>>>               jiffies_to_msecs(jiffies));
>>> @@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>       bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>       bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>       bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>> +    int nr;
>>>
>>>       tlb_change_page_size(tlb, PAGE_SIZE);
>>>       pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>> @@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>       flush_tlb_batched_pending(vma->vm_mm);
>>>       arch_enter_lazy_mmu_mode();
>>>       do {
>>> +        nr = 1;
>>>           oldpte = ptep_get(pte);
>>>           if (pte_present(oldpte)) {
>>> +            int max_nr = (end - addr) >> PAGE_SHIFT;
>>
>> Not a fan of open-coding this. Since we already provide addr, why not just
>> provide end as well and have prot_numa_avoid_fault() calculate it?
>>
>>>               pte_t ptent;
>>>
>>>               /*
>>> @@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                * pages. See similar comment in change_huge_pmd.
>>>                */
>>>               if (prot_numa &&
>>> -                prot_numa_avoid_fault(vma, addr,
>>> -                          oldpte, target_node))
>>> +                prot_numa_avoid_fault(vma, addr, pte,
>>> +                          oldpte, target_node,
>>> +                              max_nr, &nr))
>>>                       continue;
>>>
>>>               oldpte = ptep_modify_prot_start(vma, addr, pte);
>>> @@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                   pages++;
>>>               }
>>>           }
>>> -    } while (pte++, addr += PAGE_SIZE, addr != end);
>>> +    } while (pte += nr, addr += nr * PAGE_SIZE, addr != end);
>>
>> This is icky, having 'nr' here like this.

For better or worse, this is the pattern we have already established in other
loops that are batching-aware. See zap_pte_range(), copy_pte_range(), etc. So
I'd prefer to follow that pattern here, as Dev has done.

Thanks.
Ryan

>>
>> But alternatives might be _even more_ icky (that is advancing both on
>> prot_numa_avoid_fault() so probably we need to keep it like this.
>>
>> Maybe more a moan at the C programming language tbh haha!
>>
>>
>>>       arch_leave_lazy_mmu_mode();
>>>       pte_unmap_unlock(pte - 1, ptl);
>>>
>>> -- 
>>> 2.30.2
>>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
  2025-04-30 13:18       ` Ryan Roberts
@ 2025-04-30 13:36         ` Lorenzo Stoakes
  0 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-30 13:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On Wed, Apr 30, 2025 at 02:18:20PM +0100, Ryan Roberts wrote:
> On 30/04/2025 07:37, Dev Jain wrote:
> >
> >
> > On 29/04/25 6:49 pm, Lorenzo Stoakes wrote:
> >> Very very very nitty on subject (sorry I realise this is annoying :P) -
> >> generally don't need to capitalise 'Optimize' here :>)
> >>
> >> Generally I like the idea here. But some issues on impl.
> >>
> >> On Tue, Apr 29, 2025 at 10:53:31AM +0530, Dev Jain wrote:
> >>> In case of prot_numa, there are various cases in which we can skip to the
> >>> next iteration. Since the skip condition is based on the folio and not
> >>> the PTEs, we can skip a PTE batch.
> >>>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> ---
> >>>   mm/mprotect.c | 27 ++++++++++++++++++++-------
> >>>   1 file changed, 20 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index 70f59aa8c2a8..ec5d17af7650 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -91,6 +91,9 @@ static bool prot_numa_skip(struct vm_area_struct *vma,
> >>> struct folio *folio,
> >>>       bool toptier;
> >>>       int nid;
> >>>
> >>> +    if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> >>> +        return true;
> >>> +
> >>
> >> Hm why not just put this here from the start? I think you should put this back
> >> in the prior commit.
> >>
> >>>       /* Also skip shared copy-on-write pages */
> >>>       if (is_cow_mapping(vma->vm_flags) &&
> >>>           (folio_maybe_dma_pinned(folio) ||
> >>> @@ -126,8 +129,10 @@ static bool prot_numa_skip(struct vm_area_struct *vma,
> >>> struct folio *folio,
> >>>   }
> >>>
> >>>   static bool prot_numa_avoid_fault(struct vm_area_struct *vma,
> >>> -        unsigned long addr, pte_t oldpte, int target_node)
> >>> +        unsigned long addr, pte_t *pte, pte_t oldpte, int target_node,
> >>> +        int max_nr, int *nr)
> >>
> >> Hate this ptr to nr.
> >>
> >> Why not just return nr, if it's 0 then skip? Simple!
> >>
> >>>   {
> >>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> >>>       struct folio *folio;
> >>>       int ret;
> >>>
> >>> @@ -136,12 +141,16 @@ static bool prot_numa_avoid_fault(struct vm_area_struct
> >>> *vma,
> >>>           return true;
> >>>
> >>>       folio = vm_normal_folio(vma, addr, oldpte);
> >>> -    if (!folio || folio_is_zone_device(folio) ||
> >>> -        folio_test_ksm(folio))
> >>> +    if (!folio)
> >>>           return true;
> >>> +
> >>
> >> Very nitty, but stray extra line unless intended...
> >>
> >> Not sure why we can't just put this !folio check in prot_numa_skip()?
> >
> > Because we won't be able to batch if the folio is NULL.
> >
> > I think I really messed up by having separate patch 1 and 2. The real intent of
> > patch 1 was to do batching in patch 2 *and* not have insane indentation. Perhaps
> > I should merge them, or completely separate them logically, I'll figure this out.
>
> I'd be inclined to just merge into single patch...

Agreed!

>
> >
> >>
> >>>       ret = prot_numa_skip(vma, folio, target_node);
> >>> -    if (ret)
> >>> +    if (ret) {
> >>> +        if (folio_test_large(folio) && max_nr != 1)
> >>> +            *nr = folio_pte_batch(folio, addr, pte, oldpte,
> >>> +                          max_nr, flags, NULL, NULL, NULL);
> >>
> >> So max_nr can <= 0 too? Shouldn't this be max_nr > 1?
> >>
> >>>           return ret;
> >>
> >> Again x = fn_return_bool(); if (x) { return x; } is a bit silly, just do if
> >> (fn_return_bool()) { return true; }.
> >>
> >> If we return the number of pages, then this can become really simple, like:
> >>
> >> I feel like maybe we should abstract the folio large handling here, though it'd
> >> be a tiny function so hm.
> >>
> >> Anyway assuming we leave it in place, and return number of pages processed, this
> >> can become:
> >>
> >> if (prot_numa_skip(vma, folio, target_node)) {
> >>     if (folio_test_large(folio) && max_nr > 1)
> >>         return folio_pte_batch(folio, addr, pte, oldpte, max_nr, flags,
> >>                 NULL, NULL, NULL);
> >>     return 1;
> >> }
> >>
> >> Which is neater I think!
> >>
> >>
> >>> +    }
> >>>       if (folio_use_access_time(folio))
> >>>           folio_xchg_access_time(folio,
> >>>               jiffies_to_msecs(jiffies));
> >>> @@ -159,6 +168,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>       bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> >>>       bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> >>>       bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> >>> +    int nr;
> >>>
> >>>       tlb_change_page_size(tlb, PAGE_SIZE);
> >>>       pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >>> @@ -173,8 +183,10 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>       flush_tlb_batched_pending(vma->vm_mm);
> >>>       arch_enter_lazy_mmu_mode();
> >>>       do {
> >>> +        nr = 1;
> >>>           oldpte = ptep_get(pte);
> >>>           if (pte_present(oldpte)) {
> >>> +            int max_nr = (end - addr) >> PAGE_SHIFT;
> >>
> >> Not a fan of open-coding this. Since we already provide addr, why not just
> >> provide end as well and have prot_numa_avoid_fault() calculate it?
> >>
> >>>               pte_t ptent;
> >>>
> >>>               /*
> >>> @@ -182,8 +194,9 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>                * pages. See similar comment in change_huge_pmd.
> >>>                */
> >>>               if (prot_numa &&
> >>> -                prot_numa_avoid_fault(vma, addr,
> >>> -                          oldpte, target_node))
> >>> +                prot_numa_avoid_fault(vma, addr, pte,
> >>> +                          oldpte, target_node,
> >>> +                              max_nr, &nr))
> >>>                       continue;
> >>>
> >>>               oldpte = ptep_modify_prot_start(vma, addr, pte);
> >>> @@ -300,7 +313,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> >>>                   pages++;
> >>>               }
> >>>           }
> >>> -    } while (pte++, addr += PAGE_SIZE, addr != end);
> >>> +    } while (pte += nr, addr += nr * PAGE_SIZE, addr != end);
> >>
> >> This is icky, having 'nr' here like this.
>
> For better or worse, this is the pattern we have already established in other
> loops that are batching-aware. See zap_pte_range(), copy_pte_range(), etc. So
> I'd prefer to follow that pattern here, as Dev has done.

Yeah I'm fine with keeping this 'nr' stuff, I don't think there's a great
alternative.

>
> Thanks.
> Ryan

Cheers, Lorenzo

>
> >>
> >> But alternatives might be _even more_ icky (that is advancing both on
> >> prot_numa_avoid_fault() so probably we need to keep it like this.
> >>
> >> Maybe more a moan at the C programming language tbh haha!
> >>
> >>
> >>>       arch_leave_lazy_mmu_mode();
> >>>       pte_unmap_unlock(pte - 1, ptl);
> >>>
> >>> --
> >>> 2.30.2
> >>>
> >
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29 13:52   ` Lorenzo Stoakes
  2025-04-30  6:25     ` Dev Jain
@ 2025-04-30 14:09     ` Ryan Roberts
  2025-04-30 14:34       ` Lorenzo Stoakes
  1 sibling, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:09 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, namit, hughd,
	yang, ziy

On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 38 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..ed287289335f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>  }
>>  #endif
>>
>> +/* See the comment for ptep_modify_prot_start */
> 
> I feel like you really should add a little more here, perhaps point out
> that it's batched etc.
> 
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> 
> This name is a bit confusing, 

On naming, the existing (modern) convention for single-pte helpers is to start
the function name with ptep_. When we started adding batched versions, we took
the approach of adding _ptes as a suffix. For example:

set_pte_at()
ptep_get_and_clear_full()
ptep_set_wrprotect()

set_ptes()
get_and_clear_full_ptes()
wrprotect_ptes()

In this case, we already have ptep_modify_prot_start() and
ptep_modify_prot_commit() for the existing single-pte helper versions. So
according to the convention (or at least how I interpret the convention), the
proposed names seem reasonable.

> it's not any ptes, it's those pte entries
> belonging to a large folio capped to the PTE table right that you are
> batching right?

Yes, but again by convention, that is captured in the kerneldoc comment for the
functions. We are operating on a batch of *ptes* not on a folio or batch of
folios. But it is a requirement of the function that the batch of ptes all lie
within a single large folio (i.e. the pfns are sequential).
 > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> the name?
> 
> We definitely need to mention in comment or name or _somewhere_ the intent
> and motivation for this.

Agreed!

> 
>> +{
>> +	pte_t pte, tmp_pte;
>> +
> 
> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> love this interface, where you require the user to know the number of
> remaining PTE entries in a PTE table.

For better or worse, that's the established convention. See part of comment for
set_ptes() for example:

"""
 * Context: The caller holds the page table lock.  The pages all belong
 * to the same folio.  The PTEs are all in the same PMD.
"""

> 
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> 
> This loop is a bit horrible. It seems needlessly confusing and you're in
> _dire_ need of comments to explain what's going on.
> 
> So my understanding is, you have the user figure out:
> 
> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> 
> Then, you want to return the pte entry belonging to the start of the large
> folio batch, but you want to adjust that pte value to propagate dirty and
> young page table flags if any page table entries within the range contain
> those page table flags, having called ptep_modify_prot_start() on all of
> them?
> 
> This is quite a bit to a. put in a header like this and b. not
> comment/explain.

This style is copied from get_and_clear_full_ptes(), which has this comment,
which explains all this complexity. My vote would be to have a simple comment
for this function:

/**
 * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
 *			     the same folio, collecting dirty/accessed bits.
 * @mm: Address space the pages are mapped into.
 * @addr: Address the first page is mapped at.
 * @ptep: Page table pointer for the first entry.
 * @nr: Number of entries to clear.
 * @full: Whether we are clearing a full mm.
 *
 * May be overridden by the architecture; otherwise, implemented as a simple
 * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
 * returned PTE.
 *
 * Note that PTE bits in the PTE range besides the PFN can differ. For example,
 * some PTEs might be write-protected.
 *
 * Context: The caller holds the page table lock.  The PTEs map consecutive
 * pages that belong to the same folio.  The PTEs are all in the same PMD.
 */

> 
> So maybe something like:
> 
> pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> /* Iterate through large folio tail PTEs. */
> for (pg = 1; pg < nr; pg++) {
> 	pte_t inner_pte;
> 
> 	ptep++;
> 	addr += PAGE_SIZE;
> 
> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 	/* We must propagate A/D state from tail PTEs. */
> 	if (pte_dirty(inner_pte))
> 		pte = pte_mkdirty(pte);
> 	if (pte_young(inner_pte))
> 		pte = pte_mkyoung(pte);
> }
> 
> Would work better?
> 
> 
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 
> 
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
> 
> Why are you propagating these?
> 
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/* See the comment for ptep_modify_prot_commit */
> 
> Same comments as above, needs more meat on the bones!
> 
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> 
> Again need to reference large folio, batched or something relevant here,
> 'ptes' is super vague.
> 
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> 
> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> 
> I'm even more concerned about the 'nr' API here now.
> 
> So this is now a user-calculated:
> 
> min3(large_folio_pages, number of pte entries left in ptep,
> 	number of pte entries left in old_pte)
> 
> It really feels like something that should be calculated here, or at least
> be broken out more clearly.
> 
> You definitely _at the very least_ need to document it in a comment.
> 
>> +{
>> +	for (;;) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		if (--nr == 0)
>> +			break;
> 
> Why are you doing an infinite loop here with a break like this? Again feels
> needlessly confusing.

I agree it's not pretty to look at. But apparently it's the most efficient. This
is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
definition of set_ptes()").

For the record, I think all your comments make good sense, Lorenzo. But there is
an established style, and personally I think at this point is it more confusing
to break from that style.

Thanks,
Ryan


> 
> I think it's ok to duplicate this single line for the sake of clarity,
> also.
> 
> Which gives us:
> 
> unsigned long pg;
> 
> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> for (pg = 1; pg < nr; pg++) {
> 	ptep++;
> 	addr += PAGE_SIZE;
> 	old_pte = pte_next_pfn(old_pte);
> 	pte = pte_next_pfn(pte);
> 
> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> }
> 
> There are alternative approaches, but I think doing an infinite loop that
> breaks and especially the confusing 'if (--foo) break;' stuff is much
> harder to parse than a super simple ranged loop.
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
>> +
>>  /*
>>   * On some architectures hardware does not set page access bit when accessing
>>   * memory page, it is responsibility of software setting this bit. It brings
>> --
>> 2.30.2
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                     ` (3 preceding siblings ...)
  2025-04-30  5:45   ` kernel test robot
@ 2025-04-30 14:16   ` Ryan Roberts
  4 siblings, 0 replies; 53+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:16 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	namit, hughd, yang, ziy

On 29/04/2025 06:23, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.

I would suggest merging this with patch #7 since that's where they actually get
used. Then you can add a single patch after that to specialize them for arm64,
which will give a performance win.

Thanks,
Ryan

> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +/* See the comment for ptep_modify_prot_start */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-30 14:09     ` Ryan Roberts
@ 2025-04-30 14:34       ` Lorenzo Stoakes
  2025-05-01 11:33         ` Ryan Roberts
  0 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-30 14:34 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> >> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >> Architecture can override these helpers.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 38 insertions(+)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index b50447ef1c92..ed287289335f 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>  }
> >>  #endif
> >>
> >> +/* See the comment for ptep_modify_prot_start */
> >
> > I feel like you really should add a little more here, perhaps point out
> > that it's batched etc.
> >
> >> +#ifndef modify_prot_start_ptes
> >> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >
> > This name is a bit confusing,
>
> On naming, the existing (modern) convention for single-pte helpers is to start
> the function name with ptep_. When we started adding batched versions, we took
> the approach of adding _ptes as a suffix. For example:
>
> set_pte_at()
> ptep_get_and_clear_full()
> ptep_set_wrprotect()
>
> set_ptes()
> get_and_clear_full_ptes()
> wrprotect_ptes()
>
> In this case, we already have ptep_modify_prot_start() and
> ptep_modify_prot_commit() for the existing single-pte helper versions. So
> according to the convention (or at least how I interpret the convention), the
> proposed names seem reasonable.
>

Right, I'm fine with following convention (we should), I just find 'ptes'
really ambiguous. It's not just a -set of PTE entries- it's very explicitly
for a large folio. I'd interpret some 'ptes' case to mean 'any number of
pte entries', though I suppose it'd not in practice be any different if
that were the intended use.

However, the proposed use case is large folio 'sub' PTEs and it'd be useful
in callers to know this is explicitly what you're doing.

I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
entriess you're after (not just in effect multiple PTE entries).

However, I'm less insistent on this with a comment that explains what's
going on.

I don't want to hold this up with trivialities around naming...

ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
'PxD/PTE entries' and the fact we use both as a short-hand for each
other. But that's not related to this series, just a pet peeve... :)

> > it's not any ptes, it's those pte entries
> > belonging to a large folio capped to the PTE table right that you are
> > batching right?
>
> Yes, but again by convention, that is captured in the kerneldoc comment for the
> functions. We are operating on a batch of *ptes* not on a folio or batch of
> folios. But it is a requirement of the function that the batch of ptes all lie
> within a single large folio (i.e. the pfns are sequential).

Ack, yeah don't love this nr stuff but fine if it's convention...

>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > the name?
> >
> > We definitely need to mention in comment or name or _somewhere_ the intent
> > and motivation for this.
>
> Agreed!

...and luckily we are aligned on this :)

>
> >
> >> +{
> >> +	pte_t pte, tmp_pte;
> >> +
> >
> > are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> > love this interface, where you require the user to know the number of
> > remaining PTE entries in a PTE table.
>
> For better or worse, that's the established convention. See part of comment for
> set_ptes() for example:
>
> """
>  * Context: The caller holds the page table lock.  The pages all belong
>  * to the same folio.  The PTEs are all in the same PMD.
> """
>
> >
> >> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> >> +	while (--nr) {
> >
> > This loop is a bit horrible. It seems needlessly confusing and you're in
> > _dire_ need of comments to explain what's going on.
> >
> > So my understanding is, you have the user figure out:
> >
> > nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >
> > Then, you want to return the pte entry belonging to the start of the large
> > folio batch, but you want to adjust that pte value to propagate dirty and
> > young page table flags if any page table entries within the range contain
> > those page table flags, having called ptep_modify_prot_start() on all of
> > them?
> >
> > This is quite a bit to a. put in a header like this and b. not
> > comment/explain.
>
> This style is copied from get_and_clear_full_ptes(), which has this comment,
> which explains all this complexity. My vote would be to have a simple comment
> for this function:
>
> /**
>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>  *			     the same folio, collecting dirty/accessed bits.
>  * @mm: Address space the pages are mapped into.
>  * @addr: Address the first page is mapped at.
>  * @ptep: Page table pointer for the first entry.
>  * @nr: Number of entries to clear.
>  * @full: Whether we are clearing a full mm.
>  *
>  * May be overridden by the architecture; otherwise, implemented as a simple
>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>  * returned PTE.
>  *
>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>  * some PTEs might be write-protected.
>  *
>  * Context: The caller holds the page table lock.  The PTEs map consecutive
>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
>  */
>

OK I think the key bit here is 'consecutive pages of the same folio'.

I'd like at least a paragraph about implementation, yes the original
function doesn't have that (and should imo), something like:

	We perform the operation on the first PTE, then if any others
	follow, we invoke the ptep_modify_prot_start() for each and
	aggregate A/D bits.

Something like this.

Point taken on consistency though!

> >
> > So maybe something like:
> >
> > pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > /* Iterate through large folio tail PTEs. */
> > for (pg = 1; pg < nr; pg++) {
> > 	pte_t inner_pte;
> >
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> >
> > 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > 	/* We must propagate A/D state from tail PTEs. */
> > 	if (pte_dirty(inner_pte))
> > 		pte = pte_mkdirty(pte);
> > 	if (pte_young(inner_pte))
> > 		pte = pte_mkyoung(pte);
> > }
> >
> > Would work better?
> >
> >
> >
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> >
> >
> >> +		if (pte_dirty(tmp_pte))
> >> +			pte = pte_mkdirty(pte);
> >> +		if (pte_young(tmp_pte))
> >> +			pte = pte_mkyoung(pte);
> >
> > Why are you propagating these?
> >
> >> +	}
> >> +	return pte;
> >> +}
> >> +#endif
> >> +
> >> +/* See the comment for ptep_modify_prot_commit */
> >
> > Same comments as above, needs more meat on the bones!
> >
> >> +#ifndef modify_prot_commit_ptes
> >> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >
> > Again need to reference large folio, batched or something relevant here,
> > 'ptes' is super vague.
> >
> >> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >
> > Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> >
> > I'm even more concerned about the 'nr' API here now.
> >
> > So this is now a user-calculated:
> >
> > min3(large_folio_pages, number of pte entries left in ptep,
> > 	number of pte entries left in old_pte)
> >
> > It really feels like something that should be calculated here, or at least
> > be broken out more clearly.
> >
> > You definitely _at the very least_ need to document it in a comment.
> >
> >> +{
> >> +	for (;;) {
> >> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >> +		if (--nr == 0)
> >> +			break;
> >
> > Why are you doing an infinite loop here with a break like this? Again feels
> > needlessly confusing.
>
> I agree it's not pretty to look at. But apparently it's the most efficient. This
> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
> definition of set_ptes()").
>
> For the record, I think all your comments make good sense, Lorenzo. But there is
> an established style, and personally I think at this point is it more confusing
> to break from that style.

This isn't _quite_ style, I'd say it's implementation, we're kind of
crossing over into something a little more I'd say :) but I mean I get your
point, sure.

I mean, fine, if (I presume you're referring _only_ to the for (;;) case
above) you are absolutely certain it is more performant in practice I
wouldn't want to stand in the way of that.

I would at least like a comment in the commit message about propagating an
odd loop for performance though to explain the 'qualities'... :)

>
> Thanks,
> Ryan
>
>
> >
> > I think it's ok to duplicate this single line for the sake of clarity,
> > also.
> >
> > Which gives us:
> >
> > unsigned long pg;
> >
> > ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > for (pg = 1; pg < nr; pg++) {
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> > 	old_pte = pte_next_pfn(old_pte);
> > 	pte = pte_next_pfn(pte);
> >
> > 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > }
> >
> > There are alternative approaches, but I think doing an infinite loop that
> > breaks and especially the confusing 'if (--foo) break;' stuff is much
> > harder to parse than a super simple ranged loop.
> >
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +		old_pte = pte_next_pfn(old_pte);
> >> +		pte = pte_next_pfn(pte);
> >> +	}
> >> +}
> >> +#endif
> >> +
> >>  /*
> >>   * On some architectures hardware does not set page access bit when accessing
> >>   * memory page, it is responsibility of software setting this bit. It brings
> >> --
> >> 2.30.2
> >>
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-30  6:25     ` Dev Jain
@ 2025-04-30 14:37       ` Lorenzo Stoakes
  2025-05-06 14:30         ` David Hildenbrand
  0 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-04-30 14:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On Wed, Apr 30, 2025 at 11:55:12AM +0530, Dev Jain wrote:
>
>
> On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
> > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> > > Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> > > Architecture can override these helpers.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 38 insertions(+)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index b50447ef1c92..ed287289335f 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > >   }
> > >   #endif
> > >
> > > +/* See the comment for ptep_modify_prot_start */
> >
> > I feel like you really should add a little more here, perhaps point out
> > that it's batched etc.
>
> Sure. I couldn't easily figure out a way to write the documentation nicely,
> I'll do it this time.

Thanks! Though see the discussion with Ryan also.

>
> >
> > > +#ifndef modify_prot_start_ptes
> > > +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> > > +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >
> > This name is a bit confusing, it's not any ptes, it's those pte entries
> > belonging to a large folio capped to the PTE table right that you are
> > batching right?
>
> yes, but I am just following the convention. See wrprotect_ptes(), etc. I
> don't have a strong preference anyways.
>
> >
> > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > the name?
>
> How about modify_prot_start_batched_ptes()?

I like this :) Ryan - that work for you, or do you feel _batched_ should be
dropped here?

>
> >
> > We definitely need to mention in comment or name or _somewhere_ the intent
> > and motivation for this.
> >
> > > +{
> > > +	pte_t pte, tmp_pte;
> > > +
> >
> > are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> > love this interface, where you require the user to know the number of
> > remaining PTE entries in a PTE table.
>
> Shall I write in the comments that the range is supposed to be within a PTE
> table?

Yeah that'd be helpful I think thanks!

>
> >
> > > +	pte = ptep_modify_prot_start(vma, addr, ptep);
> > > +	while (--nr) {
> >
> > This loop is a bit horrible. It seems needlessly confusing and you're in
> > _dire_ need of comments to explain what's going on.
>
> Again, following the pattern of get_and_clear_full_ptes :)

Yeah, see discussion with Ryan :>)

> >
> > So my understanding is, you have the user figure out:
> >
> > nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >
> > Then, you want to return the pte entry belonging to the start of the large
> > folio batch, but you want to adjust that pte value to propagate dirty and
> > young page table flags if any page table entries within the range contain
> > those page table flags, having called ptep_modify_prot_start() on all of
> > them?
> >
> > This is quite a bit to a. put in a header like this and b. not
> > comment/explain.
> >
> > So maybe something like:
> >
> > pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > /* Iterate through large folio tail PTEs. */
> > for (pg = 1; pg < nr; pg++) {
> > 	pte_t inner_pte;
> >
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> >
> > 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > 	/* We must propagate A/D state from tail PTEs. */
> > 	if (pte_dirty(inner_pte))
> > 		pte = pte_mkdirty(pte);
> > 	if (pte_young(inner_pte))
> > 		pte = pte_mkyoung(pte);
> > }
> >
> > Would work better?
>
> No preference, I'll do this then.

Thanks!

>
> >
> >
> >
> > > +		ptep++;
> > > +		addr += PAGE_SIZE;
> > > +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> >
> >
> > > +		if (pte_dirty(tmp_pte))
> > > +			pte = pte_mkdirty(pte);
> > > +		if (pte_young(tmp_pte))
> > > +			pte = pte_mkyoung(pte);
> >
> > Why are you propagating these?
>
> Because the a/d bits are per-folio; and, this will help us batch around
> can_change_pte_writable (return pte_dirty(pte)) and, batch around
> pte_needs_flush() for parisc.

Understood, thanks!

>
> >
> > > +	}
> > > +	return pte;
> > > +}
> > > +#endif
> > > +
> > > +/* See the comment for ptep_modify_prot_commit */
> >
> > Same comments as above, needs more meat on the bones!
> >
> > > +#ifndef modify_prot_commit_ptes
> > > +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >
> > Again need to reference large folio, batched or something relevant here,
> > 'ptes' is super vague.
> >
> > > +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >
> > Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
>
> Because ptep is a pointer, and old_pte isn't.

Oops :P :) sorry, this is me being a little 'slow' here... I missed that. Carry
on then :P

>
> >
> > I'm even more concerned about the 'nr' API here now.
> >
> > So this is now a user-calculated:
> >
> > min3(large_folio_pages, number of pte entries left in ptep,
> > 	number of pte entries left in old_pte)
> >
> > It really feels like something that should be calculated here, or at least
> > be broken out more clearly.
> >
> > You definitely _at the very least_ need to document it in a comment.
> >
> > > +{
> > > +	for (;;) {
> > > +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > > +		if (--nr == 0)
> > > +			break;
> >
> > Why are you doing an infinite loop here with a break like this? Again feels
> > needlessly confusing.
>
> Following wrprotect_ptes().
> I agree that this is confusing, which is why I thought why it was done in
> the first place :) but I just followed what already is.
> I'll change this to a simple for loop if that is your inclination.

No, I guess let's keep it as-is, Ryan pointed out there are perf considerations
here. This one is a lot less egregious.

>
> >
> > I think it's ok to duplicate this single line for the sake of clarity,
> > also.
> >
> > Which gives us:
> >
> > unsigned long pg;
> >
> > ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > for (pg = 1; pg < nr; pg++) {
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> > 	old_pte = pte_next_pfn(old_pte);
> > 	pte = pte_next_pfn(pte);
> >
> > 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > }
> >
> > There are alternative approaches, but I think doing an infinite loop that
> > breaks and especially the confusing 'if (--foo) break;' stuff is much
> > harder to parse than a super simple ranged loop.
> >
> > > +		ptep++;
> > > +		addr += PAGE_SIZE;
> > > +		old_pte = pte_next_pfn(old_pte);
> > > +		pte = pte_next_pfn(pte);
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >   /*
> > >    * On some architectures hardware does not set page access bit when accessing
> > >    * memory page, it is responsibility of software setting this bit. It brings
> > > --
> > > 2.30.2
> > >
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-30 14:34       ` Lorenzo Stoakes
@ 2025-05-01 11:33         ` Ryan Roberts
  2025-05-01 12:58           ` Lorenzo Stoakes
  0 siblings, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-05-01 11:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On 30/04/2025 15:34, Lorenzo Stoakes wrote:
> On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
>> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
>>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index b50447ef1c92..ed287289335f 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>  }
>>>>  #endif
>>>>
>>>> +/* See the comment for ptep_modify_prot_start */
>>>
>>> I feel like you really should add a little more here, perhaps point out
>>> that it's batched etc.
>>>
>>>> +#ifndef modify_prot_start_ptes
>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>>>
>>> This name is a bit confusing,
>>
>> On naming, the existing (modern) convention for single-pte helpers is to start
>> the function name with ptep_. When we started adding batched versions, we took
>> the approach of adding _ptes as a suffix. For example:
>>
>> set_pte_at()
>> ptep_get_and_clear_full()
>> ptep_set_wrprotect()
>>
>> set_ptes()
>> get_and_clear_full_ptes()
>> wrprotect_ptes()
>>
>> In this case, we already have ptep_modify_prot_start() and
>> ptep_modify_prot_commit() for the existing single-pte helper versions. So
>> according to the convention (or at least how I interpret the convention), the
>> proposed names seem reasonable.
>>
> 
> Right, I'm fine with following convention (we should), I just find 'ptes'
> really ambiguous. It's not just a -set of PTE entries- it's very explicitly
> for a large folio. I'd interpret some 'ptes' case to mean 'any number of
> pte entries', though I suppose it'd not in practice be any different if
> that were the intended use.
> 
> However, the proposed use case is large folio 'sub' PTEs and it'd be useful
> in callers to know this is explicitly what you're doing.
> 
> I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
> entriess you're after (not just in effect multiple PTE entries).

I don't mind _batched_ptes. _pte_batch would be shorter though - what do you think?

But if we go with one of these, then we should consistently apply it to all the
existing helpers IMHO - perhaps with a preparatory patch at the start of the series.

> 
> However, I'm less insistent on this with a comment that explains what's
> going on.

That would still get my vote :)

> 
> I don't want to hold this up with trivialities around naming...

There are TWO hard things in computer science; cache invalidation, naming, and
off-by-one errors :)

> 
> ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
> 'PxD/PTE entries' and the fact we use both as a short-hand for each
> other. But that's not related to this series, just a pet peeve... :)

I assume you are referring to the ambiguity between the *table* and the *entry*
(which just goes to show how ambiguous it is I guess)... I also hate this and
still trip over it all the time...

> 
>>> it's not any ptes, it's those pte entries
>>> belonging to a large folio capped to the PTE table right that you are
>>> batching right?
>>
>> Yes, but again by convention, that is captured in the kerneldoc comment for the
>> functions. We are operating on a batch of *ptes* not on a folio or batch of
>> folios. But it is a requirement of the function that the batch of ptes all lie
>> within a single large folio (i.e. the pfns are sequential).
> 
> Ack, yeah don't love this nr stuff but fine if it's convention...
> 
>>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
>>> the name?
>>>
>>> We definitely need to mention in comment or name or _somewhere_ the intent
>>> and motivation for this.
>>
>> Agreed!
> 
> ...and luckily we are aligned on this :)
> 
>>
>>>
>>>> +{
>>>> +	pte_t pte, tmp_pte;
>>>> +
>>>
>>> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
>>> love this interface, where you require the user to know the number of
>>> remaining PTE entries in a PTE table.
>>
>> For better or worse, that's the established convention. See part of comment for
>> set_ptes() for example:
>>
>> """
>>  * Context: The caller holds the page table lock.  The pages all belong
>>  * to the same folio.  The PTEs are all in the same PMD.
>> """
>>
>>>
>>>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> +	while (--nr) {
>>>
>>> This loop is a bit horrible. It seems needlessly confusing and you're in
>>> _dire_ need of comments to explain what's going on.
>>>
>>> So my understanding is, you have the user figure out:
>>>
>>> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
>>>
>>> Then, you want to return the pte entry belonging to the start of the large
>>> folio batch, but you want to adjust that pte value to propagate dirty and
>>> young page table flags if any page table entries within the range contain
>>> those page table flags, having called ptep_modify_prot_start() on all of
>>> them?
>>>
>>> This is quite a bit to a. put in a header like this and b. not
>>> comment/explain.
>>
>> This style is copied from get_and_clear_full_ptes(), which has this comment,
>> which explains all this complexity. My vote would be to have a simple comment

Oops; I meant "similar" when my fingers somehow typed "simple"... This is not
simple :)

>> for this function:
>>
>> /**
>>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>>  *			     the same folio, collecting dirty/accessed bits.
>>  * @mm: Address space the pages are mapped into.
>>  * @addr: Address the first page is mapped at.
>>  * @ptep: Page table pointer for the first entry.
>>  * @nr: Number of entries to clear.
>>  * @full: Whether we are clearing a full mm.
>>  *
>>  * May be overridden by the architecture; otherwise, implemented as a simple
>>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>>  * returned PTE.
>>  *
>>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>  * some PTEs might be write-protected.
>>  *
>>  * Context: The caller holds the page table lock.  The PTEs map consecutive
>>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>  */
>>
> 
> OK I think the key bit here is 'consecutive pages of the same folio'.
> 
> I'd like at least a paragraph about implementation, yes the original
> function doesn't have that (and should imo), something like:
> 
> 	We perform the operation on the first PTE, then if any others
> 	follow, we invoke the ptep_modify_prot_start() for each and
> 	aggregate A/D bits.
> 
> Something like this.
> 
> Point taken on consistency though!
> 
>>>
>>> So maybe something like:
>>>
>>> pte = ptep_modify_prot_start(vma, addr, ptep);
>>>
>>> /* Iterate through large folio tail PTEs. */
>>> for (pg = 1; pg < nr; pg++) {
>>> 	pte_t inner_pte;
>>>
>>> 	ptep++;
>>> 	addr += PAGE_SIZE;
>>>
>>> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>
>>> 	/* We must propagate A/D state from tail PTEs. */
>>> 	if (pte_dirty(inner_pte))
>>> 		pte = pte_mkdirty(pte);
>>> 	if (pte_young(inner_pte))
>>> 		pte = pte_mkyoung(pte);
>>> }
>>>
>>> Would work better?
>>>
>>>
>>>
>>>> +		ptep++;
>>>> +		addr += PAGE_SIZE;
>>>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>
>>>
>>>
>>>> +		if (pte_dirty(tmp_pte))
>>>> +			pte = pte_mkdirty(pte);
>>>> +		if (pte_young(tmp_pte))
>>>> +			pte = pte_mkyoung(pte);
>>>
>>> Why are you propagating these?
>>>
>>>> +	}
>>>> +	return pte;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/* See the comment for ptep_modify_prot_commit */
>>>
>>> Same comments as above, needs more meat on the bones!
>>>
>>>> +#ifndef modify_prot_commit_ptes
>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>>>
>>> Again need to reference large folio, batched or something relevant here,
>>> 'ptes' is super vague.
>>>
>>>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>
>>> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
>>>
>>> I'm even more concerned about the 'nr' API here now.
>>>
>>> So this is now a user-calculated:
>>>
>>> min3(large_folio_pages, number of pte entries left in ptep,
>>> 	number of pte entries left in old_pte)
>>>
>>> It really feels like something that should be calculated here, or at least
>>> be broken out more clearly.
>>>
>>> You definitely _at the very least_ need to document it in a comment.
>>>
>>>> +{
>>>> +	for (;;) {
>>>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>> +		if (--nr == 0)
>>>> +			break;
>>>
>>> Why are you doing an infinite loop here with a break like this? Again feels
>>> needlessly confusing.
>>
>> I agree it's not pretty to look at. But apparently it's the most efficient. This
>> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
>> definition of set_ptes()").
>>
>> For the record, I think all your comments make good sense, Lorenzo. But there is
>> an established style, and personally I think at this point is it more confusing
>> to break from that style.
> 
> This isn't _quite_ style, I'd say it's implementation, we're kind of
> crossing over into something a little more I'd say :) but I mean I get your
> point, sure.
> 
> I mean, fine, if (I presume you're referring _only_ to the for (;;) case
> above) you are absolutely certain it is more performant in practice I
> wouldn't want to stand in the way of that.

No I'm not certain at all... I'm just saying that's been the argument in the
past. I vaguely recall I even tried changing the loop style in batched helpers I
implemented in the past and David asked me to stick with the established style.

> 
> I would at least like a comment in the commit message about propagating an
> odd loop for performance though to explain the 'qualities'... :)

Just to make it clear, I'm just trying to provide some historical context, I'm
not arguing that all those decisions were perfect. How about we take these
concrete steps:

  - Stick with the _ptes naming convention
  - Add kerneldoc comments for the 2 new functions that are very clear about
    what the function does and the requirements on the batch of ptes (just like
    the other batched helpers)
  - Rework the looping styles in the 2 new functions to be more "standard";
    let's not micro-optimize unless we have real evidence that it is useful.
  - Merge this patch with the one that uses these new functions

How does that sound as a way forwards?

Thanks,
Ryan

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>> I think it's ok to duplicate this single line for the sake of clarity,
>>> also.
>>>
>>> Which gives us:
>>>
>>> unsigned long pg;
>>>
>>> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> for (pg = 1; pg < nr; pg++) {
>>> 	ptep++;
>>> 	addr += PAGE_SIZE;
>>> 	old_pte = pte_next_pfn(old_pte);
>>> 	pte = pte_next_pfn(pte);
>>>
>>> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> }
>>>
>>> There are alternative approaches, but I think doing an infinite loop that
>>> breaks and especially the confusing 'if (--foo) break;' stuff is much
>>> harder to parse than a super simple ranged loop.
>>>
>>>> +		ptep++;
>>>> +		addr += PAGE_SIZE;
>>>> +		old_pte = pte_next_pfn(old_pte);
>>>> +		pte = pte_next_pfn(pte);
>>>> +	}
>>>> +}
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * On some architectures hardware does not set page access bit when accessing
>>>>   * memory page, it is responsibility of software setting this bit. It brings
>>>> --
>>>> 2.30.2
>>>>
>>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-01 11:33         ` Ryan Roberts
@ 2025-05-01 12:58           ` Lorenzo Stoakes
  2025-05-06 14:28             ` David Hildenbrand
  0 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-05-01 12:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On Thu, May 01, 2025 at 12:33:30PM +0100, Ryan Roberts wrote:
> On 30/04/2025 15:34, Lorenzo Stoakes wrote:
> > On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
> >> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> >>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> >>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >>>> Architecture can override these helpers.
> >>>>
> >>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 38 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index b50447ef1c92..ed287289335f 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +/* See the comment for ptep_modify_prot_start */
> >>>
> >>> I feel like you really should add a little more here, perhaps point out
> >>> that it's batched etc.
> >>>
> >>>> +#ifndef modify_prot_start_ptes
> >>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >>>
> >>> This name is a bit confusing,
> >>
> >> On naming, the existing (modern) convention for single-pte helpers is to start
> >> the function name with ptep_. When we started adding batched versions, we took
> >> the approach of adding _ptes as a suffix. For example:
> >>
> >> set_pte_at()
> >> ptep_get_and_clear_full()
> >> ptep_set_wrprotect()
> >>
> >> set_ptes()
> >> get_and_clear_full_ptes()
> >> wrprotect_ptes()
> >>
> >> In this case, we already have ptep_modify_prot_start() and
> >> ptep_modify_prot_commit() for the existing single-pte helper versions. So
> >> according to the convention (or at least how I interpret the convention), the
> >> proposed names seem reasonable.
> >>
> >
> > Right, I'm fine with following convention (we should), I just find 'ptes'
> > really ambiguous. It's not just a -set of PTE entries- it's very explicitly
> > for a large folio. I'd interpret some 'ptes' case to mean 'any number of
> > pte entries', though I suppose it'd not in practice be any different if
> > that were the intended use.
> >
> > However, the proposed use case is large folio 'sub' PTEs and it'd be useful
> > in callers to know this is explicitly what you're doing.
> >
> > I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
> > entriess you're after (not just in effect multiple PTE entries).
>
> I don't mind _batched_ptes. _pte_batch would be shorter though - what do you think?

Sounds good!

>
> But if we go with one of these, then we should consistently apply it to all the
> existing helpers IMHO - perhaps with a preparatory patch at the start of the series.
>
> >
> > However, I'm less insistent on this with a comment that explains what's
> > going on.
>
> That would still get my vote :)

Awesome :)

>
> >
> > I don't want to hold this up with trivialities around naming...
>
> There are TWO hard things in computer science; cache invalidation, naming, and
> off-by-one errors :)

Haha yes... I continue to be surprised at how bloody hard it is as my
career goes on...

>
> >
> > ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
> > 'PxD/PTE entries' and the fact we use both as a short-hand for each
> > other. But that's not related to this series, just a pet peeve... :)
>
> I assume you are referring to the ambiguity between the *table* and the *entry*
> (which just goes to show how ambiguous it is I guess)... I also hate this and
> still trip over it all the time...

Yes. As do I, as does everybody I think... Sadly I think unavoidable :(

>
> >
> >>> it's not any ptes, it's those pte entries
> >>> belonging to a large folio capped to the PTE table right that you are
> >>> batching right?
> >>
> >> Yes, but again by convention, that is captured in the kerneldoc comment for the
> >> functions. We are operating on a batch of *ptes* not on a folio or batch of
> >> folios. But it is a requirement of the function that the batch of ptes all lie
> >> within a single large folio (i.e. the pfns are sequential).
> >
> > Ack, yeah don't love this nr stuff but fine if it's convention...
> >
> >>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> >>> the name?
> >>>
> >>> We definitely need to mention in comment or name or _somewhere_ the intent
> >>> and motivation for this.
> >>
> >> Agreed!
> >
> > ...and luckily we are aligned on this :)
> >
> >>
> >>>
> >>>> +{
> >>>> +	pte_t pte, tmp_pte;
> >>>> +
> >>>
> >>> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> >>> love this interface, where you require the user to know the number of
> >>> remaining PTE entries in a PTE table.
> >>
> >> For better or worse, that's the established convention. See part of comment for
> >> set_ptes() for example:
> >>
> >> """
> >>  * Context: The caller holds the page table lock.  The pages all belong
> >>  * to the same folio.  The PTEs are all in the same PMD.
> >> """
> >>
> >>>
> >>>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>> +	while (--nr) {
> >>>
> >>> This loop is a bit horrible. It seems needlessly confusing and you're in
> >>> _dire_ need of comments to explain what's going on.
> >>>
> >>> So my understanding is, you have the user figure out:
> >>>
> >>> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >>>
> >>> Then, you want to return the pte entry belonging to the start of the large
> >>> folio batch, but you want to adjust that pte value to propagate dirty and
> >>> young page table flags if any page table entries within the range contain
> >>> those page table flags, having called ptep_modify_prot_start() on all of
> >>> them?
> >>>
> >>> This is quite a bit to a. put in a header like this and b. not
> >>> comment/explain.
> >>
> >> This style is copied from get_and_clear_full_ptes(), which has this comment,
> >> which explains all this complexity. My vote would be to have a simple comment
>
> Oops; I meant "similar" when my fingers somehow typed "simple"... This is not
> simple :)

Ha, yeah indeed :P that makes more sense!

>
> >> for this function:
> >>
> >> /**
> >>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
> >>  *			     the same folio, collecting dirty/accessed bits.
> >>  * @mm: Address space the pages are mapped into.
> >>  * @addr: Address the first page is mapped at.
> >>  * @ptep: Page table pointer for the first entry.
> >>  * @nr: Number of entries to clear.
> >>  * @full: Whether we are clearing a full mm.
> >>  *
> >>  * May be overridden by the architecture; otherwise, implemented as a simple
> >>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
> >>  * returned PTE.
> >>  *
> >>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> >>  * some PTEs might be write-protected.
> >>  *
> >>  * Context: The caller holds the page table lock.  The PTEs map consecutive
> >>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
> >>  */
> >>
> >
> > OK I think the key bit here is 'consecutive pages of the same folio'.
> >
> > I'd like at least a paragraph about implementation, yes the original
> > function doesn't have that (and should imo), something like:
> >
> > 	We perform the operation on the first PTE, then if any others
> > 	follow, we invoke the ptep_modify_prot_start() for each and
> > 	aggregate A/D bits.
> >
> > Something like this.
> >
> > Point taken on consistency though!
> >
> >>>
> >>> So maybe something like:
> >>>
> >>> pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>
> >>> /* Iterate through large folio tail PTEs. */
> >>> for (pg = 1; pg < nr; pg++) {
> >>> 	pte_t inner_pte;
> >>>
> >>> 	ptep++;
> >>> 	addr += PAGE_SIZE;
> >>>
> >>> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>
> >>> 	/* We must propagate A/D state from tail PTEs. */
> >>> 	if (pte_dirty(inner_pte))
> >>> 		pte = pte_mkdirty(pte);
> >>> 	if (pte_young(inner_pte))
> >>> 		pte = pte_mkyoung(pte);
> >>> }
> >>>
> >>> Would work better?
> >>>
> >>>
> >>>
> >>>> +		ptep++;
> >>>> +		addr += PAGE_SIZE;
> >>>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>
> >>>
> >>>
> >>>> +		if (pte_dirty(tmp_pte))
> >>>> +			pte = pte_mkdirty(pte);
> >>>> +		if (pte_young(tmp_pte))
> >>>> +			pte = pte_mkyoung(pte);
> >>>
> >>> Why are you propagating these?
> >>>
> >>>> +	}
> >>>> +	return pte;
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>> +/* See the comment for ptep_modify_prot_commit */
> >>>
> >>> Same comments as above, needs more meat on the bones!
> >>>
> >>>> +#ifndef modify_prot_commit_ptes
> >>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >>>
> >>> Again need to reference large folio, batched or something relevant here,
> >>> 'ptes' is super vague.
> >>>
> >>>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >>>
> >>> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> >>>
> >>> I'm even more concerned about the 'nr' API here now.
> >>>
> >>> So this is now a user-calculated:
> >>>
> >>> min3(large_folio_pages, number of pte entries left in ptep,
> >>> 	number of pte entries left in old_pte)
> >>>
> >>> It really feels like something that should be calculated here, or at least
> >>> be broken out more clearly.
> >>>
> >>> You definitely _at the very least_ need to document it in a comment.
> >>>
> >>>> +{
> >>>> +	for (;;) {
> >>>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>>> +		if (--nr == 0)
> >>>> +			break;
> >>>
> >>> Why are you doing an infinite loop here with a break like this? Again feels
> >>> needlessly confusing.
> >>
> >> I agree it's not pretty to look at. But apparently it's the most efficient. This
> >> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
> >> definition of set_ptes()").
> >>
> >> For the record, I think all your comments make good sense, Lorenzo. But there is
> >> an established style, and personally I think at this point is it more confusing
> >> to break from that style.
> >
> > This isn't _quite_ style, I'd say it's implementation, we're kind of
> > crossing over into something a little more I'd say :) but I mean I get your
> > point, sure.
> >
> > I mean, fine, if (I presume you're referring _only_ to the for (;;) case
> > above) you are absolutely certain it is more performant in practice I
> > wouldn't want to stand in the way of that.
>
> No I'm not certain at all... I'm just saying that's been the argument in the
> past. I vaguely recall I even tried changing the loop style in batched helpers I
> implemented in the past and David asked me to stick with the established style.

I definitely defer to David's expertise, but I feel there's some room here
for improving things.

>
> >
> > I would at least like a comment in the commit message about propagating an
> > odd loop for performance though to explain the 'qualities'... :)
>
> Just to make it clear, I'm just trying to provide some historical context, I'm
> not arguing that all those decisions were perfect. How about we take these
> concrete steps:

Ack sure.

>
>   - Stick with the _ptes naming convention
>   - Add kerneldoc comments for the 2 new functions that are very clear about
>     what the function does and the requirements on the batch of ptes (just like
>     the other batched helpers)
>   - Rework the looping styles in the 2 new functions to be more "standard";
>     let's not micro-optimize unless we have real evidence that it is useful.
>   - Merge this patch with the one that uses these new functions
>
> How does that sound as a way forwards?

Sounds good to me!

Cheers, Lorenzo

>
> Thanks,
> Ryan
>
> >
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >>>
> >>> I think it's ok to duplicate this single line for the sake of clarity,
> >>> also.
> >>>
> >>> Which gives us:
> >>>
> >>> unsigned long pg;
> >>>
> >>> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> for (pg = 1; pg < nr; pg++) {
> >>> 	ptep++;
> >>> 	addr += PAGE_SIZE;
> >>> 	old_pte = pte_next_pfn(old_pte);
> >>> 	pte = pte_next_pfn(pte);
> >>>
> >>> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> }
> >>>
> >>> There are alternative approaches, but I think doing an infinite loop that
> >>> breaks and especially the confusing 'if (--foo) break;' stuff is much
> >>> harder to parse than a super simple ranged loop.
> >>>
> >>>> +		ptep++;
> >>>> +		addr += PAGE_SIZE;
> >>>> +		old_pte = pte_next_pfn(old_pte);
> >>>> +		pte = pte_next_pfn(pte);
> >>>> +	}
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  /*
> >>>>   * On some architectures hardware does not set page access bit when accessing
> >>>>   * memory page, it is responsibility of software setting this bit. It brings
> >>>> --
> >>>> 2.30.2
> >>>>
> >>
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-04-29  9:27     ` David Hildenbrand
  2025-04-29 13:57       ` Lorenzo Stoakes
@ 2025-05-06  9:16       ` Dev Jain
  2025-05-06 14:34         ` David Hildenbrand
  1 sibling, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-05-06  9:16 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy



On 29/04/25 2:57 pm, David Hildenbrand wrote:
> On 29.04.25 11:19, David Hildenbrand wrote:
>>
>>>    #include "internal.h"
>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> -                 pte_t pte)
>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> +                  pte_t pte, struct folio *folio, unsigned int nr)
>>>    {
>>>        struct page *page;
>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct 
>>> *vma, unsigned long addr,
>>>             * write-fault handler similarly would map them writable 
>>> without
>>>             * any additional checks while holding the PT lock.
>>>             */
>>> -        page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> +        if (!folio)
>>> +            folio = vm_normal_folio(vma, addr, pte);
>>> +        return folio_test_anon(folio) && ! 
>>> folio_maybe_mapped_shared(folio);
>>
>> Oh no, now I spot it. That is horribly wrong.
>>
>> Please understand first what you are doing.
> 
> Also, would expect that the cow.c selftest would catch that:
> 
> "vmsplice() + unmap in child with mprotect() optimization"
> 
> After fork() we have a R/O PTE in the parent. Our child then uses 
> vmsplice() and unmaps the R/O PTE, meaning it is only left mapped by the 
> parent.
> 
> ret = mprotect(mem, size, PROT_READ);
> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
> 
> should turn the PTE writable, although it shouldn't.
> 
> If that test case does not detect the issue you're introducing, we 
> should look into adding a test case that detects it.
> 

Hi David, I am afraid I don't understand my mistake :( PageAnon(page) 
boils down to folio_test_anon(folio). Next we want to determine whether 
the page underlying a PTE is mapped exclusively or not. I approximate 
this by folio_maybe_mapped_shared -> if the folio => all pages are 
mapped exclusively, then I convert the entire batch to writable. If one 
of the pages is mapped shared, then I do not convert the batch to 
writable, thus missing out on the optimization. As far as I understand,
the test failure points out exactly this right?

Do you suggest an alternate way? My initial approach was to add a new 
flag to folio_pte_batch: FPB_IGNORE_ANON_EXCLUSIVE, but from an API 
design PoV Ryan pointed out that that looked bad.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-01 12:58           ` Lorenzo Stoakes
@ 2025-05-06 14:28             ` David Hildenbrand
  0 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-05-06 14:28 UTC (permalink / raw)
  To: Lorenzo Stoakes, Ryan Roberts
  Cc: Dev Jain, akpm, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, vbabka, jannh, anshuman.khandual, peterx,
	joey.gouly, ioworker0, baohua, kevin.brodsky, quic_zhenhuah,
	christophe.leroy, yangyicong, linux-arm-kernel, namit, hughd,
	yang, ziy

On 01.05.25 14:58, Lorenzo Stoakes wrote:
> On Thu, May 01, 2025 at 12:33:30PM +0100, Ryan Roberts wrote:
>> On 30/04/2025 15:34, Lorenzo Stoakes wrote:
>>> On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
>>>> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
>>>>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>>>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>>>> Architecture can override these helpers.
>>>>>>
>>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>>> ---
>>>>>>   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>>>>>   1 file changed, 38 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index b50447ef1c92..ed287289335f 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>   }
>>>>>>   #endif
>>>>>>
>>>>>> +/* See the comment for ptep_modify_prot_start */
>>>>>
>>>>> I feel like you really should add a little more here, perhaps point out
>>>>> that it's batched etc.
>>>>>
>>>>>> +#ifndef modify_prot_start_ptes
>>>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>>
>>>>> This name is a bit confusing,
>>>>
>>>> On naming, the existing (modern) convention for single-pte helpers is to start
>>>> the function name with ptep_. When we started adding batched versions, we took
>>>> the approach of adding _ptes as a suffix. For example:
>>>>
>>>> set_pte_at()
>>>> ptep_get_and_clear_full()
>>>> ptep_set_wrprotect()
>>>>
>>>> set_ptes()
>>>> get_and_clear_full_ptes()
>>>> wrprotect_ptes()
>>>>
>>>> In this case, we already have ptep_modify_prot_start() and
>>>> ptep_modify_prot_commit() for the existing single-pte helper versions. So
>>>> according to the convention (or at least how I interpret the convention), the
>>>> proposed names seem reasonable.
>>>>
>>>
>>> Right, I'm fine with following convention (we should), I just find 'ptes'
>>> really ambiguous. It's not just a -set of PTE entries- it's very explicitly
>>> for a large folio. I'd interpret some 'ptes' case to mean 'any number of
>>> pte entries', though I suppose it'd not in practice be any different if
>>> that were the intended use.
>>>
>>> However, the proposed use case is large folio 'sub' PTEs and it'd be useful
>>> in callers to know this is explicitly what you're doing.
>>>
>>> I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
>>> entriess you're after (not just in effect multiple PTE entries).
>>
>> I don't mind _batched_ptes. _pte_batch would be shorter though - what do you think?
> 
> Sounds good!
> 
>>
>> But if we go with one of these, then we should consistently apply it to all the
>> existing helpers IMHO - perhaps with a preparatory patch at the start of the series.
>>
>>>
>>> However, I'm less insistent on this with a comment that explains what's
>>> going on.
>>
>> That would still get my vote :)
> 
> Awesome :)
> 
>>
>>>
>>> I don't want to hold this up with trivialities around naming...
>>
>> There are TWO hard things in computer science; cache invalidation, naming, and
>> off-by-one errors :)
> 
> Haha yes... I continue to be surprised at how bloody hard it is as my
> career goes on...
> 
>>
>>>
>>> ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
>>> 'PxD/PTE entries' and the fact we use both as a short-hand for each
>>> other. But that's not related to this series, just a pet peeve... :)
>>
>> I assume you are referring to the ambiguity between the *table* and the *entry*
>> (which just goes to show how ambiguous it is I guess)... I also hate this and
>> still trip over it all the time...
> 
> Yes. As do I, as does everybody I think... Sadly I think unavoidable :(
> 
>>
>>>
>>>>> it's not any ptes, it's those pte entries
>>>>> belonging to a large folio capped to the PTE table right that you are
>>>>> batching right?
>>>>
>>>> Yes, but again by convention, that is captured in the kerneldoc comment for the
>>>> functions. We are operating on a batch of *ptes* not on a folio or batch of
>>>> folios. But it is a requirement of the function that the batch of ptes all lie
>>>> within a single large folio (i.e. the pfns are sequential).
>>>
>>> Ack, yeah don't love this nr stuff but fine if it's convention...
>>>
>>>>   > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
>>>>> the name?
>>>>>
>>>>> We definitely need to mention in comment or name or _somewhere_ the intent
>>>>> and motivation for this.
>>>>
>>>> Agreed!
>>>
>>> ...and luckily we are aligned on this :)
>>>
>>>>
>>>>>
>>>>>> +{
>>>>>> +	pte_t pte, tmp_pte;
>>>>>> +
>>>>>
>>>>> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
>>>>> love this interface, where you require the user to know the number of
>>>>> remaining PTE entries in a PTE table.
>>>>
>>>> For better or worse, that's the established convention. See part of comment for
>>>> set_ptes() for example:
>>>>
>>>> """
>>>>   * Context: The caller holds the page table lock.  The pages all belong
>>>>   * to the same folio.  The PTEs are all in the same PMD.
>>>> """
>>>>
>>>>>
>>>>>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>> +	while (--nr) {
>>>>>
>>>>> This loop is a bit horrible. It seems needlessly confusing and you're in
>>>>> _dire_ need of comments to explain what's going on.
>>>>>
>>>>> So my understanding is, you have the user figure out:
>>>>>
>>>>> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
>>>>>
>>>>> Then, you want to return the pte entry belonging to the start of the large
>>>>> folio batch, but you want to adjust that pte value to propagate dirty and
>>>>> young page table flags if any page table entries within the range contain
>>>>> those page table flags, having called ptep_modify_prot_start() on all of
>>>>> them?
>>>>>
>>>>> This is quite a bit to a. put in a header like this and b. not
>>>>> comment/explain.
>>>>
>>>> This style is copied from get_and_clear_full_ptes(), which has this comment,
>>>> which explains all this complexity. My vote would be to have a simple comment
>>
>> Oops; I meant "similar" when my fingers somehow typed "simple"... This is not
>> simple :)
> 
> Ha, yeah indeed :P that makes more sense!
> 
>>
>>>> for this function:
>>>>
>>>> /**
>>>>   * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>>>>   *			     the same folio, collecting dirty/accessed bits.
>>>>   * @mm: Address space the pages are mapped into.
>>>>   * @addr: Address the first page is mapped at.
>>>>   * @ptep: Page table pointer for the first entry.
>>>>   * @nr: Number of entries to clear.
>>>>   * @full: Whether we are clearing a full mm.
>>>>   *
>>>>   * May be overridden by the architecture; otherwise, implemented as a simple
>>>>   * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>>>>   * returned PTE.
>>>>   *
>>>>   * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>>>   * some PTEs might be write-protected.
>>>>   *
>>>>   * Context: The caller holds the page table lock.  The PTEs map consecutive
>>>>   * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>>>   */
>>>>
>>>
>>> OK I think the key bit here is 'consecutive pages of the same folio'.
>>>
>>> I'd like at least a paragraph about implementation, yes the original
>>> function doesn't have that (and should imo), something like:
>>>
>>> 	We perform the operation on the first PTE, then if any others
>>> 	follow, we invoke the ptep_modify_prot_start() for each and
>>> 	aggregate A/D bits.
>>>
>>> Something like this.
>>>
>>> Point taken on consistency though!
>>>
>>>>>
>>>>> So maybe something like:
>>>>>
>>>>> pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>
>>>>> /* Iterate through large folio tail PTEs. */
>>>>> for (pg = 1; pg < nr; pg++) {
>>>>> 	pte_t inner_pte;
>>>>>
>>>>> 	ptep++;
>>>>> 	addr += PAGE_SIZE;
>>>>>
>>>>> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>
>>>>> 	/* We must propagate A/D state from tail PTEs. */
>>>>> 	if (pte_dirty(inner_pte))
>>>>> 		pte = pte_mkdirty(pte);
>>>>> 	if (pte_young(inner_pte))
>>>>> 		pte = pte_mkyoung(pte);
>>>>> }
>>>>>
>>>>> Would work better?
>>>>>
>>>>>
>>>>>
>>>>>> +		ptep++;
>>>>>> +		addr += PAGE_SIZE;
>>>>>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>
>>>>>
>>>>>
>>>>>> +		if (pte_dirty(tmp_pte))
>>>>>> +			pte = pte_mkdirty(pte);
>>>>>> +		if (pte_young(tmp_pte))
>>>>>> +			pte = pte_mkyoung(pte);
>>>>>
>>>>> Why are you propagating these?
>>>>>
>>>>>> +	}
>>>>>> +	return pte;
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>> +/* See the comment for ptep_modify_prot_commit */
>>>>>
>>>>> Same comments as above, needs more meat on the bones!
>>>>>
>>>>>> +#ifndef modify_prot_commit_ptes
>>>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>>>>>
>>>>> Again need to reference large folio, batched or something relevant here,
>>>>> 'ptes' is super vague.
>>>>>
>>>>>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>>>
>>>>> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
>>>>>
>>>>> I'm even more concerned about the 'nr' API here now.
>>>>>
>>>>> So this is now a user-calculated:
>>>>>
>>>>> min3(large_folio_pages, number of pte entries left in ptep,
>>>>> 	number of pte entries left in old_pte)
>>>>>
>>>>> It really feels like something that should be calculated here, or at least
>>>>> be broken out more clearly.
>>>>>
>>>>> You definitely _at the very least_ need to document it in a comment.
>>>>>
>>>>>> +{
>>>>>> +	for (;;) {
>>>>>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>>>> +		if (--nr == 0)
>>>>>> +			break;
>>>>>
>>>>> Why are you doing an infinite loop here with a break like this? Again feels
>>>>> needlessly confusing.
>>>>
>>>> I agree it's not pretty to look at. But apparently it's the most efficient. This
>>>> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
>>>> definition of set_ptes()").
>>>>
>>>> For the record, I think all your comments make good sense, Lorenzo. But there is
>>>> an established style, and personally I think at this point is it more confusing
>>>> to break from that style.
>>>
>>> This isn't _quite_ style, I'd say it's implementation, we're kind of
>>> crossing over into something a little more I'd say :) but I mean I get your
>>> point, sure.
>>>
>>> I mean, fine, if (I presume you're referring _only_ to the for (;;) case
>>> above) you are absolutely certain it is more performant in practice I
>>> wouldn't want to stand in the way of that.
>>
>> No I'm not certain at all... I'm just saying that's been the argument in the
>> past. I vaguely recall I even tried changing the loop style in batched helpers I
>> implemented in the past and David asked me to stick with the established style.
> 
> I definitely defer to David's expertise, but I feel there's some room here
> for improving things.

Yeah, I recall Willy introducing that scheme, arguing that it is the 
most efficient once. Can't argue with that :)


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-04-30 14:37       ` Lorenzo Stoakes
@ 2025-05-06 14:30         ` David Hildenbrand
  2025-05-06 15:03           ` Lorenzo Stoakes
  0 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-05-06 14:30 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On 30.04.25 16:37, Lorenzo Stoakes wrote:
> On Wed, Apr 30, 2025 at 11:55:12AM +0530, Dev Jain wrote:
>>
>>
>> On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
>>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index b50447ef1c92..ed287289335f 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    }
>>>>    #endif
>>>>
>>>> +/* See the comment for ptep_modify_prot_start */
>>>
>>> I feel like you really should add a little more here, perhaps point out
>>> that it's batched etc.
>>
>> Sure. I couldn't easily figure out a way to write the documentation nicely,
>> I'll do it this time.
> 
> Thanks! Though see the discussion with Ryan also.
> 
>>
>>>
>>>> +#ifndef modify_prot_start_ptes
>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>>>
>>> This name is a bit confusing, it's not any ptes, it's those pte entries
>>> belonging to a large folio capped to the PTE table right that you are
>>> batching right?
>>
>> yes, but I am just following the convention. See wrprotect_ptes(), etc. I
>> don't have a strong preference anyways.
>>
>>>
>>> Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
>>> the name?
>>
>> How about modify_prot_start_batched_ptes()?
> 
> I like this :) Ryan - that work for you, or do you feel _batched_ should be
> dropped here?


modify_prot_start_folio_ptes ?

But I would rather go with

modify_prot_folio_ptes_start

The "batched" is implicit, and "large folio" is not required if it's 
more than one pte ...

:)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
  2025-05-06  9:16       ` Dev Jain
@ 2025-05-06 14:34         ` David Hildenbrand
  0 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-05-06 14:34 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On 06.05.25 11:16, Dev Jain wrote:
> 
> 
> On 29/04/25 2:57 pm, David Hildenbrand wrote:
>> On 29.04.25 11:19, David Hildenbrand wrote:
>>>
>>>>     #include "internal.h"
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned
>>>> long addr,
>>>> -                 pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned
>>>> long addr,
>>>> +                  pte_t pte, struct folio *folio, unsigned int nr)
>>>>     {
>>>>         struct page *page;
>>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct
>>>> *vma, unsigned long addr,
>>>>              * write-fault handler similarly would map them writable
>>>> without
>>>>              * any additional checks while holding the PT lock.
>>>>              */
>>>> -        page = vm_normal_page(vma, addr, pte);
>>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +        if (!folio)
>>>> +            folio = vm_normal_folio(vma, addr, pte);
>>>> +        return folio_test_anon(folio) && !
>>>> folio_maybe_mapped_shared(folio);
>>>
>>> Oh no, now I spot it. That is horribly wrong.
>>>
>>> Please understand first what you are doing.
>>
>> Also, would expect that the cow.c selftest would catch that:
>>
>> "vmsplice() + unmap in child with mprotect() optimization"
>>
>> After fork() we have a R/O PTE in the parent. Our child then uses
>> vmsplice() and unmaps the R/O PTE, meaning it is only left mapped by the
>> parent.
>>
>> ret = mprotect(mem, size, PROT_READ);
>> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>>
>> should turn the PTE writable, although it shouldn't.
>>
>> If that test case does not detect the issue you're introducing, we
>> should look into adding a test case that detects it.
>>
> 
> Hi David, I am afraid I don't understand my mistake :( PageAnon(page)
> boils down to folio_test_anon(folio). Next we want to determine whether
> the page underlying a PTE is mapped exclusively or not.

No. :)

There is your mistake.

We need to know if this folio page is *exclusive* not if the folio is 
*mapped exclusively*.

I know, it's confusing, but that's an important distinction.

You really have to test all PAE bits. I recently sent a mail on how we 
could remove PAE and encode it in the PTE value itself (W || (!W 
+dirty)), which would mean that we could batch more easily. For the time 
being, we have to stick with what we have.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-06 14:30         ` David Hildenbrand
@ 2025-05-06 15:03           ` Lorenzo Stoakes
  0 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06 15:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, namit, hughd, yang, ziy

On Tue, May 06, 2025 at 04:30:18PM +0200, David Hildenbrand wrote:
> On 30.04.25 16:37, Lorenzo Stoakes wrote:
> > On Wed, Apr 30, 2025 at 11:55:12AM +0530, Dev Jain wrote:
> > >
> > >
> > > On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
> > > > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> > > > > Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> > > > > Architecture can override these helpers.
> > > > >
> > > > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > > > ---
> > > > >    include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 38 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > > > index b50447ef1c92..ed287289335f 100644
> > > > > --- a/include/linux/pgtable.h
> > > > > +++ b/include/linux/pgtable.h
> > > > > @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > > > >    }
> > > > >    #endif
> > > > >
> > > > > +/* See the comment for ptep_modify_prot_start */
> > > >
> > > > I feel like you really should add a little more here, perhaps point out
> > > > that it's batched etc.
> > >
> > > Sure. I couldn't easily figure out a way to write the documentation nicely,
> > > I'll do it this time.
> >
> > Thanks! Though see the discussion with Ryan also.
> >
> > >
> > > >
> > > > > +#ifndef modify_prot_start_ptes
> > > > > +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> > > > > +		unsigned long addr, pte_t *ptep, unsigned int nr)
> > > >
> > > > This name is a bit confusing, it's not any ptes, it's those pte entries
> > > > belonging to a large folio capped to the PTE table right that you are
> > > > batching right?
> > >
> > > yes, but I am just following the convention. See wrprotect_ptes(), etc. I
> > > don't have a strong preference anyways.
> > >
> > > >
> > > > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > > > the name?
> > >
> > > How about modify_prot_start_batched_ptes()?
> >
> > I like this :) Ryan - that work for you, or do you feel _batched_ should be
> > dropped here?
>
>
> modify_prot_start_folio_ptes ?
>
> But I would rather go with
>
> modify_prot_folio_ptes_start
>
> The "batched" is implicit, and "large folio" is not required if it's more
> than one pte ...

Yeah that works for me! The mention of folio with plural does neatly imply
the rest.

Naming is hard :P

>
> :)
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-05-06 18:54 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29  5:23 [PATCH v2 0/7] Optimize mprotect for large folios Dev Jain
2025-04-29  5:23 ` [PATCH v2 1/7] mm: Refactor code in mprotect Dev Jain
2025-04-29  6:41   ` Anshuman Khandual
2025-04-29  6:54     ` Dev Jain
2025-04-29 11:00   ` Lorenzo Stoakes
2025-04-29  5:23 ` [PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
2025-04-29  7:14   ` Anshuman Khandual
2025-04-29  8:59     ` Dev Jain
2025-04-29 13:19   ` Lorenzo Stoakes
2025-04-30  6:37     ` Dev Jain
2025-04-30 13:18       ` Ryan Roberts
2025-04-30 13:36         ` Lorenzo Stoakes
2025-04-29  5:23 ` [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-04-29  8:39   ` Anshuman Khandual
2025-04-29  9:01     ` Dev Jain
2025-04-29 13:52   ` Lorenzo Stoakes
2025-04-30  6:25     ` Dev Jain
2025-04-30 14:37       ` Lorenzo Stoakes
2025-05-06 14:30         ` David Hildenbrand
2025-05-06 15:03           ` Lorenzo Stoakes
2025-04-30 14:09     ` Ryan Roberts
2025-04-30 14:34       ` Lorenzo Stoakes
2025-05-01 11:33         ` Ryan Roberts
2025-05-01 12:58           ` Lorenzo Stoakes
2025-05-06 14:28             ` David Hildenbrand
2025-04-30  5:35   ` kernel test robot
2025-04-30  5:45   ` kernel test robot
2025-04-30 14:16   ` Ryan Roberts
2025-04-29  5:23 ` [PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start Dev Jain
2025-04-30  5:43   ` Anshuman Khandual
2025-04-30  5:49     ` Dev Jain
2025-04-30  6:14       ` Anshuman Khandual
2025-04-30  6:32         ` Dev Jain
2025-04-29  5:23 ` [PATCH v2 5/7] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
2025-04-29  5:23 ` [PATCH v2 6/7] mm: Batch around can_change_pte_writable() Dev Jain
2025-04-29  9:15   ` David Hildenbrand
2025-04-29  9:19   ` David Hildenbrand
2025-04-29  9:27     ` David Hildenbrand
2025-04-29 13:57       ` Lorenzo Stoakes
2025-04-29 14:00         ` David Hildenbrand
2025-04-30  5:44         ` Dev Jain
2025-05-06  9:16       ` Dev Jain
2025-05-06 14:34         ` David Hildenbrand
2025-04-30  6:17   ` kernel test robot
2025-04-29  5:23 ` [PATCH v2 7/7] mm: Optimize mprotect() through PTE-batching Dev Jain
2025-04-29  7:06 ` [PATCH v2 0/7] Optimize mprotect for large folios Lance Yang
2025-04-29  9:02   ` Dev Jain
2025-04-29 10:41     ` Lorenzo Stoakes
2025-04-30  5:42       ` Dev Jain
2025-04-30  6:22         ` Lance Yang
2025-04-30  7:07           ` Dev Jain
2025-04-29 11:03 ` Lorenzo Stoakes
2025-04-29 14:02   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).