[PATCH v3 0/5] Optimize mprotect() for large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Optimize mprotect() for large folios
@ 2025-05-19  7:48 Dev Jain
  2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-19  7:48 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

This patchset optimizes the mprotect() system call for large folios
by PTE-batching.

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds

After the patchset:
T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds

Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.

Here is the test program:

 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>

 #define SIZE (1024*1024*1024)

unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);

static void pte_map_thps(char *mem, size_t size)
{
	size_t offs;
	int ret = 0;


	/* PTE-map each THP by temporarily splitting the VMAs. */
	for (offs = 0; offs < size; offs += pmdsize) {
		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
	}

	if (ret) {
		fprintf(stderr, "ERROR: mprotect() failed\n");
		exit(1);
	}
}

int main(int argc, char *argv[])
{
	char *p;
        int ret = 0;
	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p != (1UL << 30)) {
		perror("mmap");
		return 1;
	}



	memset(p, 0, SIZE);
	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
		perror("madvise");
	explicit_bzero(p, SIZE);
	pte_map_thps(p, SIZE);

	for (int loops = 0; loops < 40; loops++) {
		if (mprotect(p, SIZE, PROT_READ))
			perror("mprotect"), exit(1);
		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
			perror("mprotect"), exit(1);
		explicit_bzero(p, SIZE);
	}
}

The patchset is rebased onto mm-unstable (9ead4336d7c07f085def6ab372245640a22af5bd).

v2->v3:
 - Add comments for the new APIs (Ryan, Lorenzo)
 - Instead of refactoring, use a "skip_batch" label
 - Move arm64 patches at the end (Ryan)
 - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
 - Resolve implicit declaration; tested build on x86 (Lance Yang)

v1->v2:
 - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
 - Abridge the anon-exclusive condition (Lance Yang)
  
Dev Jain (5):
  mm: Optimize mprotect() by batch-skipping PTEs
  mm: Add batched versions of ptep_modify_prot_start/commit
  mm: Optimize mprotect() by PTE batching
  arm64: Add batched version of ptep_modify_prot_start
  arm64: Add batched version of ptep_modify_prot_commit

 arch/arm64/include/asm/pgtable.h |  10 ++
 arch/arm64/mm/mmu.c              |  21 ++++-
 include/linux/mm.h               |   7 +-
 include/linux/pgtable.h          |  79 ++++++++++++++++
 mm/mprotect.c                    | 154 +++++++++++++++++++++++++------
 mm/pgtable-generic.c             |  16 +++-
 6 files changed, 246 insertions(+), 41 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-19  7:48 [PATCH v3 0/5] Optimize mprotect() for large folios Dev Jain
@ 2025-05-19  7:48 ` Dev Jain
  2025-05-21  8:43   ` Ryan Roberts
                     ` (2 more replies)
  2025-05-19  7:48 ` [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-19  7:48 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

In case of prot_numa, there are various cases in which we can skip to the
next iteration. Since the skip condition is based on the folio and not
the PTEs, we can skip a PTE batch.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88608d0dc2c2..1ee160ed0b14 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,6 +83,18 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
+static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
+		pte_t pte, int max_nr_ptes)
+{
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+
+	if (!folio_test_large(folio) || (max_nr_ptes == 1))
+		return 1;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
+			       NULL, NULL, NULL);
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -94,6 +106,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int nr_ptes;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -108,8 +121,10 @@ static long change_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+		nr_ptes = 1;
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
+			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
 			pte_t ptent;
 
 			/*
@@ -126,15 +141,18 @@ static long change_pte_range(struct mmu_gather *tlb,
 					continue;
 
 				folio = vm_normal_folio(vma, addr, oldpte);
-				if (!folio || folio_is_zone_device(folio) ||
-				    folio_test_ksm(folio))
+				if (!folio)
 					continue;
 
+				if (folio_is_zone_device(folio) ||
+				    folio_test_ksm(folio))
+					goto skip_batch;
+
 				/* Also skip shared copy-on-write pages */
 				if (is_cow_mapping(vma->vm_flags) &&
 				    (folio_maybe_dma_pinned(folio) ||
 				     folio_maybe_mapped_shared(folio)))
-					continue;
+					goto skip_batch;
 
 				/*
 				 * While migration can move some dirty pages,
@@ -143,7 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				 */
 				if (folio_is_file_lru(folio) &&
 				    folio_test_dirty(folio))
-					continue;
+					goto skip_batch;
 
 				/*
 				 * Don't mess with PTEs if page is already on the node
@@ -151,7 +169,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				 */
 				nid = folio_nid(folio);
 				if (target_node == nid)
-					continue;
+					goto skip_batch;
 				toptier = node_is_toptier(nid);
 
 				/*
@@ -159,8 +177,12 @@ static long change_pte_range(struct mmu_gather *tlb,
 				 * balancing is disabled
 				 */
 				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
-				    toptier)
+				    toptier) {
+skip_batch:
+					nr_ptes = mprotect_batch(folio, addr, pte,
+								 oldpte, max_nr_ptes);
 					continue;
+				}
 				if (folio_use_access_time(folio))
 					folio_xchg_access_time(folio,
 						jiffies_to_msecs(jiffies));
@@ -280,7 +302,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				pages++;
 			}
 		}
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
@ 2025-05-21  8:43   ` Ryan Roberts
  2025-05-21 11:58   ` Ryan Roberts
  2025-05-21 12:06   ` David Hildenbrand
  2 siblings, 0 replies; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21  8:43 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 19/05/2025 08:48, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM; a lot less churn than before.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/mprotect.c | 36 +++++++++++++++++++++++++++++-------
>  1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88608d0dc2c2..1ee160ed0b14 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,18 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>  
> +static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t pte, int max_nr_ptes)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio_test_large(folio) || (max_nr_ptes == 1))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> +			       NULL, NULL, NULL);
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -94,6 +106,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -108,8 +121,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>  			pte_t ptent;
>  
>  			/*
> @@ -126,15 +141,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>  					continue;
>  
>  				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> +				if (!folio)
>  					continue;
>  
> +				if (folio_is_zone_device(folio) ||
> +				    folio_test_ksm(folio))
> +					goto skip_batch;
> +
>  				/* Also skip shared copy-on-write pages */
>  				if (is_cow_mapping(vma->vm_flags) &&
>  				    (folio_maybe_dma_pinned(folio) ||
>  				     folio_maybe_mapped_shared(folio)))
> -					continue;
> +					goto skip_batch;
>  
>  				/*
>  				 * While migration can move some dirty pages,
> @@ -143,7 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 */
>  				if (folio_is_file_lru(folio) &&
>  				    folio_test_dirty(folio))
> -					continue;
> +					goto skip_batch;
>  
>  				/*
>  				 * Don't mess with PTEs if page is already on the node
> @@ -151,7 +169,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 */
>  				nid = folio_nid(folio);
>  				if (target_node == nid)
> -					continue;
> +					goto skip_batch;
>  				toptier = node_is_toptier(nid);
>  
>  				/*
> @@ -159,8 +177,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 * balancing is disabled
>  				 */
>  				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				    toptier) {
> +skip_batch:
> +					nr_ptes = mprotect_batch(folio, addr, pte,
> +								 oldpte, max_nr_ptes);
>  					continue;
> +				}
>  				if (folio_use_access_time(folio))
>  					folio_xchg_access_time(folio,
>  						jiffies_to_msecs(jiffies));
> @@ -280,7 +302,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
  2025-05-21  8:43   ` Ryan Roberts
@ 2025-05-21 11:58   ` Ryan Roberts
  2025-05-22  5:45     ` Dev Jain
  2025-05-21 12:06   ` David Hildenbrand
  2 siblings, 1 reply; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21 11:58 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 19/05/2025 08:48, Dev Jain wrote:
> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 36 +++++++++++++++++++++++++++++-------
>  1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88608d0dc2c2..1ee160ed0b14 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,18 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>  
> +static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,

Perhaps it should be called mprotect_folio_pte_batch() to match the existing
madvise_folio_pte_batch()?

> +		pte_t pte, int max_nr_ptes)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio_test_large(folio) || (max_nr_ptes == 1))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> +			       NULL, NULL, NULL);
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -94,6 +106,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -108,8 +121,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>  			pte_t ptent;
>  
>  			/*
> @@ -126,15 +141,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>  					continue;
>  
>  				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> +				if (!folio)
>  					continue;

You modify mprotect_batch() to handle folio == NULL later, perhaps just add that
here, then you don't need to unpick this conditional and can just goto
skip_branch, even for the !folio case.

Thanks,
Ryan

>  
> +				if (folio_is_zone_device(folio) ||
> +				    folio_test_ksm(folio))
> +					goto skip_batch;
> +
>  				/* Also skip shared copy-on-write pages */
>  				if (is_cow_mapping(vma->vm_flags) &&
>  				    (folio_maybe_dma_pinned(folio) ||
>  				     folio_maybe_mapped_shared(folio)))
> -					continue;
> +					goto skip_batch;
>  
>  				/*
>  				 * While migration can move some dirty pages,
> @@ -143,7 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 */
>  				if (folio_is_file_lru(folio) &&
>  				    folio_test_dirty(folio))
> -					continue;
> +					goto skip_batch;
>  
>  				/*
>  				 * Don't mess with PTEs if page is already on the node
> @@ -151,7 +169,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 */
>  				nid = folio_nid(folio);
>  				if (target_node == nid)
> -					continue;
> +					goto skip_batch;
>  				toptier = node_is_toptier(nid);
>  
>  				/*
> @@ -159,8 +177,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				 * balancing is disabled
>  				 */
>  				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				    toptier) {
> +skip_batch:
> +					nr_ptes = mprotect_batch(folio, addr, pte,
> +								 oldpte, max_nr_ptes);
>  					continue;
> +				}
>  				if (folio_use_access_time(folio))
>  					folio_xchg_access_time(folio,
>  						jiffies_to_msecs(jiffies));
> @@ -280,7 +302,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-21 11:58   ` Ryan Roberts
@ 2025-05-22  5:45     ` Dev Jain
  0 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-22  5:45 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 5:28 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> In case of prot_numa, there are various cases in which we can skip to the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 36 +++++++++++++++++++++++++++++-------
>>   1 file changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88608d0dc2c2..1ee160ed0b14 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,18 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	return pte_dirty(pte);
>>   }
>>   
>> +static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
> Perhaps it should be called mprotect_folio_pte_batch() to match the existing
> madvise_folio_pte_batch()?


Thanks, this is better.


>
>> +		pte_t pte, int max_nr_ptes)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio_test_large(folio) || (max_nr_ptes == 1))
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> +			       NULL, NULL, NULL);
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -94,6 +106,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	int nr_ptes;
>>   
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -108,8 +121,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>   			pte_t ptent;
>>   
>>   			/*
>> @@ -126,15 +141,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   					continue;
>>   
>>   				folio = vm_normal_folio(vma, addr, oldpte);
>> -				if (!folio || folio_is_zone_device(folio) ||
>> -				    folio_test_ksm(folio))
>> +				if (!folio)
>>   					continue;
> You modify mprotect_batch() to handle folio == NULL later, perhaps just add that
> here, then you don't need to unpick this conditional and can just goto
> skip_branch, even for the !folio case.


I'll check this.


>
> Thanks,
> Ryan
>
>>   
>> +				if (folio_is_zone_device(folio) ||
>> +				    folio_test_ksm(folio))
>> +					goto skip_batch;
>> +
>>   				/* Also skip shared copy-on-write pages */
>>   				if (is_cow_mapping(vma->vm_flags) &&
>>   				    (folio_maybe_dma_pinned(folio) ||
>>   				     folio_maybe_mapped_shared(folio)))
>> -					continue;
>> +					goto skip_batch;
>>   
>>   				/*
>>   				 * While migration can move some dirty pages,
>> @@ -143,7 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				 */
>>   				if (folio_is_file_lru(folio) &&
>>   				    folio_test_dirty(folio))
>> -					continue;
>> +					goto skip_batch;
>>   
>>   				/*
>>   				 * Don't mess with PTEs if page is already on the node
>> @@ -151,7 +169,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				 */
>>   				nid = folio_nid(folio);
>>   				if (target_node == nid)
>> -					continue;
>> +					goto skip_batch;
>>   				toptier = node_is_toptier(nid);
>>   
>>   				/*
>> @@ -159,8 +177,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				 * balancing is disabled
>>   				 */
>>   				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
>> -				    toptier)
>> +				    toptier) {
>> +skip_batch:
>> +					nr_ptes = mprotect_batch(folio, addr, pte,
>> +								 oldpte, max_nr_ptes);
>>   					continue;
>> +				}
>>   				if (folio_use_access_time(folio))
>>   					folio_xchg_access_time(folio,
>>   						jiffies_to_msecs(jiffies));
>> @@ -280,7 +302,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				pages++;
>>   			}
>>   		}
>> -	} while (pte++, addr += PAGE_SIZE, addr != end);
>> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>>   	arch_leave_lazy_mmu_mode();
>>   	pte_unmap_unlock(pte - 1, ptl);
>>   


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
  2025-05-21  8:43   ` Ryan Roberts
  2025-05-21 11:58   ` Ryan Roberts
@ 2025-05-21 12:06   ` David Hildenbrand
  2025-05-22  5:43     ` Dev Jain
  2 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-05-21 12:06 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 19.05.25 09:48, Dev Jain wrote:

Please highlight in the subject that this is only about MM_CP_PROT_NUMA 
handling.

> In case of prot_numa, there are various cases in which we can skip to the
> next iteration. Since the skip condition is based on the folio and not
> the PTEs, we can skip a PTE batch.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   mm/mprotect.c | 36 +++++++++++++++++++++++++++++-------
>   1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88608d0dc2c2..1ee160ed0b14 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,18 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>   	return pte_dirty(pte);
>   }
>   
> +static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t pte, int max_nr_ptes)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio_test_large(folio) || (max_nr_ptes == 1))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> +			       NULL, NULL, NULL);
> +}
> +
>   static long change_pte_range(struct mmu_gather *tlb,
>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -94,6 +106,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>   
>   	tlb_change_page_size(tlb, PAGE_SIZE);
>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -108,8 +121,10 @@ static long change_pte_range(struct mmu_gather *tlb,
>   	flush_tlb_batched_pending(vma->vm_mm);
>   	arch_enter_lazy_mmu_mode();
>   	do {
> +		nr_ptes = 1;
>   		oldpte = ptep_get(pte);
>   		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>   			pte_t ptent;
>   
>   			/*
> @@ -126,15 +141,18 @@ static long change_pte_range(struct mmu_gather *tlb,
>   					continue;
>   
>   				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> +				if (!folio)
>   					continue;
>   
> +				if (folio_is_zone_device(folio) ||
> +				    folio_test_ksm(folio))
> +					goto skip_batch;
> +
>   				/* Also skip shared copy-on-write pages */
>   				if (is_cow_mapping(vma->vm_flags) &&
>   				    (folio_maybe_dma_pinned(folio) ||
>   				     folio_maybe_mapped_shared(folio)))
> -					continue;
> +					goto skip_batch;
>   
>   				/*
>   				 * While migration can move some dirty pages,
> @@ -143,7 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>   				 */
>   				if (folio_is_file_lru(folio) &&
>   				    folio_test_dirty(folio))
> -					continue;
> +					goto skip_batch;
>   
>   				/*
>   				 * Don't mess with PTEs if page is already on the node
> @@ -151,7 +169,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>   				 */
>   				nid = folio_nid(folio);
>   				if (target_node == nid)
> -					continue;
> +					goto skip_batch;
>   				toptier = node_is_toptier(nid);
>   
>   				/*
> @@ -159,8 +177,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>   				 * balancing is disabled
>   				 */
>   				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				    toptier) {
> +skip_batch:
> +					nr_ptes = mprotect_batch(folio, addr, pte,
> +								 oldpte, max_nr_ptes);
>   					continue;
> +				}


I suggest

a) not burying that skip_batch label in another if condition

b) looking into factoring out prot_numa handling into a separate 
function first.


Likely we want something like

if (prot_numa) {
	nr_ptes = prot_numa_pte_range_skip_ptes(vma, addr, oldpte);
	if (nr_ptes)
		continue;
}

... likely with a better function name,


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-21 12:06   ` David Hildenbrand
@ 2025-05-22  5:43     ` Dev Jain
  2025-05-22  7:13       ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-05-22  5:43 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 21/05/25 5:36 pm, David Hildenbrand wrote:
> On 19.05.25 09:48, Dev Jain wrote:
>
> Please highlight in the subject that this is only about 
> MM_CP_PROT_NUMA handling.


Sure.


>
>> In case of prot_numa, there are various cases in which we can skip to 
>> the
>> next iteration. Since the skip condition is based on the folio and not
>> the PTEs, we can skip a PTE batch.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mprotect.c | 36 +++++++++++++++++++++++++++++-------
>>   1 file changed, 29 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88608d0dc2c2..1ee160ed0b14 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,18 @@ bool can_change_pte_writable(struct vm_area_struct 
>> *vma, unsigned long addr,
>>       return pte_dirty(pte);
>>   }
>>   +static int mprotect_batch(struct folio *folio, unsigned long addr, 
>> pte_t *ptep,
>> +        pte_t pte, int max_nr_ptes)
>> +{
>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +
>> +    if (!folio_test_large(folio) || (max_nr_ptes == 1))
>> +        return 1;
>> +
>> +    return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>> +                   NULL, NULL, NULL);
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>           struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>           unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -94,6 +106,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>       bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>       bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>       bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +    int nr_ptes;
>>         tlb_change_page_size(tlb, PAGE_SIZE);
>>       pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -108,8 +121,10 @@ static long change_pte_range(struct mmu_gather 
>> *tlb,
>>       flush_tlb_batched_pending(vma->vm_mm);
>>       arch_enter_lazy_mmu_mode();
>>       do {
>> +        nr_ptes = 1;
>>           oldpte = ptep_get(pte);
>>           if (pte_present(oldpte)) {
>> +            int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>               pte_t ptent;
>>                 /*
>> @@ -126,15 +141,18 @@ static long change_pte_range(struct mmu_gather 
>> *tlb,
>>                       continue;
>>                     folio = vm_normal_folio(vma, addr, oldpte);
>> -                if (!folio || folio_is_zone_device(folio) ||
>> -                    folio_test_ksm(folio))
>> +                if (!folio)
>>                       continue;
>>   +                if (folio_is_zone_device(folio) ||
>> +                    folio_test_ksm(folio))
>> +                    goto skip_batch;
>> +
>>                   /* Also skip shared copy-on-write pages */
>>                   if (is_cow_mapping(vma->vm_flags) &&
>>                       (folio_maybe_dma_pinned(folio) ||
>>                        folio_maybe_mapped_shared(folio)))
>> -                    continue;
>> +                    goto skip_batch;
>>                     /*
>>                    * While migration can move some dirty pages,
>> @@ -143,7 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                    */
>>                   if (folio_is_file_lru(folio) &&
>>                       folio_test_dirty(folio))
>> -                    continue;
>> +                    goto skip_batch;
>>                     /*
>>                    * Don't mess with PTEs if page is already on the node
>> @@ -151,7 +169,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                    */
>>                   nid = folio_nid(folio);
>>                   if (target_node == nid)
>> -                    continue;
>> +                    goto skip_batch;
>>                   toptier = node_is_toptier(nid);
>>                     /*
>> @@ -159,8 +177,12 @@ static long change_pte_range(struct mmu_gather 
>> *tlb,
>>                    * balancing is disabled
>>                    */
>>                   if (!(sysctl_numa_balancing_mode & 
>> NUMA_BALANCING_NORMAL) &&
>> -                    toptier)
>> +                    toptier) {
>> +skip_batch:
>> +                    nr_ptes = mprotect_batch(folio, addr, pte,
>> +                                 oldpte, max_nr_ptes);
>>                       continue;
>> +                }
>
>
> I suggest
>
> a) not burying that skip_batch label in another if condition
>
> b) looking into factoring out prot_numa handling into a separate 
> function first.
>
>
> Likely we want something like
>
> if (prot_numa) {
>     nr_ptes = prot_numa_pte_range_skip_ptes(vma, addr, oldpte);
>     if (nr_ptes)
>         continue;
> }
>
> ... likely with a better function name,


I want to be able to reuse the folio from vm_normal_folio(), and we also 
need

nr_ptes to know how much to skip, so if there is no objection in passing 
int *nr_ptes,

or struct folio **foliop to this new function, then I'll carry on with 
your suggestion :)


>
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-22  5:43     ` Dev Jain
@ 2025-05-22  7:13       ` David Hildenbrand
  2025-05-22  7:47         ` Dev Jain
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-05-22  7:13 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


>> ... likely with a better function name,
> 
> 
> I want to be able to reuse the folio from vm_normal_folio(), and we also
> need
> 
> nr_ptes to know how much to skip, so if there is no objection in passing
> int *nr_ptes,
> 
> or struct folio **foliop to this new function, then I'll carry on with
> your suggestion :)

Can you quickly prototype what you have in mind and paste it here? Will 
make it easier :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-22  7:13       ` David Hildenbrand
@ 2025-05-22  7:47         ` Dev Jain
  2025-05-22 16:18           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-05-22  7:47 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 22/05/25 12:43 pm, David Hildenbrand wrote:
>
>>> ... likely with a better function name,
>>
>>
>> I want to be able to reuse the folio from vm_normal_folio(), and we also
>> need
>>
>> nr_ptes to know how much to skip, so if there is no objection in passing
>> int *nr_ptes,
>>
>> or struct folio **foliop to this new function, then I'll carry on with
>> your suggestion :)
>
> Can you quickly prototype what you have in mind and paste it here? 
> Will make it easier :)


if (prot_numa)

     func(vma, addr, oldpte, &nr);


struct folio *folio func(vma, addr, oldpte, int *nr)

{

     if (pte_protnone(oldpte))

         *nr = 1, return NULL;

     folio = vm_normal_folio();

     if (!folio)

         *nr =1, return NULL;

     if the other skipping conditions happen, goto skip_batch, return 
folio and set *nr = batch


     if (sysctl.....) {

skip_batch:

     *nr = mprotect_batch();

}

     if (folio_use_access_time)....

     return folio;

}



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-22  7:47         ` Dev Jain
@ 2025-05-22 16:18           ` David Hildenbrand
  2025-06-04 10:38             ` Dev Jain
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-05-22 16:18 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 22.05.25 09:47, Dev Jain wrote:
> 
> On 22/05/25 12:43 pm, David Hildenbrand wrote:
>>
>>>> ... likely with a better function name,
>>>
>>>
>>> I want to be able to reuse the folio from vm_normal_folio(), and we also
>>> need
>>>
>>> nr_ptes to know how much to skip, so if there is no objection in passing
>>> int *nr_ptes,
>>>
>>> or struct folio **foliop to this new function, then I'll carry on with
>>> your suggestion :)
>>
>> Can you quickly prototype what you have in mind and paste it here?
>> Will make it easier :)
> 
> 
> if (prot_numa)
> 
>       func(vma, addr, oldpte, &nr);

I'd probably return "nr_ptes" and return the folio using a &folio instead.

That way, you can easily extend the function to return the folio in the 
patch where you really need it (not this patch IIUR :) )

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-05-22 16:18           ` David Hildenbrand
@ 2025-06-04 10:38             ` Dev Jain
  2025-06-04 11:44               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-06-04 10:38 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 22/05/25 9:48 pm, David Hildenbrand wrote:
> On 22.05.25 09:47, Dev Jain wrote:
>>
>> On 22/05/25 12:43 pm, David Hildenbrand wrote:
>>>
>>>>> ... likely with a better function name,
>>>>
>>>>
>>>> I want to be able to reuse the folio from vm_normal_folio(), and we 
>>>> also
>>>> need
>>>>
>>>> nr_ptes to know how much to skip, so if there is no objection in 
>>>> passing
>>>> int *nr_ptes,
>>>>
>>>> or struct folio **foliop to this new function, then I'll carry on with
>>>> your suggestion :)
>>>
>>> Can you quickly prototype what you have in mind and paste it here?
>>> Will make it easier :)
>>
>>
>> if (prot_numa)
>>
>>       func(vma, addr, oldpte, &nr);
>
> I'd probably return "nr_ptes" and return the folio using a &folio 
> instead.
>
> That way, you can easily extend the function to return the folio in 
> the patch where you really need it (not this patch IIUR :) )

Just confirming, you mean to return nr_ptes and get the folio by passing 
&folio, and the function parameter will be struct folio **foliop?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
  2025-06-04 10:38             ` Dev Jain
@ 2025-06-04 11:44               ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-06-04 11:44 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 04.06.25 12:38, Dev Jain wrote:
> 
> On 22/05/25 9:48 pm, David Hildenbrand wrote:
>> On 22.05.25 09:47, Dev Jain wrote:
>>>
>>> On 22/05/25 12:43 pm, David Hildenbrand wrote:
>>>>
>>>>>> ... likely with a better function name,
>>>>>
>>>>>
>>>>> I want to be able to reuse the folio from vm_normal_folio(), and we
>>>>> also
>>>>> need
>>>>>
>>>>> nr_ptes to know how much to skip, so if there is no objection in
>>>>> passing
>>>>> int *nr_ptes,
>>>>>
>>>>> or struct folio **foliop to this new function, then I'll carry on with
>>>>> your suggestion :)
>>>>
>>>> Can you quickly prototype what you have in mind and paste it here?
>>>> Will make it easier :)
>>>
>>>
>>> if (prot_numa)
>>>
>>>        func(vma, addr, oldpte, &nr);
>>
>> I'd probably return "nr_ptes" and return the folio using a &folio
>> instead.
>>
>> That way, you can easily extend the function to return the folio in
>> the patch where you really need it (not this patch IIUR :) )
> 
> Just confirming, you mean to return nr_ptes and get the folio by passing
> &folio, and the function parameter will be struct folio **foliop?

Yes.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-19  7:48 [PATCH v3 0/5] Optimize mprotect() for large folios Dev Jain
  2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
@ 2025-05-19  7:48 ` Dev Jain
  2025-05-21 11:16   ` Ryan Roberts
  2025-05-19  7:48 ` [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching Dev Jain
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-05-19  7:48 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
Architecture can override these helpers; in case not, they are implemented
as a simple loop over the corresponding single pte helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c           |  4 +--
 2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..e40ed57e034d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 	__ptep_modify_prot_commit(vma, addr, ptep, pte);
 }
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+/**
+ * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
+ * over a batch of ptes, which protects against asynchronous hardware modifications
+ * to the ptes. The intention is not to prevent the hardware from making pte
+ * updates, but to prevent any updates it may make from being lost.
+ * Please see the comment above ptep_modify_prot_start() for full description.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
+ * folio.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/**
+ * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
+ * hardware-controlled bits in the PTE unmodified.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_commit().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; ++i) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+		ptep++;
+		addr += PAGE_SIZE;
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 /*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1ee160ed0b14..124612ce3d24 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 						jiffies_to_msecs(jiffies));
 			}
 
-			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
 			if (uffd_wp)
@@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			    can_change_pte_writable(vma, addr, ptent))
 				ptent = pte_mkwrite(ptent, vma);
 
-			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
 			if (pte_needs_flush(oldpte, ptent))
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-19  7:48 ` [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-05-21 11:16   ` Ryan Roberts
  2025-05-21 11:45     ` Ryan Roberts
                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21 11:16 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 19/05/2025 08:48, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers; in case not, they are implemented
> as a simple loop over the corresponding single pte helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
>  mm/mprotect.c           |  4 +--
>  2 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..e40ed57e034d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/**
> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
> + * over a batch of ptes, which protects against asynchronous hardware modifications

nit: This overflows the 80 char soft limit.

> + * to the ptes. The intention is not to prevent the hardware from making pte
> + * updates, but to prevent any updates it may make from being lost.
> + * Please see the comment above ptep_modify_prot_start() for full description.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
> + * folio.

nit: "mapped folio" is a bit confusing given we are operating on ptes. Perhaps
"collecting the a/d bits from each pte in the batch" is clearer.

> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ.

nit: Perhaps "batch" would be more consistent than "range"?

> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)

I thought David H suggested modify_prot_ptes_start() and
modify_prot_ptes_commit(), which we settled on? I'm personally fine with either
though.

> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {

I thought we agreed to make the loop logic a bit more standard. I don't recall
exactly what was finally agreed, but I would think something like this would be
better:

	for (i = 1; i < nr; i++) {

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/**
> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> + * hardware-controlled bits in the PTE unmodified.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.

You've missed pte and old_pte params here.

> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_commit().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ.

How can it? All the applied bits other than the PFN will be exactly the same for
the range because they all come from pte. I think this line can be dropped.

> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.

The middle sentance doesn't apply; the PTEs will all initially be none if using
the default version of modify_prot_start_ptes(). I think that can be dropped.
But I think you need to explain that this will be the case on exit.

> + */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; ++i) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif

I have some general concerns about the correctness of batching these functions.
The support was originally added by Commit 1ea0704e0da6 ("mm: add a
ptep_modify_prot transaction abstraction"), and the intent was to make it easier
to defer the pte updates for XEN on x86.

Your default implementations of the batched versions will match the number of
ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
of the batch used for modify_prot_start_ptes(). That's a requirement and you've
met it. But in the batched case, there are 2 differences;

  - You can now have multiple PTEs within a start-commit block at one time. I
hope none of the specialized implementations care about that (i.e. XEN).

  - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
and according to your docs "PTE bits in the PTE range besides the PFN can
differ" when calling modify_prot_start_ptes() so R/W and other things could
differ here.

I'm not sure if these are problems in practice; they probably are not. But have
you checked the XEN implementation (and any other specialized implementations)
are definitely compatible with your batched semantics?

Thanks,
Ryan

> +
>  #endif /* CONFIG_MMU */
>  
>  /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1ee160ed0b14..124612ce3d24 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  						jiffies_to_msecs(jiffies));
>  			}
>  
> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>  
>  			if (uffd_wp)
> @@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			    can_change_pte_writable(vma, addr, ptent))
>  				ptent = pte_mkwrite(ptent, vma);
>  
> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>  			if (pte_needs_flush(oldpte, ptent))
>  				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>  			pages++;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-21 11:16   ` Ryan Roberts
@ 2025-05-21 11:45     ` Ryan Roberts
  2025-05-22  6:33       ` Dev Jain
  2025-05-22  6:39     ` Dev Jain
  2025-06-16  6:37     ` Dev Jain
  2 siblings, 1 reply; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21 11:45 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 21/05/2025 12:16, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>

[...]

> 
> I have some general concerns about the correctness of batching these functions.
> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
> to defer the pte updates for XEN on x86.
> 
> Your default implementations of the batched versions will match the number of
> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
> met it. But in the batched case, there are 2 differences;
> 
>   - You can now have multiple PTEs within a start-commit block at one time. I
> hope none of the specialized implementations care about that (i.e. XEN).

I had a look; this isn't a problem.

> 
>   - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
> and according to your docs "PTE bits in the PTE range besides the PFN can
> differ" when calling modify_prot_start_ptes() so R/W and other things could
> differ here.

It looks like powerpc will break if you provide old_pte which has different
permissions to the "real" old_pte, see radix__ptep_modify_prot_commit(). So I
think you need to at least spec modify_prot_start_ptes() to require that all
bits of the PTE except the PFN, access and dirty are identical. And perhaps you
can VM_WARN if found to be otherwise? And perhaps modify
ptep_modify_prot_commit()'s documentation to explcitly allow old_pte's
access/dirty to be "upgraded" from what was actually read in
ptep_modify_prot_start()?

XEN/x86 and arm64 don't care about old_pte.

Thanks,
Ryan

> 
> I'm not sure if these are problems in practice; they probably are not. But have
> you checked the XEN implementation (and any other specialized implementations)
> are definitely compatible with your batched semantics?
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-21 11:45     ` Ryan Roberts
@ 2025-05-22  6:33       ` Dev Jain
  2025-05-22  7:51         ` Ryan Roberts
  0 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-05-22  6:33 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 5:15 pm, Ryan Roberts wrote:
> On 21/05/2025 12:16, Ryan Roberts wrote:
>> On 19/05/2025 08:48, Dev Jain wrote:
>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>> Architecture can override these helpers; in case not, they are implemented
>>> as a simple loop over the corresponding single pte helpers.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> [...]
>
>> I have some general concerns about the correctness of batching these functions.
>> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
>> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
>> to defer the pte updates for XEN on x86.
>>
>> Your default implementations of the batched versions will match the number of
>> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
>> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
>> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
>> met it. But in the batched case, there are 2 differences;
>>
>>    - You can now have multiple PTEs within a start-commit block at one time. I
>> hope none of the specialized implementations care about that (i.e. XEN).
> I had a look; this isn't a problem.
>
>>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
>> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
>> and according to your docs "PTE bits in the PTE range besides the PFN can
>> differ" when calling modify_prot_start_ptes() so R/W and other things could
>> differ here.
> It looks like powerpc will break if you provide old_pte which has different
> permissions to the "real" old_pte, see radix__ptep_modify_prot_commit(). So I
> think you need to at least spec modify_prot_start_ptes() to require that all
> bits of the PTE except the PFN, access and dirty are identical. And perhaps you
> can VM_WARN if found to be otherwise? And perhaps modify
> ptep_modify_prot_commit()'s documentation to explcitly allow old_pte's
> access/dirty to be "upgraded" from what was actually read in
> ptep_modify_prot_start()?


Got it, so we just need to document that, the permissions for all ptes 
must be identical

when using modify_prot_start_ptes(). And that we may be smearing extra 
a/d bits in

modify_prot_commit_ptes().


>
> XEN/x86 and arm64 don't care about old_pte.
>
> Thanks,
> Ryan
>
>> I'm not sure if these are problems in practice; they probably are not. But have
>> you checked the XEN implementation (and any other specialized implementations)
>> are definitely compatible with your batched semantics?
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-22  6:33       ` Dev Jain
@ 2025-05-22  7:51         ` Ryan Roberts
  0 siblings, 0 replies; 34+ messages in thread
From: Ryan Roberts @ 2025-05-22  7:51 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 22/05/2025 07:33, Dev Jain wrote:
> 
> On 21/05/25 5:15 pm, Ryan Roberts wrote:
>> On 21/05/2025 12:16, Ryan Roberts wrote:
>>> On 19/05/2025 08:48, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers; in case not, they are implemented
>>>> as a simple loop over the corresponding single pte helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> [...]
>>
>>> I have some general concerns about the correctness of batching these functions.
>>> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
>>> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
>>> to defer the pte updates for XEN on x86.
>>>
>>> Your default implementations of the batched versions will match the number of
>>> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
>>> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
>>> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
>>> met it. But in the batched case, there are 2 differences;
>>>
>>>    - You can now have multiple PTEs within a start-commit block at one time. I
>>> hope none of the specialized implementations care about that (i.e. XEN).
>> I had a look; this isn't a problem.
>>
>>>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
>>> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
>>> and according to your docs "PTE bits in the PTE range besides the PFN can
>>> differ" when calling modify_prot_start_ptes() so R/W and other things could
>>> differ here.
>> It looks like powerpc will break if you provide old_pte which has different
>> permissions to the "real" old_pte, see radix__ptep_modify_prot_commit(). So I
>> think you need to at least spec modify_prot_start_ptes() to require that all
>> bits of the PTE except the PFN, access and dirty are identical. And perhaps you
>> can VM_WARN if found to be otherwise? And perhaps modify
>> ptep_modify_prot_commit()'s documentation to explcitly allow old_pte's
>> access/dirty to be "upgraded" from what was actually read in
>> ptep_modify_prot_start()?
> 
> 
> Got it, so we just need to document that, the permissions for all ptes must be
> identical

Not just permissions; all bits (inc SW bits) except PFN and A/D.

> 
> when using modify_prot_start_ptes(). And that we may be smearing extra a/d bits in
> 
> modify_prot_commit_ptes().
> 
> 
>>
>> XEN/x86 and arm64 don't care about old_pte.
>>
>> Thanks,
>> Ryan
>>
>>> I'm not sure if these are problems in practice; they probably are not. But have
>>> you checked the XEN implementation (and any other specialized implementations)
>>> are definitely compatible with your batched semantics?
>>>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-21 11:16   ` Ryan Roberts
  2025-05-21 11:45     ` Ryan Roberts
@ 2025-05-22  6:39     ` Dev Jain
  2025-06-16  6:37     ` Dev Jain
  2 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-22  6:39 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 4:46 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
>>   mm/mprotect.c           |  4 +--
>>   2 files changed, 77 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..e40ed57e034d 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>>   }
>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>> +
>> +/**
>> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
>> + * over a batch of ptes, which protects against asynchronous hardware modifications
> nit: This overflows the 80 char soft limit.
>
>> + * to the ptes. The intention is not to prevent the hardware from making pte
>> + * updates, but to prevent any updates it may make from being lost.
>> + * Please see the comment above ptep_modify_prot_start() for full description.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
>> + * folio.
> nit: "mapped folio" is a bit confusing given we are operating on ptes. Perhaps
> "collecting the a/d bits from each pte in the batch" is clearer.


Sure.


>
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> nit: Perhaps "batch" would be more consistent than "range"?


Sure.


>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> I thought David H suggested modify_prot_ptes_start() and
> modify_prot_ptes_commit(), which we settled on? I'm personally fine with either
> though.


No strong opinion, I'll do that.


>
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> I thought we agreed to make the loop logic a bit more standard. I don't recall
> exactly what was finally agreed, but I would think something like this would be
> better:
>
> 	for (i = 1; i < nr; i++) {


Sure.


>
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/**
>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>> + * hardware-controlled bits in the PTE unmodified.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
> You've missed pte and old_pte params here.


My bad.


>
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_commit().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> How can it? All the applied bits other than the PFN will be exactly the same for
> the range because they all come from pte. I think this line can be dropped.


Copy pasted, then forgot to remove :)


>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> The middle sentance doesn't apply; the PTEs will all initially be none if using
> the default version of modify_prot_start_ptes(). I think that can be dropped.
> But I think you need to explain that this will be the case on exit.


Ah got it. "On exit, the set ptes will map the same folio."


>
>> + */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; ++i) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
> I have some general concerns about the correctness of batching these functions.
> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
> to defer the pte updates for XEN on x86.
>
> Your default implementations of the batched versions will match the number of
> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
> met it. But in the batched case, there are 2 differences;
>
>    - You can now have multiple PTEs within a start-commit block at one time. I
> hope none of the specialized implementations care about that (i.e. XEN).
>
>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
> and according to your docs "PTE bits in the PTE range besides the PFN can
> differ" when calling modify_prot_start_ptes() so R/W and other things could
> differ here.
>
> I'm not sure if these are problems in practice; they probably are not. But have
> you checked the XEN implementation (and any other specialized implementations)
> are definitely compatible with your batched semantics?
>
> Thanks,
> Ryan
>
>> +
>>   #endif /* CONFIG_MMU */
>>   
>>   /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 1ee160ed0b14..124612ce3d24 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>>   			if (uffd_wp)
>> @@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			    can_change_pte_writable(vma, addr, ptent))
>>   				ptent = pte_mkwrite(ptent, vma);
>>   
>> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>   			if (pte_needs_flush(oldpte, ptent))
>>   				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>   			pages++;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-05-21 11:16   ` Ryan Roberts
  2025-05-21 11:45     ` Ryan Roberts
  2025-05-22  6:39     ` Dev Jain
@ 2025-06-16  6:37     ` Dev Jain
  2 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-06-16  6:37 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 4:46 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
>>   mm/mprotect.c           |  4 +--
>>   2 files changed, 77 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..e40ed57e034d 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>>   }
>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>> +
>> +/**
>> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
>> + * over a batch of ptes, which protects against asynchronous hardware modifications
> nit: This overflows the 80 char soft limit.
>
>> + * to the ptes. The intention is not to prevent the hardware from making pte
>> + * updates, but to prevent any updates it may make from being lost.
>> + * Please see the comment above ptep_modify_prot_start() for full description.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
>> + * folio.
> nit: "mapped folio" is a bit confusing given we are operating on ptes. Perhaps
> "collecting the a/d bits from each pte in the batch" is clearer.
>
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> nit: Perhaps "batch" would be more consistent than "range"?
>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> I thought David H suggested modify_prot_ptes_start() and
> modify_prot_ptes_commit(), which we settled on? I'm personally fine with either
> though.

Nothing was decided upon strongly. I will personally keep it the same for sake
of consistency.

>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> I thought we agreed to make the loop logic a bit more standard. I don't recall
> exactly what was finally agreed, but I would think something like this would be
> better:

Again, nothing was particularly decided on as far as I remember : ) Let us
keep it the same - https://lore.kernel.org/all/d048366b-eb6a-4fea-9b60-af834182b1b9@redhat.com/

> 	for (i = 1; i < nr; i++) {
>
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/**
>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>> + * hardware-controlled bits in the PTE unmodified.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
> You've missed pte and old_pte params here.
>
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_commit().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> How can it? All the applied bits other than the PFN will be exactly the same for
> the range because they all come from pte. I think this line can be dropped.
>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> The middle sentance doesn't apply; the PTEs will all initially be none if using
> the default version of modify_prot_start_ptes(). I think that can be dropped.
> But I think you need to explain that this will be the case on exit.
>
>> + */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; ++i) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
> I have some general concerns about the correctness of batching these functions.
> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
> to defer the pte updates for XEN on x86.
>
> Your default implementations of the batched versions will match the number of
> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
> met it. But in the batched case, there are 2 differences;
>
>    - You can now have multiple PTEs within a start-commit block at one time. I
> hope none of the specialized implementations care about that (i.e. XEN).
>
>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
> and according to your docs "PTE bits in the PTE range besides the PFN can
> differ" when calling modify_prot_start_ptes() so R/W and other things could
> differ here.
>
> I'm not sure if these are problems in practice; they probably are not. But have
> you checked the XEN implementation (and any other specialized implementations)
> are definitely compatible with your batched semantics?
>
> Thanks,
> Ryan
>
>> +
>>   #endif /* CONFIG_MMU */
>>   
>>   /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 1ee160ed0b14..124612ce3d24 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>>   			if (uffd_wp)
>> @@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			    can_change_pte_writable(vma, addr, ptent))
>>   				ptent = pte_mkwrite(ptent, vma);
>>   
>> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>   			if (pte_needs_flush(oldpte, ptent))
>>   				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>   			pages++;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-19  7:48 [PATCH v3 0/5] Optimize mprotect() for large folios Dev Jain
  2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
  2025-05-19  7:48 ` [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-05-19  7:48 ` Dev Jain
  2025-05-19  8:18   ` Barry Song
  2025-05-21 13:26   ` Ryan Roberts
  2025-05-19  7:48 ` [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start Dev Jain
  2025-05-19  7:48 ` [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
  4 siblings, 2 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-19  7:48 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
it lets us batch around pte_needs_flush() (for parisc, the definition includes
the access bit).
For all cases other than the PageAnonExclusive case, if the case holds true
for one pte in the batch, one can confirm that that case will hold true for
other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
across the batch, therefore batching across pte_dirty(): this is correct since
the dirty bit on the PTE really is just an indication that the folio got written
to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
the wp-fault optimization can be made.
The crux now is how to batch around the PageAnonExclusive case; we must check
the corresponding condition for every single page. Therefore, from the large
folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
condition, and process that sub batch, then determine and process the next sub batch,
and so on. Note that this does not cause any extra overhead; if suppose the size of
the folio batch is 512, then the sub batch processing in total will take 512 iterations,
which is the same as what we would have done before.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/mm.h |   7 ++-
 mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
 2 files changed, 104 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43748c8f3454..7d5b96f005dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte);
+bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, int max_len, int *len);
+#define can_change_pte_writable(vma, addr, pte)	\
+	can_change_ptes_writable(vma, addr, pte, 1, NULL)
+
 extern long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, unsigned long cp_flags);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 124612ce3d24..6cd8cdc168fa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,25 +40,36 @@
 
 #include "internal.h"
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
+bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, int max_len, int *len)
 {
 	struct page *page;
+	bool temp_ret;
+	bool ret;
+	int i;
 
-	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
-		return false;
+	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
+		ret = false;
+		goto out;
+	}
 
 	/* Don't touch entries that are not even readable. */
-	if (pte_protnone(pte))
-		return false;
+	if (pte_protnone(pte)) {
+		ret = false;
+		goto out;
+	}
 
 	/* Do we need write faults for softdirty tracking? */
-	if (pte_needs_soft_dirty_wp(vma, pte))
-		return false;
+	if (pte_needs_soft_dirty_wp(vma, pte)) {
+		ret = false;
+		goto out;
+	}
 
 	/* Do we need write faults for uffd-wp tracking? */
-	if (userfaultfd_pte_wp(vma, pte))
-		return false;
+	if (userfaultfd_pte_wp(vma, pte)) {
+		ret = false;
+		goto out;
+	}
 
 	if (!(vma->vm_flags & VM_SHARED)) {
 		/*
@@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 		 * any additional checks while holding the PT lock.
 		 */
 		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
+		ret = (page && PageAnon(page) && PageAnonExclusive(page));
+		if (!len)
+			return ret;
+
+		/* Check how many consecutive pages are AnonExclusive or not */
+		for (i = 1; i < max_len; ++i) {
+			++page;
+			temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
+			if (temp_ret != ret)
+				break;
+		}
+		*len = i;
+		return ret;
 	}
 
 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
@@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	 * FS was already notified and we can simply mark the PTE writable
 	 * just like the write-fault handler would do.
 	 */
-	return pte_dirty(pte);
+	ret = pte_dirty(pte);
+
+out:
+	/* The entire batch is guaranteed to have the same return value */
+	if (len)
+		*len = max_len;
+	return ret;
 }
 
 static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
-		pte_t pte, int max_nr_ptes)
+		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
 {
-	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	fpb_t flags = FPB_IGNORE_DIRTY;
 
-	if (!folio_test_large(folio) || (max_nr_ptes == 1))
+	if (ignore_soft_dirty)
+		flags |= FPB_IGNORE_SOFT_DIRTY;
+
+	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
 		return 1;
 
 	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
 			       NULL, NULL, NULL);
 }
 
+/**
+ * modify_sub_batch - Identifies a sub-batch which has the same return value
+ * of can_change_pte_writable(), from within a folio batch. max_len is the
+ * max length of the possible sub-batch. sub_batch_idx is the offset from
+ * the start of the original folio batch.
+ */
+static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
+		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
+		int max_len, int sub_batch_idx)
+{
+	unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
+	pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
+	pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
+	pte_t *new_ptep = ptep + sub_batch_idx;
+	int len = 1;
+
+	if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
+		new_ptent = pte_mkwrite(new_ptent, vma);
+
+	modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
+	if (pte_needs_flush(new_oldpte, new_ptent))
+		tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
+	return len;
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
-	int nr_ptes;
+	int sub_batch_idx, max_len, len, nr_ptes;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+		sub_batch_idx = 0;
 		nr_ptes = 1;
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
 			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
+			struct folio *folio = NULL;
 			pte_t ptent;
 
 			/*
@@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct folio *folio;
 				int nid;
 				bool toptier;
 
@@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
 				    toptier) {
 skip_batch:
 					nr_ptes = mprotect_batch(folio, addr, pte,
-								 oldpte, max_nr_ptes);
+								 oldpte, max_nr_ptes,
+								 true);
 					continue;
 				}
 				if (folio_use_access_time(folio))
@@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 						jiffies_to_msecs(jiffies));
 			}
 
+			if (!folio)
+				folio = vm_normal_folio(vma, addr, oldpte);
+
+			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
+						 max_nr_ptes, false);
 			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * example, if a PTE is already dirty and no other
 			 * COW or special handling is required.
 			 */
-			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent, vma);
-
-			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
-			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
-			pages++;
+			if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
+				max_len = nr_ptes;
+				while (sub_batch_idx < nr_ptes) {
+
+					/* Get length of sub batch */
+					len = modify_sub_batch(vma, tlb, addr, pte,
+							       oldpte, ptent, max_len,
+							       sub_batch_idx);
+					sub_batch_idx += len;
+					max_len -= len;
+				}
+			} else {
+				modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
+				if (pte_needs_flush(oldpte, ptent))
+					tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
+			}
+			pages += nr_ptes;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-19  7:48 ` [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching Dev Jain
@ 2025-05-19  8:18   ` Barry Song
  2025-05-20  9:18     ` Dev Jain
  2025-05-21 13:26   ` Ryan Roberts
  1 sibling, 1 reply; 34+ messages in thread
From: Barry Song @ 2025-05-19  8:18 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Mon, May 19, 2025 at 7:49 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
> it lets us batch around pte_needs_flush() (for parisc, the definition includes
> the access bit).
> For all cases other than the PageAnonExclusive case, if the case holds true
> for one pte in the batch, one can confirm that that case will hold true for
> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
> across the batch, therefore batching across pte_dirty(): this is correct since
> the dirty bit on the PTE really is just an indication that the folio got written
> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
> the wp-fault optimization can be made.
> The crux now is how to batch around the PageAnonExclusive case; we must check
> the corresponding condition for every single page. Therefore, from the large
> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
> condition, and process that sub batch, then determine and process the next sub batch,
> and so on. Note that this does not cause any extra overhead; if suppose the size of
> the folio batch is 512, then the sub batch processing in total will take 512 iterations,
> which is the same as what we would have done before.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/mm.h |   7 ++-
>  mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>  2 files changed, 104 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 43748c8f3454..7d5b96f005dc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>                                             MM_CP_UFFD_WP_RESOLVE)
>
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -                            pte_t pte);
> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> +                            pte_t pte, int max_len, int *len);
> +#define can_change_pte_writable(vma, addr, pte)        \
> +       can_change_ptes_writable(vma, addr, pte, 1, NULL)
> +
>  extern long change_protection(struct mmu_gather *tlb,
>                               struct vm_area_struct *vma, unsigned long start,
>                               unsigned long end, unsigned long cp_flags);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 124612ce3d24..6cd8cdc168fa 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -40,25 +40,36 @@
>
>  #include "internal.h"
>
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -                            pte_t pte)
> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> +                            pte_t pte, int max_len, int *len)
>  {
>         struct page *page;
> +       bool temp_ret;
> +       bool ret;
> +       int i;
>
> -       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
> -               return false;
> +       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
> +               ret = false;
> +               goto out;
> +       }
>
>         /* Don't touch entries that are not even readable. */
> -       if (pte_protnone(pte))
> -               return false;
> +       if (pte_protnone(pte)) {
> +               ret = false;
> +               goto out;
> +       }
>
>         /* Do we need write faults for softdirty tracking? */
> -       if (pte_needs_soft_dirty_wp(vma, pte))
> -               return false;
> +       if (pte_needs_soft_dirty_wp(vma, pte)) {
> +               ret = false;
> +               goto out;
> +       }
>
>         /* Do we need write faults for uffd-wp tracking? */
> -       if (userfaultfd_pte_wp(vma, pte))
> -               return false;
> +       if (userfaultfd_pte_wp(vma, pte)) {
> +               ret = false;
> +               goto out;
> +       }
>
>         if (!(vma->vm_flags & VM_SHARED)) {
>                 /*
> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>                  * any additional checks while holding the PT lock.
>                  */
>                 page = vm_normal_page(vma, addr, pte);
> -               return page && PageAnon(page) && PageAnonExclusive(page);
> +               ret = (page && PageAnon(page) && PageAnonExclusive(page));
> +               if (!len)
> +                       return ret;
> +
> +               /* Check how many consecutive pages are AnonExclusive or not */
> +               for (i = 1; i < max_len; ++i) {
> +                       ++page;
> +                       temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
> +                       if (temp_ret != ret)

Do we really need to do PageAnon for each subpage which is:
folio_test_anon(page_folio(page))

since we have checked subpage[0] ?

> +                               break;
> +               }
> +               *len = i;
> +               return ret;
>         }
>
>         VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>          * FS was already notified and we can simply mark the PTE writable
>          * just like the write-fault handler would do.
>          */
> -       return pte_dirty(pte);
> +       ret = pte_dirty(pte);
> +
> +out:
> +       /* The entire batch is guaranteed to have the same return value */
> +       if (len)
> +               *len = max_len;
> +       return ret;
>  }
>
>  static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
> -               pte_t pte, int max_nr_ptes)
> +               pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
>  {
> -       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +       fpb_t flags = FPB_IGNORE_DIRTY;
>
> -       if (!folio_test_large(folio) || (max_nr_ptes == 1))
> +       if (ignore_soft_dirty)
> +               flags |= FPB_IGNORE_SOFT_DIRTY;
> +
> +       if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>                 return 1;
>
>         return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>                                NULL, NULL, NULL);
>  }
>
> +/**
> + * modify_sub_batch - Identifies a sub-batch which has the same return value
> + * of can_change_pte_writable(), from within a folio batch. max_len is the
> + * max length of the possible sub-batch. sub_batch_idx is the offset from
> + * the start of the original folio batch.
> + */
> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
> +               unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> +               int max_len, int sub_batch_idx)
> +{
> +       unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
> +       pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
> +       pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
> +       pte_t *new_ptep = ptep + sub_batch_idx;
> +       int len = 1;
> +
> +       if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
> +               new_ptent = pte_mkwrite(new_ptent, vma);
> +
> +       modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
> +       if (pte_needs_flush(new_oldpte, new_ptent))
> +               tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
> +       return len;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>                 struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>                 unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>         bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>         bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>         bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> -       int nr_ptes;
> +       int sub_batch_idx, max_len, len, nr_ptes;
>
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>         flush_tlb_batched_pending(vma->vm_mm);
>         arch_enter_lazy_mmu_mode();
>         do {
> +               sub_batch_idx = 0;
>                 nr_ptes = 1;
>                 oldpte = ptep_get(pte);
>                 if (pte_present(oldpte)) {
>                         int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> +                       struct folio *folio = NULL;
>                         pte_t ptent;
>
>                         /*
> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>                          * pages. See similar comment in change_huge_pmd.
>                          */
>                         if (prot_numa) {
> -                               struct folio *folio;
>                                 int nid;
>                                 bool toptier;
>
> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>                                     toptier) {
>  skip_batch:
>                                         nr_ptes = mprotect_batch(folio, addr, pte,
> -                                                                oldpte, max_nr_ptes);
> +                                                                oldpte, max_nr_ptes,
> +                                                                true);
>                                         continue;
>                                 }
>                                 if (folio_use_access_time(folio))
> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>                                                 jiffies_to_msecs(jiffies));
>                         }
>
> +                       if (!folio)
> +                               folio = vm_normal_folio(vma, addr, oldpte);
> +
> +                       nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
> +                                                max_nr_ptes, false);
>                         oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>                         ptent = pte_modify(oldpte, newprot);
>
> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>                          * example, if a PTE is already dirty and no other
>                          * COW or special handling is required.
>                          */
> -                       if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -                           !pte_write(ptent) &&
> -                           can_change_pte_writable(vma, addr, ptent))
> -                               ptent = pte_mkwrite(ptent, vma);
> -
> -                       modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -                       if (pte_needs_flush(oldpte, ptent))
> -                               tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -                       pages++;
> +                       if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
> +                               max_len = nr_ptes;
> +                               while (sub_batch_idx < nr_ptes) {
> +
> +                                       /* Get length of sub batch */
> +                                       len = modify_sub_batch(vma, tlb, addr, pte,
> +                                                              oldpte, ptent, max_len,
> +                                                              sub_batch_idx);
> +                                       sub_batch_idx += len;
> +                                       max_len -= len;
> +                               }
> +                       } else {
> +                               modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> +                               if (pte_needs_flush(oldpte, ptent))
> +                                       tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
> +                       }
> +                       pages += nr_ptes;
>                 } else if (is_swap_pte(oldpte)) {
>                         swp_entry_t entry = pte_to_swp_entry(oldpte);
>                         pte_t newpte;
> --
> 2.30.2
>

Thanks
barry


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-19  8:18   ` Barry Song
@ 2025-05-20  9:18     ` Dev Jain
  0 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-20  9:18 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

Hi, sorry for the late reply, my Thunderbird is not working and I am not
being able to receive emails. I am manually downloading the mbox and 
then replying, so my email response may be slow :(

On 19/05/25 1:48 pm, Barry Song wrote:
> On Mon, May 19, 2025 at 7:49 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
>> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
>> it lets us batch around pte_needs_flush() (for parisc, the definition includes
>> the access bit).
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
>> across the batch, therefore batching across pte_dirty(): this is correct since
>> the dirty bit on the PTE really is just an indication that the folio got written
>> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
>> the wp-fault optimization can be made.
>> The crux now is how to batch around the PageAnonExclusive case; we must check
>> the corresponding condition for every single page. Therefore, from the large
>> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
>> condition, and process that sub batch, then determine and process the next sub batch,
>> and so on. Note that this does not cause any extra overhead; if suppose the size of
>> the folio batch is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/mm.h |   7 ++-
>>   mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>>   2 files changed, 104 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 43748c8f3454..7d5b96f005dc 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>                                              MM_CP_UFFD_WP_RESOLVE)
>>
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -                            pte_t pte);
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +                            pte_t pte, int max_len, int *len);
>> +#define can_change_pte_writable(vma, addr, pte)        \
>> +       can_change_ptes_writable(vma, addr, pte, 1, NULL)
>> +
>>   extern long change_protection(struct mmu_gather *tlb,
>>                                struct vm_area_struct *vma, unsigned long start,
>>                                unsigned long end, unsigned long cp_flags);
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 124612ce3d24..6cd8cdc168fa 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,25 +40,36 @@
>>
>>   #include "internal.h"
>>
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -                            pte_t pte)
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +                            pte_t pte, int max_len, int *len)
>>   {
>>          struct page *page;
>> +       bool temp_ret;
>> +       bool ret;
>> +       int i;
>>
>> -       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -               return false;
>> +       if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
>> +               ret = false;
>> +               goto out;
>> +       }
>>
>>          /* Don't touch entries that are not even readable. */
>> -       if (pte_protnone(pte))
>> -               return false;
>> +       if (pte_protnone(pte)) {
>> +               ret = false;
>> +               goto out;
>> +       }
>>
>>          /* Do we need write faults for softdirty tracking? */
>> -       if (pte_needs_soft_dirty_wp(vma, pte))
>> -               return false;
>> +       if (pte_needs_soft_dirty_wp(vma, pte)) {
>> +               ret = false;
>> +               goto out;
>> +       }
>>
>>          /* Do we need write faults for uffd-wp tracking? */
>> -       if (userfaultfd_pte_wp(vma, pte))
>> -               return false;
>> +       if (userfaultfd_pte_wp(vma, pte)) {
>> +               ret = false;
>> +               goto out;
>> +       }
>>
>>          if (!(vma->vm_flags & VM_SHARED)) {
>>                  /*
>> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>                   * any additional checks while holding the PT lock.
>>                   */
>>                  page = vm_normal_page(vma, addr, pte);
>> -               return page && PageAnon(page) && PageAnonExclusive(page);
>> +               ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +               if (!len)
>> +                       return ret;
>> +
>> +               /* Check how many consecutive pages are AnonExclusive or not */
>> +               for (i = 1; i < max_len; ++i) {
>> +                       ++page;
>> +                       temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +                       if (temp_ret != ret)
> 
> Do we really need to do PageAnon for each subpage which is:
> folio_test_anon(page_folio(page))
> 
> since we have checked subpage[0] ?

You are correct. I just wrote it like that for consistency, I'll change
it in v4.

> 
>> +                               break;
>> +               }
>> +               *len = i;
>> +               return ret;
>>          }
>>
>>          VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>           * FS was already notified and we can simply mark the PTE writable
>>           * just like the write-fault handler would do.
>>           */
>> -       return pte_dirty(pte);
>> +       ret = pte_dirty(pte);
>> +
>> +out:
>> +       /* The entire batch is guaranteed to have the same return value */
>> +       if (len)
>> +               *len = max_len;
>> +       return ret;
>>   }
>>
>>   static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
>> -               pte_t pte, int max_nr_ptes)
>> +               pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
>>   {
>> -       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +       fpb_t flags = FPB_IGNORE_DIRTY;
>>
>> -       if (!folio_test_large(folio) || (max_nr_ptes == 1))
>> +       if (ignore_soft_dirty)
>> +               flags |= FPB_IGNORE_SOFT_DIRTY;
>> +
>> +       if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>                  return 1;
>>
>>          return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>                                 NULL, NULL, NULL);
>>   }
>>
>> +/**
>> + * modify_sub_batch - Identifies a sub-batch which has the same return value
>> + * of can_change_pte_writable(), from within a folio batch. max_len is the
>> + * max length of the possible sub-batch. sub_batch_idx is the offset from
>> + * the start of the original folio batch.
>> + */
>> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
>> +               unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>> +               int max_len, int sub_batch_idx)
>> +{
>> +       unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
>> +       pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
>> +       pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
>> +       pte_t *new_ptep = ptep + sub_batch_idx;
>> +       int len = 1;
>> +
>> +       if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
>> +               new_ptent = pte_mkwrite(new_ptent, vma);
>> +
>> +       modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
>> +       if (pte_needs_flush(new_oldpte, new_ptent))
>> +               tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
>> +       return len;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>                  struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>                  unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>          bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>          bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>          bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> -       int nr_ptes;
>> +       int sub_batch_idx, max_len, len, nr_ptes;
>>
>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>          pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>          flush_tlb_batched_pending(vma->vm_mm);
>>          arch_enter_lazy_mmu_mode();
>>          do {
>> +               sub_batch_idx = 0;
>>                  nr_ptes = 1;
>>                  oldpte = ptep_get(pte);
>>                  if (pte_present(oldpte)) {
>>                          int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +                       struct folio *folio = NULL;
>>                          pte_t ptent;
>>
>>                          /*
>> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                           * pages. See similar comment in change_huge_pmd.
>>                           */
>>                          if (prot_numa) {
>> -                               struct folio *folio;
>>                                  int nid;
>>                                  bool toptier;
>>
>> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                                      toptier) {
>>   skip_batch:
>>                                          nr_ptes = mprotect_batch(folio, addr, pte,
>> -                                                                oldpte, max_nr_ptes);
>> +                                                                oldpte, max_nr_ptes,
>> +                                                                true);
>>                                          continue;
>>                                  }
>>                                  if (folio_use_access_time(folio))
>> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                                                  jiffies_to_msecs(jiffies));
>>                          }
>>
>> +                       if (!folio)
>> +                               folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +                       nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
>> +                                                max_nr_ptes, false);
>>                          oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>                          ptent = pte_modify(oldpte, newprot);
>>
>> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                           * example, if a PTE is already dirty and no other
>>                           * COW or special handling is required.
>>                           */
>> -                       if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -                           !pte_write(ptent) &&
>> -                           can_change_pte_writable(vma, addr, ptent))
>> -                               ptent = pte_mkwrite(ptent, vma);
>> -
>> -                       modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -                       if (pte_needs_flush(oldpte, ptent))
>> -                               tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -                       pages++;
>> +                       if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
>> +                               max_len = nr_ptes;
>> +                               while (sub_batch_idx < nr_ptes) {
>> +
>> +                                       /* Get length of sub batch */
>> +                                       len = modify_sub_batch(vma, tlb, addr, pte,
>> +                                                              oldpte, ptent, max_len,
>> +                                                              sub_batch_idx);
>> +                                       sub_batch_idx += len;
>> +                                       max_len -= len;
>> +                               }
>> +                       } else {
>> +                               modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> +                               if (pte_needs_flush(oldpte, ptent))
>> +                                       tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>> +                       }
>> +                       pages += nr_ptes;
>>                  } else if (is_swap_pte(oldpte)) {
>>                          swp_entry_t entry = pte_to_swp_entry(oldpte);
>>                          pte_t newpte;
>> --
>> 2.30.2
>>
> 
> Thanks
> barry



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-19  7:48 ` [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching Dev Jain
  2025-05-19  8:18   ` Barry Song
@ 2025-05-21 13:26   ` Ryan Roberts
  2025-05-22  6:59     ` Dev Jain
                       ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21 13:26 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 19/05/2025 08:48, Dev Jain wrote:
> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
> it lets us batch around pte_needs_flush() (for parisc, the definition includes
> the access bit).

parisc's pte_needs_flush() is a different (static) function, just coincidentally 
named - it takes different arguments.

But powerpc and x86 both specialize it and they consider the dirty bit. Both 
implementations look to me like they will function correctly but it is a bit 
ugly. They both conclude that if there is no change or if dirty has gone from 
clear to set, then no flush is needed. (flush only needed when dirty goes from 
set to clear). Given after your change, oldpte may have dirty set when it wasn't 
really set for the pte in question that means the function could detect no 
change when really its a clear -> set change; it gives the same answer in both 
cases so it's safe. Perhaps a bit fragile though...

> For all cases other than the PageAnonExclusive case, if the case holds true
> for one pte in the batch, one can confirm that that case will hold true for
> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
> across the batch, therefore batching across pte_dirty(): this is correct since
> the dirty bit on the PTE really is just an indication that the folio got written
> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
> the wp-fault optimization can be made.
> The crux now is how to batch around the PageAnonExclusive case; we must check
> the corresponding condition for every single page. Therefore, from the large
> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
> condition, and process that sub batch, then determine and process the next sub batch,
> and so on. Note that this does not cause any extra overhead; if suppose the size of
> the folio batch is 512, then the sub batch processing in total will take 512 iterations,
> which is the same as what we would have done before.

This commit log could do with some reformatting; blank lines between paragraphs 
and break at whatever the usual git commit log char limit is (72 chars?).

> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/mm.h |   7 ++-
>  mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>  2 files changed, 104 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 43748c8f3454..7d5b96f005dc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
>  
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte);
> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte, int max_len, int *len);
> +#define can_change_pte_writable(vma, addr, pte)	\
> +	can_change_ptes_writable(vma, addr, pte, 1, NULL)

nit: add an extra tab for readability:

#define can_change_pte_writable(vma, addr, pte)	\
		can_change_ptes_writable(vma, addr, pte, 1, NULL)

> +
>  extern long change_protection(struct mmu_gather *tlb,
>  			      struct vm_area_struct *vma, unsigned long start,
>  			      unsigned long end, unsigned long cp_flags);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 124612ce3d24..6cd8cdc168fa 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -40,25 +40,36 @@
>  
>  #include "internal.h"
>  
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte, int max_len, int *len)
>  {
>  	struct page *page;
> +	bool temp_ret;
> +	bool ret;
> +	int i;
>  
> -	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
> -		return false;
> +	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
> +		ret = false;
> +		goto out;
> +	}
>  
>  	/* Don't touch entries that are not even readable. */
> -	if (pte_protnone(pte))
> -		return false;
> +	if (pte_protnone(pte)) {
> +		ret = false;
> +		goto out;
> +	}
>  
>  	/* Do we need write faults for softdirty tracking? */
> -	if (pte_needs_soft_dirty_wp(vma, pte))
> -		return false;
> +	if (pte_needs_soft_dirty_wp(vma, pte)) {
> +		ret = false;
> +		goto out;
> +	}
>  
>  	/* Do we need write faults for uffd-wp tracking? */
> -	if (userfaultfd_pte_wp(vma, pte))
> -		return false;
> +	if (userfaultfd_pte_wp(vma, pte)) {
> +		ret = false;
> +		goto out;
> +	}
>  
>  	if (!(vma->vm_flags & VM_SHARED)) {
>  		/*
> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  		 * any additional checks while holding the PT lock.
>  		 */
>  		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> +		ret = (page && PageAnon(page) && PageAnonExclusive(page));
> +		if (!len)
> +			return ret;
> +
> +		/* Check how many consecutive pages are AnonExclusive or not */
> +		for (i = 1; i < max_len; ++i) {
> +			++page;
> +			temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
> +			if (temp_ret != ret)
> +				break;
> +		}
> +		*len = i;
> +		return ret;
>  	}
>  
>  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	 * FS was already notified and we can simply mark the PTE writable
>  	 * just like the write-fault handler would do.
>  	 */
> -	return pte_dirty(pte);
> +	ret = pte_dirty(pte);
> +
> +out:
> +	/* The entire batch is guaranteed to have the same return value */
> +	if (len)
> +		*len = max_len;
> +	return ret;
>  }
>  
>  static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
> -		pte_t pte, int max_nr_ptes)
> +		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)

We'll almost certainly want more flags here in future. I wonder if it's cleaner 
just to pass "fpb_t extra_flags" here instead of the bool, then OR them in. You 
can pass 0 or FPB_IGNORE_SOFT_DIRTY at your 2 callsites. No strong opinion 
though.

>  {
> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	fpb_t flags = FPB_IGNORE_DIRTY;
>  
> -	if (!folio_test_large(folio) || (max_nr_ptes == 1))
> +	if (ignore_soft_dirty)
> +		flags |= FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>  		return 1;
>  
>  	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>  			       NULL, NULL, NULL);
>  }
>  
> +/**
> + * modify_sub_batch - Identifies a sub-batch which has the same return value
> + * of can_change_pte_writable(), from within a folio batch. max_len is the
> + * max length of the possible sub-batch. sub_batch_idx is the offset from
> + * the start of the original folio batch.
> + */
> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> +		int max_len, int sub_batch_idx)
> +{
> +	unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
> +	pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
> +	pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
> +	pte_t *new_ptep = ptep + sub_batch_idx;
> +	int len = 1;
> +
> +	if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
> +		new_ptent = pte_mkwrite(new_ptent, vma);
> +
> +	modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
> +	if (pte_needs_flush(new_oldpte, new_ptent))
> +		tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
> +	return len;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> -	int nr_ptes;
> +	int sub_batch_idx, max_len, len, nr_ptes;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		sub_batch_idx = 0;
>  		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> +			struct folio *folio = NULL;
>  			pte_t ptent;
>  
>  			/*
> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct folio *folio;
>  				int nid;
>  				bool toptier;
>  
> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				    toptier) {
>  skip_batch:
>  					nr_ptes = mprotect_batch(folio, addr, pte,
> -								 oldpte, max_nr_ptes);
> +								 oldpte, max_nr_ptes,
> +								 true);
>  					continue;
>  				}
>  				if (folio_use_access_time(folio))
> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  						jiffies_to_msecs(jiffies));
>  			}
>  
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
> +						 max_nr_ptes, false);
>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>  
> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * example, if a PTE is already dirty and no other
>  			 * COW or special handling is required.
>  			 */
> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -			    !pte_write(ptent) &&

Don't you need to keep the !pte_write(ptent) condition here? No need to sub-
batch if write is already set.

> -			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent, vma);
> -
> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -			if (pte_needs_flush(oldpte, ptent))
> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -			pages++;
> +			if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
> +				max_len = nr_ptes;
> +				while (sub_batch_idx < nr_ptes) {
> +
> +					/* Get length of sub batch */
> +					len = modify_sub_batch(vma, tlb, addr, pte,
> +							       oldpte, ptent, max_len,
> +							       sub_batch_idx);
> +					sub_batch_idx += len;
> +					max_len -= len;
> +				}
> +			} else {
> +				modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> +				if (pte_needs_flush(oldpte, ptent))
> +					tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
> +			}

This if/else block and modify_sub_block() is all pretty ugly. I wonder if 
something like this would be neater:

			use_sub_batch = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && 
					!pte_write(ptent) && 
					LIKE_can_change_pte_writable(vma, addr, ptent, folio);

			while (nr_ptes) {
				if (use_sub_batch)
					sub_nr_ptes = sub_batch(...);
				else
					sub_nr_ptes = nr_ptes;

				modify_prot_commit_ptes(vma, addr, pte, oldpte, 
							ptent, sub_nr_ptes);
				if (pte_needs_flush(oldpte, ptent))
					tlb_flush_pte_range(tlb, addr,
						sub_nr_ptes * PAGE_SIZE);

				addr += sub_nr_ptes * PAGE_SIZE;
				pte += sub_nr_ptes;
				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
				nr_ptes -= sub_nr_ptes;
				pages += sub_nr_ptes;
			}

Where:

 - LIKE_can_change_pte_writable() is similar to can_change_pte_writable() but
   does everything except checking the per-page exclusive flag. Note that we
   already have the folio so can pass that rather than calling vm_normal_page()
   again.

 - sub_batch() can be passed the folio and the index of the first page within
   the folio and the max number of pages to check (nr_ptes). Then returns the
   number of pages where exclusive flag is the same.

Obviously they both need better names...

Thanks,
Ryan


> +			pages += nr_ptes;
>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
>  			pte_t newpte;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-21 13:26   ` Ryan Roberts
@ 2025-05-22  6:59     ` Dev Jain
  2025-05-22  7:11     ` Dev Jain
  2025-06-16 11:24     ` Dev Jain
  2 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-22  6:59 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 6:56 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
>> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
>> it lets us batch around pte_needs_flush() (for parisc, the definition includes
>> the access bit).
> parisc's pte_needs_flush() is a different (static) function, just coincidentally
> named - it takes different arguments.


Oh my god I'm *that* stupid :(


>
> But powerpc and x86 both specialize it and they consider the dirty bit. Both
> implementations look to me like they will function correctly but it is a bit
> ugly. They both conclude that if there is no change or if dirty has gone from
> clear to set, then no flush is needed. (flush only needed when dirty goes from
> set to clear). Given after your change, oldpte may have dirty set when it wasn't
> really set for the pte in question that means the function could detect no
> change when really its a clear -> set change; it gives the same answer in both
> cases so it's safe. Perhaps a bit fragile though...


Yeah I was wondering about the fragility too. There is a hard dependency 
between

arch private functions doing a very specific thing so that our batching 
does not break.

Best we can do is add a comment explaining this, I think.


>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
>> across the batch, therefore batching across pte_dirty(): this is correct since
>> the dirty bit on the PTE really is just an indication that the folio got written
>> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
>> the wp-fault optimization can be made.
>> The crux now is how to batch around the PageAnonExclusive case; we must check
>> the corresponding condition for every single page. Therefore, from the large
>> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
>> condition, and process that sub batch, then determine and process the next sub batch,
>> and so on. Note that this does not cause any extra overhead; if suppose the size of
>> the folio batch is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
> This commit log could do with some reformatting; blank lines between paragraphs
> and break at whatever the usual git commit log char limit is (72 chars?).


Sure, and I think it is 75 chars.


>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/mm.h |   7 ++-
>>   mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>>   2 files changed, 104 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 43748c8f3454..7d5b96f005dc 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>   					    MM_CP_UFFD_WP_RESOLVE)
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte);
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte, int max_len, int *len);
>> +#define can_change_pte_writable(vma, addr, pte)	\
>> +	can_change_ptes_writable(vma, addr, pte, 1, NULL)
> nit: add an extra tab for readability:
>
> #define can_change_pte_writable(vma, addr, pte)	\
> 		can_change_ptes_writable(vma, addr, pte, 1, NULL)


Sure.


>
>> +
>>   extern long change_protection(struct mmu_gather *tlb,
>>   			      struct vm_area_struct *vma, unsigned long start,
>>   			      unsigned long end, unsigned long cp_flags);
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 124612ce3d24..6cd8cdc168fa 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,25 +40,36 @@
>>   
>>   #include "internal.h"
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte, int max_len, int *len)
>>   {
>>   	struct page *page;
>> +	bool temp_ret;
>> +	bool ret;
>> +	int i;
>>   
>> -	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Don't touch entries that are not even readable. */
>> -	if (pte_protnone(pte))
>> -		return false;
>> +	if (pte_protnone(pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Do we need write faults for softdirty tracking? */
>> -	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +	if (pte_needs_soft_dirty_wp(vma, pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Do we need write faults for uffd-wp tracking? */
>> -	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +	if (userfaultfd_pte_wp(vma, pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>   		/*
>> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   		 * any additional checks while holding the PT lock.
>>   		 */
>>   		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +		if (!len)
>> +			return ret;
>> +
>> +		/* Check how many consecutive pages are AnonExclusive or not */
>> +		for (i = 1; i < max_len; ++i) {
>> +			++page;
>> +			temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +			if (temp_ret != ret)
>> +				break;
>> +		}
>> +		*len = i;
>> +		return ret;
>>   	}
>>   
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	 * FS was already notified and we can simply mark the PTE writable
>>   	 * just like the write-fault handler would do.
>>   	 */
>> -	return pte_dirty(pte);
>> +	ret = pte_dirty(pte);
>> +
>> +out:
>> +	/* The entire batch is guaranteed to have the same return value */
>> +	if (len)
>> +		*len = max_len;
>> +	return ret;
>>   }
>>   
>>   static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
>> -		pte_t pte, int max_nr_ptes)
>> +		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
> We'll almost certainly want more flags here in future. I wonder if it's cleaner
> just to pass "fpb_t extra_flags" here instead of the bool, then OR them in. You
> can pass 0 or FPB_IGNORE_SOFT_DIRTY at your 2 callsites. No strong opinion
> though.


I think your suggestion is better.


>
>>   {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY;
>>   
>> -	if (!folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (ignore_soft_dirty)
>> +		flags |= FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>   		return 1;
>>   
>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>   			       NULL, NULL, NULL);
>>   }
>>   
>> +/**
>> + * modify_sub_batch - Identifies a sub-batch which has the same return value
>> + * of can_change_pte_writable(), from within a folio batch. max_len is the
>> + * max length of the possible sub-batch. sub_batch_idx is the offset from
>> + * the start of the original folio batch.
>> + */
>> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
>> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>> +		int max_len, int sub_batch_idx)
>> +{
>> +	unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
>> +	pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
>> +	pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
>> +	pte_t *new_ptep = ptep + sub_batch_idx;
>> +	int len = 1;
>> +
>> +	if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
>> +		new_ptent = pte_mkwrite(new_ptent, vma);
>> +
>> +	modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
>> +	if (pte_needs_flush(new_oldpte, new_ptent))
>> +		tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
>> +	return len;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> -	int nr_ptes;
>> +	int sub_batch_idx, max_len, len, nr_ptes;
>>   
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		sub_batch_idx = 0;
>>   		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +			struct folio *folio = NULL;
>>   			pte_t ptent;
>>   
>>   			/*
>> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa) {
>> -				struct folio *folio;
>>   				int nid;
>>   				bool toptier;
>>   
>> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				    toptier) {
>>   skip_batch:
>>   					nr_ptes = mprotect_batch(folio, addr, pte,
>> -								 oldpte, max_nr_ptes);
>> +								 oldpte, max_nr_ptes,
>> +								 true);
>>   					continue;
>>   				}
>>   				if (folio_use_access_time(folio))
>> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
>> +						 max_nr_ptes, false);
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * example, if a PTE is already dirty and no other
>>   			 * COW or special handling is required.
>>   			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
> Don't you need to keep the !pte_write(ptent) condition here? No need to sub-
> batch if write is already set.


Ah right, the permissions for the entire batch are the same, so 
pte_write(ptent) will

return the same value for all ptes in the batch, so we can batch through 
this too. I

thought that we may skip on the wp-optimization for other ptes if the first

pte is writable and the others are not, but that won't happen because we 
are batching.


>
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
>> +				max_len = nr_ptes;
>> +				while (sub_batch_idx < nr_ptes) {
>> +
>> +					/* Get length of sub batch */
>> +					len = modify_sub_batch(vma, tlb, addr, pte,
>> +							       oldpte, ptent, max_len,
>> +							       sub_batch_idx);
>> +					sub_batch_idx += len;
>> +					max_len -= len;
>> +				}
>> +			} else {
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> +				if (pte_needs_flush(oldpte, ptent))
>> +					tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>> +			}
> This if/else block and modify_sub_block() is all pretty ugly. I wonder if
> something like this would be neater:
>
> 			use_sub_batch = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> 					!pte_write(ptent) &&
> 					LIKE_can_change_pte_writable(vma, addr, ptent, folio);
>
> 			while (nr_ptes) {
> 				if (use_sub_batch)
> 					sub_nr_ptes = sub_batch(...);
> 				else
> 					sub_nr_ptes = nr_ptes;
>
> 				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> 							ptent, sub_nr_ptes);
> 				if (pte_needs_flush(oldpte, ptent))
> 					tlb_flush_pte_range(tlb, addr,
> 						sub_nr_ptes * PAGE_SIZE);
>
> 				addr += sub_nr_ptes * PAGE_SIZE;
> 				pte += sub_nr_ptes;
> 				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> 				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> 				nr_ptes -= sub_nr_ptes;
> 				pages += sub_nr_ptes;
> 			}
>
> Where:
>
>   - LIKE_can_change_pte_writable() is similar to can_change_pte_writable() but
>     does everything except checking the per-page exclusive flag. Note that we
>     already have the folio so can pass that rather than calling vm_normal_page()
>     again.
>
>   - sub_batch() can be passed the folio and the index of the first page within
>     the folio and the max number of pages to check (nr_ptes). Then returns the
>     number of pages where exclusive flag is the same.
>
> Obviously they both need better names...
>
> Thanks,
> Ryan
>
>
>> +			pages += nr_ptes;
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-21 13:26   ` Ryan Roberts
  2025-05-22  6:59     ` Dev Jain
@ 2025-05-22  7:11     ` Dev Jain
  2025-06-16 11:24     ` Dev Jain
  2 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-22  7:11 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 6:56 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
>> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
>> it lets us batch around pte_needs_flush() (for parisc, the definition includes
>> the access bit).
> parisc's pte_needs_flush() is a different (static) function, just coincidentally
> named - it takes different arguments.
>
> But powerpc and x86 both specialize it and they consider the dirty bit. Both
> implementations look to me like they will function correctly but it is a bit
> ugly. They both conclude that if there is no change or if dirty has gone from
> clear to set, then no flush is needed. (flush only needed when dirty goes from
> set to clear). Given after your change, oldpte may have dirty set when it wasn't
> really set for the pte in question that means the function could detect no
> change when really its a clear -> set change; it gives the same answer in both
> cases so it's safe. Perhaps a bit fragile though...
>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
>> across the batch, therefore batching across pte_dirty(): this is correct since
>> the dirty bit on the PTE really is just an indication that the folio got written
>> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
>> the wp-fault optimization can be made.
>> The crux now is how to batch around the PageAnonExclusive case; we must check
>> the corresponding condition for every single page. Therefore, from the large
>> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
>> condition, and process that sub batch, then determine and process the next sub batch,
>> and so on. Note that this does not cause any extra overhead; if suppose the size of
>> the folio batch is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
> This commit log could do with some reformatting; blank lines between paragraphs
> and break at whatever the usual git commit log char limit is (72 chars?).
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/mm.h |   7 ++-
>>   mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>>   2 files changed, 104 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 43748c8f3454..7d5b96f005dc 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>   					    MM_CP_UFFD_WP_RESOLVE)
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte);
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte, int max_len, int *len);
>> +#define can_change_pte_writable(vma, addr, pte)	\
>> +	can_change_ptes_writable(vma, addr, pte, 1, NULL)
> nit: add an extra tab for readability:
>
> #define can_change_pte_writable(vma, addr, pte)	\
> 		can_change_ptes_writable(vma, addr, pte, 1, NULL)
>
>> +
>>   extern long change_protection(struct mmu_gather *tlb,
>>   			      struct vm_area_struct *vma, unsigned long start,
>>   			      unsigned long end, unsigned long cp_flags);
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 124612ce3d24..6cd8cdc168fa 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,25 +40,36 @@
>>   
>>   #include "internal.h"
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte, int max_len, int *len)
>>   {
>>   	struct page *page;
>> +	bool temp_ret;
>> +	bool ret;
>> +	int i;
>>   
>> -	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Don't touch entries that are not even readable. */
>> -	if (pte_protnone(pte))
>> -		return false;
>> +	if (pte_protnone(pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Do we need write faults for softdirty tracking? */
>> -	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +	if (pte_needs_soft_dirty_wp(vma, pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Do we need write faults for uffd-wp tracking? */
>> -	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +	if (userfaultfd_pte_wp(vma, pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>   		/*
>> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   		 * any additional checks while holding the PT lock.
>>   		 */
>>   		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +		if (!len)
>> +			return ret;
>> +
>> +		/* Check how many consecutive pages are AnonExclusive or not */
>> +		for (i = 1; i < max_len; ++i) {
>> +			++page;
>> +			temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +			if (temp_ret != ret)
>> +				break;
>> +		}
>> +		*len = i;
>> +		return ret;
>>   	}
>>   
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	 * FS was already notified and we can simply mark the PTE writable
>>   	 * just like the write-fault handler would do.
>>   	 */
>> -	return pte_dirty(pte);
>> +	ret = pte_dirty(pte);
>> +
>> +out:
>> +	/* The entire batch is guaranteed to have the same return value */
>> +	if (len)
>> +		*len = max_len;
>> +	return ret;
>>   }
>>   
>>   static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
>> -		pte_t pte, int max_nr_ptes)
>> +		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
> We'll almost certainly want more flags here in future. I wonder if it's cleaner
> just to pass "fpb_t extra_flags" here instead of the bool, then OR them in. You
> can pass 0 or FPB_IGNORE_SOFT_DIRTY at your 2 callsites. No strong opinion
> though.
>
>>   {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY;
>>   
>> -	if (!folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (ignore_soft_dirty)
>> +		flags |= FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>   		return 1;
>>   
>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>   			       NULL, NULL, NULL);
>>   }
>>   
>> +/**
>> + * modify_sub_batch - Identifies a sub-batch which has the same return value
>> + * of can_change_pte_writable(), from within a folio batch. max_len is the
>> + * max length of the possible sub-batch. sub_batch_idx is the offset from
>> + * the start of the original folio batch.
>> + */
>> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
>> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>> +		int max_len, int sub_batch_idx)
>> +{
>> +	unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
>> +	pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
>> +	pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
>> +	pte_t *new_ptep = ptep + sub_batch_idx;
>> +	int len = 1;
>> +
>> +	if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
>> +		new_ptent = pte_mkwrite(new_ptent, vma);
>> +
>> +	modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
>> +	if (pte_needs_flush(new_oldpte, new_ptent))
>> +		tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
>> +	return len;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> -	int nr_ptes;
>> +	int sub_batch_idx, max_len, len, nr_ptes;
>>   
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		sub_batch_idx = 0;
>>   		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +			struct folio *folio = NULL;
>>   			pte_t ptent;
>>   
>>   			/*
>> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa) {
>> -				struct folio *folio;
>>   				int nid;
>>   				bool toptier;
>>   
>> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				    toptier) {
>>   skip_batch:
>>   					nr_ptes = mprotect_batch(folio, addr, pte,
>> -								 oldpte, max_nr_ptes);
>> +								 oldpte, max_nr_ptes,
>> +								 true);
>>   					continue;
>>   				}
>>   				if (folio_use_access_time(folio))
>> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
>> +						 max_nr_ptes, false);
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * example, if a PTE is already dirty and no other
>>   			 * COW or special handling is required.
>>   			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
> Don't you need to keep the !pte_write(ptent) condition here? No need to sub-
> batch if write is already set.
>
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
>> +				max_len = nr_ptes;
>> +				while (sub_batch_idx < nr_ptes) {
>> +
>> +					/* Get length of sub batch */
>> +					len = modify_sub_batch(vma, tlb, addr, pte,
>> +							       oldpte, ptent, max_len,
>> +							       sub_batch_idx);
>> +					sub_batch_idx += len;
>> +					max_len -= len;
>> +				}
>> +			} else {
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> +				if (pte_needs_flush(oldpte, ptent))
>> +					tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>> +			}
> This if/else block and modify_sub_block() is all pretty ugly. I wonder if
> something like this would be neater:
>
> 			use_sub_batch = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> 					!pte_write(ptent) &&
> 					LIKE_can_change_pte_writable(vma, addr, ptent, folio);
>
> 			while (nr_ptes) {
> 				if (use_sub_batch)
> 					sub_nr_ptes = sub_batch(...);
> 				else
> 					sub_nr_ptes = nr_ptes;
>
> 				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> 							ptent, sub_nr_ptes);
> 				if (pte_needs_flush(oldpte, ptent))
> 					tlb_flush_pte_range(tlb, addr,
> 						sub_nr_ptes * PAGE_SIZE);
>
> 				addr += sub_nr_ptes * PAGE_SIZE;
> 				pte += sub_nr_ptes;
> 				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> 				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> 				nr_ptes -= sub_nr_ptes;
> 				pages += sub_nr_ptes;
> 			}
>
> Where:
>
>   - LIKE_can_change_pte_writable() is similar to can_change_pte_writable() but
>     does everything except checking the per-page exclusive flag. Note that we
>     already have the folio so can pass that rather than calling vm_normal_page()
>     again.


Hmm...all I can say is that I can *try* to come up with something 
cleaner, but can't

promise :) I'll try modifying your approach.


>
>   - sub_batch() can be passed the folio and the index of the first page within
>     the folio and the max number of pages to check (nr_ptes). Then returns the
>     number of pages where exclusive flag is the same.
>
> Obviously they both need better names...
>
> Thanks,
> Ryan
>
>
>> +			pages += nr_ptes;
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-05-21 13:26   ` Ryan Roberts
  2025-05-22  6:59     ` Dev Jain
  2025-05-22  7:11     ` Dev Jain
@ 2025-06-16 11:24     ` Dev Jain
  2025-06-26  8:09       ` Ryan Roberts
  2 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-06-16 11:24 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 6:56 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Reuse the folio from prot_numa
>> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
>> it lets us batch around pte_needs_flush() (for parisc, the definition includes
>> the access bit).
> parisc's pte_needs_flush() is a different (static) function, just coincidentally
> named - it takes different arguments.
>
> But powerpc and x86 both specialize it and they consider the dirty bit. Both
> implementations look to me like they will function correctly but it is a bit
> ugly. They both conclude that if there is no change or if dirty has gone from
> clear to set, then no flush is needed. (flush only needed when dirty goes from
> set to clear). Given after your change, oldpte may have dirty set when it wasn't
> really set for the pte in question that means the function could detect no
> change when really its a clear -> set change; it gives the same answer in both
> cases so it's safe. Perhaps a bit fragile though...
>
>> For all cases other than the PageAnonExclusive case, if the case holds true
>> for one pte in the batch, one can confirm that that case will hold true for
>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access bits
>> across the batch, therefore batching across pte_dirty(): this is correct since
>> the dirty bit on the PTE really is just an indication that the folio got written
>> to, so even if the PTE is not actually dirty (but one of the PTEs in the batch is),
>> the wp-fault optimization can be made.
>> The crux now is how to batch around the PageAnonExclusive case; we must check
>> the corresponding condition for every single page. Therefore, from the large
>> folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive
>> condition, and process that sub batch, then determine and process the next sub batch,
>> and so on. Note that this does not cause any extra overhead; if suppose the size of
>> the folio batch is 512, then the sub batch processing in total will take 512 iterations,
>> which is the same as what we would have done before.
> This commit log could do with some reformatting; blank lines between paragraphs
> and break at whatever the usual git commit log char limit is (72 chars?).
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/mm.h |   7 ++-
>>   mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>>   2 files changed, 104 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 43748c8f3454..7d5b96f005dc 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>   					    MM_CP_UFFD_WP_RESOLVE)
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte);
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte, int max_len, int *len);
>> +#define can_change_pte_writable(vma, addr, pte)	\
>> +	can_change_ptes_writable(vma, addr, pte, 1, NULL)
> nit: add an extra tab for readability:
>
> #define can_change_pte_writable(vma, addr, pte)	\
> 		can_change_ptes_writable(vma, addr, pte, 1, NULL)
>
>> +
>>   extern long change_protection(struct mmu_gather *tlb,
>>   			      struct vm_area_struct *vma, unsigned long start,
>>   			      unsigned long end, unsigned long cp_flags);
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 124612ce3d24..6cd8cdc168fa 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -40,25 +40,36 @@
>>   
>>   #include "internal.h"
>>   
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t pte, int max_len, int *len)
>>   {
>>   	struct page *page;
>> +	bool temp_ret;
>> +	bool ret;
>> +	int i;
>>   
>> -	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>> -		return false;
>> +	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Don't touch entries that are not even readable. */
>> -	if (pte_protnone(pte))
>> -		return false;
>> +	if (pte_protnone(pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Do we need write faults for softdirty tracking? */
>> -	if (pte_needs_soft_dirty_wp(vma, pte))
>> -		return false;
>> +	if (pte_needs_soft_dirty_wp(vma, pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	/* Do we need write faults for uffd-wp tracking? */
>> -	if (userfaultfd_pte_wp(vma, pte))
>> -		return false;
>> +	if (userfaultfd_pte_wp(vma, pte)) {
>> +		ret = false;
>> +		goto out;
>> +	}
>>   
>>   	if (!(vma->vm_flags & VM_SHARED)) {
>>   		/*
>> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   		 * any additional checks while holding the PT lock.
>>   		 */
>>   		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +		if (!len)
>> +			return ret;
>> +
>> +		/* Check how many consecutive pages are AnonExclusive or not */
>> +		for (i = 1; i < max_len; ++i) {
>> +			++page;
>> +			temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
>> +			if (temp_ret != ret)
>> +				break;
>> +		}
>> +		*len = i;
>> +		return ret;
>>   	}
>>   
>>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   	 * FS was already notified and we can simply mark the PTE writable
>>   	 * just like the write-fault handler would do.
>>   	 */
>> -	return pte_dirty(pte);
>> +	ret = pte_dirty(pte);
>> +
>> +out:
>> +	/* The entire batch is guaranteed to have the same return value */
>> +	if (len)
>> +		*len = max_len;
>> +	return ret;
>>   }
>>   
>>   static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
>> -		pte_t pte, int max_nr_ptes)
>> +		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
> We'll almost certainly want more flags here in future. I wonder if it's cleaner
> just to pass "fpb_t extra_flags" here instead of the bool, then OR them in. You
> can pass 0 or FPB_IGNORE_SOFT_DIRTY at your 2 callsites. No strong opinion
> though.
>
>>   {
>> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	fpb_t flags = FPB_IGNORE_DIRTY;
>>   
>> -	if (!folio_test_large(folio) || (max_nr_ptes == 1))
>> +	if (ignore_soft_dirty)
>> +		flags |= FPB_IGNORE_SOFT_DIRTY;
>> +
>> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>   		return 1;
>>   
>>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>   			       NULL, NULL, NULL);
>>   }
>>   
>> +/**
>> + * modify_sub_batch - Identifies a sub-batch which has the same return value
>> + * of can_change_pte_writable(), from within a folio batch. max_len is the
>> + * max length of the possible sub-batch. sub_batch_idx is the offset from
>> + * the start of the original folio batch.
>> + */
>> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
>> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>> +		int max_len, int sub_batch_idx)
>> +{
>> +	unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
>> +	pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
>> +	pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
>> +	pte_t *new_ptep = ptep + sub_batch_idx;
>> +	int len = 1;
>> +
>> +	if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
>> +		new_ptent = pte_mkwrite(new_ptent, vma);
>> +
>> +	modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent, len);
>> +	if (pte_needs_flush(new_oldpte, new_ptent))
>> +		tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
>> +	return len;
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>   	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>   	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> -	int nr_ptes;
>> +	int sub_batch_idx, max_len, len, nr_ptes;
>>   
>>   	tlb_change_page_size(tlb, PAGE_SIZE);
>>   	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>   	do {
>> +		sub_batch_idx = 0;
>>   		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> +			struct folio *folio = NULL;
>>   			pte_t ptent;
>>   
>>   			/*
>> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * pages. See similar comment in change_huge_pmd.
>>   			 */
>>   			if (prot_numa) {
>> -				struct folio *folio;
>>   				int nid;
>>   				bool toptier;
>>   
>> @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   				    toptier) {
>>   skip_batch:
>>   					nr_ptes = mprotect_batch(folio, addr, pte,
>> -								 oldpte, max_nr_ptes);
>> +								 oldpte, max_nr_ptes,
>> +								 true);
>>   					continue;
>>   				}
>>   				if (folio_use_access_time(folio))
>> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
>> +						 max_nr_ptes, false);
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>> @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * example, if a PTE is already dirty and no other
>>   			 * COW or special handling is required.
>>   			 */
>> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
> Don't you need to keep the !pte_write(ptent) condition here? No need to sub-
> batch if write is already set.
>
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
>> +				max_len = nr_ptes;
>> +				while (sub_batch_idx < nr_ptes) {
>> +
>> +					/* Get length of sub batch */
>> +					len = modify_sub_batch(vma, tlb, addr, pte,
>> +							       oldpte, ptent, max_len,
>> +							       sub_batch_idx);
>> +					sub_batch_idx += len;
>> +					max_len -= len;
>> +				}
>> +			} else {
>> +				modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> +				if (pte_needs_flush(oldpte, ptent))
>> +					tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>> +			}
> This if/else block and modify_sub_block() is all pretty ugly. I wonder if
> something like this would be neater:
>
> 			use_sub_batch = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> 					!pte_write(ptent) &&
> 					LIKE_can_change_pte_writable(vma, addr, ptent, folio);
>
> 			while (nr_ptes) {
> 				if (use_sub_batch)
> 					sub_nr_ptes = sub_batch(...);
> 				else
> 					sub_nr_ptes = nr_ptes;
>
> 				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> 							ptent, sub_nr_ptes);
> 				if (pte_needs_flush(oldpte, ptent))
> 					tlb_flush_pte_range(tlb, addr,
> 						sub_nr_ptes * PAGE_SIZE);
>
> 				addr += sub_nr_ptes * PAGE_SIZE;
> 				pte += sub_nr_ptes;
> 				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> 				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> 				nr_ptes -= sub_nr_ptes;
> 				pages += sub_nr_ptes;
> 			}
>
> Where:
>
>   - LIKE_can_change_pte_writable() is similar to can_change_pte_writable() but
>     does everything except checking the per-page exclusive flag. Note that we
>     already have the folio so can pass that rather than calling vm_normal_page()
>     again.
>
>   - sub_batch() can be passed the folio and the index of the first page within
>     the folio and the max number of pages to check (nr_ptes). Then returns the
>     number of pages where exclusive flag is the same.
>
> Obviously they both need better names...

I cannot figure a nice way of implementing your suggestion. We need to get the length
of the sub batch and the true/false return value from can_change_ptes_writable, and
we also need to call pte_mkwrite in case the ret is true. Doing that in your suggested
way seems harder to me, along with the fact that it is causing indentation problem by
one extra tab compared to my current implementation.

I don't know, my method looks pretty neat to me : )

>
> Thanks,
> Ryan
>
>
>> +			pages += nr_ptes;
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-06-16 11:24     ` Dev Jain
@ 2025-06-26  8:09       ` Ryan Roberts
  2025-06-27  4:55         ` Dev Jain
  0 siblings, 1 reply; 34+ messages in thread
From: Ryan Roberts @ 2025-06-26  8:09 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 16/06/2025 12:24, Dev Jain wrote:
> 
> On 21/05/25 6:56 pm, Ryan Roberts wrote:
>> On 19/05/2025 08:48, Dev Jain wrote:
>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>> prot_numa
>>> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
>>> it lets us batch around pte_needs_flush() (for parisc, the definition includes
>>> the access bit).
>> parisc's pte_needs_flush() is a different (static) function, just coincidentally
>> named - it takes different arguments.
>>
>> But powerpc and x86 both specialize it and they consider the dirty bit. Both
>> implementations look to me like they will function correctly but it is a bit
>> ugly. They both conclude that if there is no change or if dirty has gone from
>> clear to set, then no flush is needed. (flush only needed when dirty goes from
>> set to clear). Given after your change, oldpte may have dirty set when it wasn't
>> really set for the pte in question that means the function could detect no
>> change when really its a clear -> set change; it gives the same answer in both
>> cases so it's safe. Perhaps a bit fragile though...
>>
>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>> for one pte in the batch, one can confirm that that case will hold true for
>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access
>>> bits
>>> across the batch, therefore batching across pte_dirty(): this is correct since
>>> the dirty bit on the PTE really is just an indication that the folio got written
>>> to, so even if the PTE is not actually dirty (but one of the PTEs in the
>>> batch is),
>>> the wp-fault optimization can be made.
>>> The crux now is how to batch around the PageAnonExclusive case; we must check
>>> the corresponding condition for every single page. Therefore, from the large
>>> folio batch, we process sub batches of ptes mapping pages with the same
>>> PageAnonExclusive
>>> condition, and process that sub batch, then determine and process the next
>>> sub batch,
>>> and so on. Note that this does not cause any extra overhead; if suppose the
>>> size of
>>> the folio batch is 512, then the sub batch processing in total will take 512
>>> iterations,
>>> which is the same as what we would have done before.
>> This commit log could do with some reformatting; blank lines between paragraphs
>> and break at whatever the usual git commit log char limit is (72 chars?).
>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   include/linux/mm.h |   7 ++-
>>>   mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>>>   2 files changed, 104 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 43748c8f3454..7d5b96f005dc 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char
>>> *buffer, int buflen);
>>>   #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>                           MM_CP_UFFD_WP_RESOLVE)
>>>   -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> -                 pte_t pte);
>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                 pte_t pte, int max_len, int *len);
>>> +#define can_change_pte_writable(vma, addr, pte)    \
>>> +    can_change_ptes_writable(vma, addr, pte, 1, NULL)
>> nit: add an extra tab for readability:
>>
>> #define can_change_pte_writable(vma, addr, pte)    \
>>         can_change_ptes_writable(vma, addr, pte, 1, NULL)
>>
>>> +
>>>   extern long change_protection(struct mmu_gather *tlb,
>>>                     struct vm_area_struct *vma, unsigned long start,
>>>                     unsigned long end, unsigned long cp_flags);
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 124612ce3d24..6cd8cdc168fa 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -40,25 +40,36 @@
>>>     #include "internal.h"
>>>   -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> -                 pte_t pte)
>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                 pte_t pte, int max_len, int *len)
>>>   {
>>>       struct page *page;
>>> +    bool temp_ret;
>>> +    bool ret;
>>> +    int i;
>>>   -    if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>> -        return false;
>>> +    if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
>>> +        ret = false;
>>> +        goto out;
>>> +    }
>>>         /* Don't touch entries that are not even readable. */
>>> -    if (pte_protnone(pte))
>>> -        return false;
>>> +    if (pte_protnone(pte)) {
>>> +        ret = false;
>>> +        goto out;
>>> +    }
>>>         /* Do we need write faults for softdirty tracking? */
>>> -    if (pte_needs_soft_dirty_wp(vma, pte))
>>> -        return false;
>>> +    if (pte_needs_soft_dirty_wp(vma, pte)) {
>>> +        ret = false;
>>> +        goto out;
>>> +    }
>>>         /* Do we need write faults for uffd-wp tracking? */
>>> -    if (userfaultfd_pte_wp(vma, pte))
>>> -        return false;
>>> +    if (userfaultfd_pte_wp(vma, pte)) {
>>> +        ret = false;
>>> +        goto out;
>>> +    }
>>>         if (!(vma->vm_flags & VM_SHARED)) {
>>>           /*
>>> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>            * any additional checks while holding the PT lock.
>>>            */
>>>           page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> +        ret = (page && PageAnon(page) && PageAnonExclusive(page));
>>> +        if (!len)
>>> +            return ret;
>>> +
>>> +        /* Check how many consecutive pages are AnonExclusive or not */
>>> +        for (i = 1; i < max_len; ++i) {
>>> +            ++page;
>>> +            temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
>>> +            if (temp_ret != ret)
>>> +                break;
>>> +        }
>>> +        *len = i;
>>> +        return ret;
>>>       }
>>>         VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>> unsigned long addr,
>>>        * FS was already notified and we can simply mark the PTE writable
>>>        * just like the write-fault handler would do.
>>>        */
>>> -    return pte_dirty(pte);
>>> +    ret = pte_dirty(pte);
>>> +
>>> +out:
>>> +    /* The entire batch is guaranteed to have the same return value */
>>> +    if (len)
>>> +        *len = max_len;
>>> +    return ret;
>>>   }
>>>     static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t
>>> *ptep,
>>> -        pte_t pte, int max_nr_ptes)
>>> +        pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
>> We'll almost certainly want more flags here in future. I wonder if it's cleaner
>> just to pass "fpb_t extra_flags" here instead of the bool, then OR them in. You
>> can pass 0 or FPB_IGNORE_SOFT_DIRTY at your 2 callsites. No strong opinion
>> though.
>>
>>>   {
>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +    fpb_t flags = FPB_IGNORE_DIRTY;
>>>   -    if (!folio_test_large(folio) || (max_nr_ptes == 1))
>>> +    if (ignore_soft_dirty)
>>> +        flags |= FPB_IGNORE_SOFT_DIRTY;
>>> +
>>> +    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>>           return 1;
>>>         return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>>                      NULL, NULL, NULL);
>>>   }
>>>   +/**
>>> + * modify_sub_batch - Identifies a sub-batch which has the same return value
>>> + * of can_change_pte_writable(), from within a folio batch. max_len is the
>>> + * max length of the possible sub-batch. sub_batch_idx is the offset from
>>> + * the start of the original folio batch.
>>> + */
>>> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
>>> +        unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>>> +        int max_len, int sub_batch_idx)
>>> +{
>>> +    unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
>>> +    pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
>>> +    pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
>>> +    pte_t *new_ptep = ptep + sub_batch_idx;
>>> +    int len = 1;
>>> +
>>> +    if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
>>> +        new_ptent = pte_mkwrite(new_ptent, vma);
>>> +
>>> +    modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent,
>>> len);
>>> +    if (pte_needs_flush(new_oldpte, new_ptent))
>>> +        tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
>>> +    return len;
>>> +}
>>> +
>>>   static long change_pte_range(struct mmu_gather *tlb,
>>>           struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>>           unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>>> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>       bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>       bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>       bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>> -    int nr_ptes;
>>> +    int sub_batch_idx, max_len, len, nr_ptes;
>>>         tlb_change_page_size(tlb, PAGE_SIZE);
>>>       pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>       flush_tlb_batched_pending(vma->vm_mm);
>>>       arch_enter_lazy_mmu_mode();
>>>       do {
>>> +        sub_batch_idx = 0;
>>>           nr_ptes = 1;
>>>           oldpte = ptep_get(pte);
>>>           if (pte_present(oldpte)) {
>>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>> +            struct folio *folio = NULL;
>>>               pte_t ptent;
>>>                 /*
>>> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                * pages. See similar comment in change_huge_pmd.
>>>                */
>>>               if (prot_numa) {
>>> -                struct folio *folio;
>>>                   int nid;
>>>                   bool toptier;
>>>   @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                       toptier) {
>>>   skip_batch:
>>>                       nr_ptes = mprotect_batch(folio, addr, pte,
>>> -                                 oldpte, max_nr_ptes);
>>> +                                 oldpte, max_nr_ptes,
>>> +                                 true);
>>>                       continue;
>>>                   }
>>>                   if (folio_use_access_time(folio))
>>> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                           jiffies_to_msecs(jiffies));
>>>               }
>>>   +            if (!folio)
>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>> +
>>> +            nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
>>> +                         max_nr_ptes, false);
>>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>               ptent = pte_modify(oldpte, newprot);
>>>   @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>                * example, if a PTE is already dirty and no other
>>>                * COW or special handling is required.
>>>                */
>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> -                !pte_write(ptent) &&
>> Don't you need to keep the !pte_write(ptent) condition here? No need to sub-
>> batch if write is already set.
>>
>>> -                can_change_pte_writable(vma, addr, ptent))
>>> -                ptent = pte_mkwrite(ptent, vma);
>>> -
>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>> -            if (pte_needs_flush(oldpte, ptent))
>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> -            pages++;
>>> +            if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
>>> +                max_len = nr_ptes;
>>> +                while (sub_batch_idx < nr_ptes) {
>>> +
>>> +                    /* Get length of sub batch */
>>> +                    len = modify_sub_batch(vma, tlb, addr, pte,
>>> +                                   oldpte, ptent, max_len,
>>> +                                   sub_batch_idx);
>>> +                    sub_batch_idx += len;
>>> +                    max_len -= len;
>>> +                }
>>> +            } else {
>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent,
>>> nr_ptes);
>>> +                if (pte_needs_flush(oldpte, ptent))
>>> +                    tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>>> +            }
>> This if/else block and modify_sub_block() is all pretty ugly. I wonder if
>> something like this would be neater:
>>
>>             use_sub_batch = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>                     !pte_write(ptent) &&
>>                     LIKE_can_change_pte_writable(vma, addr, ptent, folio);
>>
>>             while (nr_ptes) {
>>                 if (use_sub_batch)
>>                     sub_nr_ptes = sub_batch(...);
>>                 else
>>                     sub_nr_ptes = nr_ptes;
>>
>>                 modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>                             ptent, sub_nr_ptes);
>>                 if (pte_needs_flush(oldpte, ptent))
>>                     tlb_flush_pte_range(tlb, addr,
>>                         sub_nr_ptes * PAGE_SIZE);
>>
>>                 addr += sub_nr_ptes * PAGE_SIZE;
>>                 pte += sub_nr_ptes;
>>                 oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>                 ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>                 nr_ptes -= sub_nr_ptes;
>>                 pages += sub_nr_ptes;
>>             }
>>
>> Where:
>>
>>   - LIKE_can_change_pte_writable() is similar to can_change_pte_writable() but
>>     does everything except checking the per-page exclusive flag. Note that we
>>     already have the folio so can pass that rather than calling vm_normal_page()
>>     again.
>>
>>   - sub_batch() can be passed the folio and the index of the first page within
>>     the folio and the max number of pages to check (nr_ptes). Then returns the
>>     number of pages where exclusive flag is the same.
>>
>> Obviously they both need better names...
> 
> I cannot figure a nice way of implementing your suggestion. We need to get the
> length
> of the sub batch and the true/false return value from can_change_ptes_writable, and
> we also need to call pte_mkwrite in case the ret is true. Doing that in your
> suggested
> way seems harder to me, along with the fact that it is causing indentation
> problem by
> one extra tab compared to my current implementation.
> 
> I don't know, my method looks pretty neat to me : )


How about something like this (compile-tested only)? Personally I find this
easier to follow:

---8<---
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 124612ce3d24..97ccc2a92a7b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,35 +40,47 @@

 #include "internal.h"

-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
-{
-	struct page *page;
+enum tristate {
+	TRI_FALSE = 0,
+	TRI_TRUE = 1,
+	TRI_MAYBE = -1,
+};

+/*
+ * Returns enum tristate indicating whether the pte can be changed to writable.
+ * If TRI_MAYBE is returned, then the folio is anonymous and the user must
+ * additionally check PageAnonExclusive() for every page in the desired range.
+ */
+static int maybe_change_pte_writable(struct vm_area_struct *vma,
+				     unsigned long addr, pte_t pte,
+				     struct folio *folio)
+{
 	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
-		return false;
+		return TRI_FALSE;

 	/* Don't touch entries that are not even readable. */
 	if (pte_protnone(pte))
-		return false;
+		return TRI_FALSE;

 	/* Do we need write faults for softdirty tracking? */
 	if (pte_needs_soft_dirty_wp(vma, pte))
-		return false;
+		return TRI_FALSE;

 	/* Do we need write faults for uffd-wp tracking? */
 	if (userfaultfd_pte_wp(vma, pte))
-		return false;
+		return TRI_FALSE;

 	if (!(vma->vm_flags & VM_SHARED)) {
 		/*
 		 * Writable MAP_PRIVATE mapping: We can only special-case on
 		 * exclusive anonymous pages, because we know that our
 		 * write-fault handler similarly would map them writable without
-		 * any additional checks while holding the PT lock.
+		 * any additional checks while holding the PT lock. So if the
+		 * folio is not anonymous, we know we cannot change pte
+		 * writable. If it is anonymous then the caller must further
+		 * check that the page is AnonExclusive().
 		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
+		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
 	}

 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
@@ -80,15 +92,60 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
unsigned long addr,
 	 * FS was already notified and we can simply mark the PTE writable
 	 * just like the write-fault handler would do.
 	 */
-	return pte_dirty(pte);
+	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
+}
+
+/*
+ * Returns the number of pages within the folio, starting from the page
+ * indicated by pgidx and up to pgidx + max_nr, that have the same value of
+ * PageAnonExclusive(). Must only be called for anonymous folios. Value of
+ * PageAnonExclusive() is returned in *exclusive.
+ */
+static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
+				bool *exclusive)
+{
+	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+	VM_WARN_ON_FOLIO(folio_nr_pages(folio) < pgidx + max_nr, folio);
+
+	struct page *page = folio_page(folio, pgidx++);
+	int nr = 1;
+
+	*exclusive = PageAnonExclusive(page);
+
+	while (nr < max_nr) {
+		page = folio_page(folio, pgidx++);
+		if (*exclusive != PageAnonExclusive(page))
+			break;
+		nr++;
+	}
+
+	return nr;
+}
+
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte)
+{
+	int ret;
+	struct page *page;
+
+	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
+	if (ret == TRI_MAYBE) {
+		page = vm_normal_page(vma, addr, pte);
+		ret = page && PageAnon(page) && PageAnonExclusive(page);
+	}
+
+	return ret;
 }

 static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
-		pte_t pte, int max_nr_ptes)
+		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
 {
-	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	fpb_t flags = FPB_IGNORE_DIRTY;

-	if (!folio_test_large(folio) || (max_nr_ptes == 1))
+	if (ignore_soft_dirty)
+		flags |= FPB_IGNORE_SOFT_DIRTY;
+
+	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
 		return 1;

 	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
@@ -125,14 +182,17 @@ static long change_pte_range(struct mmu_gather *tlb,
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
 			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
-			pte_t ptent;
+			struct folio *folio = NULL;
+			int sub_nr_ptes, pgidx;
+			pte_t ptent, newpte;
+			bool sub_set_write;
+			int set_write;

 			/*
 			 * Avoid trapping faults against the zero or KSM
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct folio *folio;
 				int nid;
 				bool toptier;

@@ -180,7 +240,8 @@ static long change_pte_range(struct mmu_gather *tlb,
 				    toptier) {
 skip_batch:
 					nr_ptes = mprotect_batch(folio, addr, pte,
-								 oldpte, max_nr_ptes);
+								 oldpte, max_nr_ptes,
+								 true);
 					continue;
 				}
 				if (folio_use_access_time(folio))
@@ -188,6 +249,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 						jiffies_to_msecs(jiffies));
 			}

+			if (!folio)
+				folio = vm_normal_folio(vma, addr, oldpte);
+
+			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
+						 max_nr_ptes, false);
 			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);

@@ -209,15 +275,38 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * example, if a PTE is already dirty and no other
 			 * COW or special handling is required.
 			 */
-			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent, vma);
-
-			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
-			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
-			pages++;
+			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
+				    !pte_write(ptent) &&
+				    maybe_change_pte_writable(vma, addr, ptent, folio);
+
+			while (nr_ptes) {
+				if (set_write == TRI_MAYBE) {
+					pgidx = folio_pfn(folio) - pte_pfn(ptent);
+					sub_nr_ptes = anon_exclusive_batch(folio,
+						pgidx, nr_ptes, &sub_set_write);
+				} else {
+					sub_nr_ptes = nr_ptes;
+					sub_set_write = set_write == TRI_TRUE;
+				}
+
+				if (sub_set_write)
+					newpte = pte_mkwrite(ptent, vma);
+				else
+					newpte = ptent;
+
+				modify_prot_commit_ptes(vma, addr, pte, oldpte,
+							newpte, sub_nr_ptes);
+				if (pte_needs_flush(oldpte, newpte))
+					tlb_flush_pte_range(tlb, addr,
+						sub_nr_ptes * PAGE_SIZE);
+
+				addr += sub_nr_ptes * PAGE_SIZE;
+				pte += sub_nr_ptes;
+				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
+				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
+				nr_ptes -= sub_nr_ptes;
+				pages += sub_nr_ptes;
+			}
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
---8<---

Thanks,
Ryan



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
  2025-06-26  8:09       ` Ryan Roberts
@ 2025-06-27  4:55         ` Dev Jain
  0 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-06-27  4:55 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 26/06/25 1:39 pm, Ryan Roberts wrote:
> On 16/06/2025 12:24, Dev Jain wrote:
>> On 21/05/25 6:56 pm, Ryan Roberts wrote:
>>> On 19/05/2025 08:48, Dev Jain wrote:
>>>> Use folio_pte_batch to batch process a large folio. Reuse the folio from
>>>> prot_numa
>>>> case if possible. Since modify_prot_start_ptes() gathers access/dirty bits,
>>>> it lets us batch around pte_needs_flush() (for parisc, the definition includes
>>>> the access bit).
>>> parisc's pte_needs_flush() is a different (static) function, just coincidentally
>>> named - it takes different arguments.
>>>
>>> But powerpc and x86 both specialize it and they consider the dirty bit. Both
>>> implementations look to me like they will function correctly but it is a bit
>>> ugly. They both conclude that if there is no change or if dirty has gone from
>>> clear to set, then no flush is needed. (flush only needed when dirty goes from
>>> set to clear). Given after your change, oldpte may have dirty set when it wasn't
>>> really set for the pte in question that means the function could detect no
>>> change when really its a clear -> set change; it gives the same answer in both
>>> cases so it's safe. Perhaps a bit fragile though...
>>>
>>>> For all cases other than the PageAnonExclusive case, if the case holds true
>>>> for one pte in the batch, one can confirm that that case will hold true for
>>>> other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
>>>> FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty and access
>>>> bits
>>>> across the batch, therefore batching across pte_dirty(): this is correct since
>>>> the dirty bit on the PTE really is just an indication that the folio got written
>>>> to, so even if the PTE is not actually dirty (but one of the PTEs in the
>>>> batch is),
>>>> the wp-fault optimization can be made.
>>>> The crux now is how to batch around the PageAnonExclusive case; we must check
>>>> the corresponding condition for every single page. Therefore, from the large
>>>> folio batch, we process sub batches of ptes mapping pages with the same
>>>> PageAnonExclusive
>>>> condition, and process that sub batch, then determine and process the next
>>>> sub batch,
>>>> and so on. Note that this does not cause any extra overhead; if suppose the
>>>> size of
>>>> the folio batch is 512, then the sub batch processing in total will take 512
>>>> iterations,
>>>> which is the same as what we would have done before.
>>> This commit log could do with some reformatting; blank lines between paragraphs
>>> and break at whatever the usual git commit log char limit is (72 chars?).
>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/mm.h |   7 ++-
>>>>    mm/mprotect.c      | 126 +++++++++++++++++++++++++++++++++++----------
>>>>    2 files changed, 104 insertions(+), 29 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 43748c8f3454..7d5b96f005dc 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -2542,8 +2542,11 @@ int get_cmdline(struct task_struct *task, char
>>>> *buffer, int buflen);
>>>>    #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>>                            MM_CP_UFFD_WP_RESOLVE)
>>>>    -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -                 pte_t pte);
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +                 pte_t pte, int max_len, int *len);
>>>> +#define can_change_pte_writable(vma, addr, pte)    \
>>>> +    can_change_ptes_writable(vma, addr, pte, 1, NULL)
>>> nit: add an extra tab for readability:
>>>
>>> #define can_change_pte_writable(vma, addr, pte)    \
>>>          can_change_ptes_writable(vma, addr, pte, 1, NULL)
>>>
>>>> +
>>>>    extern long change_protection(struct mmu_gather *tlb,
>>>>                      struct vm_area_struct *vma, unsigned long start,
>>>>                      unsigned long end, unsigned long cp_flags);
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 124612ce3d24..6cd8cdc168fa 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -40,25 +40,36 @@
>>>>      #include "internal.h"
>>>>    -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -                 pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +                 pte_t pte, int max_len, int *len)
>>>>    {
>>>>        struct page *page;
>>>> +    bool temp_ret;
>>>> +    bool ret;
>>>> +    int i;
>>>>    -    if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>>>> -        return false;
>>>> +    if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE))) {
>>>> +        ret = false;
>>>> +        goto out;
>>>> +    }
>>>>          /* Don't touch entries that are not even readable. */
>>>> -    if (pte_protnone(pte))
>>>> -        return false;
>>>> +    if (pte_protnone(pte)) {
>>>> +        ret = false;
>>>> +        goto out;
>>>> +    }
>>>>          /* Do we need write faults for softdirty tracking? */
>>>> -    if (pte_needs_soft_dirty_wp(vma, pte))
>>>> -        return false;
>>>> +    if (pte_needs_soft_dirty_wp(vma, pte)) {
>>>> +        ret = false;
>>>> +        goto out;
>>>> +    }
>>>>          /* Do we need write faults for uffd-wp tracking? */
>>>> -    if (userfaultfd_pte_wp(vma, pte))
>>>> -        return false;
>>>> +    if (userfaultfd_pte_wp(vma, pte)) {
>>>> +        ret = false;
>>>> +        goto out;
>>>> +    }
>>>>          if (!(vma->vm_flags & VM_SHARED)) {
>>>>            /*
>>>> @@ -68,7 +79,19 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>>             * any additional checks while holding the PT lock.
>>>>             */
>>>>            page = vm_normal_page(vma, addr, pte);
>>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +        ret = (page && PageAnon(page) && PageAnonExclusive(page));
>>>> +        if (!len)
>>>> +            return ret;
>>>> +
>>>> +        /* Check how many consecutive pages are AnonExclusive or not */
>>>> +        for (i = 1; i < max_len; ++i) {
>>>> +            ++page;
>>>> +            temp_ret = (page && PageAnon(page) && PageAnonExclusive(page));
>>>> +            if (temp_ret != ret)
>>>> +                break;
>>>> +        }
>>>> +        *len = i;
>>>> +        return ret;
>>>>        }
>>>>          VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>>>> @@ -80,21 +103,55 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>>         * FS was already notified and we can simply mark the PTE writable
>>>>         * just like the write-fault handler would do.
>>>>         */
>>>> -    return pte_dirty(pte);
>>>> +    ret = pte_dirty(pte);
>>>> +
>>>> +out:
>>>> +    /* The entire batch is guaranteed to have the same return value */
>>>> +    if (len)
>>>> +        *len = max_len;
>>>> +    return ret;
>>>>    }
>>>>      static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t
>>>> *ptep,
>>>> -        pte_t pte, int max_nr_ptes)
>>>> +        pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
>>> We'll almost certainly want more flags here in future. I wonder if it's cleaner
>>> just to pass "fpb_t extra_flags" here instead of the bool, then OR them in. You
>>> can pass 0 or FPB_IGNORE_SOFT_DIRTY at your 2 callsites. No strong opinion
>>> though.
>>>
>>>>    {
>>>> -    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +    fpb_t flags = FPB_IGNORE_DIRTY;
>>>>    -    if (!folio_test_large(folio) || (max_nr_ptes == 1))
>>>> +    if (ignore_soft_dirty)
>>>> +        flags |= FPB_IGNORE_SOFT_DIRTY;
>>>> +
>>>> +    if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>>>>            return 1;
>>>>          return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
>>>>                       NULL, NULL, NULL);
>>>>    }
>>>>    +/**
>>>> + * modify_sub_batch - Identifies a sub-batch which has the same return value
>>>> + * of can_change_pte_writable(), from within a folio batch. max_len is the
>>>> + * max length of the possible sub-batch. sub_batch_idx is the offset from
>>>> + * the start of the original folio batch.
>>>> + */
>>>> +static int modify_sub_batch(struct vm_area_struct *vma, struct mmu_gather *tlb,
>>>> +        unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>>>> +        int max_len, int sub_batch_idx)
>>>> +{
>>>> +    unsigned long new_addr = addr + sub_batch_idx * PAGE_SIZE;
>>>> +    pte_t new_oldpte = pte_advance_pfn(oldpte, sub_batch_idx);
>>>> +    pte_t new_ptent = pte_advance_pfn(ptent, sub_batch_idx);
>>>> +    pte_t *new_ptep = ptep + sub_batch_idx;
>>>> +    int len = 1;
>>>> +
>>>> +    if (can_change_ptes_writable(vma, new_addr, new_ptent, max_len, &len))
>>>> +        new_ptent = pte_mkwrite(new_ptent, vma);
>>>> +
>>>> +    modify_prot_commit_ptes(vma, new_addr, new_ptep, new_oldpte, new_ptent,
>>>> len);
>>>> +    if (pte_needs_flush(new_oldpte, new_ptent))
>>>> +        tlb_flush_pte_range(tlb, new_addr, len * PAGE_SIZE);
>>>> +    return len;
>>>> +}
>>>> +
>>>>    static long change_pte_range(struct mmu_gather *tlb,
>>>>            struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>>>            unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>>>> @@ -106,7 +163,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>        bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>>        bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>>        bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>>> -    int nr_ptes;
>>>> +    int sub_batch_idx, max_len, len, nr_ptes;
>>>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>>>        pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>>> @@ -121,10 +178,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>        flush_tlb_batched_pending(vma->vm_mm);
>>>>        arch_enter_lazy_mmu_mode();
>>>>        do {
>>>> +        sub_batch_idx = 0;
>>>>            nr_ptes = 1;
>>>>            oldpte = ptep_get(pte);
>>>>            if (pte_present(oldpte)) {
>>>>                int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>>>> +            struct folio *folio = NULL;
>>>>                pte_t ptent;
>>>>                  /*
>>>> @@ -132,7 +191,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                 * pages. See similar comment in change_huge_pmd.
>>>>                 */
>>>>                if (prot_numa) {
>>>> -                struct folio *folio;
>>>>                    int nid;
>>>>                    bool toptier;
>>>>    @@ -180,7 +238,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                        toptier) {
>>>>    skip_batch:
>>>>                        nr_ptes = mprotect_batch(folio, addr, pte,
>>>> -                                 oldpte, max_nr_ptes);
>>>> +                                 oldpte, max_nr_ptes,
>>>> +                                 true);
>>>>                        continue;
>>>>                    }
>>>>                    if (folio_use_access_time(folio))
>>>> @@ -188,6 +247,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                            jiffies_to_msecs(jiffies));
>>>>                }
>>>>    +            if (!folio)
>>>> +                folio = vm_normal_folio(vma, addr, oldpte);
>>>> +
>>>> +            nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
>>>> +                         max_nr_ptes, false);
>>>>                oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>>>                ptent = pte_modify(oldpte, newprot);
>>>>    @@ -209,15 +273,23 @@ static long change_pte_range(struct mmu_gather *tlb,
>>>>                 * example, if a PTE is already dirty and no other
>>>>                 * COW or special handling is required.
>>>>                 */
>>>> -            if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>> -                !pte_write(ptent) &&
>>> Don't you need to keep the !pte_write(ptent) condition here? No need to sub-
>>> batch if write is already set.
>>>
>>>> -                can_change_pte_writable(vma, addr, ptent))
>>>> -                ptent = pte_mkwrite(ptent, vma);
>>>> -
>>>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>>> -            if (pte_needs_flush(oldpte, ptent))
>>>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>>> -            pages++;
>>>> +            if (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) {
>>>> +                max_len = nr_ptes;
>>>> +                while (sub_batch_idx < nr_ptes) {
>>>> +
>>>> +                    /* Get length of sub batch */
>>>> +                    len = modify_sub_batch(vma, tlb, addr, pte,
>>>> +                                   oldpte, ptent, max_len,
>>>> +                                   sub_batch_idx);
>>>> +                    sub_batch_idx += len;
>>>> +                    max_len -= len;
>>>> +                }
>>>> +            } else {
>>>> +                modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent,
>>>> nr_ptes);
>>>> +                if (pte_needs_flush(oldpte, ptent))
>>>> +                    tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>>>> +            }
>>> This if/else block and modify_sub_block() is all pretty ugly. I wonder if
>>> something like this would be neater:
>>>
>>>              use_sub_batch = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>                      !pte_write(ptent) &&
>>>                      LIKE_can_change_pte_writable(vma, addr, ptent, folio);
>>>
>>>              while (nr_ptes) {
>>>                  if (use_sub_batch)
>>>                      sub_nr_ptes = sub_batch(...);
>>>                  else
>>>                      sub_nr_ptes = nr_ptes;
>>>
>>>                  modify_prot_commit_ptes(vma, addr, pte, oldpte,
>>>                              ptent, sub_nr_ptes);
>>>                  if (pte_needs_flush(oldpte, ptent))
>>>                      tlb_flush_pte_range(tlb, addr,
>>>                          sub_nr_ptes * PAGE_SIZE);
>>>
>>>                  addr += sub_nr_ptes * PAGE_SIZE;
>>>                  pte += sub_nr_ptes;
>>>                  oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
>>>                  ptent = pte_advance_pfn(ptent, sub_nr_ptes);
>>>                  nr_ptes -= sub_nr_ptes;
>>>                  pages += sub_nr_ptes;
>>>              }
>>>
>>> Where:
>>>
>>>    - LIKE_can_change_pte_writable() is similar to can_change_pte_writable() but
>>>      does everything except checking the per-page exclusive flag. Note that we
>>>      already have the folio so can pass that rather than calling vm_normal_page()
>>>      again.
>>>
>>>    - sub_batch() can be passed the folio and the index of the first page within
>>>      the folio and the max number of pages to check (nr_ptes). Then returns the
>>>      number of pages where exclusive flag is the same.
>>>
>>> Obviously they both need better names...
>> I cannot figure a nice way of implementing your suggestion. We need to get the
>> length
>> of the sub batch and the true/false return value from can_change_ptes_writable, and
>> we also need to call pte_mkwrite in case the ret is true. Doing that in your
>> suggested
>> way seems harder to me, along with the fact that it is causing indentation
>> problem by
>> one extra tab compared to my current implementation.
>>
>> I don't know, my method looks pretty neat to me : )
>
> How about something like this (compile-tested only)? Personally I find this
> easier to follow:
>
> ---8<---
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 124612ce3d24..97ccc2a92a7b 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -40,35 +40,47 @@
>
>   #include "internal.h"
>
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> -{
> -	struct page *page;
> +enum tristate {
> +	TRI_FALSE = 0,
> +	TRI_TRUE = 1,
> +	TRI_MAYBE = -1,
> +};
>
> +/*
> + * Returns enum tristate indicating whether the pte can be changed to writable.
> + * If TRI_MAYBE is returned, then the folio is anonymous and the user must
> + * additionally check PageAnonExclusive() for every page in the desired range.
> + */
> +static int maybe_change_pte_writable(struct vm_area_struct *vma,
> +				     unsigned long addr, pte_t pte,
> +				     struct folio *folio)
> +{
>   	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
> -		return false;
> +		return TRI_FALSE;
>
>   	/* Don't touch entries that are not even readable. */
>   	if (pte_protnone(pte))
> -		return false;
> +		return TRI_FALSE;
>
>   	/* Do we need write faults for softdirty tracking? */
>   	if (pte_needs_soft_dirty_wp(vma, pte))
> -		return false;
> +		return TRI_FALSE;
>
>   	/* Do we need write faults for uffd-wp tracking? */
>   	if (userfaultfd_pte_wp(vma, pte))
> -		return false;
> +		return TRI_FALSE;
>
>   	if (!(vma->vm_flags & VM_SHARED)) {
>   		/*
>   		 * Writable MAP_PRIVATE mapping: We can only special-case on
>   		 * exclusive anonymous pages, because we know that our
>   		 * write-fault handler similarly would map them writable without
> -		 * any additional checks while holding the PT lock.
> +		 * any additional checks while holding the PT lock. So if the
> +		 * folio is not anonymous, we know we cannot change pte
> +		 * writable. If it is anonymous then the caller must further
> +		 * check that the page is AnonExclusive().
>   		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> +		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
>   	}
>
>   	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
> @@ -80,15 +92,60 @@ bool can_change_pte_writable(struct vm_area_struct *vma,
> unsigned long addr,
>   	 * FS was already notified and we can simply mark the PTE writable
>   	 * just like the write-fault handler would do.
>   	 */
> -	return pte_dirty(pte);
> +	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
> +}
> +
> +/*
> + * Returns the number of pages within the folio, starting from the page
> + * indicated by pgidx and up to pgidx + max_nr, that have the same value of
> + * PageAnonExclusive(). Must only be called for anonymous folios. Value of
> + * PageAnonExclusive() is returned in *exclusive.
> + */
> +static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
> +				bool *exclusive)
> +{
> +	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +	VM_WARN_ON_FOLIO(folio_nr_pages(folio) < pgidx + max_nr, folio);
> +
> +	struct page *page = folio_page(folio, pgidx++);
> +	int nr = 1;
> +
> +	*exclusive = PageAnonExclusive(page);
> +
> +	while (nr < max_nr) {
> +		page = folio_page(folio, pgidx++);
> +		if (*exclusive != PageAnonExclusive(page))
> +			break;
> +		nr++;
> +	}
> +
> +	return nr;
> +}
> +
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte)
> +{
> +	int ret;
> +	struct page *page;
> +
> +	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
> +	if (ret == TRI_MAYBE) {
> +		page = vm_normal_page(vma, addr, pte);
> +		ret = page && PageAnon(page) && PageAnonExclusive(page);
> +	}
> +
> +	return ret;
>   }
>
>   static int mprotect_batch(struct folio *folio, unsigned long addr, pte_t *ptep,
> -		pte_t pte, int max_nr_ptes)
> +		pte_t pte, int max_nr_ptes, bool ignore_soft_dirty)
>   {
> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	fpb_t flags = FPB_IGNORE_DIRTY;
>
> -	if (!folio_test_large(folio) || (max_nr_ptes == 1))
> +	if (ignore_soft_dirty)
> +		flags |= FPB_IGNORE_SOFT_DIRTY;
> +
> +	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
>   		return 1;
>
>   	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
> @@ -125,14 +182,17 @@ static long change_pte_range(struct mmu_gather *tlb,
>   		oldpte = ptep_get(pte);
>   		if (pte_present(oldpte)) {
>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> -			pte_t ptent;
> +			struct folio *folio = NULL;
> +			int sub_nr_ptes, pgidx;
> +			pte_t ptent, newpte;
> +			bool sub_set_write;
> +			int set_write;
>
>   			/*
>   			 * Avoid trapping faults against the zero or KSM
>   			 * pages. See similar comment in change_huge_pmd.
>   			 */
>   			if (prot_numa) {
> -				struct folio *folio;
>   				int nid;
>   				bool toptier;
>
> @@ -180,7 +240,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>   				    toptier) {
>   skip_batch:
>   					nr_ptes = mprotect_batch(folio, addr, pte,
> -								 oldpte, max_nr_ptes);
> +								 oldpte, max_nr_ptes,
> +								 true);
>   					continue;
>   				}
>   				if (folio_use_access_time(folio))
> @@ -188,6 +249,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>   						jiffies_to_msecs(jiffies));
>   			}
>
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_batch(folio, addr, pte, oldpte,
> +						 max_nr_ptes, false);
>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>   			ptent = pte_modify(oldpte, newprot);
>
> @@ -209,15 +275,38 @@ static long change_pte_range(struct mmu_gather *tlb,
>   			 * example, if a PTE is already dirty and no other
>   			 * COW or special handling is required.
>   			 */
> -			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent, vma);
> -
> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -			if (pte_needs_flush(oldpte, ptent))
> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -			pages++;
> +			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> +				    !pte_write(ptent) &&
> +				    maybe_change_pte_writable(vma, addr, ptent, folio);
> +
> +			while (nr_ptes) {
> +				if (set_write == TRI_MAYBE) {
> +					pgidx = folio_pfn(folio) - pte_pfn(ptent);
> +					sub_nr_ptes = anon_exclusive_batch(folio,
> +						pgidx, nr_ptes, &sub_set_write);
> +				} else {
> +					sub_nr_ptes = nr_ptes;
> +					sub_set_write = set_write == TRI_TRUE;
> +				}
> +
> +				if (sub_set_write)
> +					newpte = pte_mkwrite(ptent, vma);
> +				else
> +					newpte = ptent;
> +
> +				modify_prot_commit_ptes(vma, addr, pte, oldpte,
> +							newpte, sub_nr_ptes);
> +				if (pte_needs_flush(oldpte, newpte))
> +					tlb_flush_pte_range(tlb, addr,
> +						sub_nr_ptes * PAGE_SIZE);
> +
> +				addr += sub_nr_ptes * PAGE_SIZE;
> +				pte += sub_nr_ptes;
> +				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
> +				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
> +				nr_ptes -= sub_nr_ptes;
> +				pages += sub_nr_ptes;
> +			}
>   		} else if (is_swap_pte(oldpte)) {
>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>   			pte_t newpte;
> ---8<---

Looks quite clean! I'll try this.

> Thanks,
> Ryan
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start
  2025-05-19  7:48 [PATCH v3 0/5] Optimize mprotect() for large folios Dev Jain
                   ` (2 preceding siblings ...)
  2025-05-19  7:48 ` [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching Dev Jain
@ 2025-05-19  7:48 ` Dev Jain
  2025-05-21 14:14   ` Ryan Roberts
  2025-05-19  7:48 ` [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
  4 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-05-19  7:48 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Override the generic definition to use get_and_clear_full_ptes(). This helper
does a TLBI only for the starting and ending contpte block of the range, whereas
the current implementation will call ptep_get_and_clear() for every contpte block,
thus doing a TLBI on every contpte block. Therefore, we have a performance win.
The arm64 definition of pte_accessible() allows us to batch around it in clear_flush_ptes():
#define pte_accessible(mm, pte)	\
	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))

All ptes are obviously present in the folio batch, and they are also valid.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  5 +++++
 arch/arm64/mm/mmu.c              | 12 +++++++++---
 include/linux/pgtable.h          |  4 ++++
 mm/pgtable-generic.c             | 16 +++++++++++-----
 4 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2a77f11b78d5..8872ea5f0642 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
 
+#define modify_prot_start_ptes modify_prot_start_ptes
+extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *ptep,
+				    unsigned int nr);
+
 #ifdef CONFIG_ARM64_CONTPTE
 
 /*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8fcf59ba39db..fe60be8774f4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
 early_initcall(prevent_bootmem_remove_init);
 #endif
 
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, unsigned int nr)
 {
 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
 		/*
@@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 		 * in cases where cpu is affected with errata #2645198.
 		 */
 		if (pte_user_exec(ptep_get(ptep)))
-			return ptep_clear_flush(vma, addr, ptep);
+			return clear_flush_ptes(vma, addr, ptep, nr);
 	}
-	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
+}
+
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+{
+	return modify_prot_start_ptes(vma, addr, ptep, 1);
 }
 
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e40ed57e034d..41f4a8de5c28 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
 			      pte_t *ptep);
 #endif
 
+extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pte_t *ptep, unsigned int nr);
+
 #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
 extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
 			      unsigned long address,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5a882f2b10f9..e238f88c3cac 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
 }
 #endif
 
-#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
-pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
-		       pte_t *ptep)
+pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep, unsigned int nr)
 {
 	struct mm_struct *mm = (vma)->vm_mm;
 	pte_t pte;
-	pte = ptep_get_and_clear(mm, address, ptep);
+	pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
 	if (pte_accessible(mm, pte))
-		flush_tlb_page(vma, address);
+		flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
 	return pte;
 }
+
+#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
+pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep)
+{
+	return clear_flush_ptes(vma, address, ptep, 1);
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start
  2025-05-19  7:48 ` [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start Dev Jain
@ 2025-05-21 14:14   ` Ryan Roberts
  2025-05-22  7:13     ` Dev Jain
  0 siblings, 1 reply; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21 14:14 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 19/05/2025 08:48, Dev Jain wrote:
> Override the generic definition to use get_and_clear_full_ptes(). This helper
> does a TLBI only for the starting and ending contpte block of the range, whereas
> the current implementation will call ptep_get_and_clear() for every contpte block,
> thus doing a TLBI on every contpte block. Therefore, we have a performance win.
> The arm64 definition of pte_accessible() allows us to batch around it in clear_flush_ptes():
> #define pte_accessible(mm, pte)	\
> 	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> 
> All ptes are obviously present in the folio batch, and they are also valid.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Please squash this with the 

> ---
>  arch/arm64/include/asm/pgtable.h |  5 +++++
>  arch/arm64/mm/mmu.c              | 12 +++++++++---
>  include/linux/pgtable.h          |  4 ++++
>  mm/pgtable-generic.c             | 16 +++++++++++-----
>  4 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 2a77f11b78d5..8872ea5f0642 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
>  
> +#define modify_prot_start_ptes modify_prot_start_ptes
> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +				    unsigned long addr, pte_t *ptep,
> +				    unsigned int nr);
> +
>  #ifdef CONFIG_ARM64_CONTPTE
>  
>  /*
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8fcf59ba39db..fe60be8774f4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
>  early_initcall(prevent_bootmem_remove_init);
>  #endif
>  
> -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t *ptep, unsigned int nr)
>  {
>  	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>  		/*
> @@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>  		 * in cases where cpu is affected with errata #2645198.
>  		 */
>  		if (pte_user_exec(ptep_get(ptep)))
> -			return ptep_clear_flush(vma, addr, ptep);
> +			return clear_flush_ptes(vma, addr, ptep, nr);
>  	}
> -	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
> +	return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
> +}

I think we can do this more precisely with respect to tlbis and also without 
needing to create a new clear_flush_ptes() helper:


pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
			     pte_t *ptep, unsigned int nr)
{
	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);

	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
		/*
		 * Break-before-make (BBM) is required for all user space mappings
		 * when the permission changes from executable to non-executable
		 * in cases where cpu is affected with errata #2645198.
		 */
		if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte))
			__flush_tlb_range(vma, addr, nr * PAGE_SIZE,
					  PAGE_SIZE, true, 3);
	}

	return pte;
}


> +
> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> +{
> +	return modify_prot_start_ptes(vma, addr, ptep, 1);
>  }
>  
>  void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e40ed57e034d..41f4a8de5c28 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>  			      pte_t *ptep);
>  #endif
>  
> +extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
> +			      unsigned long address,
> +			      pte_t *ptep, unsigned int nr);
> +
>  #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
>  extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
>  			      unsigned long address,
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 5a882f2b10f9..e238f88c3cac 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
>  }
>  #endif
>  
> -#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
> -pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
> -		       pte_t *ptep)
> +pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
> +		       pte_t *ptep, unsigned int nr)
>  {
>  	struct mm_struct *mm = (vma)->vm_mm;
>  	pte_t pte;
> -	pte = ptep_get_and_clear(mm, address, ptep);
> +	pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
>  	if (pte_accessible(mm, pte))
> -		flush_tlb_page(vma, address);
> +		flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
>  	return pte;
>  }

Let's not create a new generic helper if only arm64 is using it. We would 
also want to add a doc header to describe this helper. My proposal avoids 
this.

Thanks,
Ryan

> +
> +#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
> +		       pte_t *ptep)
> +{
> +	return clear_flush_ptes(vma, address, ptep, 1);
> +}
>  #endif
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start
  2025-05-21 14:14   ` Ryan Roberts
@ 2025-05-22  7:13     ` Dev Jain
  0 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-22  7:13 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 7:44 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Override the generic definition to use get_and_clear_full_ptes(). This helper
>> does a TLBI only for the starting and ending contpte block of the range, whereas
>> the current implementation will call ptep_get_and_clear() for every contpte block,
>> thus doing a TLBI on every contpte block. Therefore, we have a performance win.
>> The arm64 definition of pte_accessible() allows us to batch around it in clear_flush_ptes():
>> #define pte_accessible(mm, pte)	\
>> 	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
>>
>> All ptes are obviously present in the folio batch, and they are also valid.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Please squash this with the
>
>> ---
>>   arch/arm64/include/asm/pgtable.h |  5 +++++
>>   arch/arm64/mm/mmu.c              | 12 +++++++++---
>>   include/linux/pgtable.h          |  4 ++++
>>   mm/pgtable-generic.c             | 16 +++++++++++-----
>>   4 files changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 2a77f11b78d5..8872ea5f0642 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1553,6 +1553,11 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   				    unsigned long addr, pte_t *ptep,
>>   				    pte_t old_pte, pte_t new_pte);
>>   
>> +#define modify_prot_start_ptes modify_prot_start_ptes
>> +extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +				    unsigned long addr, pte_t *ptep,
>> +				    unsigned int nr);
>> +
>>   #ifdef CONFIG_ARM64_CONTPTE
>>   
>>   /*
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..fe60be8774f4 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1523,7 +1523,8 @@ static int __init prevent_bootmem_remove_init(void)
>>   early_initcall(prevent_bootmem_remove_init);
>>   #endif
>>   
>> -pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>> +pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t *ptep, unsigned int nr)
>>   {
>>   	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
>>   		/*
>> @@ -1532,9 +1533,14 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>>   		 * in cases where cpu is affected with errata #2645198.
>>   		 */
>>   		if (pte_user_exec(ptep_get(ptep)))
>> -			return ptep_clear_flush(vma, addr, ptep);
>> +			return clear_flush_ptes(vma, addr, ptep, nr);
>>   	}
>> -	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
>> +	return get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
>> +}
> I think we can do this more precisely with respect to tlbis and also without
> needing to create a new clear_flush_ptes() helper:
>
>
> pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
> 			     pte_t *ptep, unsigned int nr)
> {
> 	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, 0);
>
> 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
> 		/*
> 		 * Break-before-make (BBM) is required for all user space mappings
> 		 * when the permission changes from executable to non-executable
> 		 * in cases where cpu is affected with errata #2645198.
> 		 */
> 		if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte))
> 			__flush_tlb_range(vma, addr, nr * PAGE_SIZE,
> 					  PAGE_SIZE, true, 3);
> 	}
>
> 	return pte;
> }
>
>
>> +
>> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>> +{
>> +	return modify_prot_start_ptes(vma, addr, ptep, 1);
>>   }
>>   
>>   void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index e40ed57e034d..41f4a8de5c28 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -828,6 +828,10 @@ extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>   			      pte_t *ptep);
>>   #endif
>>   
>> +extern pte_t clear_flush_ptes(struct vm_area_struct *vma,
>> +			      unsigned long address,
>> +			      pte_t *ptep, unsigned int nr);
>> +
>>   #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
>>   extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
>>   			      unsigned long address,
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 5a882f2b10f9..e238f88c3cac 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -90,17 +90,23 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   }
>>   #endif
>>   
>> -#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>> -pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>> -		       pte_t *ptep)
>> +pte_t clear_flush_ptes(struct vm_area_struct *vma, unsigned long address,
>> +		       pte_t *ptep, unsigned int nr)
>>   {
>>   	struct mm_struct *mm = (vma)->vm_mm;
>>   	pte_t pte;
>> -	pte = ptep_get_and_clear(mm, address, ptep);
>> +	pte = get_and_clear_full_ptes(mm, address, ptep, nr, 0);
>>   	if (pte_accessible(mm, pte))
>> -		flush_tlb_page(vma, address);
>> +		flush_tlb_range(vma, address, address + nr * PAGE_SIZE);
>>   	return pte;
>>   }
> Let's not create a new generic helper if only arm64 is using it. We would
> also want to add a doc header to describe this helper. My proposal avoids
> this.


This is better, thanks!


>
> Thanks,
> Ryan
>
>> +
>> +#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>> +pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
>> +		       pte_t *ptep)
>> +{
>> +	return clear_flush_ptes(vma, address, ptep, 1);
>> +}
>>   #endif
>>   
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit
  2025-05-19  7:48 [PATCH v3 0/5] Optimize mprotect() for large folios Dev Jain
                   ` (3 preceding siblings ...)
  2025-05-19  7:48 ` [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start Dev Jain
@ 2025-05-19  7:48 ` Dev Jain
  2025-05-21 14:17   ` Ryan Roberts
  4 siblings, 1 reply; 34+ messages in thread
From: Dev Jain @ 2025-05-19  7:48 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Override the generic definition to simply use set_ptes() to map the new
ptes into the pagetable.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 5 +++++
 arch/arm64/mm/mmu.c              | 9 ++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8872ea5f0642..0b13ca38f80c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1558,6 +1558,11 @@ extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    unsigned int nr);
 
+#define modify_prot_commit_ptes modify_prot_commit_ptes
+extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+				    pte_t *ptep, pte_t old_pte, pte_t pte,
+				    unsigned int nr);
+
 #ifdef CONFIG_ARM64_CONTPTE
 
 /*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index fe60be8774f4..5f04bcdcd946 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1543,10 +1543,17 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
 	return modify_prot_start_ptes(vma, addr, ptep, 1);
 }
 
+void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, pte_t old_pte, pte_t pte,
+			     unsigned int nr)
+{
+	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
+}
+
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
 			     pte_t old_pte, pte_t pte)
 {
-	set_pte_at(vma->vm_mm, addr, ptep, pte);
+	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
 }
 
 /*
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit
  2025-05-19  7:48 ` [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
@ 2025-05-21 14:17   ` Ryan Roberts
  2025-05-22  7:12     ` Dev Jain
  0 siblings, 1 reply; 34+ messages in thread
From: Ryan Roberts @ 2025-05-21 14:17 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 19/05/2025 08:48, Dev Jain wrote:
> Override the generic definition to simply use set_ptes() to map the new
> ptes into the pagetable.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Let's squash this into the previous patch; it doesn't make sense for an arch to
implement one without the other. Otherwise, LGTM.

Thanks,
Ryan

> ---
>  arch/arm64/include/asm/pgtable.h | 5 +++++
>  arch/arm64/mm/mmu.c              | 9 ++++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 8872ea5f0642..0b13ca38f80c 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1558,6 +1558,11 @@ extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    unsigned int nr);
>  
> +#define modify_prot_commit_ptes modify_prot_commit_ptes
> +extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +				    pte_t *ptep, pte_t old_pte, pte_t pte,
> +				    unsigned int nr);
> +
>  #ifdef CONFIG_ARM64_CONTPTE
>  
>  /*
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index fe60be8774f4..5f04bcdcd946 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1543,10 +1543,17 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>  	return modify_prot_start_ptes(vma, addr, ptep, 1);
>  }
>  
> +void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t *ptep, pte_t old_pte, pte_t pte,
> +			     unsigned int nr)
> +{
> +	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
> +}
> +
>  void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>  			     pte_t old_pte, pte_t pte)
>  {
> -	set_pte_at(vma->vm_mm, addr, ptep, pte);
> +	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
>  }
>  
>  /*



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit
  2025-05-21 14:17   ` Ryan Roberts
@ 2025-05-22  7:12     ` Dev Jain
  0 siblings, 0 replies; 34+ messages in thread
From: Dev Jain @ 2025-05-22  7:12 UTC (permalink / raw)
  To: Ryan Roberts, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy


On 21/05/25 7:47 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Override the generic definition to simply use set_ptes() to map the new
>> ptes into the pagetable.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Let's squash this into the previous patch; it doesn't make sense for an arch to
> implement one without the other. Otherwise, LGTM.


Sure. Thanks.


>
> Thanks,
> Ryan
>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 5 +++++
>>   arch/arm64/mm/mmu.c              | 9 ++++++++-
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 8872ea5f0642..0b13ca38f80c 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1558,6 +1558,11 @@ extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>   				    unsigned long addr, pte_t *ptep,
>>   				    unsigned int nr);
>>   
>> +#define modify_prot_commit_ptes modify_prot_commit_ptes
>> +extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +				    pte_t *ptep, pte_t old_pte, pte_t pte,
>> +				    unsigned int nr);
>> +
>>   #ifdef CONFIG_ARM64_CONTPTE
>>   
>>   /*
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index fe60be8774f4..5f04bcdcd946 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1543,10 +1543,17 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
>>   	return modify_prot_start_ptes(vma, addr, ptep, 1);
>>   }
>>   
>> +void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +			     pte_t *ptep, pte_t old_pte, pte_t pte,
>> +			     unsigned int nr)
>> +{
>> +	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
>> +}
>> +
>>   void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
>>   			     pte_t old_pte, pte_t pte)
>>   {
>> -	set_pte_at(vma->vm_mm, addr, ptep, pte);
>> +	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
>>   }
>>   
>>   /*
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-06-27  4:56 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19  7:48 [PATCH v3 0/5] Optimize mprotect() for large folios Dev Jain
2025-05-19  7:48 ` [PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs Dev Jain
2025-05-21  8:43   ` Ryan Roberts
2025-05-21 11:58   ` Ryan Roberts
2025-05-22  5:45     ` Dev Jain
2025-05-21 12:06   ` David Hildenbrand
2025-05-22  5:43     ` Dev Jain
2025-05-22  7:13       ` David Hildenbrand
2025-05-22  7:47         ` Dev Jain
2025-05-22 16:18           ` David Hildenbrand
2025-06-04 10:38             ` Dev Jain
2025-06-04 11:44               ` David Hildenbrand
2025-05-19  7:48 ` [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-05-21 11:16   ` Ryan Roberts
2025-05-21 11:45     ` Ryan Roberts
2025-05-22  6:33       ` Dev Jain
2025-05-22  7:51         ` Ryan Roberts
2025-05-22  6:39     ` Dev Jain
2025-06-16  6:37     ` Dev Jain
2025-05-19  7:48 ` [PATCH v3 3/5] mm: Optimize mprotect() by PTE batching Dev Jain
2025-05-19  8:18   ` Barry Song
2025-05-20  9:18     ` Dev Jain
2025-05-21 13:26   ` Ryan Roberts
2025-05-22  6:59     ` Dev Jain
2025-05-22  7:11     ` Dev Jain
2025-06-16 11:24     ` Dev Jain
2025-06-26  8:09       ` Ryan Roberts
2025-06-27  4:55         ` Dev Jain
2025-05-19  7:48 ` [PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start Dev Jain
2025-05-21 14:14   ` Ryan Roberts
2025-05-22  7:13     ` Dev Jain
2025-05-19  7:48 ` [PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit Dev Jain
2025-05-21 14:17   ` Ryan Roberts
2025-05-22  7:12     ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).