[PATCH v5 0/7] Optimize mprotect() for large folios

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/7] Optimize mprotect() for large folios
@ 2025-07-18  9:02 Dev Jain
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
                   ` (7 more replies)
  0 siblings, 8 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all
16 entries to collect a/d bits. Hence this optimization will result in
a 16x reduction in the number of ptep_get() calls. Next,
ptep_modify_prot_start() will eventually call contpte_try_unfold() on
every contig block, thus flushing the TLB for the complete large folio
range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on
each contig block, and only do them on the starting and ending
contig block.

For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement
is expected due to a reduction in the number of function calls.

mm-selftests pass on arm64. I have some failing tests on my x86 VM already;
no new tests fail as a result of this patchset.

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 2.1 seconds   T2: 2 seconds   T3: 1 second

After the patchset:
T1: 0.65 seconds   T2: 0.7 seconds   T3: 1.1 seconds

Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.

For x86:
Before the patchset:
T1: 3.75 seconds  T2: 3.7 seconds  T3: 3.85 seconds

After the patchset:
T1: 3.7 seconds  T2: 3.7 seconds  T3: 3.9 seconds

So there is a minor improvement due to reduction in number of function
calls, and a slight degradation in the small folio case due to the
overhead of vm_normal_folio() + folio_test_large().

Here is the test program:

 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>

 #define SIZE (1024*1024*1024)

unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);

static void pte_map_thps(char *mem, size_t size)
{
	size_t offs;
	int ret = 0;


	/* PTE-map each THP by temporarily splitting the VMAs. */
	for (offs = 0; offs < size; offs += pmdsize) {
		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
	}

	if (ret) {
		fprintf(stderr, "ERROR: mprotect() failed\n");
		exit(1);
	}
}

int main(int argc, char *argv[])
{
	char *p;
        int ret = 0;
	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p != (1UL << 30)) {
		perror("mmap");
		return 1;
	}



	memset(p, 0, SIZE);
	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
		perror("madvise");
	explicit_bzero(p, SIZE);
	pte_map_thps(p, SIZE);

	for (int loops = 0; loops < 40; loops++) {
		if (mprotect(p, SIZE, PROT_READ))
			perror("mprotect"), exit(1);
		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
			perror("mprotect"), exit(1);
		explicit_bzero(p, SIZE);
	}
}

---
v4->v5:
 - Add patch 4
 - Add patch 1 (Lorenzo)
 - For patch 2, instead of using nr_ptes returned from prot_numa_skip()
   as a dummy for whether to skip or not, make that function return
   boolean, and then use folio_pte_batch() to determine how much to
   skip
 - Split can_change_pte_writable() (Lorenzo)
 - Implement patch 6 in a better way

v3->v4:
 - Refactor skipping logic into a new function, edit patch 1 subject
   to highlight it is only for MM_CP_PROT_NUMA case (David H)
 - Refactor the optimization logic, add more documentation to the generic
   batched functions, do not add clear_flush_ptes, squash patch 4
   and 5 (Ryan)

v2->v3:
 - Add comments for the new APIs (Ryan, Lorenzo)
 - Instead of refactoring, use a "skip_batch" label
 - Move arm64 patches at the end (Ryan)
 - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
 - Resolve implicit declaration; tested build on x86 (Lance Yang)

v1->v2:
 - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
 - Abridge the anon-exclusive condition (Lance Yang)


Dev Jain (7):
  mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  mm: Add batched versions of ptep_modify_prot_start/commit
  mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
  mm: Split can_change_pte_writable() into private and shared parts
  mm: Optimize mprotect() by PTE batching
  arm64: Add batched versions of ptep_modify_prot_start/commit

 arch/arm64/include/asm/pgtable.h |  10 ++
 arch/arm64/mm/mmu.c              |  28 ++-
 include/linux/pgtable.h          |  84 ++++++++-
 mm/internal.h                    |  11 +-
 mm/mprotect.c                    | 295 ++++++++++++++++++++++++-------
 5 files changed, 352 insertions(+), 76 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 16:19   ` Lorenzo Stoakes
                     ` (3 more replies)
  2025-07-18  9:02 ` [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Reduce indentation by refactoring the prot_numa case into a new function.
No functional change intended.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 101 +++++++++++++++++++++++++++-----------------------
 1 file changed, 55 insertions(+), 46 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88709c01177b..2a9c73bd0778 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,6 +83,59 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
+static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
+			   pte_t oldpte, pte_t *pte, int target_node)
+{
+	struct folio *folio;
+	bool toptier;
+	int nid;
+
+	/* Avoid TLB flush if possible */
+	if (pte_protnone(oldpte))
+		return true;
+
+	folio = vm_normal_folio(vma, addr, oldpte);
+	if (!folio)
+		return true;
+
+	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
+		return true;
+
+	/* Also skip shared copy-on-write pages */
+	if (is_cow_mapping(vma->vm_flags) &&
+	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
+		return true;
+
+	/*
+	 * While migration can move some dirty pages,
+	 * it cannot move them all from MIGRATE_ASYNC
+	 * context.
+	 */
+	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
+		return true;
+
+	/*
+	 * Don't mess with PTEs if page is already on the node
+	 * a single-threaded process is running on.
+	 */
+	nid = folio_nid(folio);
+	if (target_node == nid)
+		return true;
+
+	toptier = node_is_toptier(nid);
+
+	/*
+	 * Skip scanning top tier node if normal numa
+	 * balancing is disabled
+	 */
+	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
+		return true;
+
+	if (folio_use_access_time(folio))
+		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
+	return false;
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -117,53 +170,9 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				struct folio *folio;
-				int nid;
-				bool toptier;
-
-				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
-					continue;
-
-				folio = vm_normal_folio(vma, addr, oldpte);
-				if (!folio || folio_is_zone_device(folio) ||
-				    folio_test_ksm(folio))
-					continue;
-
-				/* Also skip shared copy-on-write pages */
-				if (is_cow_mapping(vma->vm_flags) &&
-				    (folio_maybe_dma_pinned(folio) ||
-				     folio_maybe_mapped_shared(folio)))
-					continue;
-
-				/*
-				 * While migration can move some dirty pages,
-				 * it cannot move them all from MIGRATE_ASYNC
-				 * context.
-				 */
-				if (folio_is_file_lru(folio) &&
-				    folio_test_dirty(folio))
-					continue;
-
-				/*
-				 * Don't mess with PTEs if page is already on the node
-				 * a single-threaded process is running on.
-				 */
-				nid = folio_nid(folio);
-				if (target_node == nid)
-					continue;
-				toptier = node_is_toptier(nid);
-
-				/*
-				 * Skip scanning top tier node if normal numa
-				 * balancing is disabled
-				 */
-				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
-				    toptier)
+				if (prot_numa_skip(vma, addr, oldpte, pte,
+						   target_node))
 					continue;
-				if (folio_use_access_time(folio))
-					folio_xchg_access_time(folio,
-						jiffies_to_msecs(jiffies));
 			}
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
@ 2025-07-18 16:19   ` Lorenzo Stoakes
  2025-07-20 23:44   ` Barry Song
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 16:19 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:38PM +0530, Dev Jain wrote:
> Reduce indentation by refactoring the prot_numa case into a new function.
> No functional change intended.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mprotect.c | 101 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 55 insertions(+), 46 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..2a9c73bd0778 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,59 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>
> +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> +			   pte_t oldpte, pte_t *pte, int target_node)
> +{
> +	struct folio *folio;
> +	bool toptier;
> +	int nid;
> +
> +	/* Avoid TLB flush if possible */
> +	if (pte_protnone(oldpte))
> +		return true;
> +
> +	folio = vm_normal_folio(vma, addr, oldpte);
> +	if (!folio)
> +		return true;
> +
> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> +		return true;
> +
> +	/* Also skip shared copy-on-write pages */
> +	if (is_cow_mapping(vma->vm_flags) &&
> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
> +		return true;
> +
> +	/*
> +	 * While migration can move some dirty pages,
> +	 * it cannot move them all from MIGRATE_ASYNC
> +	 * context.
> +	 */
> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> +		return true;
> +
> +	/*
> +	 * Don't mess with PTEs if page is already on the node
> +	 * a single-threaded process is running on.
> +	 */
> +	nid = folio_nid(folio);
> +	if (target_node == nid)
> +		return true;
> +
> +	toptier = node_is_toptier(nid);
> +
> +	/*
> +	 * Skip scanning top tier node if normal numa
> +	 * balancing is disabled
> +	 */
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> +		return true;
> +
> +	if (folio_use_access_time(folio))
> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> +	return false;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -117,53 +170,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct folio *folio;
> -				int nid;
> -				bool toptier;
> -
> -				/* Avoid TLB flush if possible */
> -				if (pte_protnone(oldpte))
> -					continue;
> -
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> -					continue;
> -
> -				/* Also skip shared copy-on-write pages */
> -				if (is_cow_mapping(vma->vm_flags) &&
> -				    (folio_maybe_dma_pinned(folio) ||
> -				     folio_maybe_mapped_shared(folio)))
> -					continue;
> -
> -				/*
> -				 * While migration can move some dirty pages,
> -				 * it cannot move them all from MIGRATE_ASYNC
> -				 * context.
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -				    folio_test_dirty(folio))
> -					continue;
> -
> -				/*
> -				 * Don't mess with PTEs if page is already on the node
> -				 * a single-threaded process is running on.
> -				 */
> -				nid = folio_nid(folio);
> -				if (target_node == nid)
> -					continue;
> -				toptier = node_is_toptier(nid);
> -
> -				/*
> -				 * Skip scanning top tier node if normal numa
> -				 * balancing is disabled
> -				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				if (prot_numa_skip(vma, addr, oldpte, pte,
> +						   target_node))
>  					continue;
> -				if (folio_use_access_time(folio))
> -					folio_xchg_access_time(folio,
> -						jiffies_to_msecs(jiffies));
>  			}
>
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
  2025-07-18 16:19   ` Lorenzo Stoakes
@ 2025-07-20 23:44   ` Barry Song
  2025-07-21  3:44     ` Dev Jain
  2025-07-22 11:25   ` Ryan Roberts
  2025-07-23 13:57   ` Zi Yan
  3 siblings, 1 reply; 53+ messages in thread
From: Barry Song @ 2025-07-20 23:44 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 5:03 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Reduce indentation by refactoring the prot_numa case into a new function.
> No functional change intended.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  mm/mprotect.c | 101 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 55 insertions(+), 46 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..2a9c73bd0778 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,59 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>         return pte_dirty(pte);
>  }
>
> +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> +                          pte_t oldpte, pte_t *pte, int target_node)
> +{

[...]

> +       /*
> +        * Skip scanning top tier node if normal numa
> +        * balancing is disabled
> +        */
> +       if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> +               return true;
> +
> +       if (folio_use_access_time(folio))
> +               folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));

Nit: I wonder if this should be moved elsewhere, since this isn't
actually about 'skipping', even though the function is named
`prot_numa_skip()`.

> +       return false;
> +}
> +

Thanks
Barry

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-20 23:44   ` Barry Song
@ 2025-07-21  3:44     ` Dev Jain
  2025-07-22 11:05       ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-07-21  3:44 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 21/07/25 5:14 am, Barry Song wrote:
> On Fri, Jul 18, 2025 at 5:03 PM Dev Jain <dev.jain@arm.com> wrote:
>> Reduce indentation by refactoring the prot_numa case into a new function.
>> No functional change intended.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Reviewed-by: Barry Song <baohua@kernel.org>
>
>> ---
>>   mm/mprotect.c | 101 +++++++++++++++++++++++++++-----------------------
>>   1 file changed, 55 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 88709c01177b..2a9c73bd0778 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -83,6 +83,59 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>          return pte_dirty(pte);
>>   }
>>
>> +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>> +                          pte_t oldpte, pte_t *pte, int target_node)
>> +{
> [...]
>
>> +       /*
>> +        * Skip scanning top tier node if normal numa
>> +        * balancing is disabled
>> +        */
>> +       if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
>> +               return true;
>> +
>> +       if (folio_use_access_time(folio))
>> +               folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> Nit: I wonder if this should be moved elsewhere, since this isn't
> actually about 'skipping', even though the function is named
> `prot_numa_skip()`.

Agreed, thanks. We can use prot_numa_skip() only to return true
or false, and if returned false, we can call folio_xchg_access_time.
I will wait for 2-3 days for any more comments and respin.

>
>> +       return false;
>> +}
>> +
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-21  3:44     ` Dev Jain
@ 2025-07-22 11:05       ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-22 11:05 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 21/07/25 9:14 am, Dev Jain wrote:
>
> On 21/07/25 5:14 am, Barry Song wrote:
>> On Fri, Jul 18, 2025 at 5:03 PM Dev Jain <dev.jain@arm.com> wrote:
>>> Reduce indentation by refactoring the prot_numa case into a new 
>>> function.
>>> No functional change intended.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> Reviewed-by: Barry Song <baohua@kernel.org>
>>
>>> ---
>>>   mm/mprotect.c | 101 
>>> +++++++++++++++++++++++++++-----------------------
>>>   1 file changed, 55 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 88709c01177b..2a9c73bd0778 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -83,6 +83,59 @@ bool can_change_pte_writable(struct 
>>> vm_area_struct *vma, unsigned long addr,
>>>          return pte_dirty(pte);
>>>   }
>>>
>>> +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> +                          pte_t oldpte, pte_t *pte, int target_node)
>>> +{
>> [...]
>>
>>> +       /*
>>> +        * Skip scanning top tier node if normal numa
>>> +        * balancing is disabled
>>> +        */
>>> +       if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && 
>>> toptier)
>>> +               return true;
>>> +
>>> +       if (folio_use_access_time(folio))
>>> +               folio_xchg_access_time(folio, 
>>> jiffies_to_msecs(jiffies));
>> Nit: I wonder if this should be moved elsewhere, since this isn't
>> actually about 'skipping', even though the function is named
>> `prot_numa_skip()`.
>
> Agreed, thanks. We can use prot_numa_skip() only to return true
> or false, and if returned false, we can call folio_xchg_access_time.
> I will wait for 2-3 days for any more comments and respin.

Since Andrew has already pulled this and we are quite late into the

release cycle, I'll do this cleanup in the next cycle.


>
>>
>>> +       return false;
>>> +}
>>> +
>> Thanks
>> Barry
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
  2025-07-18 16:19   ` Lorenzo Stoakes
  2025-07-20 23:44   ` Barry Song
@ 2025-07-22 11:25   ` Ryan Roberts
  2025-07-23 13:57   ` Zi Yan
  3 siblings, 0 replies; 53+ messages in thread
From: Ryan Roberts @ 2025-07-22 11:25 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 18/07/2025 10:02, Dev Jain wrote:
> Reduce indentation by refactoring the prot_numa case into a new function.
> No functional change intended.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 101 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 55 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..2a9c73bd0778 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,6 +83,59 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>  
> +static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> +			   pte_t oldpte, pte_t *pte, int target_node)
> +{
> +	struct folio *folio;
> +	bool toptier;
> +	int nid;
> +
> +	/* Avoid TLB flush if possible */
> +	if (pte_protnone(oldpte))
> +		return true;
> +
> +	folio = vm_normal_folio(vma, addr, oldpte);
> +	if (!folio)
> +		return true;
> +
> +	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> +		return true;
> +
> +	/* Also skip shared copy-on-write pages */
> +	if (is_cow_mapping(vma->vm_flags) &&
> +	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
> +		return true;
> +
> +	/*
> +	 * While migration can move some dirty pages,
> +	 * it cannot move them all from MIGRATE_ASYNC
> +	 * context.
> +	 */
> +	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> +		return true;
> +
> +	/*
> +	 * Don't mess with PTEs if page is already on the node
> +	 * a single-threaded process is running on.
> +	 */
> +	nid = folio_nid(folio);
> +	if (target_node == nid)
> +		return true;
> +
> +	toptier = node_is_toptier(nid);
> +
> +	/*
> +	 * Skip scanning top tier node if normal numa
> +	 * balancing is disabled
> +	 */
> +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> +		return true;
> +
> +	if (folio_use_access_time(folio))
> +		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));

Perhaps this bit should be kept in the original location? It's got nothing to do
with determining if the pte should be skipped.

Thanks,
Ryan

> +	return false;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -117,53 +170,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				struct folio *folio;
> -				int nid;
> -				bool toptier;
> -
> -				/* Avoid TLB flush if possible */
> -				if (pte_protnone(oldpte))
> -					continue;
> -
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -				if (!folio || folio_is_zone_device(folio) ||
> -				    folio_test_ksm(folio))
> -					continue;
> -
> -				/* Also skip shared copy-on-write pages */
> -				if (is_cow_mapping(vma->vm_flags) &&
> -				    (folio_maybe_dma_pinned(folio) ||
> -				     folio_maybe_mapped_shared(folio)))
> -					continue;
> -
> -				/*
> -				 * While migration can move some dirty pages,
> -				 * it cannot move them all from MIGRATE_ASYNC
> -				 * context.
> -				 */
> -				if (folio_is_file_lru(folio) &&
> -				    folio_test_dirty(folio))
> -					continue;
> -
> -				/*
> -				 * Don't mess with PTEs if page is already on the node
> -				 * a single-threaded process is running on.
> -				 */
> -				nid = folio_nid(folio);
> -				if (target_node == nid)
> -					continue;
> -				toptier = node_is_toptier(nid);
> -
> -				/*
> -				 * Skip scanning top tier node if normal numa
> -				 * balancing is disabled
> -				 */
> -				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
> -				    toptier)
> +				if (prot_numa_skip(vma, addr, oldpte, pte,
> +						   target_node))
>  					continue;
> -				if (folio_use_access_time(folio))
> -					folio_xchg_access_time(folio,
> -						jiffies_to_msecs(jiffies));
>  			}
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
                     ` (2 preceding siblings ...)
  2025-07-22 11:25   ` Ryan Roberts
@ 2025-07-23 13:57   ` Zi Yan
  3 siblings, 0 replies; 53+ messages in thread
From: Zi Yan @ 2025-07-23 13:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang

On 18 Jul 2025, at 5:02, Dev Jain wrote:

> Reduce indentation by refactoring the prot_numa case into a new function.
> No functional change intended.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 101 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 55 insertions(+), 46 deletions(-)
>
LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 16:40   ` Lorenzo Stoakes
                     ` (2 more replies)
  2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an
iteration due to the underlying folio satisfying any of the skip
conditions, then for all subsequent ptes which map the same folio, the
iteration will be skipped for them too. Therefore, we can optimize
by using folio_pte_batch() to batch skip the iterations.

Use prot_numa_skip() introduced in the previous patch to determine whether
we need to skip the iteration. Change its signature to have a double
pointer to a folio, which will be used by mprotect_folio_pte_batch() to
determine the number of iterations we can safely skip.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 55 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2a9c73bd0778..97adc62c50ab 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,28 +83,43 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
+static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+				    pte_t pte, int max_nr_ptes)
+{
+	/* No underlying folio, so cannot batch */
+	if (!folio)
+		return 1;
+
+	if (!folio_test_large(folio))
+		return 1;
+
+	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
+}
+
 static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
-			   pte_t oldpte, pte_t *pte, int target_node)
+			   pte_t oldpte, pte_t *pte, int target_node,
+			   struct folio **foliop)
 {
-	struct folio *folio;
+	struct folio *folio = NULL;
+	bool ret = true;
 	bool toptier;
 	int nid;
 
 	/* Avoid TLB flush if possible */
 	if (pte_protnone(oldpte))
-		return true;
+		goto skip;
 
 	folio = vm_normal_folio(vma, addr, oldpte);
 	if (!folio)
-		return true;
+		goto skip;
 
 	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
-		return true;
+		goto skip;
 
 	/* Also skip shared copy-on-write pages */
 	if (is_cow_mapping(vma->vm_flags) &&
 	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
-		return true;
+		goto skip;
 
 	/*
 	 * While migration can move some dirty pages,
@@ -112,7 +127,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
 	 * context.
 	 */
 	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
-		return true;
+		goto skip;
 
 	/*
 	 * Don't mess with PTEs if page is already on the node
@@ -120,7 +135,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
 	 */
 	nid = folio_nid(folio);
 	if (target_node == nid)
-		return true;
+		goto skip;
 
 	toptier = node_is_toptier(nid);
 
@@ -129,11 +144,15 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
 	 * balancing is disabled
 	 */
 	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
-		return true;
+		goto skip;
 
+	ret = false;
 	if (folio_use_access_time(folio))
 		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
-	return false;
+
+skip:
+	*foliop = folio;
+	return ret;
 }
 
 static long change_pte_range(struct mmu_gather *tlb,
@@ -147,6 +166,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int nr_ptes;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -161,8 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+		nr_ptes = 1;
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
+			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
+			struct folio *folio;
 			pte_t ptent;
 
 			/*
@@ -170,9 +193,15 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * pages. See similar comment in change_huge_pmd.
 			 */
 			if (prot_numa) {
-				if (prot_numa_skip(vma, addr, oldpte, pte,
-						   target_node))
+				int ret = prot_numa_skip(vma, addr, oldpte, pte,
+							 target_node, &folio);
+				if (ret) {
+
+					/* determine batch to skip */
+					nr_ptes = mprotect_folio_pte_batch(folio,
+						  pte, oldpte, max_nr_ptes);
 					continue;
+				}
 			}
 
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
@@ -289,7 +318,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				pages++;
 			}
 		}
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-07-18  9:02 ` [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
@ 2025-07-18 16:40   ` Lorenzo Stoakes
  2025-07-22 11:26   ` Ryan Roberts
  2025-07-23 14:25   ` Zi Yan
  2 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 16:40 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:39PM +0530, Dev Jain wrote:
> For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an
> iteration due to the underlying folio satisfying any of the skip
> conditions, then for all subsequent ptes which map the same folio, the
> iteration will be skipped for them too. Therefore, we can optimize
> by using folio_pte_batch() to batch skip the iterations.
>
> Use prot_numa_skip() introduced in the previous patch to determine whether
> we need to skip the iteration. Change its signature to have a double
> pointer to a folio, which will be used by mprotect_folio_pte_batch() to
> determine the number of iterations we can safely skip.

On reflection I don't think this is too bad, it seems neat in this version
at least :)

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Overall looks good to me, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mprotect.c | 55 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 42 insertions(+), 13 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 2a9c73bd0778..97adc62c50ab 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,28 +83,43 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>
> +static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +				    pte_t pte, int max_nr_ptes)
> +{
> +	/* No underlying folio, so cannot batch */
> +	if (!folio)
> +		return 1;
> +
> +	if (!folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
> +}
> +
>  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> -			   pte_t oldpte, pte_t *pte, int target_node)
> +			   pte_t oldpte, pte_t *pte, int target_node,
> +			   struct folio **foliop)
>  {
> -	struct folio *folio;
> +	struct folio *folio = NULL;
> +	bool ret = true;
>  	bool toptier;
>  	int nid;
>
>  	/* Avoid TLB flush if possible */
>  	if (pte_protnone(oldpte))
> -		return true;
> +		goto skip;
>
>  	folio = vm_normal_folio(vma, addr, oldpte);
>  	if (!folio)
> -		return true;
> +		goto skip;
>
>  	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> -		return true;
> +		goto skip;
>
>  	/* Also skip shared copy-on-write pages */
>  	if (is_cow_mapping(vma->vm_flags) &&
>  	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
> -		return true;
> +		goto skip;
>
>  	/*
>  	 * While migration can move some dirty pages,
> @@ -112,7 +127,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	 * context.
>  	 */
>  	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> -		return true;
> +		goto skip;
>
>  	/*
>  	 * Don't mess with PTEs if page is already on the node
> @@ -120,7 +135,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	 */
>  	nid = folio_nid(folio);
>  	if (target_node == nid)
> -		return true;
> +		goto skip;
>
>  	toptier = node_is_toptier(nid);
>
> @@ -129,11 +144,15 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	 * balancing is disabled
>  	 */
>  	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> -		return true;
> +		goto skip;
>
> +	ret = false;
>  	if (folio_use_access_time(folio))
>  		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> -	return false;
> +
> +skip:
> +	*foliop = folio;
> +	return ret;
>  }
>
>  static long change_pte_range(struct mmu_gather *tlb,
> @@ -147,6 +166,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -161,8 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> +			struct folio *folio;
>  			pte_t ptent;
>
>  			/*
> @@ -170,9 +193,15 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				if (prot_numa_skip(vma, addr, oldpte, pte,
> -						   target_node))
> +				int ret = prot_numa_skip(vma, addr, oldpte, pte,
> +							 target_node, &folio);
> +				if (ret) {
> +
> +					/* determine batch to skip */
> +					nr_ptes = mprotect_folio_pte_batch(folio,
> +						  pte, oldpte, max_nr_ptes);
>  					continue;
> +				}
>  			}
>
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> @@ -289,7 +318,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-07-18  9:02 ` [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
  2025-07-18 16:40   ` Lorenzo Stoakes
@ 2025-07-22 11:26   ` Ryan Roberts
  2025-07-23 14:25   ` Zi Yan
  2 siblings, 0 replies; 53+ messages in thread
From: Ryan Roberts @ 2025-07-22 11:26 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 18/07/2025 10:02, Dev Jain wrote:
> For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an
> iteration due to the underlying folio satisfying any of the skip
> conditions, then for all subsequent ptes which map the same folio, the
> iteration will be skipped for them too. Therefore, we can optimize
> by using folio_pte_batch() to batch skip the iterations.
> 
> Use prot_numa_skip() introduced in the previous patch to determine whether
> we need to skip the iteration. Change its signature to have a double
> pointer to a folio, which will be used by mprotect_folio_pte_batch() to
> determine the number of iterations we can safely skip.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/mprotect.c | 55 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 2a9c73bd0778..97adc62c50ab 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -83,28 +83,43 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>  
> +static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +				    pte_t pte, int max_nr_ptes)
> +{
> +	/* No underlying folio, so cannot batch */
> +	if (!folio)
> +		return 1;
> +
> +	if (!folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
> +}
> +
>  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> -			   pte_t oldpte, pte_t *pte, int target_node)
> +			   pte_t oldpte, pte_t *pte, int target_node,
> +			   struct folio **foliop)
>  {
> -	struct folio *folio;
> +	struct folio *folio = NULL;
> +	bool ret = true;
>  	bool toptier;
>  	int nid;
>  
>  	/* Avoid TLB flush if possible */
>  	if (pte_protnone(oldpte))
> -		return true;
> +		goto skip;
>  
>  	folio = vm_normal_folio(vma, addr, oldpte);
>  	if (!folio)
> -		return true;
> +		goto skip;
>  
>  	if (folio_is_zone_device(folio) || folio_test_ksm(folio))
> -		return true;
> +		goto skip;
>  
>  	/* Also skip shared copy-on-write pages */
>  	if (is_cow_mapping(vma->vm_flags) &&
>  	    (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
> -		return true;
> +		goto skip;
>  
>  	/*
>  	 * While migration can move some dirty pages,
> @@ -112,7 +127,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	 * context.
>  	 */
>  	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> -		return true;
> +		goto skip;
>  
>  	/*
>  	 * Don't mess with PTEs if page is already on the node
> @@ -120,7 +135,7 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	 */
>  	nid = folio_nid(folio);
>  	if (target_node == nid)
> -		return true;
> +		goto skip;
>  
>  	toptier = node_is_toptier(nid);
>  
> @@ -129,11 +144,15 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	 * balancing is disabled
>  	 */
>  	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
> -		return true;
> +		goto skip;
>  
> +	ret = false;
>  	if (folio_use_access_time(folio))
>  		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> -	return false;
> +
> +skip:
> +	*foliop = folio;
> +	return ret;
>  }
>  
>  static long change_pte_range(struct mmu_gather *tlb,
> @@ -147,6 +166,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	int nr_ptes;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -161,8 +181,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> +		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> +			struct folio *folio;
>  			pte_t ptent;
>  
>  			/*
> @@ -170,9 +193,15 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
> -				if (prot_numa_skip(vma, addr, oldpte, pte,
> -						   target_node))
> +				int ret = prot_numa_skip(vma, addr, oldpte, pte,
> +							 target_node, &folio);
> +				if (ret) {
> +
> +					/* determine batch to skip */
> +					nr_ptes = mprotect_folio_pte_batch(folio,
> +						  pte, oldpte, max_nr_ptes);
>  					continue;
> +				}
>  			}
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> @@ -289,7 +318,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		}
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  2025-07-18  9:02 ` [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
  2025-07-18 16:40   ` Lorenzo Stoakes
  2025-07-22 11:26   ` Ryan Roberts
@ 2025-07-23 14:25   ` Zi Yan
  2 siblings, 0 replies; 53+ messages in thread
From: Zi Yan @ 2025-07-23 14:25 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang

On 18 Jul 2025, at 5:02, Dev Jain wrote:

> For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an
> iteration due to the underlying folio satisfying any of the skip
> conditions, then for all subsequent ptes which map the same folio, the
> iteration will be skipped for them too. Therefore, we can optimize
> by using folio_pte_batch() to batch skip the iterations.
>
> Use prot_numa_skip() introduced in the previous patch to determine whether
> we need to skip the iteration. Change its signature to have a double
> pointer to a folio, which will be used by mprotect_folio_pte_batch() to
> determine the number of iterations we can safely skip.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 55 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 42 insertions(+), 13 deletions(-)
>
LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
  2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
  2025-07-18  9:02 ` [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 17:05   ` Lorenzo Stoakes
                     ` (3 more replies)
  2025-07-18  9:02 ` [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure Dev Jain
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect,
implementing them as a simple loop over the corresponding single pte
helpers. Architecture may override these helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 84 ++++++++++++++++++++++++++++++++++++++++-
 mm/mprotect.c           |  4 +-
 2 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cf1515c163e2..e3b99920be05 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1331,7 +1331,9 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 
 /*
  * Commit an update to a pte, leaving any hardware-controlled bits in
- * the PTE unmodified.
+ * the PTE unmodified. The pte returned from ptep_modify_prot_start() may
+ * additionally have young and/or dirty bits set where previously they were not,
+ * so the updated pte may have these additional changes.
  */
 static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 					   unsigned long addr,
@@ -1340,6 +1342,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 	__ptep_modify_prot_commit(vma, addr, ptep, pte);
 }
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+/**
+ * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
+ * over a batch of ptes, which protects against asynchronous hardware
+ * modifications to the ptes. The intention is not to prevent the hardware from
+ * making pte updates, but to prevent any updates it may make from being lost.
+ * Please see the comment above ptep_modify_prot_start() for full description.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
+ * in the batch.
+ *
+ * Note that PTE bits in the PTE batch besides the PFN can differ.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio. All other PTE bits must be identical for
+ * all PTEs in the batch except for young and dirty bits.  The PTEs are all in
+ * the same PMD.
+ */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/**
+ * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
+ * hardware-controlled bits in the PTE unmodified.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @old_pte: Old page table entry (for the first entry) which is now cleared.
+ * @pte: New page table entry to be set.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_commit().
+ *
+ * Context: The caller holds the page table lock. The PTEs are all in the same
+ * PMD. On exit, the set ptes in the batch map the same folio. The ptes set by
+ * ptep_modify_prot_start() may additionally have young and/or dirty bits set
+ * where previously they were not, so the updated ptes may have these
+ * additional changes.
+ */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+
+		/* Advance PFN only, set same prot */
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 /*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 97adc62c50ab..4977f198168e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -204,7 +204,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				}
 			}
 
-			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
 			if (uffd_wp)
@@ -230,7 +230,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			    can_change_pte_writable(vma, addr, ptent))
 				ptent = pte_mkwrite(ptent, vma);
 
-			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
 			if (pte_needs_flush(oldpte, ptent))
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-07-18 17:05   ` Lorenzo Stoakes
  2025-07-20 23:59   ` Barry Song
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 17:05 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:40PM +0530, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect,
> implementing them as a simple loop over the corresponding single pte
> helpers. Architecture may override these helpers.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/pgtable.h | 84 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mprotect.c           |  4 +-
>  2 files changed, 85 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index cf1515c163e2..e3b99920be05 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1331,7 +1331,9 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>
>  /*
>   * Commit an update to a pte, leaving any hardware-controlled bits in
> - * the PTE unmodified.
> + * the PTE unmodified. The pte returned from ptep_modify_prot_start() may
> + * additionally have young and/or dirty bits set where previously they were not,
> + * so the updated pte may have these additional changes.
>   */
>  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  					   unsigned long addr,
> @@ -1340,6 +1342,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/**
> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
> + * over a batch of ptes, which protects against asynchronous hardware
> + * modifications to the ptes. The intention is not to prevent the hardware from
> + * making pte updates, but to prevent any updates it may make from being lost.
> + * Please see the comment above ptep_modify_prot_start() for full description.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> + * in the batch.
> + *
> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio. All other PTE bits must be identical for
> + * all PTEs in the batch except for young and dirty bits.  The PTEs are all in
> + * the same PMD.
> + */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/**
> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> + * hardware-controlled bits in the PTE unmodified.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> + * @pte: New page table entry to be set.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_commit().
> + *
> + * Context: The caller holds the page table lock. The PTEs are all in the same
> + * PMD. On exit, the set ptes in the batch map the same folio. The ptes set by
> + * ptep_modify_prot_start() may additionally have young and/or dirty bits set
> + * where previously they were not, so the updated ptes may have these
> + * additional changes.
> + */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +
> +		/* Advance PFN only, set same prot */
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  #endif /* CONFIG_MMU */
>
>  /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 97adc62c50ab..4977f198168e 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -204,7 +204,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				}
>  			}
>
> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>
>  			if (uffd_wp)
> @@ -230,7 +230,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			    can_change_pte_writable(vma, addr, ptent))
>  				ptent = pte_mkwrite(ptent, vma);
>
> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>  			if (pte_needs_flush(oldpte, ptent))
>  				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>  			pages++;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-07-18 17:05   ` Lorenzo Stoakes
@ 2025-07-20 23:59   ` Barry Song
  2025-07-22 11:35   ` Ryan Roberts
  2025-07-23 15:09   ` Zi Yan
  3 siblings, 0 replies; 53+ messages in thread
From: Barry Song @ 2025-07-20 23:59 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 5:03 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect,
> implementing them as a simple loop over the corresponding single pte
> helpers. Architecture may override these helpers.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  include/linux/pgtable.h | 84 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mprotect.c           |  4 +-
>  2 files changed, 85 insertions(+), 3 deletions(-)
>

[...]

> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +               unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +       pte_t pte, tmp_pte;
> +
> +       pte = ptep_modify_prot_start(vma, addr, ptep);
> +       while (--nr) {
> +               ptep++;
> +               addr += PAGE_SIZE;
> +               tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +               if (pte_dirty(tmp_pte))
> +                       pte = pte_mkdirty(pte);
> +               if (pte_young(tmp_pte))
> +                       pte = pte_mkyoung(pte);

It might be interesting to explore whether a similar optimization
could apply here as well:
https://lore.kernel.org/linux-mm/20250624152549.2647828-1-xavier.qyxia@gmail.com/
I suspect it would, but it's probably not worth including in this
patch.

> +       }
> +       return pte;
> +}
> +#endif
> +

Thanks
Barry

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-07-18 17:05   ` Lorenzo Stoakes
  2025-07-20 23:59   ` Barry Song
@ 2025-07-22 11:35   ` Ryan Roberts
  2025-07-23 15:09   ` Zi Yan
  3 siblings, 0 replies; 53+ messages in thread
From: Ryan Roberts @ 2025-07-22 11:35 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 18/07/2025 10:02, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect,
> implementing them as a simple loop over the corresponding single pte
> helpers. Architecture may override these helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  include/linux/pgtable.h | 84 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mprotect.c           |  4 +-
>  2 files changed, 85 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index cf1515c163e2..e3b99920be05 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1331,7 +1331,9 @@ static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  
>  /*
>   * Commit an update to a pte, leaving any hardware-controlled bits in
> - * the PTE unmodified.
> + * the PTE unmodified. The pte returned from ptep_modify_prot_start() may
> + * additionally have young and/or dirty bits set where previously they were not,
> + * so the updated pte may have these additional changes.

nit: I still find this difficult to parse (Although I expect you might tell me
that this is the text I suggested last time around :) ). I think you mean that
"it is permissable for young and/or dirty bits to be set in old_pte, despite
beling clear when originally returned by ptep_modify_prot_start()".

Anyway, no big deal. I think we all know what it's getting at.

Thanks,
Ryan


>   */
>  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  					   unsigned long addr,
> @@ -1340,6 +1342,86 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/**
> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
> + * over a batch of ptes, which protects against asynchronous hardware
> + * modifications to the ptes. The intention is not to prevent the hardware from
> + * making pte updates, but to prevent any updates it may make from being lost.
> + * Please see the comment above ptep_modify_prot_start() for full description.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_start(), collecting the a/d bits from each pte
> + * in the batch.
> + *
> + * Note that PTE bits in the PTE batch besides the PFN can differ.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio. All other PTE bits must be identical for
> + * all PTEs in the batch except for young and dirty bits.  The PTEs are all in
> + * the same PMD.
> + */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/**
> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> + * hardware-controlled bits in the PTE unmodified.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @old_pte: Old page table entry (for the first entry) which is now cleared.
> + * @pte: New page table entry to be set.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_commit().
> + *
> + * Context: The caller holds the page table lock. The PTEs are all in the same
> + * PMD. On exit, the set ptes in the batch map the same folio. The ptes set by
> + * ptep_modify_prot_start() may additionally have young and/or dirty bits set
> + * where previously they were not, so the updated ptes may have these
> + * additional changes.
> + */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +
> +		/* Advance PFN only, set same prot */
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  #endif /* CONFIG_MMU */
>  
>  /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 97adc62c50ab..4977f198168e 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -204,7 +204,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				}
>  			}
>  
> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>  
>  			if (uffd_wp)
> @@ -230,7 +230,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			    can_change_pte_writable(vma, addr, ptent))
>  				ptent = pte_mkwrite(ptent, vma);
>  
> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>  			if (pte_needs_flush(oldpte, ptent))
>  				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>  			pages++;


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
                     ` (2 preceding siblings ...)
  2025-07-22 11:35   ` Ryan Roberts
@ 2025-07-23 15:09   ` Zi Yan
  3 siblings, 0 replies; 53+ messages in thread
From: Zi Yan @ 2025-07-23 15:09 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang

On 18 Jul 2025, at 5:02, Dev Jain wrote:

> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect,
> implementing them as a simple loop over the corresponding single pte
> helpers. Architecture may override these helpers.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 84 ++++++++++++++++++++++++++++++++++++++++-
>  mm/mprotect.c           |  4 +-
>  2 files changed, 85 insertions(+), 3 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
                   ` (2 preceding siblings ...)
  2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 17:12   ` Lorenzo Stoakes
                     ` (2 more replies)
  2025-07-18  9:02 ` [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts Dev Jain
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Patch 6 optimizes mprotect() by batch clearing the ptes, masking in the new
protections, and batch setting the ptes. Suppose that the first pte
of the batch is writable - with the current implementation of
folio_pte_batch(), it is not guaranteed that the other ptes in the batch
are already writable too, so we may incorrectly end up setting the
writable bit on all ptes via modify_prot_commit_ptes().

Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch
are writable or not.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/internal.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5b0f71e5434b..28d2d5b051df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -208,17 +208,20 @@ typedef int __bitwise fpb_t;
 /* Compare PTEs respecting the soft-dirty bit. */
 #define FPB_RESPECT_SOFT_DIRTY		((__force fpb_t)BIT(1))
 
+/* Compare PTEs respecting the writable bit. */
+#define FPB_RESPECT_WRITE		((__force fpb_t)BIT(2))
+
 /*
  * Merge PTE write bits: if any PTE in the batch is writable, modify the
  * PTE at @ptentp to be writable.
  */
-#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
+#define FPB_MERGE_WRITE			((__force fpb_t)BIT(3))
 
 /*
  * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
  * modify the PTE at @ptentp to be young or dirty, respectively.
  */
-#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
+#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(4))
 
 static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 {
@@ -226,7 +229,9 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 		pte = pte_mkclean(pte);
 	if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY)))
 		pte = pte_clear_soft_dirty(pte);
-	return pte_wrprotect(pte_mkold(pte));
+	if (likely(!(flags & FPB_RESPECT_WRITE)))
+		pte = pte_wrprotect(pte);
+	return pte_mkold(pte);
 }
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
  2025-07-18  9:02 ` [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure Dev Jain
@ 2025-07-18 17:12   ` Lorenzo Stoakes
  2025-07-22 11:37   ` Ryan Roberts
  2025-07-23 15:28   ` Zi Yan
  2 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 17:12 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:41PM +0530, Dev Jain wrote:
> Patch 6 optimizes mprotect() by batch clearing the ptes, masking in the new
> protections, and batch setting the ptes. Suppose that the first pte
> of the batch is writable - with the current implementation of
> folio_pte_batch(), it is not guaranteed that the other ptes in the batch
> are already writable too, so we may incorrectly end up setting the
> writable bit on all ptes via modify_prot_commit_ptes().
>
> Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch
> are writable or not.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/internal.h | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 5b0f71e5434b..28d2d5b051df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -208,17 +208,20 @@ typedef int __bitwise fpb_t;
>  /* Compare PTEs respecting the soft-dirty bit. */
>  #define FPB_RESPECT_SOFT_DIRTY		((__force fpb_t)BIT(1))
>
> +/* Compare PTEs respecting the writable bit. */
> +#define FPB_RESPECT_WRITE		((__force fpb_t)BIT(2))
> +
>  /*
>   * Merge PTE write bits: if any PTE in the batch is writable, modify the
>   * PTE at @ptentp to be writable.
>   */
> -#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(3))
>
>  /*
>   * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
>   * modify the PTE at @ptentp to be young or dirty, respectively.
>   */
> -#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(4))
>
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
> @@ -226,7 +229,9 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  		pte = pte_mkclean(pte);
>  	if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY)))
>  		pte = pte_clear_soft_dirty(pte);
> -	return pte_wrprotect(pte_mkold(pte));
> +	if (likely(!(flags & FPB_RESPECT_WRITE)))
> +		pte = pte_wrprotect(pte);
> +	return pte_mkold(pte);
>  }
>
>  /**
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
  2025-07-18  9:02 ` [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure Dev Jain
  2025-07-18 17:12   ` Lorenzo Stoakes
@ 2025-07-22 11:37   ` Ryan Roberts
  2025-07-23 15:28   ` Zi Yan
  2 siblings, 0 replies; 53+ messages in thread
From: Ryan Roberts @ 2025-07-22 11:37 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: david, willy, linux-mm, linux-kernel, catalin.marinas, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On 18/07/2025 10:02, Dev Jain wrote:
> Patch 6 optimizes mprotect() by batch clearing the ptes, masking in the new
> protections, and batch setting the ptes. Suppose that the first pte
> of the batch is writable - with the current implementation of
> folio_pte_batch(), it is not guaranteed that the other ptes in the batch
> are already writable too, so we may incorrectly end up setting the
> writable bit on all ptes via modify_prot_commit_ptes().
> 
> Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch
> are writable or not.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/internal.h | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 5b0f71e5434b..28d2d5b051df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -208,17 +208,20 @@ typedef int __bitwise fpb_t;
>  /* Compare PTEs respecting the soft-dirty bit. */
>  #define FPB_RESPECT_SOFT_DIRTY		((__force fpb_t)BIT(1))
>  
> +/* Compare PTEs respecting the writable bit. */
> +#define FPB_RESPECT_WRITE		((__force fpb_t)BIT(2))
> +
>  /*
>   * Merge PTE write bits: if any PTE in the batch is writable, modify the
>   * PTE at @ptentp to be writable.
>   */
> -#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(3))
>  
>  /*
>   * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
>   * modify the PTE at @ptentp to be young or dirty, respectively.
>   */
> -#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(4))
>  
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
> @@ -226,7 +229,9 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  		pte = pte_mkclean(pte);
>  	if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY)))
>  		pte = pte_clear_soft_dirty(pte);
> -	return pte_wrprotect(pte_mkold(pte));
> +	if (likely(!(flags & FPB_RESPECT_WRITE)))
> +		pte = pte_wrprotect(pte);
> +	return pte_mkold(pte);
>  }
>  
>  /**


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
  2025-07-18  9:02 ` [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure Dev Jain
  2025-07-18 17:12   ` Lorenzo Stoakes
  2025-07-22 11:37   ` Ryan Roberts
@ 2025-07-23 15:28   ` Zi Yan
  2025-07-23 15:32     ` Dev Jain
  2 siblings, 1 reply; 53+ messages in thread
From: Zi Yan @ 2025-07-23 15:28 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang

On 18 Jul 2025, at 5:02, Dev Jain wrote:

> Patch 6 optimizes mprotect() by batch clearing the ptes, masking in the new

“Patch 6” might not make sense when reading it in the git log. Something like
below might be better:

mprotect() will be optimized by batch clearing the ptes, masking in the new
protections, and batch setting the ptes in an upcoming commit.

No need to repin for this one.

> protections, and batch setting the ptes. Suppose that the first pte
> of the batch is writable - with the current implementation of
> folio_pte_batch(), it is not guaranteed that the other ptes in the batch
> are already writable too, so we may incorrectly end up setting the
> writable bit on all ptes via modify_prot_commit_ptes().
>
> Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch
> are writable or not.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/internal.h | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
  2025-07-23 15:28   ` Zi Yan
@ 2025-07-23 15:32     ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-23 15:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang


On 23/07/25 8:58 pm, Zi Yan wrote:
> On 18 Jul 2025, at 5:02, Dev Jain wrote:
>
>> Patch 6 optimizes mprotect() by batch clearing the ptes, masking in the new
> “Patch 6” might not make sense when reading it in the git log. Something like
> below might be better:

Andrew has fixed that for me :)

>
> mprotect() will be optimized by batch clearing the ptes, masking in the new
> protections, and batch setting the ptes in an upcoming commit.
>
> No need to repin for this one.
>
>> protections, and batch setting the ptes. Suppose that the first pte
>> of the batch is writable - with the current implementation of
>> folio_pte_batch(), it is not guaranteed that the other ptes in the batch
>> are already writable too, so we may incorrectly end up setting the
>> writable bit on all ptes via modify_prot_commit_ptes().
>>
>> Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch
>> are writable or not.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/internal.h | 11 ++++++++---
>>   1 file changed, 8 insertions(+), 3 deletions(-)
>>
> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Thanks.

>
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
                   ` (3 preceding siblings ...)
  2025-07-18  9:02 ` [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 17:27   ` Lorenzo Stoakes
  2025-07-23 15:40   ` Zi Yan
  2025-07-18  9:02 ` [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching Dev Jain
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

In preparation for patch 6 and modularizing the code in general, split
can_change_pte_writable() into private and shared VMA parts. No functional
change intended.

Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 50 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4977f198168e..a1c7d8a4648d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,11 +40,8 @@
 
 #include "internal.h"
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
+static bool maybe_change_pte_writable(struct vm_area_struct *vma, pte_t pte)
 {
-	struct page *page;
-
 	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
 		return false;
 
@@ -60,16 +57,32 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	if (userfaultfd_pte_wp(vma, pte))
 		return false;
 
-	if (!(vma->vm_flags & VM_SHARED)) {
-		/*
-		 * Writable MAP_PRIVATE mapping: We can only special-case on
-		 * exclusive anonymous pages, because we know that our
-		 * write-fault handler similarly would map them writable without
-		 * any additional checks while holding the PT lock.
-		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
-	}
+	return true;
+}
+
+static bool can_change_private_pte_writable(struct vm_area_struct *vma,
+					    unsigned long addr, pte_t pte)
+{
+	struct page *page;
+
+	if (!maybe_change_pte_writable(vma, pte))
+		return false;
+
+	/*
+	 * Writable MAP_PRIVATE mapping: We can only special-case on
+	 * exclusive anonymous pages, because we know that our
+	 * write-fault handler similarly would map them writable without
+	 * any additional checks while holding the PT lock.
+	 */
+	page = vm_normal_page(vma, addr, pte);
+	return page && PageAnon(page) && PageAnonExclusive(page);
+}
+
+static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
+					   pte_t pte)
+{
+	if (!maybe_change_pte_writable(vma, pte))
+		return false;
 
 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
 
@@ -83,6 +96,15 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte)
+{
+	if (!(vma->vm_flags & VM_SHARED))
+		return can_change_private_pte_writable(vma, addr, pte);
+
+	return can_change_shared_pte_writable(vma, pte);
+}
+
 static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 				    pte_t pte, int max_nr_ptes)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts
  2025-07-18  9:02 ` [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts Dev Jain
@ 2025-07-18 17:27   ` Lorenzo Stoakes
  2025-07-23 15:40   ` Zi Yan
  1 sibling, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 17:27 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:42PM +0530, Dev Jain wrote:
> In preparation for patch 6 and modularizing the code in general, split
> can_change_pte_writable() into private and shared VMA parts. No functional
> change intended.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Great thanks! This is much clearer I think (of course being the
Suggested-by here makes me somewhat biased :P) :>)

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mprotect.c | 50 ++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 4977f198168e..a1c7d8a4648d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -40,11 +40,8 @@
>
>  #include "internal.h"
>
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> +static bool maybe_change_pte_writable(struct vm_area_struct *vma, pte_t pte)
>  {
> -	struct page *page;
> -
>  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
>  		return false;
>
> @@ -60,16 +57,32 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	if (userfaultfd_pte_wp(vma, pte))
>  		return false;
>
> -	if (!(vma->vm_flags & VM_SHARED)) {
> -		/*
> -		 * Writable MAP_PRIVATE mapping: We can only special-case on
> -		 * exclusive anonymous pages, because we know that our
> -		 * write-fault handler similarly would map them writable without
> -		 * any additional checks while holding the PT lock.
> -		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> -	}
> +	return true;
> +}
> +
> +static bool can_change_private_pte_writable(struct vm_area_struct *vma,
> +					    unsigned long addr, pte_t pte)
> +{
> +	struct page *page;
> +
> +	if (!maybe_change_pte_writable(vma, pte))
> +		return false;
> +
> +	/*
> +	 * Writable MAP_PRIVATE mapping: We can only special-case on
> +	 * exclusive anonymous pages, because we know that our
> +	 * write-fault handler similarly would map them writable without
> +	 * any additional checks while holding the PT lock.
> +	 */
> +	page = vm_normal_page(vma, addr, pte);
> +	return page && PageAnon(page) && PageAnonExclusive(page);
> +}
> +
> +static bool can_change_shared_pte_writable(struct vm_area_struct *vma,
> +					   pte_t pte)
> +{
> +	if (!maybe_change_pte_writable(vma, pte))
> +		return false;
>
>  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
>
> @@ -83,6 +96,15 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return pte_dirty(pte);
>  }
>
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +			     pte_t pte)
> +{
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return can_change_private_pte_writable(vma, addr, pte);
> +
> +	return can_change_shared_pte_writable(vma, pte);
> +}
> +
>  static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  				    pte_t pte, int max_nr_ptes)
>  {
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts
  2025-07-18  9:02 ` [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts Dev Jain
  2025-07-18 17:27   ` Lorenzo Stoakes
@ 2025-07-23 15:40   ` Zi Yan
  1 sibling, 0 replies; 53+ messages in thread
From: Zi Yan @ 2025-07-23 15:40 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang

On 18 Jul 2025, at 5:02, Dev Jain wrote:

> In preparation for patch 6 and modularizing the code in general, split
> can_change_pte_writable() into private and shared VMA parts. No functional
> change intended.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 50 ++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 36 insertions(+), 14 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
                   ` (4 preceding siblings ...)
  2025-07-18  9:02 ` [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 18:49   ` Lorenzo Stoakes
                     ` (2 more replies)
  2025-07-18  9:02 ` [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-07-18  9:50 ` [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
  7 siblings, 3 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Use folio_pte_batch to batch process a large folio. Note that, PTE
batching here will save a few function calls, and this strategy in certain
cases (not this one) batches atomic operations in general, so we have
a performance win for all arches. This patch paves the way for patch 7
which will help us elide the TLBI per contig block on arm64.

The correctness of this patch lies on the correctness of setting the
new ptes based upon information only from the first pte of the batch
(which may also have accumulated a/d bits via modify_prot_start_ptes()).

Observe that the flag combination we pass to mprotect_folio_pte_batch()
guarantees that the batch is uniform w.r.t the soft-dirty bit and the
writable bit. Therefore, the only bits which may differ are the a/d bits.
So we only need to worry about code which is concerned about the a/d bits
of the PTEs.

Setting extra a/d bits on the new ptes where previously they were not set,
is fine - setting access bit when it was not set is not an incorrectness
problem but will only possibly delay the reclaim of the page mapped by
the pte (which is in fact intended because the kernel just operated on this
region via mprotect()!). Setting dirty bit when it was not set is again
not an incorrectness problem but will only possibly force an unnecessary
writeback.

So now we need to reason whether something can go wrong via
can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
and userfaultfd_pte_wp cases are solved due to uniformity in the
corresponding bits guaranteed by the flag combination. The ptes all
belong to the same VMA (since callers guarantee that [start, end) will
lie within the VMA) therefore the conditional based on the VMA is also
safe to batch around.

Since the dirty bit on the PTE really is just an indication that the folio
got written to - even if the PTE is not actually dirty but one of the PTEs
in the batch is, the wp-fault optimization can be made. Therefore, it is
safe to batch around pte_dirty() in can_change_shared_pte_writable()
(in fact this is better since without batching, it may happen that
some ptes aren't changed to writable just because they are not dirty,
even though the other ptes mapping the same large folio are dirty).

To batch around the PageAnonExclusive case, we must check the corresponding
condition for every single page. Therefore, from the large folio batch,
we process sub batches of ptes mapping pages with the same
PageAnonExclusive condition, and process that sub batch, then determine
and process the next sub batch, and so on. Note that this does not cause
any extra overhead; if suppose the size of the folio batch is 512, then
the sub batch processing in total will take 512 iterations, which is the
same as what we would have done before.

For pte_needs_flush():

ppc does not care about the a/d bits.

For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
get cleared; since we can only have extra a/d bits due to batching,
we will only have an extra flush, not a case where we elide a flush due
to batching when we shouldn't have.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 113 insertions(+), 12 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index a1c7d8a4648d..2ddd37b2f462 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 }
 
 static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
-				    pte_t pte, int max_nr_ptes)
+				    pte_t pte, int max_nr_ptes, fpb_t flags)
 {
 	/* No underlying folio, so cannot batch */
 	if (!folio)
@@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 	if (!folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
+	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
 }
 
 static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
@@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
 	return ret;
 }
 
+/* Set nr_ptes number of ptes, starting from idx */
+static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
+		int idx, bool set_write, struct mmu_gather *tlb)
+{
+	/*
+	 * Advance the position in the batch by idx; note that if idx > 0,
+	 * then the nr_ptes passed here is <= batch size - idx.
+	 */
+	addr += idx * PAGE_SIZE;
+	ptep += idx;
+	oldpte = pte_advance_pfn(oldpte, idx);
+	ptent = pte_advance_pfn(ptent, idx);
+
+	if (set_write)
+		ptent = pte_mkwrite(ptent, vma);
+
+	modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
+	if (pte_needs_flush(oldpte, ptent))
+		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
+}
+
+/*
+ * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
+ * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
+ * that the ptes point to consecutive pages of the same anon large folio.
+ */
+static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+		struct page *first_page, bool expected_anon_exclusive)
+{
+	int idx;
+
+	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
+		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
+			break;
+	}
+	return idx - start_idx;
+}
+
+/*
+ * This function is a result of trying our very best to retain the
+ * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
+ * if the vma is a private vma, and we cannot determine whether to change
+ * the pte to writable just from the vma and the pte, we then need to look
+ * at the actual page pointed to by the pte. Unfortunately, if we have a
+ * batch of ptes pointing to consecutive pages of the same anon large folio,
+ * the anon-exclusivity (or the negation) of the first page does not guarantee
+ * the anon-exclusivity (or the negation) of the other pages corresponding to
+ * the pte batch; hence in this case it is incorrect to decide to change or
+ * not change the ptes to writable just by using information from the first
+ * pte of the batch. Therefore, we must individually check all pages and
+ * retrieve sub-batches.
+ */
+static void commit_anon_folio_batch(struct vm_area_struct *vma,
+		struct folio *folio, unsigned long addr, pte_t *ptep,
+		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
+{
+	struct page *first_page = folio_page(folio, 0);
+	bool expected_anon_exclusive;
+	int sub_batch_idx = 0;
+	int len;
+
+	while (nr_ptes) {
+		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
+		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
+					first_page, expected_anon_exclusive);
+		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
+				       sub_batch_idx, expected_anon_exclusive, tlb);
+		sub_batch_idx += len;
+		nr_ptes -= len;
+	}
+}
+
+static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
+		struct folio *folio, unsigned long addr, pte_t *ptep,
+		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
+{
+	bool set_write;
+
+	if (vma->vm_flags & VM_SHARED) {
+		set_write = can_change_shared_pte_writable(vma, ptent);
+		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
+				       /* idx = */ 0, set_write, tlb);
+		return;
+	}
+
+	set_write = maybe_change_pte_writable(vma, ptent) &&
+		    (folio && folio_test_anon(folio));
+	if (!set_write) {
+		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
+				       /* idx = */ 0, set_write, tlb);
+		return;
+	}
+	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
 		nr_ptes = 1;
 		oldpte = ptep_get(pte);
 		if (pte_present(oldpte)) {
+			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
 			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
-			struct folio *folio;
+			struct folio *folio = NULL;
 			pte_t ptent;
 
 			/*
@@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
 
 					/* determine batch to skip */
 					nr_ptes = mprotect_folio_pte_batch(folio,
-						  pte, oldpte, max_nr_ptes);
+						  pte, oldpte, max_nr_ptes, /* flags = */ 0);
 					continue;
 				}
 			}
 
+			if (!folio)
+				folio = vm_normal_folio(vma, addr, oldpte);
+
+			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
+
 			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * COW or special handling is required.
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent, vma);
-
-			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
-			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
-			pages++;
+			     !pte_write(ptent))
+				set_write_prot_commit_flush_ptes(vma, folio,
+				addr, pte, oldpte, ptent, nr_ptes, tlb);
+			else
+				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
+					nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
+			pages += nr_ptes;
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-18  9:02 ` [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching Dev Jain
@ 2025-07-18 18:49   ` Lorenzo Stoakes
  2025-07-19 13:46     ` Dev Jain
  2025-07-24 19:55   ` Zi Yan
  2025-08-06  8:08   ` David Hildenbrand
  2 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 18:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote:
> Use folio_pte_batch to batch process a large folio. Note that, PTE
> batching here will save a few function calls, and this strategy in certain
> cases (not this one) batches atomic operations in general, so we have
> a performance win for all arches. This patch paves the way for patch 7
> which will help us elide the TLBI per contig block on arm64.
>
> The correctness of this patch lies on the correctness of setting the
> new ptes based upon information only from the first pte of the batch
> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>
> Observe that the flag combination we pass to mprotect_folio_pte_batch()
> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
> writable bit. Therefore, the only bits which may differ are the a/d bits.
> So we only need to worry about code which is concerned about the a/d bits
> of the PTEs.
>
> Setting extra a/d bits on the new ptes where previously they were not set,
> is fine - setting access bit when it was not set is not an incorrectness
> problem but will only possibly delay the reclaim of the page mapped by
> the pte (which is in fact intended because the kernel just operated on this
> region via mprotect()!). Setting dirty bit when it was not set is again
> not an incorrectness problem but will only possibly force an unnecessary
> writeback.
>
> So now we need to reason whether something can go wrong via
> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
> and userfaultfd_pte_wp cases are solved due to uniformity in the
> corresponding bits guaranteed by the flag combination. The ptes all
> belong to the same VMA (since callers guarantee that [start, end) will
> lie within the VMA) therefore the conditional based on the VMA is also
> safe to batch around.
>
> Since the dirty bit on the PTE really is just an indication that the folio
> got written to - even if the PTE is not actually dirty but one of the PTEs
> in the batch is, the wp-fault optimization can be made. Therefore, it is
> safe to batch around pte_dirty() in can_change_shared_pte_writable()
> (in fact this is better since without batching, it may happen that
> some ptes aren't changed to writable just because they are not dirty,
> even though the other ptes mapping the same large folio are dirty).
>
> To batch around the PageAnonExclusive case, we must check the corresponding
> condition for every single page. Therefore, from the large folio batch,
> we process sub batches of ptes mapping pages with the same
> PageAnonExclusive condition, and process that sub batch, then determine
> and process the next sub batch, and so on. Note that this does not cause
> any extra overhead; if suppose the size of the folio batch is 512, then
> the sub batch processing in total will take 512 iterations, which is the
> same as what we would have done before.
>
> For pte_needs_flush():
>
> ppc does not care about the a/d bits.
>
> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
> get cleared; since we can only have extra a/d bits due to batching,
> we will only have an extra flush, not a case where we elide a flush due
> to batching when we shouldn't have.
>

Thanks for great commit message!

> Signed-off-by: Dev Jain <dev.jain@arm.com>

This is looking MUCH better :) Thanks!

Some nits below, but I've gone through this carefully and can't find
anything that seems obviously wrong here, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 113 insertions(+), 12 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index a1c7d8a4648d..2ddd37b2f462 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  }
>
>  static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> -				    pte_t pte, int max_nr_ptes)
> +				    pte_t pte, int max_nr_ptes, fpb_t flags)
>  {
>  	/* No underlying folio, so cannot batch */
>  	if (!folio)
> @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  	if (!folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
> +	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
>  }
>
>  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	return ret;
>  }
>
> +/* Set nr_ptes number of ptes, starting from idx */
> +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> +		int idx, bool set_write, struct mmu_gather *tlb)
> +{
> +	/*
> +	 * Advance the position in the batch by idx; note that if idx > 0,
> +	 * then the nr_ptes passed here is <= batch size - idx.
> +	 */
> +	addr += idx * PAGE_SIZE;
> +	ptep += idx;
> +	oldpte = pte_advance_pfn(oldpte, idx);
> +	ptent = pte_advance_pfn(ptent, idx);
> +
> +	if (set_write)
> +		ptent = pte_mkwrite(ptent, vma);
> +
> +	modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
> +	if (pte_needs_flush(oldpte, ptent))
> +		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
> +}
> +
> +/*
> + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
> + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
> + * that the ptes point to consecutive pages of the same anon large folio.
> + */
> +static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
> +		struct page *first_page, bool expected_anon_exclusive)
> +{
> +	int idx;

Nit but:

	int end = start_idx + max_len;

	for (idx = start_idx + 1; idx < end; idx++) {

Would be a little neater here.

> +
> +	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {

Nitty again but the below might be a little clearer?

	struct page *page = &firstpage[idx];

	if (expected_anon_exclusive != PageAnonExclusive(page))


> +		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
> +			break;
> +	}
> +	return idx - start_idx;
> +}
> +
> +/*
> + * This function is a result of trying our very best to retain the
> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
> + * if the vma is a private vma, and we cannot determine whether to change
> + * the pte to writable just from the vma and the pte, we then need to look
> + * at the actual page pointed to by the pte. Unfortunately, if we have a
> + * batch of ptes pointing to consecutive pages of the same anon large folio,
> + * the anon-exclusivity (or the negation) of the first page does not guarantee
> + * the anon-exclusivity (or the negation) of the other pages corresponding to
> + * the pte batch; hence in this case it is incorrect to decide to change or
> + * not change the ptes to writable just by using information from the first
> + * pte of the batch. Therefore, we must individually check all pages and
> + * retrieve sub-batches.
> + */

Nice comment thanks.

> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
> +		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> +{
> +	struct page *first_page = folio_page(folio, 0);
> +	bool expected_anon_exclusive;
> +	int sub_batch_idx = 0;
> +	int len;
> +
> +	while (nr_ptes) {

I'd prefer this to be:

	int i;

	...

	for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) {

> +		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);

Nit but would prefer:

		struct page *page = &first_page[sub_batch_idx];

		expected_anon_exclusive = PageAnonExclusive(page);

> +		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
> +					first_page, expected_anon_exclusive);
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
> +				       sub_batch_idx, expected_anon_exclusive, tlb);
> +		sub_batch_idx += len;
> +		nr_ptes -= len;
> +	}
> +}
> +
> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> +		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> +{
> +	bool set_write;
> +
> +	if (vma->vm_flags & VM_SHARED) {
> +		set_write = can_change_shared_pte_writable(vma, ptent);
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
> +				       /* idx = */ 0, set_write, tlb);
> +		return;
> +	}
> +
> +	set_write = maybe_change_pte_writable(vma, ptent) &&
> +		    (folio && folio_test_anon(folio));
> +	if (!set_write) {
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
> +				       /* idx = */ 0, set_write, tlb);
> +		return;
> +	}
> +	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>  		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> -			struct folio *folio;
> +			struct folio *folio = NULL;
>  			pte_t ptent;
>
>  			/*
> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
>
>  					/* determine batch to skip */
>  					nr_ptes = mprotect_folio_pte_batch(folio,
> -						  pte, oldpte, max_nr_ptes);
> +						  pte, oldpte, max_nr_ptes, /* flags = */ 0);
>  					continue;
>  				}
>  			}
>
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> +
>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>
> @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 * COW or special handling is required.
>  			 */
>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent, vma);
> -
> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -			if (pte_needs_flush(oldpte, ptent))
> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -			pages++;
> +			     !pte_write(ptent))
> +				set_write_prot_commit_flush_ptes(vma, folio,
> +				addr, pte, oldpte, ptent, nr_ptes, tlb);
> +			else
> +				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> +					nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
> +			pages += nr_ptes;
>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);
>  			pte_t newpte;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-18 18:49   ` Lorenzo Stoakes
@ 2025-07-19 13:46     ` Dev Jain
  2025-07-20 11:20       ` Lorenzo Stoakes
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-07-19 13:46 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 19/07/25 12:19 am, Lorenzo Stoakes wrote:
> On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Note that, PTE
>> batching here will save a few function calls, and this strategy in certain
>> cases (not this one) batches atomic operations in general, so we have
>> a performance win for all arches. This patch paves the way for patch 7
>> which will help us elide the TLBI per contig block on arm64.
>>
>> The correctness of this patch lies on the correctness of setting the
>> new ptes based upon information only from the first pte of the batch
>> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>>
>> Observe that the flag combination we pass to mprotect_folio_pte_batch()
>> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
>> writable bit. Therefore, the only bits which may differ are the a/d bits.
>> So we only need to worry about code which is concerned about the a/d bits
>> of the PTEs.
>>
>> Setting extra a/d bits on the new ptes where previously they were not set,
>> is fine - setting access bit when it was not set is not an incorrectness
>> problem but will only possibly delay the reclaim of the page mapped by
>> the pte (which is in fact intended because the kernel just operated on this
>> region via mprotect()!). Setting dirty bit when it was not set is again
>> not an incorrectness problem but will only possibly force an unnecessary
>> writeback.
>>
>> So now we need to reason whether something can go wrong via
>> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
>> and userfaultfd_pte_wp cases are solved due to uniformity in the
>> corresponding bits guaranteed by the flag combination. The ptes all
>> belong to the same VMA (since callers guarantee that [start, end) will
>> lie within the VMA) therefore the conditional based on the VMA is also
>> safe to batch around.
>>
>> Since the dirty bit on the PTE really is just an indication that the folio
>> got written to - even if the PTE is not actually dirty but one of the PTEs
>> in the batch is, the wp-fault optimization can be made. Therefore, it is
>> safe to batch around pte_dirty() in can_change_shared_pte_writable()
>> (in fact this is better since without batching, it may happen that
>> some ptes aren't changed to writable just because they are not dirty,
>> even though the other ptes mapping the same large folio are dirty).
>>
>> To batch around the PageAnonExclusive case, we must check the corresponding
>> condition for every single page. Therefore, from the large folio batch,
>> we process sub batches of ptes mapping pages with the same
>> PageAnonExclusive condition, and process that sub batch, then determine
>> and process the next sub batch, and so on. Note that this does not cause
>> any extra overhead; if suppose the size of the folio batch is 512, then
>> the sub batch processing in total will take 512 iterations, which is the
>> same as what we would have done before.
>>
>> For pte_needs_flush():
>>
>> ppc does not care about the a/d bits.
>>
>> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
>> get cleared; since we can only have extra a/d bits due to batching,
>> we will only have an extra flush, not a case where we elide a flush due
>> to batching when we shouldn't have.
>>
> Thanks for great commit message!
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> This is looking MUCH better :) Thanks!
>
> Some nits below, but I've gone through this carefully and can't find
> anything that seems obviously wrong here, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

>
>> ---
>>   mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 113 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index a1c7d8a4648d..2ddd37b2f462 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>   }
>>
>>   static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>> -				    pte_t pte, int max_nr_ptes)
>> +				    pte_t pte, int max_nr_ptes, fpb_t flags)
>>   {
>>   	/* No underlying folio, so cannot batch */
>>   	if (!folio)
>> @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>   	if (!folio_test_large(folio))
>>   		return 1;
>>
>> -	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
>> +	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
>>   }
>>
>>   static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>> @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>>   	return ret;
>>   }
>>
>> +/* Set nr_ptes number of ptes, starting from idx */
>> +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
>> +		int idx, bool set_write, struct mmu_gather *tlb)
>> +{
>> +	/*
>> +	 * Advance the position in the batch by idx; note that if idx > 0,
>> +	 * then the nr_ptes passed here is <= batch size - idx.
>> +	 */
>> +	addr += idx * PAGE_SIZE;
>> +	ptep += idx;
>> +	oldpte = pte_advance_pfn(oldpte, idx);
>> +	ptent = pte_advance_pfn(ptent, idx);
>> +
>> +	if (set_write)
>> +		ptent = pte_mkwrite(ptent, vma);
>> +
>> +	modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
>> +	if (pte_needs_flush(oldpte, ptent))
>> +		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>> +}
>> +
>> +/*
>> + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
>> + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
>> + * that the ptes point to consecutive pages of the same anon large folio.
>> + */
>> +static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>> +		struct page *first_page, bool expected_anon_exclusive)
>> +{
>> +	int idx;
> Nit but:
>
> 	int end = start_idx + max_len;
>
> 	for (idx = start_idx + 1; idx < end; idx++) {
>
> Would be a little neater here.

I politely disagree :) start_idx + max_len is *obviously* the
end index, no need to add one more line of code asserting that.


>
>> +
>> +	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
> Nitty again but the below might be a little clearer?
>
> 	struct page *page = &firstpage[idx];
>
> 	if (expected_anon_exclusive != PageAnonExclusive(page))

I don't think so. first_page[idx] may confuse us into thinking that
we have an array of pages. Also, the way you define it assigns a
stack address to struct page *page; this is not a problem in theory
and the code will still be correct, but I will prefer struct page *page
containing the actual address of the linear map struct page, which is
vmemmap + PFN. The way I write it is, I initialize first_page from folio_page()
which will derive the address from folio->page, and folio was derived from
vm_normal_folio() (which was derived from the PFN in the PTE), therefore
first_page will contain the actual vmemmap address of corresponding struct page,
hence it is guaranteed that first_page + x will give me the x'th page in
the folio.


>
>
>> +		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
>> +			break;
>> +	}
>> +	return idx - start_idx;
>> +}
>> +
>> +/*
>> + * This function is a result of trying our very best to retain the
>> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
>> + * if the vma is a private vma, and we cannot determine whether to change
>> + * the pte to writable just from the vma and the pte, we then need to look
>> + * at the actual page pointed to by the pte. Unfortunately, if we have a
>> + * batch of ptes pointing to consecutive pages of the same anon large folio,
>> + * the anon-exclusivity (or the negation) of the first page does not guarantee
>> + * the anon-exclusivity (or the negation) of the other pages corresponding to
>> + * the pte batch; hence in this case it is incorrect to decide to change or
>> + * not change the ptes to writable just by using information from the first
>> + * pte of the batch. Therefore, we must individually check all pages and
>> + * retrieve sub-batches.
>> + */
> Nice comment thanks.
>
>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>> +		struct folio *folio, unsigned long addr, pte_t *ptep,
>> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +	struct page *first_page = folio_page(folio, 0);
>> +	bool expected_anon_exclusive;
>> +	int sub_batch_idx = 0;
>> +	int len;
>> +
>> +	while (nr_ptes) {
> I'd prefer this to be:
>
> 	int i;
>
> 	...
>
> 	for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) {
>
>> +		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);

We won't be able to do nr_ptes -= len with this. And personally a while loop
is clearer to me here.


> Nit but would prefer:
>
> 		struct page *page = &first_page[sub_batch_idx];
>
> 		expected_anon_exclusive = PageAnonExclusive(page);
>
>> +		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
>> +					first_page, expected_anon_exclusive);
>> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
>> +				       sub_batch_idx, expected_anon_exclusive, tlb);
>> +		sub_batch_idx += len;
>> +		nr_ptes -= len;
>> +	}
>> +}
>> +
>> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>> +		struct folio *folio, unsigned long addr, pte_t *ptep,
>> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +	bool set_write;
>> +
>> +	if (vma->vm_flags & VM_SHARED) {
>> +		set_write = can_change_shared_pte_writable(vma, ptent);
>> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
>> +				       /* idx = */ 0, set_write, tlb);
>> +		return;
>> +	}
>> +
>> +	set_write = maybe_change_pte_writable(vma, ptent) &&
>> +		    (folio && folio_test_anon(folio));
>> +	if (!set_write) {
>> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
>> +				       /* idx = */ 0, set_write, tlb);
>> +		return;
>> +	}
>> +	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   		nr_ptes = 1;
>>   		oldpte = ptep_get(pte);
>>   		if (pte_present(oldpte)) {
>> +			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> -			struct folio *folio;
>> +			struct folio *folio = NULL;
>>   			pte_t ptent;
>>
>>   			/*
>> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
>>
>>   					/* determine batch to skip */
>>   					nr_ptes = mprotect_folio_pte_batch(folio,
>> -						  pte, oldpte, max_nr_ptes);
>> +						  pte, oldpte, max_nr_ptes, /* flags = */ 0);
>>   					continue;
>>   				}
>>   			}
>>
>> +			if (!folio)
>> +				folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
>> +
>>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>
>> @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			 * COW or special handling is required.
>>   			 */
>>   			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> -				ptent = pte_mkwrite(ptent, vma);
>> -
>> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>> -			if (pte_needs_flush(oldpte, ptent))
>> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -			pages++;
>> +			     !pte_write(ptent))
>> +				set_write_prot_commit_flush_ptes(vma, folio,
>> +				addr, pte, oldpte, ptent, nr_ptes, tlb);
>> +			else
>> +				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>> +					nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>> +			pages += nr_ptes;
>>   		} else if (is_swap_pte(oldpte)) {
>>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>>   			pte_t newpte;
>> --
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-19 13:46     ` Dev Jain
@ 2025-07-20 11:20       ` Lorenzo Stoakes
  2025-07-20 14:39         ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-20 11:20 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Sat, Jul 19, 2025 at 07:16:48PM +0530, Dev Jain wrote:
>
> On 19/07/25 12:19 am, Lorenzo Stoakes wrote:
> > On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote:
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks!

You're welcome :)

>
> >
> > > ---
> > >   mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 113 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index a1c7d8a4648d..2ddd37b2f462 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > >   }
> > >
> > >   static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > > -				    pte_t pte, int max_nr_ptes)
> > > +				    pte_t pte, int max_nr_ptes, fpb_t flags)
> > >   {
> > >   	/* No underlying folio, so cannot batch */
> > >   	if (!folio)
> > > @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > >   	if (!folio_test_large(folio))
> > >   		return 1;
> > >
> > > -	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
> > > +	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> > >   }
> > >
> > >   static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> > > @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> > >   	return ret;
> > >   }
> > >
> > > +/* Set nr_ptes number of ptes, starting from idx */
> > > +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> > > +		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> > > +		int idx, bool set_write, struct mmu_gather *tlb)
> > > +{
> > > +	/*
> > > +	 * Advance the position in the batch by idx; note that if idx > 0,
> > > +	 * then the nr_ptes passed here is <= batch size - idx.
> > > +	 */
> > > +	addr += idx * PAGE_SIZE;
> > > +	ptep += idx;
> > > +	oldpte = pte_advance_pfn(oldpte, idx);
> > > +	ptent = pte_advance_pfn(ptent, idx);
> > > +
> > > +	if (set_write)
> > > +		ptent = pte_mkwrite(ptent, vma);
> > > +
> > > +	modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
> > > +	if (pte_needs_flush(oldpte, ptent))
> > > +		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
> > > +}
> > > +
> > > +/*
> > > + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
> > > + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
> > > + * that the ptes point to consecutive pages of the same anon large folio.
> > > + */
> > > +static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
> > > +		struct page *first_page, bool expected_anon_exclusive)
> > > +{
> > > +	int idx;
> > Nit but:
> >
> > 	int end = start_idx + max_len;
> >
> > 	for (idx = start_idx + 1; idx < end; idx++) {
> >
> > Would be a little neater here.
>
> I politely disagree :) start_idx + max_len is *obviously* the
> end index, no need to add one more line of code asserting that.

Haha, well disagreement is permitted you know ;) as long as it's polite of
course...

That's fine, this isn't a big deal.

>
>
> >
> > > +
> > > +	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
> > Nitty again but the below might be a little clearer?
> >
> > 	struct page *page = &firstpage[idx];
> >
> > 	if (expected_anon_exclusive != PageAnonExclusive(page))
>

> I don't think so. first_page[idx] may confuse us into thinking that
> we have an array of pages. Also, the way you define it assigns a
> stack address to struct page *page; this is not a problem in theory
> and the code will still be correct, but I will prefer struct page *page
> containing the actual address of the linear map struct page, which is
> vmemmap + PFN. The way I write it is, I initialize first_page from folio_page()
> which will derive the address from folio->page, and folio was derived from
> vm_normal_folio() (which was derived from the PFN in the PTE), therefore
> first_page will contain the actual vmemmap address of corresponding struct page,
> hence it is guaranteed that first_page + x will give me the x'th page in
> the folio.

OK, I don't think this is an issue, but I"m not going to press this, as it's a
trivial thing, it's fine as-is :)

>
>
> >
> >
> > > +		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
> > > +			break;
> > > +	}
> > > +	return idx - start_idx;
> > > +}
> > > +
> > > +/*
> > > + * This function is a result of trying our very best to retain the
> > > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
> > > + * if the vma is a private vma, and we cannot determine whether to change
> > > + * the pte to writable just from the vma and the pte, we then need to look
> > > + * at the actual page pointed to by the pte. Unfortunately, if we have a
> > > + * batch of ptes pointing to consecutive pages of the same anon large folio,
> > > + * the anon-exclusivity (or the negation) of the first page does not guarantee
> > > + * the anon-exclusivity (or the negation) of the other pages corresponding to
> > > + * the pte batch; hence in this case it is incorrect to decide to change or
> > > + * not change the ptes to writable just by using information from the first
> > > + * pte of the batch. Therefore, we must individually check all pages and
> > > + * retrieve sub-batches.
> > > + */
> > Nice comment thanks.
> >
> > > +static void commit_anon_folio_batch(struct vm_area_struct *vma,
> > > +		struct folio *folio, unsigned long addr, pte_t *ptep,
> > > +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> > > +{
> > > +	struct page *first_page = folio_page(folio, 0);
> > > +	bool expected_anon_exclusive;
> > > +	int sub_batch_idx = 0;
> > > +	int len;
> > > +
> > > +	while (nr_ptes) {
> > I'd prefer this to be:
> >
> > 	int i;
> >
> > 	...
> >
> > 	for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) {
> >
> > > +		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
>
> We won't be able to do nr_ptes -= len with this. And personally a while loop
> is clearer to me here.

Well, you don't need to :) maybe rename i to pte_idx + pass nr_ptes - pte_idx.

Buuuut I'm not going to press this, it's not a big deal, and I see your point!

Overall the R-b tag still stands with the above unchanged.

Thanks for doing this series and being open to feedback, I feel we're iterated
to something nice here!

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-20 11:20       ` Lorenzo Stoakes
@ 2025-07-20 14:39         ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-20 14:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 20/07/25 4:50 pm, Lorenzo Stoakes wrote:
> On Sat, Jul 19, 2025 at 07:16:48PM +0530, Dev Jain wrote:
>> On 19/07/25 12:19 am, Lorenzo Stoakes wrote:
>>> On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote:
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Thanks!
> You're welcome :)
>
>>>> ---
>>>>    mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
>>>>    1 file changed, 113 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index a1c7d8a4648d..2ddd37b2f462 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>>    }
>>>>
>>>>    static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>>> -				    pte_t pte, int max_nr_ptes)
>>>> +				    pte_t pte, int max_nr_ptes, fpb_t flags)
>>>>    {
>>>>    	/* No underlying folio, so cannot batch */
>>>>    	if (!folio)
>>>> @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>>>    	if (!folio_test_large(folio))
>>>>    		return 1;
>>>>
>>>> -	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
>>>> +	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
>>>>    }
>>>>
>>>>    static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>>>> @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>>>>    	return ret;
>>>>    }
>>>>
>>>> +/* Set nr_ptes number of ptes, starting from idx */
>>>> +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
>>>> +		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
>>>> +		int idx, bool set_write, struct mmu_gather *tlb)
>>>> +{
>>>> +	/*
>>>> +	 * Advance the position in the batch by idx; note that if idx > 0,
>>>> +	 * then the nr_ptes passed here is <= batch size - idx.
>>>> +	 */
>>>> +	addr += idx * PAGE_SIZE;
>>>> +	ptep += idx;
>>>> +	oldpte = pte_advance_pfn(oldpte, idx);
>>>> +	ptent = pte_advance_pfn(ptent, idx);
>>>> +
>>>> +	if (set_write)
>>>> +		ptent = pte_mkwrite(ptent, vma);
>>>> +
>>>> +	modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
>>>> +	if (pte_needs_flush(oldpte, ptent))
>>>> +		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
>>>> + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
>>>> + * that the ptes point to consecutive pages of the same anon large folio.
>>>> + */
>>>> +static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>>>> +		struct page *first_page, bool expected_anon_exclusive)
>>>> +{
>>>> +	int idx;
>>> Nit but:
>>>
>>> 	int end = start_idx + max_len;
>>>
>>> 	for (idx = start_idx + 1; idx < end; idx++) {
>>>
>>> Would be a little neater here.
>> I politely disagree :) start_idx + max_len is *obviously* the
>> end index, no need to add one more line of code asserting that.
> Haha, well disagreement is permitted you know ;) as long as it's polite of
> course...
>
> That's fine, this isn't a big deal.
>
>>
>>>> +
>>>> +	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
>>> Nitty again but the below might be a little clearer?
>>>
>>> 	struct page *page = &firstpage[idx];
>>>
>>> 	if (expected_anon_exclusive != PageAnonExclusive(page))
>> I don't think so. first_page[idx] may confuse us into thinking that
>> we have an array of pages. Also, the way you define it assigns a
>> stack address to struct page *page; this is not a problem in theory
>> and the code will still be correct, but I will prefer struct page *page
>> containing the actual address of the linear map struct page, which is
>> vmemmap + PFN. The way I write it is, I initialize first_page from folio_page()
>> which will derive the address from folio->page, and folio was derived from
>> vm_normal_folio() (which was derived from the PFN in the PTE), therefore
>> first_page will contain the actual vmemmap address of corresponding struct page,
>> hence it is guaranteed that first_page + x will give me the x'th page in
>> the folio.
> OK, I don't think this is an issue, but I"m not going to press this, as it's a
> trivial thing, it's fine as-is :)
>
>>
>>>
>>>> +		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
>>>> +			break;
>>>> +	}
>>>> +	return idx - start_idx;
>>>> +}
>>>> +
>>>> +/*
>>>> + * This function is a result of trying our very best to retain the
>>>> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
>>>> + * if the vma is a private vma, and we cannot determine whether to change
>>>> + * the pte to writable just from the vma and the pte, we then need to look
>>>> + * at the actual page pointed to by the pte. Unfortunately, if we have a
>>>> + * batch of ptes pointing to consecutive pages of the same anon large folio,
>>>> + * the anon-exclusivity (or the negation) of the first page does not guarantee
>>>> + * the anon-exclusivity (or the negation) of the other pages corresponding to
>>>> + * the pte batch; hence in this case it is incorrect to decide to change or
>>>> + * not change the ptes to writable just by using information from the first
>>>> + * pte of the batch. Therefore, we must individually check all pages and
>>>> + * retrieve sub-batches.
>>>> + */
>>> Nice comment thanks.
>>>
>>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>> +		struct folio *folio, unsigned long addr, pte_t *ptep,
>>>> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>>>> +{
>>>> +	struct page *first_page = folio_page(folio, 0);
>>>> +	bool expected_anon_exclusive;
>>>> +	int sub_batch_idx = 0;
>>>> +	int len;
>>>> +
>>>> +	while (nr_ptes) {
>>> I'd prefer this to be:
>>>
>>> 	int i;
>>>
>>> 	...
>>>
>>> 	for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) {
>>>
>>>> +		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
>> We won't be able to do nr_ptes -= len with this. And personally a while loop
>> is clearer to me here.
> Well, you don't need to :) maybe rename i to pte_idx + pass nr_ptes - pte_idx.
>
> Buuuut I'm not going to press this, it's not a big deal, and I see your point!
>
> Overall the R-b tag still stands with the above unchanged.
>
> Thanks for doing this series and being open to feedback, I feel we're iterated
> to something nice here!

We definitely iterated to something nice! I definitely found your review very
useful; as contributors trying to solve our own problems, we simply keep
adding new features and the code becomes a mess of if-else statements - this
is what I felt when I began reading kernel code. I am reading code and
suddenly some conditional handling a uffd-wp pte appears (I know you have
some nice words about uffd :))

Thank you for actually forcing me to think into how I can implement
my ideas into code better!

>
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-18  9:02 ` [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching Dev Jain
  2025-07-18 18:49   ` Lorenzo Stoakes
@ 2025-07-24 19:55   ` Zi Yan
  2025-08-06  8:08   ` David Hildenbrand
  2 siblings, 0 replies; 53+ messages in thread
From: Zi Yan @ 2025-07-24 19:55 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang

On 18 Jul 2025, at 5:02, Dev Jain wrote:

> Use folio_pte_batch to batch process a large folio. Note that, PTE
> batching here will save a few function calls, and this strategy in certain
> cases (not this one) batches atomic operations in general, so we have
> a performance win for all arches. This patch paves the way for patch 7
> which will help us elide the TLBI per contig block on arm64.
>
> The correctness of this patch lies on the correctness of setting the
> new ptes based upon information only from the first pte of the batch
> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>
> Observe that the flag combination we pass to mprotect_folio_pte_batch()
> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
> writable bit. Therefore, the only bits which may differ are the a/d bits.
> So we only need to worry about code which is concerned about the a/d bits
> of the PTEs.
>
> Setting extra a/d bits on the new ptes where previously they were not set,
> is fine - setting access bit when it was not set is not an incorrectness
> problem but will only possibly delay the reclaim of the page mapped by
> the pte (which is in fact intended because the kernel just operated on this
> region via mprotect()!). Setting dirty bit when it was not set is again
> not an incorrectness problem but will only possibly force an unnecessary
> writeback.
>
> So now we need to reason whether something can go wrong via
> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
> and userfaultfd_pte_wp cases are solved due to uniformity in the
> corresponding bits guaranteed by the flag combination. The ptes all
> belong to the same VMA (since callers guarantee that [start, end) will
> lie within the VMA) therefore the conditional based on the VMA is also
> safe to batch around.
>
> Since the dirty bit on the PTE really is just an indication that the folio
> got written to - even if the PTE is not actually dirty but one of the PTEs
> in the batch is, the wp-fault optimization can be made. Therefore, it is
> safe to batch around pte_dirty() in can_change_shared_pte_writable()
> (in fact this is better since without batching, it may happen that
> some ptes aren't changed to writable just because they are not dirty,
> even though the other ptes mapping the same large folio are dirty).
>
> To batch around the PageAnonExclusive case, we must check the corresponding
> condition for every single page. Therefore, from the large folio batch,
> we process sub batches of ptes mapping pages with the same
> PageAnonExclusive condition, and process that sub batch, then determine
> and process the next sub batch, and so on. Note that this does not cause
> any extra overhead; if suppose the size of the folio batch is 512, then
> the sub batch processing in total will take 512 iterations, which is the
> same as what we would have done before.
>
> For pte_needs_flush():
>
> ppc does not care about the a/d bits.
>
> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
> get cleared; since we can only have extra a/d bits due to batching,
> we will only have an extra flush, not a case where we elide a flush due
> to batching when we shouldn't have.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 113 insertions(+), 12 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index a1c7d8a4648d..2ddd37b2f462 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  }
>
>  static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> -				    pte_t pte, int max_nr_ptes)
> +				    pte_t pte, int max_nr_ptes, fpb_t flags)
>  {
>  	/* No underlying folio, so cannot batch */
>  	if (!folio)
> @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  	if (!folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
> +	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
>  }
>
>  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	return ret;
>  }
>
> +/* Set nr_ptes number of ptes, starting from idx */
> +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> +		int idx, bool set_write, struct mmu_gather *tlb)
> +{
> +	/*
> +	 * Advance the position in the batch by idx; note that if idx > 0,
> +	 * then the nr_ptes passed here is <= batch size - idx.
> +	 */
> +	addr += idx * PAGE_SIZE;
> +	ptep += idx;
> +	oldpte = pte_advance_pfn(oldpte, idx);
> +	ptent = pte_advance_pfn(ptent, idx);
> +
> +	if (set_write)
> +		ptent = pte_mkwrite(ptent, vma);
> +
> +	modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
> +	if (pte_needs_flush(oldpte, ptent))
> +		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
> +}
> +
> +/*
> + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
> + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
> + * that the ptes point to consecutive pages of the same anon large folio.
> + */
> +static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
> +		struct page *first_page, bool expected_anon_exclusive)
> +{
> +	int idx;
> +
> +	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
> +		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
> +			break;
> +	}
> +	return idx - start_idx;
> +}
> +
> +/*
> + * This function is a result of trying our very best to retain the
> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
> + * if the vma is a private vma, and we cannot determine whether to change
> + * the pte to writable just from the vma and the pte, we then need to look
> + * at the actual page pointed to by the pte. Unfortunately, if we have a
> + * batch of ptes pointing to consecutive pages of the same anon large folio,
> + * the anon-exclusivity (or the negation) of the first page does not guarantee
> + * the anon-exclusivity (or the negation) of the other pages corresponding to
> + * the pte batch; hence in this case it is incorrect to decide to change or
> + * not change the ptes to writable just by using information from the first
> + * pte of the batch. Therefore, we must individually check all pages and
> + * retrieve sub-batches.
> + */
> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
> +		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> +{
> +	struct page *first_page = folio_page(folio, 0);
> +	bool expected_anon_exclusive;
> +	int sub_batch_idx = 0;
> +	int len;
> +
> +	while (nr_ptes) {
> +		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
> +		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
> +					first_page, expected_anon_exclusive);
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
> +				       sub_batch_idx, expected_anon_exclusive, tlb);
> +		sub_batch_idx += len;
> +		nr_ptes -= len;
> +	}
> +}
> +
> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> +		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> +{
> +	bool set_write;
> +
> +	if (vma->vm_flags & VM_SHARED) {
> +		set_write = can_change_shared_pte_writable(vma, ptent);
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
> +				       /* idx = */ 0, set_write, tlb);
> +		return;
> +	}
> +
> +	set_write = maybe_change_pte_writable(vma, ptent) &&
> +		    (folio && folio_test_anon(folio));
> +	if (!set_write) {
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
> +				       /* idx = */ 0, set_write, tlb);
> +		return;
> +	}
> +	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>  		nr_ptes = 1;
>  		oldpte = ptep_get(pte);
>  		if (pte_present(oldpte)) {
> +			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> -			struct folio *folio;
> +			struct folio *folio = NULL;
>  			pte_t ptent;
>
>  			/*
> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
>
>  					/* determine batch to skip */
>  					nr_ptes = mprotect_folio_pte_batch(folio,
> -						  pte, oldpte, max_nr_ptes);
> +						  pte, oldpte, max_nr_ptes, /* flags = */ 0);
>  					continue;
>  				}
>  			}
>
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> +
>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);

Here both mprotect_folio_pte_batch() and modify_prot_start_ptes()
can collect dirty and access bits from PTEs in the batch. But
modify_prot_start_ptes() clears PTEs to prevent concurrent hardware
modification. It took me a while to figure this out, since I was
wondering why oldpte is not used after mprotect_folio_pte_batch().

I wish folio_pte_batch_flags() used by mprotect_folio_pte_batch()
can take a function pointer to allow either read PTEs or clear
PTEs, so that there could be one function. But that could
complicate folio_pte_batch_flags(). Just a comment. :)

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-07-18  9:02 ` [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching Dev Jain
  2025-07-18 18:49   ` Lorenzo Stoakes
  2025-07-24 19:55   ` Zi Yan
@ 2025-08-06  8:08   ` David Hildenbrand
  2025-08-06  8:12     ` David Hildenbrand
                       ` (3 more replies)
  2 siblings, 4 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06  8:08 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 18.07.25 11:02, Dev Jain wrote:
> Use folio_pte_batch to batch process a large folio. Note that, PTE
> batching here will save a few function calls, and this strategy in certain
> cases (not this one) batches atomic operations in general, so we have
> a performance win for all arches. This patch paves the way for patch 7
> which will help us elide the TLBI per contig block on arm64.
> 
> The correctness of this patch lies on the correctness of setting the
> new ptes based upon information only from the first pte of the batch
> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
> 
> Observe that the flag combination we pass to mprotect_folio_pte_batch()
> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
> writable bit. Therefore, the only bits which may differ are the a/d bits.
> So we only need to worry about code which is concerned about the a/d bits
> of the PTEs.
> 
> Setting extra a/d bits on the new ptes where previously they were not set,
> is fine - setting access bit when it was not set is not an incorrectness
> problem but will only possibly delay the reclaim of the page mapped by
> the pte (which is in fact intended because the kernel just operated on this
> region via mprotect()!). Setting dirty bit when it was not set is again
> not an incorrectness problem but will only possibly force an unnecessary
> writeback.
> 
> So now we need to reason whether something can go wrong via
> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
> and userfaultfd_pte_wp cases are solved due to uniformity in the
> corresponding bits guaranteed by the flag combination. The ptes all
> belong to the same VMA (since callers guarantee that [start, end) will
> lie within the VMA) therefore the conditional based on the VMA is also
> safe to batch around.
> 
> Since the dirty bit on the PTE really is just an indication that the folio
> got written to - even if the PTE is not actually dirty but one of the PTEs
> in the batch is, the wp-fault optimization can be made. Therefore, it is
> safe to batch around pte_dirty() in can_change_shared_pte_writable()
> (in fact this is better since without batching, it may happen that
> some ptes aren't changed to writable just because they are not dirty,
> even though the other ptes mapping the same large folio are dirty).
> 
> To batch around the PageAnonExclusive case, we must check the corresponding
> condition for every single page. Therefore, from the large folio batch,
> we process sub batches of ptes mapping pages with the same
> PageAnonExclusive condition, and process that sub batch, then determine
> and process the next sub batch, and so on. Note that this does not cause
> any extra overhead; if suppose the size of the folio batch is 512, then
> the sub batch processing in total will take 512 iterations, which is the
> same as what we would have done before.
> 
> For pte_needs_flush():
> 
> ppc does not care about the a/d bits.
> 
> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
> get cleared; since we can only have extra a/d bits due to batching,
> we will only have an extra flush, not a case where we elide a flush due
> to batching when we shouldn't have.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>


I wanted to review this, but looks like it's already upstream and I 
suspect it's buggy (see the upstream report I cc'ed you on)

[...]

> +
> +/*
> + * This function is a result of trying our very best to retain the
> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
> + * if the vma is a private vma, and we cannot determine whether to change
> + * the pte to writable just from the vma and the pte, we then need to look
> + * at the actual page pointed to by the pte. Unfortunately, if we have a
> + * batch of ptes pointing to consecutive pages of the same anon large folio,
> + * the anon-exclusivity (or the negation) of the first page does not guarantee
> + * the anon-exclusivity (or the negation) of the other pages corresponding to
> + * the pte batch; hence in this case it is incorrect to decide to change or
> + * not change the ptes to writable just by using information from the first
> + * pte of the batch. Therefore, we must individually check all pages and
> + * retrieve sub-batches.
> + */
> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
> +		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> +{
> +	struct page *first_page = folio_page(folio, 0);

Who says that we have the first page of the folio mapped into the first 
PTE of the batch?

> +	bool expected_anon_exclusive;
> +	int sub_batch_idx = 0;
> +	int len;
> +
> +	while (nr_ptes) {
> +		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
> +		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
> +					first_page, expected_anon_exclusive);
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
> +				       sub_batch_idx, expected_anon_exclusive, tlb);
> +		sub_batch_idx += len;
> +		nr_ptes -= len;
> +	}
> +}
> +
> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> +		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> +{
> +	bool set_write;
> +
> +	if (vma->vm_flags & VM_SHARED) {
> +		set_write = can_change_shared_pte_writable(vma, ptent);
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
> +				       /* idx = */ 0, set_write, tlb);
> +		return;
> +	}
> +
> +	set_write = maybe_change_pte_writable(vma, ptent) &&
> +		    (folio && folio_test_anon(folio));
> +	if (!set_write) {
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
> +				       /* idx = */ 0, set_write, tlb);
> +		return;
> +	}
> +	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
> +}
> +
>   static long change_pte_range(struct mmu_gather *tlb,
>   		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>   		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>   		nr_ptes = 1;
>   		oldpte = ptep_get(pte);
>   		if (pte_present(oldpte)) {
> +			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>   			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
> -			struct folio *folio;
> +			struct folio *folio = NULL;
>   			pte_t ptent;
>   
>   			/*
> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
>   
>   					/* determine batch to skip */
>   					nr_ptes = mprotect_folio_pte_batch(folio,
> -						  pte, oldpte, max_nr_ptes);
> +						  pte, oldpte, max_nr_ptes, /* flags = */ 0);
>   					continue;
>   				}
>   			}
>   
> +			if (!folio)
> +				folio = vm_normal_folio(vma, addr, oldpte);
> +
> +			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> +
>   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>   			ptent = pte_modify(oldpte, newprot);
>   
> @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
>   			 * COW or special handling is required.
>   			 */
>   			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> -			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> -				ptent = pte_mkwrite(ptent, vma);
> -
> -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> -			if (pte_needs_flush(oldpte, ptent))
> -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> -			pages++;
> +			     !pte_write(ptent))
> +				set_write_prot_commit_flush_ptes(vma, folio,
> +				addr, pte, oldpte, ptent, nr_ptes, tlb);

While staring at this:

Very broken indentation.

> +			else
> +				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> +					nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);

Semi-broken intendation.

> +			pages += nr_ptes;
>   		} else if (is_swap_pte(oldpte)) {
>   			swp_entry_t entry = pte_to_swp_entry(oldpte);
>   			pte_t newpte;


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  8:08   ` David Hildenbrand
@ 2025-08-06  8:12     ` David Hildenbrand
  2025-08-06  8:15     ` Will Deacon
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06  8:12 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 06.08.25 10:08, David Hildenbrand wrote:
> On 18.07.25 11:02, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Note that, PTE
>> batching here will save a few function calls, and this strategy in certain
>> cases (not this one) batches atomic operations in general, so we have
>> a performance win for all arches. This patch paves the way for patch 7
>> which will help us elide the TLBI per contig block on arm64.
>>
>> The correctness of this patch lies on the correctness of setting the
>> new ptes based upon information only from the first pte of the batch
>> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>>
>> Observe that the flag combination we pass to mprotect_folio_pte_batch()
>> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
>> writable bit. Therefore, the only bits which may differ are the a/d bits.
>> So we only need to worry about code which is concerned about the a/d bits
>> of the PTEs.
>>
>> Setting extra a/d bits on the new ptes where previously they were not set,
>> is fine - setting access bit when it was not set is not an incorrectness
>> problem but will only possibly delay the reclaim of the page mapped by
>> the pte (which is in fact intended because the kernel just operated on this
>> region via mprotect()!). Setting dirty bit when it was not set is again
>> not an incorrectness problem but will only possibly force an unnecessary
>> writeback.
>>
>> So now we need to reason whether something can go wrong via
>> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
>> and userfaultfd_pte_wp cases are solved due to uniformity in the
>> corresponding bits guaranteed by the flag combination. The ptes all
>> belong to the same VMA (since callers guarantee that [start, end) will
>> lie within the VMA) therefore the conditional based on the VMA is also
>> safe to batch around.
>>
>> Since the dirty bit on the PTE really is just an indication that the folio
>> got written to - even if the PTE is not actually dirty but one of the PTEs
>> in the batch is, the wp-fault optimization can be made. Therefore, it is
>> safe to batch around pte_dirty() in can_change_shared_pte_writable()
>> (in fact this is better since without batching, it may happen that
>> some ptes aren't changed to writable just because they are not dirty,
>> even though the other ptes mapping the same large folio are dirty).
>>
>> To batch around the PageAnonExclusive case, we must check the corresponding
>> condition for every single page. Therefore, from the large folio batch,
>> we process sub batches of ptes mapping pages with the same
>> PageAnonExclusive condition, and process that sub batch, then determine
>> and process the next sub batch, and so on. Note that this does not cause
>> any extra overhead; if suppose the size of the folio batch is 512, then
>> the sub batch processing in total will take 512 iterations, which is the
>> same as what we would have done before.
>>
>> For pte_needs_flush():
>>
>> ppc does not care about the a/d bits.
>>
>> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
>> get cleared; since we can only have extra a/d bits due to batching,
>> we will only have an extra flush, not a case where we elide a flush due
>> to batching when we shouldn't have.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> 
> 
> I wanted to review this, but looks like it's already upstream and I
> suspect it's buggy (see the upstream report I cc'ed you on)
> 
> [...]
> 
>> +
>> +/*
>> + * This function is a result of trying our very best to retain the
>> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
>> + * if the vma is a private vma, and we cannot determine whether to change
>> + * the pte to writable just from the vma and the pte, we then need to look
>> + * at the actual page pointed to by the pte. Unfortunately, if we have a
>> + * batch of ptes pointing to consecutive pages of the same anon large folio,
>> + * the anon-exclusivity (or the negation) of the first page does not guarantee
>> + * the anon-exclusivity (or the negation) of the other pages corresponding to
>> + * the pte batch; hence in this case it is incorrect to decide to change or
>> + * not change the ptes to writable just by using information from the first
>> + * pte of the batch. Therefore, we must individually check all pages and
>> + * retrieve sub-batches.
>> + */
>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>> +		struct folio *folio, unsigned long addr, pte_t *ptep,
>> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +	struct page *first_page = folio_page(folio, 0);
> 
> Who says that we have the first page of the folio mapped into the first
> PTE of the batch?

For the record, I *hate* that we moved from vm_normal_folio() to 
vm_normal_page(). Please undo that and forward the proper mapped page.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  8:08   ` David Hildenbrand
  2025-08-06  8:12     ` David Hildenbrand
@ 2025-08-06  8:15     ` Will Deacon
  2025-08-06  8:19       ` David Hildenbrand
  2025-08-06  8:53     ` Dev Jain
  2025-08-06  9:12     ` Lorenzo Stoakes
  3 siblings, 1 reply; 53+ messages in thread
From: Will Deacon @ 2025-08-06  8:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
> On 18.07.25 11:02, Dev Jain wrote:
> > Use folio_pte_batch to batch process a large folio. Note that, PTE
> > batching here will save a few function calls, and this strategy in certain
> > cases (not this one) batches atomic operations in general, so we have
> > a performance win for all arches. This patch paves the way for patch 7
> > which will help us elide the TLBI per contig block on arm64.
> > 
> > The correctness of this patch lies on the correctness of setting the
> > new ptes based upon information only from the first pte of the batch
> > (which may also have accumulated a/d bits via modify_prot_start_ptes()).
> > 
> > Observe that the flag combination we pass to mprotect_folio_pte_batch()
> > guarantees that the batch is uniform w.r.t the soft-dirty bit and the
> > writable bit. Therefore, the only bits which may differ are the a/d bits.
> > So we only need to worry about code which is concerned about the a/d bits
> > of the PTEs.
> > 
> > Setting extra a/d bits on the new ptes where previously they were not set,
> > is fine - setting access bit when it was not set is not an incorrectness
> > problem but will only possibly delay the reclaim of the page mapped by
> > the pte (which is in fact intended because the kernel just operated on this
> > region via mprotect()!). Setting dirty bit when it was not set is again
> > not an incorrectness problem but will only possibly force an unnecessary
> > writeback.
> > 
> > So now we need to reason whether something can go wrong via
> > can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
> > and userfaultfd_pte_wp cases are solved due to uniformity in the
> > corresponding bits guaranteed by the flag combination. The ptes all
> > belong to the same VMA (since callers guarantee that [start, end) will
> > lie within the VMA) therefore the conditional based on the VMA is also
> > safe to batch around.
> > 
> > Since the dirty bit on the PTE really is just an indication that the folio
> > got written to - even if the PTE is not actually dirty but one of the PTEs
> > in the batch is, the wp-fault optimization can be made. Therefore, it is
> > safe to batch around pte_dirty() in can_change_shared_pte_writable()
> > (in fact this is better since without batching, it may happen that
> > some ptes aren't changed to writable just because they are not dirty,
> > even though the other ptes mapping the same large folio are dirty).
> > 
> > To batch around the PageAnonExclusive case, we must check the corresponding
> > condition for every single page. Therefore, from the large folio batch,
> > we process sub batches of ptes mapping pages with the same
> > PageAnonExclusive condition, and process that sub batch, then determine
> > and process the next sub batch, and so on. Note that this does not cause
> > any extra overhead; if suppose the size of the folio batch is 512, then
> > the sub batch processing in total will take 512 iterations, which is the
> > same as what we would have done before.
> > 
> > For pte_needs_flush():
> > 
> > ppc does not care about the a/d bits.
> > 
> > For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
> > get cleared; since we can only have extra a/d bits due to batching,
> > we will only have an extra flush, not a case where we elide a flush due
> > to batching when we shouldn't have.
> > 
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> 
> 
> I wanted to review this, but looks like it's already upstream and I suspect
> it's buggy (see the upstream report I cc'ed you on)

Please excuse my laziness, but do you have a link to the report? I've
been looking at some oddities on arm64 coming back from some of the CI
systems and was heading in the direction of a recent mm regression
judging by the first-known-bad-build in linux-next.

https://lore.kernel.org/r/CA+G9fYumD2MGjECCv0wx2V_96_FKNtFQpT63qVNrrCmomoPYVQ@mail.gmail.com

Will

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  8:15     ` Will Deacon
@ 2025-08-06  8:19       ` David Hildenbrand
  0 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06  8:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 06.08.25 10:15, Will Deacon wrote:
> On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
>> On 18.07.25 11:02, Dev Jain wrote:
>>> Use folio_pte_batch to batch process a large folio. Note that, PTE
>>> batching here will save a few function calls, and this strategy in certain
>>> cases (not this one) batches atomic operations in general, so we have
>>> a performance win for all arches. This patch paves the way for patch 7
>>> which will help us elide the TLBI per contig block on arm64.
>>>
>>> The correctness of this patch lies on the correctness of setting the
>>> new ptes based upon information only from the first pte of the batch
>>> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>>>
>>> Observe that the flag combination we pass to mprotect_folio_pte_batch()
>>> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
>>> writable bit. Therefore, the only bits which may differ are the a/d bits.
>>> So we only need to worry about code which is concerned about the a/d bits
>>> of the PTEs.
>>>
>>> Setting extra a/d bits on the new ptes where previously they were not set,
>>> is fine - setting access bit when it was not set is not an incorrectness
>>> problem but will only possibly delay the reclaim of the page mapped by
>>> the pte (which is in fact intended because the kernel just operated on this
>>> region via mprotect()!). Setting dirty bit when it was not set is again
>>> not an incorrectness problem but will only possibly force an unnecessary
>>> writeback.
>>>
>>> So now we need to reason whether something can go wrong via
>>> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
>>> and userfaultfd_pte_wp cases are solved due to uniformity in the
>>> corresponding bits guaranteed by the flag combination. The ptes all
>>> belong to the same VMA (since callers guarantee that [start, end) will
>>> lie within the VMA) therefore the conditional based on the VMA is also
>>> safe to batch around.
>>>
>>> Since the dirty bit on the PTE really is just an indication that the folio
>>> got written to - even if the PTE is not actually dirty but one of the PTEs
>>> in the batch is, the wp-fault optimization can be made. Therefore, it is
>>> safe to batch around pte_dirty() in can_change_shared_pte_writable()
>>> (in fact this is better since without batching, it may happen that
>>> some ptes aren't changed to writable just because they are not dirty,
>>> even though the other ptes mapping the same large folio are dirty).
>>>
>>> To batch around the PageAnonExclusive case, we must check the corresponding
>>> condition for every single page. Therefore, from the large folio batch,
>>> we process sub batches of ptes mapping pages with the same
>>> PageAnonExclusive condition, and process that sub batch, then determine
>>> and process the next sub batch, and so on. Note that this does not cause
>>> any extra overhead; if suppose the size of the folio batch is 512, then
>>> the sub batch processing in total will take 512 iterations, which is the
>>> same as what we would have done before.
>>>
>>> For pte_needs_flush():
>>>
>>> ppc does not care about the a/d bits.
>>>
>>> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
>>> get cleared; since we can only have extra a/d bits due to batching,
>>> we will only have an extra flush, not a case where we elide a flush due
>>> to batching when we shouldn't have.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>
>>
>> I wanted to review this, but looks like it's already upstream and I suspect
>> it's buggy (see the upstream report I cc'ed you on)
> 
> Please excuse my laziness, but do you have a link to the report?

I was lazy :)

https://lkml.kernel.org/r/68930511.050a0220.7f033.003a.GAE@google.com

> I've
> been looking at some oddities on arm64 coming back from some of the CI
> systems and was heading in the direction of a recent mm regression
> judging by the first-known-bad-build in linux-next.
> 
> https://lore.kernel.org/r/CA+G9fYumD2MGjECCv0wx2V_96_FKNtFQpT63qVNrrCmomoPYVQ@mail.gmail.com

Hm, mprotect seems to be involved. So it might or might not correlate.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  8:08   ` David Hildenbrand
  2025-08-06  8:12     ` David Hildenbrand
  2025-08-06  8:15     ` Will Deacon
@ 2025-08-06  8:53     ` Dev Jain
  2025-08-06  8:56       ` David Hildenbrand
  2025-08-06  9:12     ` Lorenzo Stoakes
  3 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-08-06  8:53 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 06/08/25 1:38 pm, David Hildenbrand wrote:
> On 18.07.25 11:02, Dev Jain wrote:
>> Use folio_pte_batch to batch process a large folio. Note that, PTE
>> batching here will save a few function calls, and this strategy in 
>> certain
>> cases (not this one) batches atomic operations in general, so we have
>> a performance win for all arches. This patch paves the way for patch 7
>> which will help us elide the TLBI per contig block on arm64.
>>
>> The correctness of this patch lies on the correctness of setting the
>> new ptes based upon information only from the first pte of the batch
>> (which may also have accumulated a/d bits via modify_prot_start_ptes()).
>>
>> Observe that the flag combination we pass to mprotect_folio_pte_batch()
>> guarantees that the batch is uniform w.r.t the soft-dirty bit and the
>> writable bit. Therefore, the only bits which may differ are the a/d 
>> bits.
>> So we only need to worry about code which is concerned about the a/d 
>> bits
>> of the PTEs.
>>
>> Setting extra a/d bits on the new ptes where previously they were not 
>> set,
>> is fine - setting access bit when it was not set is not an incorrectness
>> problem but will only possibly delay the reclaim of the page mapped by
>> the pte (which is in fact intended because the kernel just operated 
>> on this
>> region via mprotect()!). Setting dirty bit when it was not set is again
>> not an incorrectness problem but will only possibly force an unnecessary
>> writeback.
>>
>> So now we need to reason whether something can go wrong via
>> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
>> and userfaultfd_pte_wp cases are solved due to uniformity in the
>> corresponding bits guaranteed by the flag combination. The ptes all
>> belong to the same VMA (since callers guarantee that [start, end) will
>> lie within the VMA) therefore the conditional based on the VMA is also
>> safe to batch around.
>>
>> Since the dirty bit on the PTE really is just an indication that the 
>> folio
>> got written to - even if the PTE is not actually dirty but one of the 
>> PTEs
>> in the batch is, the wp-fault optimization can be made. Therefore, it is
>> safe to batch around pte_dirty() in can_change_shared_pte_writable()
>> (in fact this is better since without batching, it may happen that
>> some ptes aren't changed to writable just because they are not dirty,
>> even though the other ptes mapping the same large folio are dirty).
>>
>> To batch around the PageAnonExclusive case, we must check the 
>> corresponding
>> condition for every single page. Therefore, from the large folio batch,
>> we process sub batches of ptes mapping pages with the same
>> PageAnonExclusive condition, and process that sub batch, then determine
>> and process the next sub batch, and so on. Note that this does not cause
>> any extra overhead; if suppose the size of the folio batch is 512, then
>> the sub batch processing in total will take 512 iterations, which is the
>> same as what we would have done before.
>>
>> For pte_needs_flush():
>>
>> ppc does not care about the a/d bits.
>>
>> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
>> get cleared; since we can only have extra a/d bits due to batching,
>> we will only have an extra flush, not a case where we elide a flush due
>> to batching when we shouldn't have.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>
>
> I wanted to review this, but looks like it's already upstream and I 
> suspect it's buggy (see the upstream report I cc'ed you on)

Thank you for CCing, and quickly spotting the problem!


>
> [...]
>
>> +
>> +/*
>> + * This function is a result of trying our very best to retain the
>> + * "avoid the write-fault handler" optimization. In 
>> can_change_pte_writable(),
>> + * if the vma is a private vma, and we cannot determine whether to 
>> change
>> + * the pte to writable just from the vma and the pte, we then need 
>> to look
>> + * at the actual page pointed to by the pte. Unfortunately, if we 
>> have a
>> + * batch of ptes pointing to consecutive pages of the same anon 
>> large folio,
>> + * the anon-exclusivity (or the negation) of the first page does not 
>> guarantee
>> + * the anon-exclusivity (or the negation) of the other pages 
>> corresponding to
>> + * the pte batch; hence in this case it is incorrect to decide to 
>> change or
>> + * not change the ptes to writable just by using information from 
>> the first
>> + * pte of the batch. Therefore, we must individually check all pages 
>> and
>> + * retrieve sub-batches.
>> + */
>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>> +        struct folio *folio, unsigned long addr, pte_t *ptep,
>> +        pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +    struct page *first_page = folio_page(folio, 0);
>
> Who says that we have the first page of the folio mapped into the 
> first PTE of the batch?

Oops, I thought I had taken care of this. We make the folio from 
vm_normal_folio(vma, addr, oldpte),

which makes the struct page from the actual pte, but then I missed that 
page_folio() will reset the folio->page

to the head page, I was under the impression it will retain the original 
page. And indeed, if it didn't do that

then the purpose of vm_normal_folio() is nullified, should have thought 
about that :(  Lemme send a fix patch.


>
>> +    bool expected_anon_exclusive;
>> +    int sub_batch_idx = 0;
>> +    int len;
>> +
>> +    while (nr_ptes) {
>> +        expected_anon_exclusive = PageAnonExclusive(first_page + 
>> sub_batch_idx);
>> +        len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
>> +                    first_page, expected_anon_exclusive);
>> +        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
>> +                       sub_batch_idx, expected_anon_exclusive, tlb);
>> +        sub_batch_idx += len;
>> +        nr_ptes -= len;
>> +    }
>> +}
>> +
>> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct 
>> *vma,
>> +        struct folio *folio, unsigned long addr, pte_t *ptep,
>> +        pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>> +{
>> +    bool set_write;
>> +
>> +    if (vma->vm_flags & VM_SHARED) {
>> +        set_write = can_change_shared_pte_writable(vma, ptent);
>> +        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
>> +                       /* idx = */ 0, set_write, tlb);
>> +        return;
>> +    }
>> +
>> +    set_write = maybe_change_pte_writable(vma, ptent) &&
>> +            (folio && folio_test_anon(folio));
>> +    if (!set_write) {
>> +        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
>> +                       /* idx = */ 0, set_write, tlb);
>> +        return;
>> +    }
>> +    commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, 
>> nr_ptes, tlb);
>> +}
>> +
>>   static long change_pte_range(struct mmu_gather *tlb,
>>           struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>>           unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
>>           nr_ptes = 1;
>>           oldpte = ptep_get(pte);
>>           if (pte_present(oldpte)) {
>> +            const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | 
>> FPB_RESPECT_WRITE;
>>               int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>> -            struct folio *folio;
>> +            struct folio *folio = NULL;
>>               pte_t ptent;
>>                 /*
>> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather 
>> *tlb,
>>                         /* determine batch to skip */
>>                       nr_ptes = mprotect_folio_pte_batch(folio,
>> -                          pte, oldpte, max_nr_ptes);
>> +                          pte, oldpte, max_nr_ptes, /* flags = */ 0);
>>                       continue;
>>                   }
>>               }
>>   +            if (!folio)
>> +                folio = vm_normal_folio(vma, addr, oldpte);
>> +
>> +            nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, 
>> max_nr_ptes, flags);
>> +
>>               oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>               ptent = pte_modify(oldpte, newprot);
>>   @@ -248,14 +350,13 @@ static long change_pte_range(struct 
>> mmu_gather *tlb,
>>                * COW or special handling is required.
>>                */
>>               if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> -                !pte_write(ptent) &&
>> -                can_change_pte_writable(vma, addr, ptent))
>> -                ptent = pte_mkwrite(ptent, vma);
>> -
>> -            modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, 
>> nr_ptes);
>> -            if (pte_needs_flush(oldpte, ptent))
>> -                tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> -            pages++;
>> +                 !pte_write(ptent))
>> +                set_write_prot_commit_flush_ptes(vma, folio,
>> +                addr, pte, oldpte, ptent, nr_ptes, tlb);
>
> While staring at this:
>
> Very broken indentation.

Indeed, but then in order to fix the indentation if we start positioning

every argument just below the opening bracket then it will take at least 
four lines :)


>
>> +            else
>> +                prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>> +                    nr_ptes, /* idx = */ 0, /* set_write = */ false, 
>> tlb);
>
> Semi-broken intendation.

Yup I can position nr_ptes below vma and set_write below nr_ptes. But this

should probably not be a part of the fix patch.


>
>> +            pages += nr_ptes;
>>           } else if (is_swap_pte(oldpte)) {
>>               swp_entry_t entry = pte_to_swp_entry(oldpte);
>>               pte_t newpte;
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  8:53     ` Dev Jain
@ 2025-08-06  8:56       ` David Hildenbrand
  0 siblings, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06  8:56 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ryan.roberts, willy, linux-mm, linux-kernel, catalin.marinas,
	will, Liam.Howlett, lorenzo.stoakes, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


>> While staring at this:
>>
>> Very broken indentation.
> 
> Indeed, but then in order to fix the indentation if we start positioning
> 
> every argument just below the opening bracket then it will take at least
> four lines :)

You are allowed to use more than 80 chars if there is good reason :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  8:08   ` David Hildenbrand
                       ` (2 preceding siblings ...)
  2025-08-06  8:53     ` Dev Jain
@ 2025-08-06  9:12     ` Lorenzo Stoakes
  2025-08-06  9:21       ` David Hildenbrand
  3 siblings, 1 reply; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-08-06  9:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
> On 18.07.25 11:02, Dev Jain wrote:
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
>
>
> I wanted to review this, but looks like it's already upstream and I suspect
> it's buggy (see the upstream report I cc'ed you on)
>
> [...]
>
> > +
> > +/*
> > + * This function is a result of trying our very best to retain the
> > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
> > + * if the vma is a private vma, and we cannot determine whether to change
> > + * the pte to writable just from the vma and the pte, we then need to look
> > + * at the actual page pointed to by the pte. Unfortunately, if we have a
> > + * batch of ptes pointing to consecutive pages of the same anon large folio,
> > + * the anon-exclusivity (or the negation) of the first page does not guarantee
> > + * the anon-exclusivity (or the negation) of the other pages corresponding to
> > + * the pte batch; hence in this case it is incorrect to decide to change or
> > + * not change the ptes to writable just by using information from the first
> > + * pte of the batch. Therefore, we must individually check all pages and
> > + * retrieve sub-batches.
> > + */
> > +static void commit_anon_folio_batch(struct vm_area_struct *vma,
> > +		struct folio *folio, unsigned long addr, pte_t *ptep,
> > +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> > +{
> > +	struct page *first_page = folio_page(folio, 0);
>
> Who says that we have the first page of the folio mapped into the first PTE
> of the batch?

Yikes, missed this sorry. Got too tied up in alogrithm here.

You mean in _this_ PTE of the batch right? As we're invoking these on each part
of the PTE table.

I mean I guess we can simply do:

	struct page *first_page = pte_page(ptent);

Right?

Presuming ptent is the PTE entry we are curently examining (and applying
the PAE sub-batching to etc.)

[snip]

> > @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
> >   					/* determine batch to skip */
> >   					nr_ptes = mprotect_folio_pte_batch(folio,
> > -						  pte, oldpte, max_nr_ptes);
> > +						  pte, oldpte, max_nr_ptes, /* flags = */ 0);
> >   					continue;
> >   				}
> >   			}
> > +			if (!folio)
> > +				folio = vm_normal_folio(vma, addr, oldpte);
> > +
> > +			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > +
> >   			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >   			ptent = pte_modify(oldpte, newprot);
> > @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
> >   			 * COW or special handling is required.
> >   			 */
> >   			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> > -			    !pte_write(ptent) &&
> > -			    can_change_pte_writable(vma, addr, ptent))
> > -				ptent = pte_mkwrite(ptent, vma);
> > -
> > -			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
> > -			if (pte_needs_flush(oldpte, ptent))
> > -				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> > -			pages++;
> > +			     !pte_write(ptent))
> > +				set_write_prot_commit_flush_ptes(vma, folio,
> > +				addr, pte, oldpte, ptent, nr_ptes, tlb);
>
> While staring at this:
>
> Very broken indentation.

Hmm... I wonder if checkpatch.pl would have picked this up actually.

Yeah that 2nd line is just wrong...

>
> > +			else
> > +				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> > +					nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>
> Semi-broken intendation.

Because of else then 2 lines after?

>
> > +			pages += nr_ptes;
> >   		} else if (is_swap_pte(oldpte)) {
> >   			swp_entry_t entry = pte_to_swp_entry(oldpte);
> >   			pte_t newpte;
>
>
> --
> Cheers,
>
> David / dhildenb
>

I think on this series I was so happy to see the implementation evolve to
something that was so much nicer than it originally was, I did not
scrutinise the impl. details of this one close enough, but as Jamie
Lannister said, "there are always lessons in failures" :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  9:12     ` Lorenzo Stoakes
@ 2025-08-06  9:21       ` David Hildenbrand
  2025-08-06  9:37         ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06  9:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 06.08.25 11:12, Lorenzo Stoakes wrote:
> On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
>> On 18.07.25 11:02, Dev Jain wrote:
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>
>>
>> I wanted to review this, but looks like it's already upstream and I suspect
>> it's buggy (see the upstream report I cc'ed you on)
>>
>> [...]
>>
>>> +
>>> +/*
>>> + * This function is a result of trying our very best to retain the
>>> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
>>> + * if the vma is a private vma, and we cannot determine whether to change
>>> + * the pte to writable just from the vma and the pte, we then need to look
>>> + * at the actual page pointed to by the pte. Unfortunately, if we have a
>>> + * batch of ptes pointing to consecutive pages of the same anon large folio,
>>> + * the anon-exclusivity (or the negation) of the first page does not guarantee
>>> + * the anon-exclusivity (or the negation) of the other pages corresponding to
>>> + * the pte batch; hence in this case it is incorrect to decide to change or
>>> + * not change the ptes to writable just by using information from the first
>>> + * pte of the batch. Therefore, we must individually check all pages and
>>> + * retrieve sub-batches.
>>> + */
>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>> +		struct folio *folio, unsigned long addr, pte_t *ptep,
>>> +		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>>> +{
>>> +	struct page *first_page = folio_page(folio, 0);
>>
>> Who says that we have the first page of the folio mapped into the first PTE
>> of the batch?
> 
> Yikes, missed this sorry. Got too tied up in alogrithm here.
> 
> You mean in _this_ PTE of the batch right? As we're invoking these on each part
> of the PTE table.
> 
> I mean I guess we can simply do:
> 
> 	struct page *first_page = pte_page(ptent);
> 
> Right?

Yes, but we should forward the result from vm_normal_page(), which does
exactly that for you, and increment the page accordingly as required,
just like with the pte we are processing.

...

>>
>>> +			else
>>> +				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>>> +					nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>>
>> Semi-broken intendation.
> 
> Because of else then 2 lines after?

prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
		       nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);

Is what I would have expected.


I think a smart man once said, that if you need more than one line per statement in
an if/else clause, a set of {} can aid readability. But I don't particularly care :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  9:21       ` David Hildenbrand
@ 2025-08-06  9:37         ` Dev Jain
  2025-08-06  9:50           ` Lorenzo Stoakes
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-08-06  9:37 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 06/08/25 2:51 pm, David Hildenbrand wrote:
> On 06.08.25 11:12, Lorenzo Stoakes wrote:
>> On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote:
>>> On 18.07.25 11:02, Dev Jain wrote:
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>
>>>
>>> I wanted to review this, but looks like it's already upstream and I 
>>> suspect
>>> it's buggy (see the upstream report I cc'ed you on)
>>>
>>> [...]
>>>
>>>> +
>>>> +/*
>>>> + * This function is a result of trying our very best to retain the
>>>> + * "avoid the write-fault handler" optimization. In 
>>>> can_change_pte_writable(),
>>>> + * if the vma is a private vma, and we cannot determine whether to 
>>>> change
>>>> + * the pte to writable just from the vma and the pte, we then need 
>>>> to look
>>>> + * at the actual page pointed to by the pte. Unfortunately, if we 
>>>> have a
>>>> + * batch of ptes pointing to consecutive pages of the same anon 
>>>> large folio,
>>>> + * the anon-exclusivity (or the negation) of the first page does 
>>>> not guarantee
>>>> + * the anon-exclusivity (or the negation) of the other pages 
>>>> corresponding to
>>>> + * the pte batch; hence in this case it is incorrect to decide to 
>>>> change or
>>>> + * not change the ptes to writable just by using information from 
>>>> the first
>>>> + * pte of the batch. Therefore, we must individually check all 
>>>> pages and
>>>> + * retrieve sub-batches.
>>>> + */
>>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>> +        struct folio *folio, unsigned long addr, pte_t *ptep,
>>>> +        pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather 
>>>> *tlb)
>>>> +{
>>>> +    struct page *first_page = folio_page(folio, 0);
>>>
>>> Who says that we have the first page of the folio mapped into the 
>>> first PTE
>>> of the batch?
>>
>> Yikes, missed this sorry. Got too tied up in alogrithm here.
>>
>> You mean in _this_ PTE of the batch right? As we're invoking these on 
>> each part
>> of the PTE table.
>>
>> I mean I guess we can simply do:
>>
>>     struct page *first_page = pte_page(ptent);
>>
>> Right?
>
> Yes, but we should forward the result from vm_normal_page(), which does
> exactly that for you, and increment the page accordingly as required,
> just like with the pte we are processing.

Makes sense, so I guess I will have to change the signature of 
prot_numa_skip()

to pass a double ptr to a page instead of folio and derive the folio in 
the caller,

and pass down both the folio and the page to 
set_write_prot_commit_flush_ptes.


>
> ...
>
>>>
>>>> +            else
>>>> +                prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>>>> +                    nr_ptes, /* idx = */ 0, /* set_write = */ 
>>>> false, tlb);
>>>
>>> Semi-broken intendation.
>>
>> Because of else then 2 lines after?
>
> prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
>                nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
>
> Is what I would have expected.
>
>
> I think a smart man once said, that if you need more than one line per 
> statement in
> an if/else clause, a set of {} can aid readability. But I don't 
> particularly care :)
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  9:37         ` Dev Jain
@ 2025-08-06  9:50           ` Lorenzo Stoakes
  2025-08-06 10:11             ` David Hildenbrand
       [not found]             ` <1b3d4799-2a57-4f16-973b-82fc7b438862@arm.com>
  0 siblings, 2 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-08-06  9:50 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand, akpm, ryan.roberts, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
> > >
> > > You mean in _this_ PTE of the batch right? As we're invoking these
> > > on each part
> > > of the PTE table.
> > >
> > > I mean I guess we can simply do:
> > >
> > >     struct page *first_page = pte_page(ptent);
> > >
> > > Right?
> >
> > Yes, but we should forward the result from vm_normal_page(), which does
> > exactly that for you, and increment the page accordingly as required,
> > just like with the pte we are processing.
>
> Makes sense, so I guess I will have to change the signature of
> prot_numa_skip()
>
> to pass a double ptr to a page instead of folio and derive the folio in the
> caller,
>
> and pass down both the folio and the page to
> set_write_prot_commit_flush_ptes.

I already don't love how we psas the folio back from there for very dubious
benefit. I really hate the idea of having a struct **page parameter...

I wonder if we should just have a quick fixup for hotfix, and refine this more
later?

I foresee some debate otherwise...

>
>
> >
> > ...
> >
> > > >
> > > > > +            else
> > > > > +                prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> > > > > +                    nr_ptes, /* idx = */ 0, /* set_write =
> > > > > */ false, tlb);
> > > >
> > > > Semi-broken intendation.
> > >
> > > Because of else then 2 lines after?
> >
> > prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> >                nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
> >
> > Is what I would have expected.
> >
> >
> > I think a smart man once said, that if you need more than one line per
> > statement in
> > an if/else clause, a set of {} can aid readability. But I don't
> > particularly care :)
> >

Can do this in follow up too.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06  9:50           ` Lorenzo Stoakes
@ 2025-08-06 10:11             ` David Hildenbrand
  2025-08-06 10:20               ` Dev Jain
  2025-08-06 10:45               ` Lorenzo Stoakes
       [not found]             ` <1b3d4799-2a57-4f16-973b-82fc7b438862@arm.com>
  1 sibling, 2 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06 10:11 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 06.08.25 11:50, Lorenzo Stoakes wrote:
> On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
>>>>
>>>> You mean in _this_ PTE of the batch right? As we're invoking these
>>>> on each part
>>>> of the PTE table.
>>>>
>>>> I mean I guess we can simply do:
>>>>
>>>>      struct page *first_page = pte_page(ptent);
>>>>
>>>> Right?
>>>
>>> Yes, but we should forward the result from vm_normal_page(), which does
>>> exactly that for you, and increment the page accordingly as required,
>>> just like with the pte we are processing.
>>
>> Makes sense, so I guess I will have to change the signature of
>> prot_numa_skip()
>>
>> to pass a double ptr to a page instead of folio and derive the folio in the
>> caller,
>>
>> and pass down both the folio and the page to
>> set_write_prot_commit_flush_ptes.
> 
> I already don't love how we psas the folio back from there for very dubious
> benefit. I really hate the idea of having a struct **page parameter...
> 
> I wonder if we should just have a quick fixup for hotfix, and refine this more
> later?

This is not an issue in any released kernel, so we can do this properly.

We should just remove that nested vm_normal_folio().

Untested, but should give an idea what we can do.


diff --git a/mm/mprotect.c b/mm/mprotect.c
index 78bded7acf795..4e0a22f7db495 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -120,7 +120,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
  
  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
  			   pte_t oldpte, pte_t *pte, int target_node,
-			   struct folio **foliop)
+			   struct folio *folio)
  {
  	struct folio *folio = NULL;
  	bool ret = true;
@@ -131,7 +131,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
  	if (pte_protnone(oldpte))
  		goto skip;
  
-	folio = vm_normal_folio(vma, addr, oldpte);
  	if (!folio)
  		goto skip;
  
@@ -172,8 +171,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
  	if (folio_use_access_time(folio))
  		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
  
-skip:
-	*foliop = folio;
  	return ret;
  }
  
@@ -231,10 +228,9 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
   * retrieve sub-batches.
   */
  static void commit_anon_folio_batch(struct vm_area_struct *vma,
-		struct folio *folio, unsigned long addr, pte_t *ptep,
+		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
  {
-	struct page *first_page = folio_page(folio, 0);
  	bool expected_anon_exclusive;
  	int sub_batch_idx = 0;
  	int len;
@@ -243,7 +239,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
  		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
  		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
  					first_page, expected_anon_exclusive);
-		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
+		prot_commit_flush_ptes(vma, addr, ptep, page, oldpte, ptent, len,
  				       sub_batch_idx, expected_anon_exclusive, tlb);
  		sub_batch_idx += len;
  		nr_ptes -= len;
@@ -251,7 +247,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
  }
  
  static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
-		struct folio *folio, unsigned long addr, pte_t *ptep,
+		struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep,
  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
  {
  	bool set_write;
@@ -270,7 +266,7 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
  				       /* idx = */ 0, set_write, tlb);
  		return;
  	}
-	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
+	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
  }
  
  static long change_pte_range(struct mmu_gather *tlb,
@@ -305,15 +301,20 @@ static long change_pte_range(struct mmu_gather *tlb,
  			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
  			struct folio *folio = NULL;
+			struct page *page;
  			pte_t ptent;
  
+			page = vm_normal_folio(vma, addr, oldpte);
+			if (page)
+				folio = page_folio(page);
+
  			/*
  			 * Avoid trapping faults against the zero or KSM
  			 * pages. See similar comment in change_huge_pmd.
  			 */
  			if (prot_numa) {
  				int ret = prot_numa_skip(vma, addr, oldpte, pte,
-							 target_node, &folio);
+							 target_node, folio);
  				if (ret) {
  
  					/* determine batch to skip */
@@ -323,9 +324,6 @@ static long change_pte_range(struct mmu_gather *tlb,
  				}
  			}
  
-			if (!folio)
-				folio = vm_normal_folio(vma, addr, oldpte);
-
  			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
  
  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
@@ -351,7 +349,7 @@ static long change_pte_range(struct mmu_gather *tlb,
  			 */
  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
  			     !pte_write(ptent))
-				set_write_prot_commit_flush_ptes(vma, folio,
+				set_write_prot_commit_flush_ptes(vma, folio, page,
  				addr, pte, oldpte, ptent, nr_ptes, tlb);
  			else
  				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
-- 
2.50.1


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06 10:11             ` David Hildenbrand
@ 2025-08-06 10:20               ` Dev Jain
  2025-08-06 10:28                 ` David Hildenbrand
  2025-08-06 10:45                 ` Lorenzo Stoakes
  2025-08-06 10:45               ` Lorenzo Stoakes
  1 sibling, 2 replies; 53+ messages in thread
From: Dev Jain @ 2025-08-06 10:20 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 06/08/25 3:41 pm, David Hildenbrand wrote:
> On 06.08.25 11:50, Lorenzo Stoakes wrote:
>> On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
>>>>>
>>>>> You mean in _this_ PTE of the batch right? As we're invoking these
>>>>> on each part
>>>>> of the PTE table.
>>>>>
>>>>> I mean I guess we can simply do:
>>>>>
>>>>>      struct page *first_page = pte_page(ptent);
>>>>>
>>>>> Right?
>>>>
>>>> Yes, but we should forward the result from vm_normal_page(), which 
>>>> does
>>>> exactly that for you, and increment the page accordingly as required,
>>>> just like with the pte we are processing.
>>>
>>> Makes sense, so I guess I will have to change the signature of
>>> prot_numa_skip()
>>>
>>> to pass a double ptr to a page instead of folio and derive the folio 
>>> in the
>>> caller,
>>>
>>> and pass down both the folio and the page to
>>> set_write_prot_commit_flush_ptes.
>>
>> I already don't love how we psas the folio back from there for very 
>> dubious
>> benefit. I really hate the idea of having a struct **page parameter...
>>
>> I wonder if we should just have a quick fixup for hotfix, and refine 
>> this more
>> later?
>
> This is not an issue in any released kernel, so we can do this properly.
>
> We should just remove that nested vm_normal_folio().
>
> Untested, but should give an idea what we can do.

This puts the overhead of vm_normal_folio() unconditionally into the 
pte_present path.

Although I am guessing that is already happening assuming prot_numa case 
is not the

hot path. This is fine by me. So I guess I shouldn't have done that 
"reuse the folio

from prot_numa case if possible" thingy at all :)


>
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 78bded7acf795..4e0a22f7db495 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -120,7 +120,7 @@ static int mprotect_folio_pte_batch(struct folio 
> *folio, pte_t *ptep,
>
>  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long 
> addr,
>                 pte_t oldpte, pte_t *pte, int target_node,
> -               struct folio **foliop)
> +               struct folio *folio)
>  {
>      struct folio *folio = NULL;
>      bool ret = true;
> @@ -131,7 +131,6 @@ static bool prot_numa_skip(struct vm_area_struct 
> *vma, unsigned long addr,
>      if (pte_protnone(oldpte))
>          goto skip;
>
> -    folio = vm_normal_folio(vma, addr, oldpte);
>      if (!folio)
>          goto skip;
>
> @@ -172,8 +171,6 @@ static bool prot_numa_skip(struct vm_area_struct 
> *vma, unsigned long addr,
>      if (folio_use_access_time(folio))
>          folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
>
> -skip:
> -    *foliop = folio;
>      return ret;
>  }
>
> @@ -231,10 +228,9 @@ static int page_anon_exclusive_sub_batch(int 
> start_idx, int max_len,
>   * retrieve sub-batches.
>   */
>  static void commit_anon_folio_batch(struct vm_area_struct *vma,
> -        struct folio *folio, unsigned long addr, pte_t *ptep,
> +        struct folio *folio, struct page *first_page, unsigned long 
> addr, pte_t *ptep,
>          pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {
> -    struct page *first_page = folio_page(folio, 0);
>      bool expected_anon_exclusive;
>      int sub_batch_idx = 0;
>      int len;
> @@ -243,7 +239,7 @@ static void commit_anon_folio_batch(struct 
> vm_area_struct *vma,
>          expected_anon_exclusive = PageAnonExclusive(first_page + 
> sub_batch_idx);
>          len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
>                      first_page, expected_anon_exclusive);
> -        prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
> +        prot_commit_flush_ptes(vma, addr, ptep, page, oldpte, ptent, 
> len,
>                         sub_batch_idx, expected_anon_exclusive, tlb);
>          sub_batch_idx += len;
>          nr_ptes -= len;
> @@ -251,7 +247,7 @@ static void commit_anon_folio_batch(struct 
> vm_area_struct *vma,
>  }
>
>  static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> -        struct folio *folio, unsigned long addr, pte_t *ptep,
> +        struct folio *folio, struct page *page, unsigned long addr, 
> pte_t *ptep,
>          pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {
>      bool set_write;
> @@ -270,7 +266,7 @@ static void 
> set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>                         /* idx = */ 0, set_write, tlb);
>          return;
>      }
> -    commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, 
> nr_ptes, tlb);
> +    commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, 
> ptent, nr_ptes, tlb);
>  }
>
>  static long change_pte_range(struct mmu_gather *tlb,
> @@ -305,15 +301,20 @@ static long change_pte_range(struct mmu_gather 
> *tlb,
>              const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | 
> FPB_RESPECT_WRITE;
>              int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>              struct folio *folio = NULL;
> +            struct page *page;
>              pte_t ptent;
>
> +            page = vm_normal_folio(vma, addr, oldpte);
> +            if (page)
> +                folio = page_folio(page);
> +
>              /*
>               * Avoid trapping faults against the zero or KSM
>               * pages. See similar comment in change_huge_pmd.
>               */
>              if (prot_numa) {
>                  int ret = prot_numa_skip(vma, addr, oldpte, pte,
> -                             target_node, &folio);
> +                             target_node, folio);
>                  if (ret) {
>
>                      /* determine batch to skip */
> @@ -323,9 +324,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>                  }
>              }
>
> -            if (!folio)
> -                folio = vm_normal_folio(vma, addr, oldpte);
> -
>              nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, 
> max_nr_ptes, flags);
>
>              oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> @@ -351,7 +349,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>               */
>              if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>                   !pte_write(ptent))
> -                set_write_prot_commit_flush_ptes(vma, folio,
> +                set_write_prot_commit_flush_ptes(vma, folio, page,
>                  addr, pte, oldpte, ptent, nr_ptes, tlb);
>              else
>                  prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06 10:20               ` Dev Jain
@ 2025-08-06 10:28                 ` David Hildenbrand
  2025-08-06 10:45                 ` Lorenzo Stoakes
  1 sibling, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06 10:28 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 06.08.25 12:20, Dev Jain wrote:
> 
> On 06/08/25 3:41 pm, David Hildenbrand wrote:
>> On 06.08.25 11:50, Lorenzo Stoakes wrote:
>>> On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
>>>>>>
>>>>>> You mean in _this_ PTE of the batch right? As we're invoking these
>>>>>> on each part
>>>>>> of the PTE table.
>>>>>>
>>>>>> I mean I guess we can simply do:
>>>>>>
>>>>>>       struct page *first_page = pte_page(ptent);
>>>>>>
>>>>>> Right?
>>>>>
>>>>> Yes, but we should forward the result from vm_normal_page(), which
>>>>> does
>>>>> exactly that for you, and increment the page accordingly as required,
>>>>> just like with the pte we are processing.
>>>>
>>>> Makes sense, so I guess I will have to change the signature of
>>>> prot_numa_skip()
>>>>
>>>> to pass a double ptr to a page instead of folio and derive the folio
>>>> in the
>>>> caller,
>>>>
>>>> and pass down both the folio and the page to
>>>> set_write_prot_commit_flush_ptes.
>>>
>>> I already don't love how we psas the folio back from there for very
>>> dubious
>>> benefit. I really hate the idea of having a struct **page parameter...
>>>
>>> I wonder if we should just have a quick fixup for hotfix, and refine
>>> this more
>>> later?
>>
>> This is not an issue in any released kernel, so we can do this properly.
>>
>> We should just remove that nested vm_normal_folio().
>>
>> Untested, but should give an idea what we can do.
> 
> This puts the overhead of vm_normal_folio() unconditionally into the
> pte_present path.
> 
> Although I am guessing that is already happening assuming prot_numa case
> is not the
> 
> hot path. This is fine by me. So I guess I shouldn't have done that
> "reuse the folio
> 
> from prot_numa case if possible" thingy at all :)

I mean, it only applies when trying to NUMA-protect something that is
already protected. Not sure how relevant that is in practice.

As we don't even batch these today, we could just do:

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4e0a22f7db495..2154a1a3c6656 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -127,10 +127,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
         bool toptier;
         int nid;
  
-       /* Avoid TLB flush if possible */
-       if (pte_protnone(oldpte))
-               goto skip;
-
         if (!folio)
                 goto skip;
  
@@ -304,6 +300,9 @@ static long change_pte_range(struct mmu_gather *tlb,
                         struct page *page;
                         pte_t ptent;
  
+                       if (prot_numa && pte_protnone(oldpte))
+                               continue;
+
                         page = vm_normal_folio(vma, addr, oldpte);
                         if (page)
                                 folio = page_folio(page);


But with my change, we could actually batch-skip such large folios,
because mprotect_folio_pte_batch() would see the folio.

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06 10:20               ` Dev Jain
  2025-08-06 10:28                 ` David Hildenbrand
@ 2025-08-06 10:45                 ` Lorenzo Stoakes
  1 sibling, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-08-06 10:45 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand, akpm, ryan.roberts, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Aug 06, 2025 at 03:50:21PM +0530, Dev Jain wrote:
>
> On 06/08/25 3:41 pm, David Hildenbrand wrote:
> > On 06.08.25 11:50, Lorenzo Stoakes wrote:
> > > On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
> > > > > >
> > > > > > You mean in _this_ PTE of the batch right? As we're invoking these
> > > > > > on each part
> > > > > > of the PTE table.
> > > > > >
> > > > > > I mean I guess we can simply do:
> > > > > >
> > > > > >      struct page *first_page = pte_page(ptent);
> > > > > >
> > > > > > Right?
> > > > >
> > > > > Yes, but we should forward the result from vm_normal_page(),
> > > > > which does
> > > > > exactly that for you, and increment the page accordingly as required,
> > > > > just like with the pte we are processing.
> > > >
> > > > Makes sense, so I guess I will have to change the signature of
> > > > prot_numa_skip()
> > > >
> > > > to pass a double ptr to a page instead of folio and derive the
> > > > folio in the
> > > > caller,
> > > >
> > > > and pass down both the folio and the page to
> > > > set_write_prot_commit_flush_ptes.
> > >
> > > I already don't love how we psas the folio back from there for very
> > > dubious
> > > benefit. I really hate the idea of having a struct **page parameter...
> > >
> > > I wonder if we should just have a quick fixup for hotfix, and refine
> > > this more
> > > later?
> >
> > This is not an issue in any released kernel, so we can do this properly.
> >
> > We should just remove that nested vm_normal_folio().
> >
> > Untested, but should give an idea what we can do.
>
> This puts the overhead of vm_normal_folio() unconditionally into the
> pte_present path.
>
> Although I am guessing that is already happening assuming prot_numa case is
> not the
>
> hot path. This is fine by me. So I guess I shouldn't have done that "reuse
> the folio
>
> from prot_numa case if possible" thingy at all :)

Don't worry it was a reasonable suggestion at the time, I think let's just
make our lives easier here :)

Can you check David's patch and make sure it's ok? It looks reasonable to
me (other than the eobvious typo I pointed out).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
  2025-08-06 10:11             ` David Hildenbrand
  2025-08-06 10:20               ` Dev Jain
@ 2025-08-06 10:45               ` Lorenzo Stoakes
  1 sibling, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-08-06 10:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Wed, Aug 06, 2025 at 12:11:33PM +0200, David Hildenbrand wrote:
> On 06.08.25 11:50, Lorenzo Stoakes wrote:
> > On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
> > > > >
> > > > > You mean in _this_ PTE of the batch right? As we're invoking these
> > > > > on each part
> > > > > of the PTE table.
> > > > >
> > > > > I mean I guess we can simply do:
> > > > >
> > > > >      struct page *first_page = pte_page(ptent);
> > > > >
> > > > > Right?
> > > >
> > > > Yes, but we should forward the result from vm_normal_page(), which does
> > > > exactly that for you, and increment the page accordingly as required,
> > > > just like with the pte we are processing.
> > >
> > > Makes sense, so I guess I will have to change the signature of
> > > prot_numa_skip()
> > >
> > > to pass a double ptr to a page instead of folio and derive the folio in the
> > > caller,
> > >
> > > and pass down both the folio and the page to
> > > set_write_prot_commit_flush_ptes.
> >
> > I already don't love how we psas the folio back from there for very dubious
> > benefit. I really hate the idea of having a struct **page parameter...
> >
> > I wonder if we should just have a quick fixup for hotfix, and refine this more
> > later?
>
> This is not an issue in any released kernel, so we can do this properly.
>
> We should just remove that nested vm_normal_folio().
>
> Untested, but should give an idea what we can do.
>
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 78bded7acf795..4e0a22f7db495 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -120,7 +120,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  			   pte_t oldpte, pte_t *pte, int target_node,
> -			   struct folio **foliop)
> +			   struct folio *folio)
>  {
>  	struct folio *folio = NULL;
>  	bool ret = true;
> @@ -131,7 +131,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	if (pte_protnone(oldpte))
>  		goto skip;
> -	folio = vm_normal_folio(vma, addr, oldpte);
>  	if (!folio)
>  		goto skip;
> @@ -172,8 +171,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
>  	if (folio_use_access_time(folio))
>  		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
> -skip:
> -	*foliop = folio;
>  	return ret;
>  }
> @@ -231,10 +228,9 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>   * retrieve sub-batches.
>   */
>  static void commit_anon_folio_batch(struct vm_area_struct *vma,
> -		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {
> -	struct page *first_page = folio_page(folio, 0);
>  	bool expected_anon_exclusive;
>  	int sub_batch_idx = 0;
>  	int len;
> @@ -243,7 +239,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
>  		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
>  		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
>  					first_page, expected_anon_exclusive);
> -		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
> +		prot_commit_flush_ptes(vma, addr, ptep, page, oldpte, ptent, len,
>  				       sub_batch_idx, expected_anon_exclusive, tlb);
>  		sub_batch_idx += len;
>  		nr_ptes -= len;
> @@ -251,7 +247,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
>  }
>  static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> -		struct folio *folio, unsigned long addr, pte_t *ptep,
> +		struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep,
>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {
>  	bool set_write;
> @@ -270,7 +266,7 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>  				       /* idx = */ 0, set_write, tlb);
>  		return;
>  	}
> -	commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
> +	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>  }
>  static long change_pte_range(struct mmu_gather *tlb,
> @@ -305,15 +301,20 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
>  			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
>  			struct folio *folio = NULL;
> +			struct page *page;
>  			pte_t ptent;
> +			page = vm_normal_folio(vma, addr, oldpte);

Surely vm_normal_page()? :P

> +			if (page)
> +				folio = page_folio(page);
> +
>  			/*
>  			 * Avoid trapping faults against the zero or KSM
>  			 * pages. See similar comment in change_huge_pmd.
>  			 */
>  			if (prot_numa) {
>  				int ret = prot_numa_skip(vma, addr, oldpte, pte,
> -							 target_node, &folio);
> +							 target_node, folio);
>  				if (ret) {
>  					/* determine batch to skip */
> @@ -323,9 +324,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				}
>  			}
> -			if (!folio)
> -				folio = vm_normal_folio(vma, addr, oldpte);
> -

Yes :) thanks :>)

>  			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> @@ -351,7 +349,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			 */
>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>  			     !pte_write(ptent))
> -				set_write_prot_commit_flush_ptes(vma, folio,
> +				set_write_prot_commit_flush_ptes(vma, folio, page,
>  				addr, pte, oldpte, ptent, nr_ptes, tlb);
>  			else
>  				prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
> --
> 2.50.1
>
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

[parent not found: <1b3d4799-2a57-4f16-973b-82fc7b438862@arm.com>]

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
       [not found]             ` <1b3d4799-2a57-4f16-973b-82fc7b438862@arm.com>
@ 2025-08-06 10:07               ` Dev Jain
  2025-08-06 10:12               ` David Hildenbrand
  1 sibling, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-08-06 10:07 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, akpm, ryan.roberts, willy, linux-mm,
	linux-kernel, catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

No idea why Thunderbird decides to send an HTML email whenever

I copy code into it. Will fix this for future once and for all.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching
       [not found]             ` <1b3d4799-2a57-4f16-973b-82fc7b438862@arm.com>
  2025-08-06 10:07               ` Dev Jain
@ 2025-08-06 10:12               ` David Hildenbrand
  1 sibling, 0 replies; 53+ messages in thread
From: David Hildenbrand @ 2025-08-06 10:12 UTC (permalink / raw)
  To: Dev Jain, Lorenzo Stoakes
  Cc: akpm, ryan.roberts, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On 06.08.25 12:04, Dev Jain wrote:
> 
> On 06/08/25 3:20 pm, Lorenzo Stoakes wrote:
>> On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote:
>>>>> You mean in _this_ PTE of the batch right? As we're invoking these
>>>>> on each part
>>>>> of the PTE table.
>>>>>
>>>>> I mean I guess we can simply do:
>>>>>
>>>>>      struct page *first_page = pte_page(ptent);
>>>>>
>>>>> Right?
>>>> Yes, but we should forward the result from vm_normal_page(), which does
>>>> exactly that for you, and increment the page accordingly as required,
>>>> just like with the pte we are processing.
>>> Makes sense, so I guess I will have to change the signature of
>>> prot_numa_skip()
>>>
>>> to pass a double ptr to a page instead of folio and derive the folio in the
>>> caller,
>>>
>>> and pass down both the folio and the page to
>>> set_write_prot_commit_flush_ptes.
>> I already don't love how we psas the folio back from there for very dubious
>> benefit. I really hate the idea of having a struct **page parameter...
>>
>> I wonder if we should just have a quick fixup for hotfix, and refine this more
>> later?
>>
>> I foresee some debate otherwise...
> 
> Yup I would personally prefer that. Although if you would like to see the churn, here
> it is (based on mm-hotfixes-unstable, untested):
> 
> ---
> mm/mprotect.c | 27 ++++++++++++++++-----------
> 1 file changed, 16 insertions(+), 11 deletions(-)
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 78bded7acf79..0735870e89ab 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -120,9 +120,10 @@ static int mprotect_folio_pte_batch(struct folio 
> *folio, pte_t *ptep,
> static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
> pte_t oldpte, pte_t *pte, int target_node,
> - struct folio **foliop)
> + struct page **pagep)
> {
> struct folio *folio = NULL;
> + struct page *page = NULL;
> bool ret = true;
> bool toptier;
> int nid;
> @@ -131,7 +132,9 @@ static bool prot_numa_skip(struct vm_area_struct 
> *vma, unsigned long addr,
> if (pte_protnone(oldpte))
> goto skip;
> - folio = vm_normal_folio(vma, addr, oldpte);
> + page = vm_normal_page(vma, addr, oldpte);
> + if (page)
> + folio = page_folio(page);

See the draft I just send, where we move that vm_normal_page() into the 
caller.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
                   ` (5 preceding siblings ...)
  2025-07-18  9:02 ` [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching Dev Jain
@ 2025-07-18  9:02 ` Dev Jain
  2025-07-18 18:50   ` Lorenzo Stoakes
  2025-07-21 15:57   ` Catalin Marinas
  2025-07-18  9:50 ` [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
  7 siblings, 2 replies; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy, Dev Jain

Override the generic definition of modify_prot_start_ptes() to use
get_and_clear_full_ptes(). This helper does a TLBI only for the starting
and ending contpte block of the range, whereas the current implementation
will call ptep_get_and_clear() for every contpte block, thus doing a
TLBI on every contpte block. Therefore, we have a performance win.

The arm64 definition of pte_accessible() allows us to batch in the
errata specific case:

#define pte_accessible(mm, pte)	\
	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))

All ptes are obviously present in the folio batch, and they are also valid.

Override the generic definition of modify_prot_commit_ptes() to simply
use set_ptes() to map the new ptes into the pagetable.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 10 ++++++++++
 arch/arm64/mm/mmu.c              | 28 +++++++++++++++++++++++-----
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ba63c8736666..abd2dee416b3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1643,6 +1643,16 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
 
+#define modify_prot_start_ptes modify_prot_start_ptes
+extern pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *ptep,
+				    unsigned int nr);
+
+#define modify_prot_commit_ptes modify_prot_commit_ptes
+extern void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+				    pte_t *ptep, pte_t old_pte, pte_t pte,
+				    unsigned int nr);
+
 #ifdef CONFIG_ARM64_CONTPTE
 
 /*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3d5fb37424ab..abd9725796e9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -26,6 +26,7 @@
 #include <linux/set_memory.h>
 #include <linux/kfence.h>
 #include <linux/pkeys.h>
+#include <linux/mm_inline.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -1524,24 +1525,41 @@ static int __init prevent_bootmem_remove_init(void)
 early_initcall(prevent_bootmem_remove_init);
 #endif
 
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+pte_t modify_prot_start_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, unsigned int nr)
 {
+	pte_t pte = get_and_clear_full_ptes(vma->vm_mm, addr, ptep, nr, /* full = */ 0);
+
 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
 		/*
 		 * Break-before-make (BBM) is required for all user space mappings
 		 * when the permission changes from executable to non-executable
 		 * in cases where cpu is affected with errata #2645198.
 		 */
-		if (pte_user_exec(ptep_get(ptep)))
-			return ptep_clear_flush(vma, addr, ptep);
+		if (pte_accessible(vma->vm_mm, pte) && pte_user_exec(pte))
+			__flush_tlb_range(vma, addr, nr * PAGE_SIZE,
+					  PAGE_SIZE, true, 3);
 	}
-	return ptep_get_and_clear(vma->vm_mm, addr, ptep);
+
+	return pte;
+}
+
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+{
+	return modify_prot_start_ptes(vma, addr, ptep, 1);
+}
+
+void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *ptep, pte_t old_pte, pte_t pte,
+			     unsigned int nr)
+{
+	set_ptes(vma->vm_mm, addr, ptep, pte, nr);
 }
 
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
 			     pte_t old_pte, pte_t pte)
 {
-	set_pte_at(vma->vm_mm, addr, ptep, pte);
+	modify_prot_commit_ptes(vma, addr, ptep, old_pte, pte, 1);
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 ` [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-07-18 18:50   ` Lorenzo Stoakes
  2025-07-21 15:57   ` Catalin Marinas
  1 sibling, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 18:50 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

Note I'm not forgetting this patch, it's just that I see Ryan has reviewed
it, and on the arm64 side I defer to his expertise.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit
  2025-07-18  9:02 ` [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
  2025-07-18 18:50   ` Lorenzo Stoakes
@ 2025-07-21 15:57   ` Catalin Marinas
  1 sibling, 0 replies; 53+ messages in thread
From: Catalin Marinas @ 2025-07-21 15:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel, will,
	Liam.Howlett, lorenzo.stoakes, vbabka, jannh, anshuman.khandual,
	peterx, joey.gouly, ioworker0, baohua, kevin.brodsky,
	quic_zhenhuah, christophe.leroy, yangyicong, linux-arm-kernel,
	hughd, yang, ziy

On Fri, Jul 18, 2025 at 02:32:44PM +0530, Dev Jain wrote:
> Override the generic definition of modify_prot_start_ptes() to use
> get_and_clear_full_ptes(). This helper does a TLBI only for the starting
> and ending contpte block of the range, whereas the current implementation
> will call ptep_get_and_clear() for every contpte block, thus doing a
> TLBI on every contpte block. Therefore, we have a performance win.
> 
> The arm64 definition of pte_accessible() allows us to batch in the
> errata specific case:
> 
> #define pte_accessible(mm, pte)	\
> 	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> 
> All ptes are obviously present in the folio batch, and they are also valid.
> 
> Override the generic definition of modify_prot_commit_ptes() to simply
> use set_ptes() to map the new ptes into the pagetable.
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

The arm64 changes look fine to me:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 0/7] Optimize mprotect() for large folios
  2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
                   ` (6 preceding siblings ...)
  2025-07-18  9:02 ` [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
@ 2025-07-18  9:50 ` Dev Jain
  2025-07-18 18:53   ` Lorenzo Stoakes
  7 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-07-18  9:50 UTC (permalink / raw)
  To: akpm
  Cc: ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, lorenzo.stoakes, vbabka,
	jannh, anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy


On 18/07/25 2:32 pm, Dev Jain wrote:
> Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all
> 16 entries to collect a/d bits. Hence this optimization will result in
> a 16x reduction in the number of ptep_get() calls. Next,
> ptep_modify_prot_start() will eventually call contpte_try_unfold() on
> every contig block, thus flushing the TLB for the complete large folio
> range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on
> each contig block, and only do them on the starting and ending
> contig block.
>
> For split folios, there will be no pte batching; the batch size returned
> by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
> still point to the same large folio; for arm64, this results in the
> optimization described above, and for other arches, a minor improvement
> is expected due to a reduction in the number of function calls.
>
> mm-selftests pass on arm64. I have some failing tests on my x86 VM already;
> no new tests fail as a result of this patchset.
>
> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
>
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
>
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 2.1 seconds   T2: 2 seconds   T3: 1 second
>
> After the patchset:
> T1: 0.65 seconds   T2: 0.7 seconds   T3: 1.1 seconds
>

For the note: the numbers are different from the previous versions.
I must have run the test for more number of iterations and then
pasted the test program here for 40 iterations, that's why the mismatch.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v5 0/7] Optimize mprotect() for large folios
  2025-07-18  9:50 ` [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
@ 2025-07-18 18:53   ` Lorenzo Stoakes
  0 siblings, 0 replies; 53+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 18:53 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, ryan.roberts, david, willy, linux-mm, linux-kernel,
	catalin.marinas, will, Liam.Howlett, vbabka, jannh,
	anshuman.khandual, peterx, joey.gouly, ioworker0, baohua,
	kevin.brodsky, quic_zhenhuah, christophe.leroy, yangyicong,
	linux-arm-kernel, hughd, yang, ziy

On Fri, Jul 18, 2025 at 03:20:16PM +0530, Dev Jain wrote:
>
> On 18/07/25 2:32 pm, Dev Jain wrote:
> > Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes
> > are painted with the contig bit, then ptep_get() will iterate through all
> > 16 entries to collect a/d bits. Hence this optimization will result in
> > a 16x reduction in the number of ptep_get() calls. Next,
> > ptep_modify_prot_start() will eventually call contpte_try_unfold() on
> > every contig block, thus flushing the TLB for the complete large folio
> > range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on
> > each contig block, and only do them on the starting and ending
> > contig block.
> >
> > For split folios, there will be no pte batching; the batch size returned
> > by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
> > still point to the same large folio; for arm64, this results in the
> > optimization described above, and for other arches, a minor improvement
> > is expected due to a reduction in the number of function calls.
> >
> > mm-selftests pass on arm64. I have some failing tests on my x86 VM already;
> > no new tests fail as a result of this patchset.
> >
> > We use the following test cases to measure performance, mprotect()'ing
> > the mapped memory to read-only then read-write 40 times:
> >
> > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> > pte-mapping those THPs
> > Test case 2: Mapping 1G of memory with 64K mTHPs
> > Test case 3: Mapping 1G of memory with 4K pages
> >
> > Average execution time on arm64, Apple M3:
> > Before the patchset:
> > T1: 2.1 seconds   T2: 2 seconds   T3: 1 second
> >
> > After the patchset:
> > T1: 0.65 seconds   T2: 0.7 seconds   T3: 1.1 seconds
> >
>
> For the note: the numbers are different from the previous versions.
> I must have run the test for more number of iterations and then
> pasted the test program here for 40 iterations, that's why the mismatch.
>

Thanks for this clarification!

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-08-06 10:46 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18  9:02 [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
2025-07-18  9:02 ` [PATCH v5 1/7] mm: Refactor MM_CP_PROT_NUMA skipping case into new function Dev Jain
2025-07-18 16:19   ` Lorenzo Stoakes
2025-07-20 23:44   ` Barry Song
2025-07-21  3:44     ` Dev Jain
2025-07-22 11:05       ` Dev Jain
2025-07-22 11:25   ` Ryan Roberts
2025-07-23 13:57   ` Zi Yan
2025-07-18  9:02 ` [PATCH v5 2/7] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
2025-07-18 16:40   ` Lorenzo Stoakes
2025-07-22 11:26   ` Ryan Roberts
2025-07-23 14:25   ` Zi Yan
2025-07-18  9:02 ` [PATCH v5 3/7] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-07-18 17:05   ` Lorenzo Stoakes
2025-07-20 23:59   ` Barry Song
2025-07-22 11:35   ` Ryan Roberts
2025-07-23 15:09   ` Zi Yan
2025-07-18  9:02 ` [PATCH v5 4/7] mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure Dev Jain
2025-07-18 17:12   ` Lorenzo Stoakes
2025-07-22 11:37   ` Ryan Roberts
2025-07-23 15:28   ` Zi Yan
2025-07-23 15:32     ` Dev Jain
2025-07-18  9:02 ` [PATCH v5 5/7] mm: Split can_change_pte_writable() into private and shared parts Dev Jain
2025-07-18 17:27   ` Lorenzo Stoakes
2025-07-23 15:40   ` Zi Yan
2025-07-18  9:02 ` [PATCH v5 6/7] mm: Optimize mprotect() by PTE batching Dev Jain
2025-07-18 18:49   ` Lorenzo Stoakes
2025-07-19 13:46     ` Dev Jain
2025-07-20 11:20       ` Lorenzo Stoakes
2025-07-20 14:39         ` Dev Jain
2025-07-24 19:55   ` Zi Yan
2025-08-06  8:08   ` David Hildenbrand
2025-08-06  8:12     ` David Hildenbrand
2025-08-06  8:15     ` Will Deacon
2025-08-06  8:19       ` David Hildenbrand
2025-08-06  8:53     ` Dev Jain
2025-08-06  8:56       ` David Hildenbrand
2025-08-06  9:12     ` Lorenzo Stoakes
2025-08-06  9:21       ` David Hildenbrand
2025-08-06  9:37         ` Dev Jain
2025-08-06  9:50           ` Lorenzo Stoakes
2025-08-06 10:11             ` David Hildenbrand
2025-08-06 10:20               ` Dev Jain
2025-08-06 10:28                 ` David Hildenbrand
2025-08-06 10:45                 ` Lorenzo Stoakes
2025-08-06 10:45               ` Lorenzo Stoakes
     [not found]             ` <1b3d4799-2a57-4f16-973b-82fc7b438862@arm.com>
2025-08-06 10:07               ` Dev Jain
2025-08-06 10:12               ` David Hildenbrand
2025-07-18  9:02 ` [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-07-18 18:50   ` Lorenzo Stoakes
2025-07-21 15:57   ` Catalin Marinas
2025-07-18  9:50 ` [PATCH v5 0/7] Optimize mprotect() for large folios Dev Jain
2025-07-18 18:53   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).