* + mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch added to mm-new branch
@ 2025-06-29 23:06 Andrew Morton
0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2025-06-29 23:06 UTC (permalink / raw)
To: mm-commits, ziy, yangyicong, yang, willy, will, vbabka,
ryan.roberts, quic_zhenhuah, peterx, lorenzo.stoakes,
liam.howlett, kevin.brodsky, joey.gouly, jannh, ioworker0, hughd,
david, christophe.leroy, catalin.marinas, baohua,
anshuman.khandual, dev.jain, akpm
The patch titled
Subject: mm: optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
has been added to the -mm mm-new branch. Its filename is
mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Dev Jain <dev.jain@arm.com>
Subject: mm: optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
Date: Sat, 28 Jun 2025 17:04:32 +0530
Patch series "Optimize mprotect() for large folios", v4.
This patchset optimizes the mprotect() system call for large folios by
PTE-batching. No issues were observed with mm-selftests, build tested on
x86_64.
We use the following test cases to measure performance, mprotect()'ing the
mapped memory to read-only then read-write 40 times:
Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages
Average execution time on arm64, Apple M3:
Before the patchset:
T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds
After the patchset:
T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds
Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.
Here is the test program:
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (1024*1024*1024)
unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);
static void pte_map_thps(char *mem, size_t size)
{
size_t offs;
int ret = 0;
/* PTE-map each THP by temporarily splitting the VMAs. */
for (offs = 0; offs < size; offs += pmdsize) {
ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
}
if (ret) {
fprintf(stderr, "ERROR: mprotect() failed\n");
exit(1);
}
}
int main(int argc, char *argv[])
{
char *p;
int ret = 0;
p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p != (1UL << 30)) {
perror("mmap");
return 1;
}
memset(p, 0, SIZE);
if (madvise(p, SIZE, MADV_NOHUGEPAGE))
perror("madvise");
explicit_bzero(p, SIZE);
pte_map_thps(p, SIZE);
for (int loops = 0; loops < 40; loops++) {
if (mprotect(p, SIZE, PROT_READ))
perror("mprotect"), exit(1);
if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
perror("mprotect"), exit(1);
explicit_bzero(p, SIZE);
}
}
This patch (of 4):
In case of prot_numa, there are various cases in which we can skip to the
next iteration. Since the skip condition is based on the folio and not
the PTEs, we can skip a PTE batch. Additionally refactor all of this into
a new function to clean up the existing code.
Link: https://lkml.kernel.org/r/20250628113435.46678-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250628113435.46678-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/mprotect.c | 134 +++++++++++++++++++++++++++++++-----------------
1 file changed, 87 insertions(+), 47 deletions(-)
--- a/mm/mprotect.c~mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes
+++ a/mm/mprotect.c
@@ -83,6 +83,83 @@ bool can_change_pte_writable(struct vm_a
return pte_dirty(pte);
}
+static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
+ pte_t *ptep, pte_t pte, int max_nr_ptes)
+{
+ const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+
+ if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
+ return 1;
+
+ return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
+ NULL, NULL, NULL);
+}
+
+static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma,
+ unsigned long addr, pte_t oldpte, pte_t *pte, int target_node,
+ int max_nr_ptes)
+{
+ struct folio *folio = NULL;
+ int nr_ptes = 1;
+ bool toptier;
+ int nid;
+
+ /* Avoid TLB flush if possible */
+ if (pte_protnone(oldpte))
+ goto skip_batch;
+
+ folio = vm_normal_folio(vma, addr, oldpte);
+ if (!folio)
+ goto skip_batch;
+
+ if (folio_is_zone_device(folio) || folio_test_ksm(folio))
+ goto skip_batch;
+
+ /* Also skip shared copy-on-write pages */
+ if (is_cow_mapping(vma->vm_flags) &&
+ (folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
+ goto skip_batch;
+
+ /*
+ * While migration can move some dirty pages,
+ * it cannot move them all from MIGRATE_ASYNC
+ * context.
+ */
+ if (folio_is_file_lru(folio) && folio_test_dirty(folio))
+ goto skip_batch;
+
+ /*
+ * Don't mess with PTEs if page is already on the node
+ * a single-threaded process is running on.
+ */
+ nid = folio_nid(folio);
+ if (target_node == nid)
+ goto skip_batch;
+
+ toptier = node_is_toptier(nid);
+
+ /*
+ * Skip scanning top tier node if normal numa
+ * balancing is disabled
+ */
+ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
+ goto skip_batch;
+
+ if (folio_use_access_time(folio)) {
+ folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
+
+ /* Do not skip in this case */
+ nr_ptes = 0;
+ goto out;
+ }
+
+skip_batch:
+ nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
+out:
+ *foliop = folio;
+ return nr_ptes;
+}
+
static long change_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -94,6 +171,7 @@ static long change_pte_range(struct mmu_
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ int nr_ptes;
tlb_change_page_size(tlb, PAGE_SIZE);
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -108,8 +186,11 @@ static long change_pte_range(struct mmu_
flush_tlb_batched_pending(vma->vm_mm);
arch_enter_lazy_mmu_mode();
do {
+ nr_ptes = 1;
oldpte = ptep_get(pte);
if (pte_present(oldpte)) {
+ int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
+ struct folio *folio = NULL;
pte_t ptent;
/*
@@ -117,53 +198,12 @@ static long change_pte_range(struct mmu_
* pages. See similar comment in change_huge_pmd.
*/
if (prot_numa) {
- struct folio *folio;
- int nid;
- bool toptier;
-
- /* Avoid TLB flush if possible */
- if (pte_protnone(oldpte))
- continue;
-
- folio = vm_normal_folio(vma, addr, oldpte);
- if (!folio || folio_is_zone_device(folio) ||
- folio_test_ksm(folio))
- continue;
-
- /* Also skip shared copy-on-write pages */
- if (is_cow_mapping(vma->vm_flags) &&
- (folio_maybe_dma_pinned(folio) ||
- folio_maybe_mapped_shared(folio)))
- continue;
-
- /*
- * While migration can move some dirty pages,
- * it cannot move them all from MIGRATE_ASYNC
- * context.
- */
- if (folio_is_file_lru(folio) &&
- folio_test_dirty(folio))
- continue;
-
- /*
- * Don't mess with PTEs if page is already on the node
- * a single-threaded process is running on.
- */
- nid = folio_nid(folio);
- if (target_node == nid)
- continue;
- toptier = node_is_toptier(nid);
-
- /*
- * Skip scanning top tier node if normal numa
- * balancing is disabled
- */
- if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
- toptier)
+ nr_ptes = prot_numa_skip_ptes(&folio, vma,
+ addr, oldpte, pte,
+ target_node,
+ max_nr_ptes);
+ if (nr_ptes)
continue;
- if (folio_use_access_time(folio))
- folio_xchg_access_time(folio,
- jiffies_to_msecs(jiffies));
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
@@ -280,7 +320,7 @@ static long change_pte_range(struct mmu_
pages++;
}
}
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
_
Patches currently in -mm which might be from dev.jain@arm.com are
xarray-add-a-bug_on-to-ensure-caller-is-not-sibling.patch
mm-call-pointers-to-ptes-as-ptep.patch
mm-optimize-mremap-by-pte-batching.patch
maple-tree-use-goto-label-to-simplify-code.patch
mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch
mm-add-batched-versions-of-ptep_modify_prot_start-commit.patch
mm-optimize-mprotect-by-pte-batching.patch
arm64-add-batched-versions-of-ptep_modify_prot_start-commit.patch
^ permalink raw reply [flat|nested] 2+ messages in thread
* + mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch added to mm-new branch
@ 2025-07-20 0:18 Andrew Morton
0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2025-07-20 0:18 UTC (permalink / raw)
To: mm-commits, ziy, yangyicong, yang, willy, will, vbabka,
ryan.roberts, quic_zhenhuah, peterx, lorenzo.stoakes,
liam.howlett, kevin.brodsky, joey.gouly, jannh, ioworker0, hughd,
david, christophe.leroy, catalin.marinas, baohua,
anshuman.khandual, dev.jain, akpm
The patch titled
Subject: mm: optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
has been added to the -mm mm-new branch. Its filename is
mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Dev Jain <dev.jain@arm.com>
Subject: mm: optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
Date: Fri, 18 Jul 2025 14:32:39 +0530
For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an
iteration due to the underlying folio satisfying any of the skip
conditions, then for all subsequent ptes which map the same folio, the
iteration will be skipped for them too. Therefore, we can optimize by
using folio_pte_batch() to batch skip the iterations.
Use prot_numa_skip() introduced in the previous patch to determine whether
we need to skip the iteration. Change its signature to have a double
pointer to a folio, which will be used by mprotect_folio_pte_batch() to
determine the number of iterations we can safely skip.
Link: https://lkml.kernel.org/r/20250718090244.21092-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/mprotect.c | 55 ++++++++++++++++++++++++++++++++++++------------
1 file changed, 42 insertions(+), 13 deletions(-)
--- a/mm/mprotect.c~mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes
+++ a/mm/mprotect.c
@@ -83,28 +83,43 @@ bool can_change_pte_writable(struct vm_a
return pte_dirty(pte);
}
+static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+ pte_t pte, int max_nr_ptes)
+{
+ /* No underlying folio, so cannot batch */
+ if (!folio)
+ return 1;
+
+ if (!folio_test_large(folio))
+ return 1;
+
+ return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
+}
+
static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
- pte_t oldpte, pte_t *pte, int target_node)
+ pte_t oldpte, pte_t *pte, int target_node,
+ struct folio **foliop)
{
- struct folio *folio;
+ struct folio *folio = NULL;
+ bool ret = true;
bool toptier;
int nid;
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
- return true;
+ goto skip;
folio = vm_normal_folio(vma, addr, oldpte);
if (!folio)
- return true;
+ goto skip;
if (folio_is_zone_device(folio) || folio_test_ksm(folio))
- return true;
+ goto skip;
/* Also skip shared copy-on-write pages */
if (is_cow_mapping(vma->vm_flags) &&
(folio_maybe_dma_pinned(folio) || folio_maybe_mapped_shared(folio)))
- return true;
+ goto skip;
/*
* While migration can move some dirty pages,
@@ -112,7 +127,7 @@ static bool prot_numa_skip(struct vm_are
* context.
*/
if (folio_is_file_lru(folio) && folio_test_dirty(folio))
- return true;
+ goto skip;
/*
* Don't mess with PTEs if page is already on the node
@@ -120,7 +135,7 @@ static bool prot_numa_skip(struct vm_are
*/
nid = folio_nid(folio);
if (target_node == nid)
- return true;
+ goto skip;
toptier = node_is_toptier(nid);
@@ -129,11 +144,15 @@ static bool prot_numa_skip(struct vm_are
* balancing is disabled
*/
if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier)
- return true;
+ goto skip;
+ ret = false;
if (folio_use_access_time(folio))
folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
- return false;
+
+skip:
+ *foliop = folio;
+ return ret;
}
static long change_pte_range(struct mmu_gather *tlb,
@@ -147,6 +166,7 @@ static long change_pte_range(struct mmu_
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ int nr_ptes;
tlb_change_page_size(tlb, PAGE_SIZE);
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -161,8 +181,11 @@ static long change_pte_range(struct mmu_
flush_tlb_batched_pending(vma->vm_mm);
arch_enter_lazy_mmu_mode();
do {
+ nr_ptes = 1;
oldpte = ptep_get(pte);
if (pte_present(oldpte)) {
+ int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
+ struct folio *folio;
pte_t ptent;
/*
@@ -170,9 +193,15 @@ static long change_pte_range(struct mmu_
* pages. See similar comment in change_huge_pmd.
*/
if (prot_numa) {
- if (prot_numa_skip(vma, addr, oldpte, pte,
- target_node))
+ int ret = prot_numa_skip(vma, addr, oldpte, pte,
+ target_node, &folio);
+ if (ret) {
+
+ /* determine batch to skip */
+ nr_ptes = mprotect_folio_pte_batch(folio,
+ pte, oldpte, max_nr_ptes);
continue;
+ }
}
oldpte = ptep_modify_prot_start(vma, addr, pte);
@@ -289,7 +318,7 @@ static long change_pte_range(struct mmu_
pages++;
}
}
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
_
Patches currently in -mm which might be from dev.jain@arm.com are
mm-refactor-mm_cp_prot_numa-skipping-case-into-new-function.patch
mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch
mm-add-batched-versions-of-ptep_modify_prot_start-commit.patch
mm-introduce-fpb_respect_write-for-pte-batching-infrastructure.patch
mm-split-can_change_pte_writable-into-private-and-shared-parts.patch
mm-optimize-mprotect-by-pte-batching.patch
arm64-add-batched-versions-of-ptep_modify_prot_start-commit.patch
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-07-20 0:18 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-29 23:06 + mm-optimize-mprotect-for-mm_cp_prot_numa-by-batch-skipping-ptes.patch added to mm-new branch Andrew Morton
-- strict thread matches above, loose matches on Subject: below --
2025-07-20 0:18 Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.