* [PATCH v2 0/2] mm/mprotect: micro-optimization work
@ 2026-03-24 15:43 Pedro Falcato
2026-03-24 15:43 ` [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function Pedro Falcato
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Pedro Falcato @ 2026-03-24 15:43 UTC (permalink / raw)
To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand,
Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel
Micro-optimize the change_protection functionality and the
change_pte_range() routine. This set of functions works in an incredibly
tight loop, and even small inefficiencies are incredibly evident when spun
hundreds, thousands or hundreds of thousands of times.
There was an attempt to keep the batching functionality as much as possible,
which introduced some part of the slowness, but not all of it. Removing it
for !arm64 architectures would speed mprotect() up even further, but could
easily pessimize cases where large folios are mapped (which is not as rare
as it seems, particularly when it comes to the page cache these days).
The micro-benchmark used for the tests was [0] (usable using google/benchmark
and g++ -O2 -lbenchmark repro.cpp)
This resulted in the following (first entry is baseline):
---------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------
mprotect_bench 85967 ns 85967 ns 6935
mprotect_bench 73374 ns 73373 ns 9602
After the patchset we can observe a 14% speedup in mprotect. Wonderful
for the elusive mprotect-based workloads!
Testing & more ideas welcome. I suspect there is plenty of improvement possible
but it would require more time than what I have on my hands right now. The
entire inlined function (which inlines into change_protection()) is gigantic
- I'm not surprised this is so finnicky.
Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes,
exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think
there's a properly safe way to go about it since we do depend on the D bit
quite a lot. This might not be such an issue on other architectures.
[0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Luke Yang <luyang@redhat.com>
Cc: jhladky@redhat.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
v2:
- Addressed Sashiko's concerns
- Picked up Lorenzo's R-b's (thank you!)
- Squashed patch 1 and 4 into a single one (David)
- Renamed the softleaf leaf function (David)
- Dropped controversial noinlines & patch 3 (Lorenzo & David)
v1:
https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@suse.de/
Pedro Falcato (2):
mm/mprotect: move softleaf code out of the main function
mm/mprotect: special-case small folios when applying write permissions
mm/mprotect.c | 146 ++++++++++++++++++++++++++++----------------------
1 file changed, 81 insertions(+), 65 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function 2026-03-24 15:43 [PATCH v2 0/2] mm/mprotect: micro-optimization work Pedro Falcato @ 2026-03-24 15:43 ` Pedro Falcato 2026-03-24 20:12 ` David Hildenbrand (Arm) 2026-03-24 15:43 ` [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato 2026-03-30 19:55 ` [PATCH v2 0/2] mm/mprotect: micro-optimization work Luke Yang 2 siblings, 1 reply; 17+ messages in thread From: Pedro Falcato @ 2026-03-24 15:43 UTC (permalink / raw) To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel Move softleaf change_pte_range code into a separate function. This makes the change_pte_range() function a good bit smaller, and lessens cognitive load when reading through the function. Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Signed-off-by: Pedro Falcato <pfalcato@suse.de> --- mm/mprotect.c | 129 +++++++++++++++++++++++++++----------------------- 1 file changed, 69 insertions(+), 60 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 9681f055b9fc..2eaf862e5734 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -211,6 +211,74 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); } +static long change_softleaf_pte(struct vm_area_struct *vma, + unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags) +{ + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; + softleaf_t entry = softleaf_from_pte(oldpte); + pte_t newpte; + + if (softleaf_is_migration_write(entry)) { + const struct folio *folio = softleaf_to_folio(entry); + + /* + * A protection check is difficult so + * just be safe and disable write + */ + if (folio_test_anon(folio)) + entry = make_readable_exclusive_migration_entry( + swp_offset(entry)); + else + entry = make_readable_migration_entry(swp_offset(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(oldpte)) + newpte = pte_swp_mksoft_dirty(newpte); + } else if (softleaf_is_device_private_write(entry)) { + /* + * We do not preserve soft-dirtiness. See + * copy_nonpresent_pte() for explanation. + */ + entry = make_readable_device_private_entry( + swp_offset(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_uffd_wp(oldpte)) + newpte = pte_swp_mkuffd_wp(newpte); + } else if (softleaf_is_marker(entry)) { + /* + * Ignore error swap entries unconditionally, + * because any access should sigbus/sigsegv + * anyway. + */ + if (softleaf_is_poison_marker(entry) || + softleaf_is_guard_marker(entry)) + return 0; + /* + * If this is uffd-wp pte marker and we'd like + * to unprotect it, drop it; the next page + * fault will trigger without uffd trapping. + */ + if (uffd_wp_resolve) { + pte_clear(vma->vm_mm, addr, pte); + return 1; + } + return 0; + } else { + newpte = oldpte; + } + + if (uffd_wp) + newpte = pte_swp_mkuffd_wp(newpte); + else if (uffd_wp_resolve) + newpte = pte_swp_clear_uffd_wp(newpte); + + if (!pte_same(oldpte, newpte)) { + set_pte_at(vma->vm_mm, addr, pte, newpte); + return 1; + } + return 0; +} + static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -317,66 +385,7 @@ static long change_pte_range(struct mmu_gather *tlb, pages++; } } else { - softleaf_t entry = softleaf_from_pte(oldpte); - pte_t newpte; - - if (softleaf_is_migration_write(entry)) { - const struct folio *folio = softleaf_to_folio(entry); - - /* - * A protection check is difficult so - * just be safe and disable write - */ - if (folio_test_anon(folio)) - entry = make_readable_exclusive_migration_entry( - swp_offset(entry)); - else - entry = make_readable_migration_entry(swp_offset(entry)); - newpte = swp_entry_to_pte(entry); - if (pte_swp_soft_dirty(oldpte)) - newpte = pte_swp_mksoft_dirty(newpte); - } else if (softleaf_is_device_private_write(entry)) { - /* - * We do not preserve soft-dirtiness. See - * copy_nonpresent_pte() for explanation. - */ - entry = make_readable_device_private_entry( - swp_offset(entry)); - newpte = swp_entry_to_pte(entry); - if (pte_swp_uffd_wp(oldpte)) - newpte = pte_swp_mkuffd_wp(newpte); - } else if (softleaf_is_marker(entry)) { - /* - * Ignore error swap entries unconditionally, - * because any access should sigbus/sigsegv - * anyway. - */ - if (softleaf_is_poison_marker(entry) || - softleaf_is_guard_marker(entry)) - continue; - /* - * If this is uffd-wp pte marker and we'd like - * to unprotect it, drop it; the next page - * fault will trigger without uffd trapping. - */ - if (uffd_wp_resolve) { - pte_clear(vma->vm_mm, addr, pte); - pages++; - } - continue; - } else { - newpte = oldpte; - } - - if (uffd_wp) - newpte = pte_swp_mkuffd_wp(newpte); - else if (uffd_wp_resolve) - newpte = pte_swp_clear_uffd_wp(newpte); - - if (!pte_same(oldpte, newpte)) { - set_pte_at(vma->vm_mm, addr, pte, newpte); - pages++; - } + pages += change_softleaf_pte(vma, addr, pte, oldpte, cp_flags); } } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); lazy_mmu_mode_disable(); -- 2.53.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function 2026-03-24 15:43 ` [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function Pedro Falcato @ 2026-03-24 20:12 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 17+ messages in thread From: David Hildenbrand (Arm) @ 2026-03-24 20:12 UTC (permalink / raw) To: Pedro Falcato, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes Cc: Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On 3/24/26 16:43, Pedro Falcato wrote: > Move softleaf change_pte_range code into a separate function. This makes > the change_pte_range() function a good bit smaller, and lessens cognitive > load when reading through the function. > > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> > Signed-off-by: Pedro Falcato <pfalcato@suse.de> > --- > mm/mprotect.c | 129 +++++++++++++++++++++++++++----------------------- > 1 file changed, 69 insertions(+), 60 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 9681f055b9fc..2eaf862e5734 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -211,6 +211,74 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); > } > > +static long change_softleaf_pte(struct vm_area_struct *vma, > + unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags) > +{ > + const bool uffd_wp = cp_flags & MM_CP_UFFD_WP; > + const bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; > + softleaf_t entry = softleaf_from_pte(oldpte); > + pte_t newpte; > + > + if (softleaf_is_migration_write(entry)) { > + const struct folio *folio = softleaf_to_folio(entry); > + > + /* > + * A protection check is difficult so > + * just be safe and disable write > + */ > + if (folio_test_anon(folio)) > + entry = make_readable_exclusive_migration_entry( > + swp_offset(entry)); We can now even fit this into a single line without feeling bad :) > + else > + entry = make_readable_migration_entry(swp_offset(entry)); > + newpte = swp_entry_to_pte(entry); > + if (pte_swp_soft_dirty(oldpte)) > + newpte = pte_swp_mksoft_dirty(newpte); > + } else if (softleaf_is_device_private_write(entry)) { > + /* > + * We do not preserve soft-dirtiness. See > + * copy_nonpresent_pte() for explanation. > + */ > + entry = make_readable_device_private_entry( > + swp_offset(entry)); Same here. > + newpte = swp_entry_to_pte(entry); > + if (pte_swp_uffd_wp(oldpte)) > + newpte = pte_swp_mkuffd_wp(newpte); > + } else if (softleaf_is_marker(entry)) { > + /* Acked-by: David Hildenbrand (Arm) <david@kernel.org> -- Cheers, David ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-03-24 15:43 [PATCH v2 0/2] mm/mprotect: micro-optimization work Pedro Falcato 2026-03-24 15:43 ` [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function Pedro Falcato @ 2026-03-24 15:43 ` Pedro Falcato 2026-03-24 20:18 ` David Hildenbrand (Arm) 2026-03-30 19:55 ` [PATCH v2 0/2] mm/mprotect: micro-optimization work Luke Yang 2 siblings, 1 reply; 17+ messages in thread From: Pedro Falcato @ 2026-03-24 15:43 UTC (permalink / raw) To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel The common order-0 case is important enough to want its own branch, and avoids the hairy, large loop logic that the CPU does not seem to handle particularly well. While at it, encourage the compiler to inline batch PTE logic and resolve constant branches by adding __always_inline strategically. Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Signed-off-by: Pedro Falcato <pfalcato@suse.de> --- mm/mprotect.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 2eaf862e5734..2fda26107066 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, return can_change_shared_pte_writable(vma, pte); } -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t flags) { /* No underlying folio, so cannot batch */ @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, } /* Set nr_ptes number of ptes, starting from idx */ -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, - int idx, bool set_write, struct mmu_gather *tlb) +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) { /* * Advance the position in the batch by idx; note that if idx > 0, @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, * pte of the batch. Therefore, we must individually check all pages and * retrieve sub-batches. */ -static void commit_anon_folio_batch(struct vm_area_struct *vma, +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, int sub_batch_idx = 0; int len; + /* Optimize for the common order-0 case. */ + if (likely(nr_ptes == 1)) { + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1, + 0, PageAnonExclusive(first_page), tlb); + return; + } + while (nr_ptes) { expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, -- 2.53.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-03-24 15:43 ` [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato @ 2026-03-24 20:18 ` David Hildenbrand (Arm) 2026-03-25 11:37 ` Pedro Falcato 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand (Arm) @ 2026-03-24 20:18 UTC (permalink / raw) To: Pedro Falcato, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes Cc: Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On 3/24/26 16:43, Pedro Falcato wrote: > The common order-0 case is important enough to want its own branch, and > avoids the hairy, large loop logic that the CPU does not seem to handle > particularly well. > > While at it, encourage the compiler to inline batch PTE logic and resolve > constant branches by adding __always_inline strategically. > > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> > Signed-off-by: Pedro Falcato <pfalcato@suse.de> > --- > mm/mprotect.c | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 2eaf862e5734..2fda26107066 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > return can_change_shared_pte_writable(vma, pte); > } > > -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > pte_t pte, int max_nr_ptes, fpb_t flags) > { > /* No underlying folio, so cannot batch */ > @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > } > > /* Set nr_ptes number of ptes, starting from idx */ > -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, > - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, > - int idx, bool set_write, struct mmu_gather *tlb) > +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, > + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) > { > /* > * Advance the position in the batch by idx; note that if idx > 0, > @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, > * pte of the batch. Therefore, we must individually check all pages and > * retrieve sub-batches. > */ > -static void commit_anon_folio_batch(struct vm_area_struct *vma, > +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, > struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, > pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > { > @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, > int sub_batch_idx = 0; > int len; > > + /* Optimize for the common order-0 case. */ > + if (likely(nr_ptes == 1)) { > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1, > + 0, PageAnonExclusive(first_page), tlb); To optimize that one, inlining prot_commit_flush_ptes() would be sufficient. Does inlining the other two really help? I don't think we can optimize out loops etc. for them? I would have thought that specializing on nr_ptes==0 on an even higher level--where we call set_write_prot_commit_flush_ptes/prot_commit_flush_ptes() would allow for optimizing the loops entirely for nr_ptes==0? -- Cheers, David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-03-24 20:18 ` David Hildenbrand (Arm) @ 2026-03-25 11:37 ` Pedro Falcato 2026-03-30 15:16 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 17+ messages in thread From: Pedro Falcato @ 2026-03-25 11:37 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On Tue, Mar 24, 2026 at 09:18:42PM +0100, David Hildenbrand (Arm) wrote: > On 3/24/26 16:43, Pedro Falcato wrote: > > The common order-0 case is important enough to want its own branch, and > > avoids the hairy, large loop logic that the CPU does not seem to handle > > particularly well. > > > > While at it, encourage the compiler to inline batch PTE logic and resolve > > constant branches by adding __always_inline strategically. > > > > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> > > Signed-off-by: Pedro Falcato <pfalcato@suse.de> > > --- > > mm/mprotect.c | 17 ++++++++++++----- > > 1 file changed, 12 insertions(+), 5 deletions(-) > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 2eaf862e5734..2fda26107066 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > > return can_change_shared_pte_writable(vma, pte); > > } > > > > -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > > +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > > pte_t pte, int max_nr_ptes, fpb_t flags) > > { > > /* No underlying folio, so cannot batch */ > > @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > > } > > > > /* Set nr_ptes number of ptes, starting from idx */ > > -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, > > - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, > > - int idx, bool set_write, struct mmu_gather *tlb) > > +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, > > + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, > > + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) > > { > > /* > > * Advance the position in the batch by idx; note that if idx > 0, > > @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, > > * pte of the batch. Therefore, we must individually check all pages and > > * retrieve sub-batches. > > */ > > -static void commit_anon_folio_batch(struct vm_area_struct *vma, > > +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, > > struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, > > pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > > { > > @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, > > int sub_batch_idx = 0; > > int len; > > > > + /* Optimize for the common order-0 case. */ > > + if (likely(nr_ptes == 1)) { > > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1, > > + 0, PageAnonExclusive(first_page), tlb); > > To optimize that one, inlining prot_commit_flush_ptes() would be > sufficient. Does inlining the other two really help? I don't think we > can optimize out loops etc. for them? Well, I'm getting meaningful (smaller) wins from adding those __always_inline's. (and I also get a small win for __always_inline on set_write_prot_commit_flush_ptes, but I didn't realize that until now). > > I would have thought that specializing on nr_ptes==0 on an even higher > level--where we call > set_write_prot_commit_flush_ptes/prot_commit_flush_ptes() would allow > for optimizing the loops entirely for nr_ptes==0? That could also work, but then set_write_prot_commit_flush_ptes (holy cow what a long name) would definitely need inlining. And might be a little uglier overall. This is the part where having data points other than my giga-fast-giga-powerful zen5 could prove handy :/ -- Pedro ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-03-25 11:37 ` Pedro Falcato @ 2026-03-30 15:16 ` David Hildenbrand (Arm) 2026-04-02 0:09 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand (Arm) @ 2026-03-30 15:16 UTC (permalink / raw) To: Pedro Falcato Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On 3/25/26 12:37, Pedro Falcato wrote: > On Tue, Mar 24, 2026 at 09:18:42PM +0100, David Hildenbrand (Arm) wrote: >> On 3/24/26 16:43, Pedro Falcato wrote: >>> The common order-0 case is important enough to want its own branch, and >>> avoids the hairy, large loop logic that the CPU does not seem to handle >>> particularly well. >>> >>> While at it, encourage the compiler to inline batch PTE logic and resolve >>> constant branches by adding __always_inline strategically. >>> >>> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> >>> Signed-off-by: Pedro Falcato <pfalcato@suse.de> >>> --- >>> mm/mprotect.c | 17 ++++++++++++----- >>> 1 file changed, 12 insertions(+), 5 deletions(-) >>> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c >>> index 2eaf862e5734..2fda26107066 100644 >>> --- a/mm/mprotect.c >>> +++ b/mm/mprotect.c >>> @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >>> return can_change_shared_pte_writable(vma, pte); >>> } >>> >>> -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >>> +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >>> pte_t pte, int max_nr_ptes, fpb_t flags) >>> { >>> /* No underlying folio, so cannot batch */ >>> @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >>> } >>> >>> /* Set nr_ptes number of ptes, starting from idx */ >>> -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, >>> - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, >>> - int idx, bool set_write, struct mmu_gather *tlb) >>> +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, >>> + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, >>> + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) >>> { >>> /* >>> * Advance the position in the batch by idx; note that if idx > 0, >>> @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, >>> * pte of the batch. Therefore, we must individually check all pages and >>> * retrieve sub-batches. >>> */ >>> -static void commit_anon_folio_batch(struct vm_area_struct *vma, >>> +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, >>> struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, >>> pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >>> { >>> @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, >>> int sub_batch_idx = 0; >>> int len; >>> >>> + /* Optimize for the common order-0 case. */ >>> + if (likely(nr_ptes == 1)) { >>> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1, >>> + 0, PageAnonExclusive(first_page), tlb); >> >> To optimize that one, inlining prot_commit_flush_ptes() would be >> sufficient. Does inlining the other two really help? I don't think we >> can optimize out loops etc. for them? > > Well, I'm getting meaningful (smaller) wins from adding those > __always_inline's. (and I also get a small win for __always_inline on > set_write_prot_commit_flush_ptes, but I didn't realize that until now). > >> >> I would have thought that specializing on nr_ptes==0 on an even higher >> level--where we call >> set_write_prot_commit_flush_ptes/prot_commit_flush_ptes() would allow >> for optimizing the loops entirely for nr_ptes==0? > > That could also work, but then set_write_prot_commit_flush_ptes (holy cow > what a long name) would definitely need inlining. And might be a little uglier > overall. Right. The idea is that you __always__inline any code that has PTE loops, such that all loops for nr_pages == 1 gets optimized out. We do that for zap and fork logic. > > This is the part where having data points other than my giga-fast-giga-powerful > zen5 could prove handy :/ I just recently lost access to my reliably, well tunes, system ... Is it just the following benchmark? https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 ? I can easily extending https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c to have an "mprotect" mode. I had that in the past bit discarded it. Then, we can easily measure the effect on various folio sizes when mprotect'ing a larger memory area. With order-0 we can then benchmark small folios exclusively. -- Cheers, David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-03-30 15:16 ` David Hildenbrand (Arm) @ 2026-04-02 0:09 ` Andrew Morton 2026-04-02 3:44 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2026-04-02 0:09 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On Mon, 30 Mar 2026 17:16:51 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote: > > That could also work, but then set_write_prot_commit_flush_ptes (holy cow > > what a long name) would definitely need inlining. And might be a little uglier > > overall. > > Right. The idea is that you __always__inline any code that has PTE > loops, such that all loops for nr_pages == 1 gets optimized out. > > We do that for zap and fork logic. > > > > > This is the part where having data points other than my giga-fast-giga-powerful > > zen5 could prove handy :/ > I just recently lost access to my reliably, well tunes, system ... > > Is it just the following benchmark? > > https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 > > ? > > > I can easily extending > > https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c > > to have an "mprotect" mode. I had that in the past bit discarded it. > > Then, we can easily measure the effect on various folio sizes when > mprotect'ing a larger memory area. > > With order-0 we can then benchmark small folios exclusively. It sounds like this is all possible future work? We have Lorenzo's R-b on this [2/2]. I'm reading this discussion as "upstream both"? --- a/mm/mprotect.c~mm-mprotect-special-case-small-folios-when-applying-write-permissions +++ a/mm/mprotect.c @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_a return can_change_shared_pte_writable(vma, pte); } -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t flags) { /* No underlying folio, so cannot batch */ @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(stru } /* Set nr_ptes number of ptes, starting from idx */ -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, - pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, - int idx, bool set_write, struct mmu_gather *tlb) +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, + int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) { /* * Advance the position in the batch by idx; note that if idx > 0, @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch * pte of the batch. Therefore, we must individually check all pages and * retrieve sub-batches. */ -static void commit_anon_folio_batch(struct vm_area_struct *vma, +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(stru int sub_batch_idx = 0; int len; + /* Optimize for the common order-0 case. */ + if (likely(nr_ptes == 1)) { + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1, + 0, PageAnonExclusive(first_page), tlb); + return; + } + while (nr_ptes) { expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, _ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-04-02 0:09 ` Andrew Morton @ 2026-04-02 3:44 ` Andrew Morton 2026-04-02 7:11 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2026-04-02 3:44 UTC (permalink / raw) To: David Hildenbrand (Arm), Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On Wed, 1 Apr 2026 17:09:31 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > I can easily extending > > > > https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c > > > > to have an "mprotect" mode. I had that in the past bit discarded it. > > > > Then, we can easily measure the effect on various folio sizes when > > mprotect'ing a larger memory area. > > > > With order-0 we can then benchmark small folios exclusively. > > It sounds like this is all possible future work? > > We have Lorenzo's R-b on this [2/2]. I'm reading this discussion as > "upstream both"? Sorry, please ignore. Reading skills. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions 2026-04-02 3:44 ` Andrew Morton @ 2026-04-02 7:11 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 17+ messages in thread From: David Hildenbrand (Arm) @ 2026-04-02 7:11 UTC (permalink / raw) To: Andrew Morton, Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel On 4/2/26 05:44, Andrew Morton wrote: > On Wed, 1 Apr 2026 17:09:31 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > >>> I can easily extending >>> >>> https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c >>> >>> to have an "mprotect" mode. I had that in the past bit discarded it. >>> >>> Then, we can easily measure the effect on various folio sizes when >>> mprotect'ing a larger memory area. >>> >>> With order-0 we can then benchmark small folios exclusively. >> >> It sounds like this is all possible future work? >> >> We have Lorenzo's R-b on this [2/2]. I'm reading this discussion as >> "upstream both"? > > Sorry, please ignore. Reading skills. Maybe writing skills on my end. I think we should do #2 properly. #1 is good to go. -- Cheers, David ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-03-24 15:43 [PATCH v2 0/2] mm/mprotect: micro-optimization work Pedro Falcato 2026-03-24 15:43 ` [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function Pedro Falcato 2026-03-24 15:43 ` [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato @ 2026-03-30 19:55 ` Luke Yang 2026-03-30 20:06 ` Andrew Morton 2026-04-01 14:10 ` Pedro Falcato 2 siblings, 2 replies; 17+ messages in thread From: Luke Yang @ 2026-03-30 19:55 UTC (permalink / raw) To: Pedro Falcato Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache Hi Pedro, Thanks for working on this. I just wanted to share that we've created a test kernel with your patches and tested on the following CPUs: --- aarch64 --- Ampere Altra Ampere Altra Max --- x86_64 --- AMD EPYC 7713 AMD EPYC 7351 AMD EPYC 7542 AMD EPYC 7573X AMD EPYC 7702 AMD EPYC 9754 Intel Xeon Gold 6126 Into Xeon Gold 6330 Intel Xeon Gold 6530 Intel Xeon Platinum 8351N Intel Core i7-6820HQ --- ppc64le --- IBM Power 10 On average, we see improvements ranging from a minimum of 5% to a maximum of 55%, with most improvements showing around a 25% speed up in the libmicro/mprot_tw4m micro benchmark. Thanks, Luke On Tue, Mar 24, 2026 at 11:44 AM Pedro Falcato <pfalcato@suse.de> wrote: > > Micro-optimize the change_protection functionality and the > change_pte_range() routine. This set of functions works in an incredibly > tight loop, and even small inefficiencies are incredibly evident when spun > hundreds, thousands or hundreds of thousands of times. > > There was an attempt to keep the batching functionality as much as possible, > which introduced some part of the slowness, but not all of it. Removing it > for !arm64 architectures would speed mprotect() up even further, but could > easily pessimize cases where large folios are mapped (which is not as rare > as it seems, particularly when it comes to the page cache these days). > > The micro-benchmark used for the tests was [0] (usable using google/benchmark > and g++ -O2 -lbenchmark repro.cpp) > > This resulted in the following (first entry is baseline): > > --------------------------------------------------------- > Benchmark Time CPU Iterations > --------------------------------------------------------- > mprotect_bench 85967 ns 85967 ns 6935 > mprotect_bench 73374 ns 73373 ns 9602 > > > After the patchset we can observe a 14% speedup in mprotect. Wonderful > for the elusive mprotect-based workloads! > > Testing & more ideas welcome. I suspect there is plenty of improvement possible > but it would require more time than what I have on my hands right now. The > entire inlined function (which inlines into change_protection()) is gigantic > - I'm not surprised this is so finnicky. > > Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, > exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think > there's a properly safe way to go about it since we do depend on the D bit > quite a lot. This might not be such an issue on other architectures. > > > [0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 > Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ > > Cc: Vlastimil Babka <vbabka@kernel.org> > Cc: Jann Horn <jannh@google.com> > Cc: David Hildenbrand <david@kernel.org> > Cc: Dev Jain <dev.jain@arm.com> > Cc: Luke Yang <luyang@redhat.com> > Cc: jhladky@redhat.com > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > > v2: > - Addressed Sashiko's concerns > - Picked up Lorenzo's R-b's (thank you!) > - Squashed patch 1 and 4 into a single one (David) > - Renamed the softleaf leaf function (David) > - Dropped controversial noinlines & patch 3 (Lorenzo & David) > > v1: > https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@suse.de/ > > Pedro Falcato (2): > mm/mprotect: move softleaf code out of the main function > mm/mprotect: special-case small folios when applying write permissions > > mm/mprotect.c | 146 ++++++++++++++++++++++++++++---------------------- > 1 file changed, 81 insertions(+), 65 deletions(-) > > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-03-30 19:55 ` [PATCH v2 0/2] mm/mprotect: micro-optimization work Luke Yang @ 2026-03-30 20:06 ` Andrew Morton 2026-04-01 8:25 ` David Hildenbrand (Arm) 2026-04-01 14:10 ` Pedro Falcato 1 sibling, 1 reply; 17+ messages in thread From: Andrew Morton @ 2026-03-30 20:06 UTC (permalink / raw) To: Luke Yang Cc: Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote: > Thanks for working on this. I just wanted to share that we've created a > test kernel with your patches and tested on the following CPUs: > > --- aarch64 --- > Ampere Altra > Ampere Altra Max > > --- x86_64 --- > AMD EPYC 7713 > AMD EPYC 7351 > AMD EPYC 7542 > AMD EPYC 7573X > AMD EPYC 7702 > AMD EPYC 9754 > Intel Xeon Gold 6126 > Into Xeon Gold 6330 > Intel Xeon Gold 6530 > Intel Xeon Platinum 8351N > Intel Core i7-6820HQ > > --- ppc64le --- > IBM Power 10 > > On average, we see improvements ranging from a minimum of 5% to a > maximum of 55%, with most improvements showing around a 25% speed up in > the libmicro/mprot_tw4m micro benchmark. Thanks, that's nice. I've added some of the above into the changelog and I took the liberty of adding your Tested-by: to both patches. fyi, regarding [2/2]: it's unclear to me whether the discussion with David will result in any alterations. If there's something I need to it always helps to lmk ;) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-03-30 20:06 ` Andrew Morton @ 2026-04-01 8:25 ` David Hildenbrand (Arm) 2026-04-01 14:14 ` Pedro Falcato 0 siblings, 1 reply; 17+ messages in thread From: David Hildenbrand (Arm) @ 2026-04-01 8:25 UTC (permalink / raw) To: Andrew Morton, Luke Yang Cc: Pedro Falcato, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache On 3/30/26 22:06, Andrew Morton wrote: > On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote: > >> Thanks for working on this. I just wanted to share that we've created a >> test kernel with your patches and tested on the following CPUs: >> >> --- aarch64 --- >> Ampere Altra >> Ampere Altra Max >> >> --- x86_64 --- >> AMD EPYC 7713 >> AMD EPYC 7351 >> AMD EPYC 7542 >> AMD EPYC 7573X >> AMD EPYC 7702 >> AMD EPYC 9754 >> Intel Xeon Gold 6126 >> Into Xeon Gold 6330 >> Intel Xeon Gold 6530 >> Intel Xeon Platinum 8351N >> Intel Core i7-6820HQ >> >> --- ppc64le --- >> IBM Power 10 >> >> On average, we see improvements ranging from a minimum of 5% to a >> maximum of 55%, with most improvements showing around a 25% speed up in >> the libmicro/mprot_tw4m micro benchmark. > > Thanks, that's nice. I've added some of the above into the changelog > and I took the liberty of adding your Tested-by: to both patches. > > fyi, regarding [2/2]: it's unclear to me whether the discussion with > David will result in any alterations. If there's something I need to > it always helps to lmk ;) I think we want to get a better understanding of which exact __always_inline is really helpful in patch #2, and where to apply the nr_ptes==1 forced optimization. I updated my microbenchmark I use for fork+unmap etc to measure mprotect as well https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads Running some simple tests with order-0 on 1 GiB of memory: Upstream Linus: ./pte-mapped-folio-benchmarks 0 write-protect 5 0.005779 ... ./pte-mapped-folio-benchmarks 0 write-unprotect 5 0.009113 ... With Pedro's patch #2: $ ./pte-mapped-folio-benchmarks 0 write-protect 5 0.003941 ... $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5 0.006163 ... With the patch below: $ ./pte-mapped-folio-benchmarks 0 write-protect 5 0.003364 $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5 0.005729 So patch #2 might be improved. And the forced inlining of mprotect_folio_pte_batch() should likely not go into the same patch. --- From cf1a2a4a6ef95ed541947f2fd9d8351bef664426 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Arm)" <david@kernel.org> Date: Wed, 1 Apr 2026 08:15:44 +0000 Subject: [PATCH] tmp Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> --- mm/mprotect.c | 79 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index c0571445bef7..8d14c05a11a2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -117,7 +117,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, } /* Set nr_ptes number of ptes, starting from idx */ -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb) { @@ -143,7 +143,7 @@ static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long add * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce * that the ptes point to consecutive pages of the same anon large folio. */ -static int page_anon_exclusive_sub_batch(int start_idx, int max_len, +static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len, struct page *first_page, bool expected_anon_exclusive) { int idx; @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, * pte of the batch. Therefore, we must individually check all pages and * retrieve sub-batches. */ -static void commit_anon_folio_batch(struct vm_area_struct *vma, +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma, struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { @@ -188,7 +188,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, } } -static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, +static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { @@ -211,6 +211,41 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); } +static __always_inline void change_present_ptes(struct mmu_gather *tlb, + struct vm_area_struct *vma, unsigned long addr, + pgprot_t newprot, unsigned long cp_flags, + struct folio *folio, struct page *page, pte_t *pte, + unsigned int nr_ptes) +{ + bool uffd_wp = cp_flags & MM_CP_UFFD_WP; + bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; + pte_t oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); + pte_t ptent = pte_modify(oldpte, newprot); + + if (uffd_wp) + ptent = pte_mkuffd_wp(ptent); + else if (uffd_wp_resolve) + ptent = pte_clear_uffd_wp(ptent); + + /* + * In some writable, shared mappings, we might want to catch actual + * write access -- see vma_wants_writenotify(). + * + * In all writable, private mappings, we have to properly handle COW. + * + * In both cases, we can sometimes still change PTEs writable and avoid + * the write-fault handler, for example, if a PTE is already dirty and + * no other COW or special handling is required. + */ + if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && + !pte_write(ptent)) + set_write_prot_commit_flush_ptes(vma, folio, page, addr, pte, + oldpte, ptent, nr_ptes, tlb); + else + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); +} + static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -242,7 +277,6 @@ static long change_pte_range(struct mmu_gather *tlb, int max_nr_ptes = (end - addr) >> PAGE_SHIFT; struct folio *folio = NULL; struct page *page; - pte_t ptent; /* Already in the desired state. */ if (prot_numa && pte_protnone(oldpte)) @@ -268,34 +302,17 @@ static long change_pte_range(struct mmu_gather *tlb, nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); - oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); - ptent = pte_modify(oldpte, newprot); - - if (uffd_wp) - ptent = pte_mkuffd_wp(ptent); - else if (uffd_wp_resolve) - ptent = pte_clear_uffd_wp(ptent); - /* - * In some writable, shared mappings, we might want - * to catch actual write access -- see - * vma_wants_writenotify(). - * - * In all writable, private mappings, we have to - * properly handle COW. - * - * In both cases, we can sometimes still change PTEs - * writable and avoid the write-fault handler, for - * example, if a PTE is already dirty and no other - * COW or special handling is required. + * Optimize for order-0 folios by optimizing out all + * loops. */ - if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && - !pte_write(ptent)) - set_write_prot_commit_flush_ptes(vma, folio, page, - addr, pte, oldpte, ptent, nr_ptes, tlb); - else - prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, - nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); + if (nr_ptes == 1) { + change_present_ptes(tlb, vma, addr, newprot, + cp_flags, folio, page, pte, 1); + } else { + change_present_ptes(tlb, vma, addr, newprot, + cp_flags, folio, page, pte, nr_ptes); + } pages += nr_ptes; } else if (pte_none(oldpte)) { /* -- 2.53.0 -- Cheers, David ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-04-01 8:25 ` David Hildenbrand (Arm) @ 2026-04-01 14:14 ` Pedro Falcato 0 siblings, 0 replies; 17+ messages in thread From: Pedro Falcato @ 2026-04-01 14:14 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Andrew Morton, Luke Yang, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache On Wed, Apr 01, 2026 at 10:25:40AM +0200, David Hildenbrand (Arm) wrote: > On 3/30/26 22:06, Andrew Morton wrote: > > On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote: > > > >> Thanks for working on this. I just wanted to share that we've created a > >> test kernel with your patches and tested on the following CPUs: > >> > >> --- aarch64 --- > >> Ampere Altra > >> Ampere Altra Max > >> > >> --- x86_64 --- > >> AMD EPYC 7713 > >> AMD EPYC 7351 > >> AMD EPYC 7542 > >> AMD EPYC 7573X > >> AMD EPYC 7702 > >> AMD EPYC 9754 > >> Intel Xeon Gold 6126 > >> Into Xeon Gold 6330 > >> Intel Xeon Gold 6530 > >> Intel Xeon Platinum 8351N > >> Intel Core i7-6820HQ > >> > >> --- ppc64le --- > >> IBM Power 10 > >> > >> On average, we see improvements ranging from a minimum of 5% to a > >> maximum of 55%, with most improvements showing around a 25% speed up in > >> the libmicro/mprot_tw4m micro benchmark. > > > > Thanks, that's nice. I've added some of the above into the changelog > > and I took the liberty of adding your Tested-by: to both patches. > > > > fyi, regarding [2/2]: it's unclear to me whether the discussion with > > David will result in any alterations. If there's something I need to > > it always helps to lmk ;) > > I think we want to get a better understanding of which exact __always_inline > is really helpful in patch #2, and where to apply the nr_ptes==1 forced > optimization. > > I updated my microbenchmark I use for fork+unmap etc to measure > mprotect as well > > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads > > Running some simple tests with order-0 on 1 GiB of memory: > > > Upstream Linus: > > ./pte-mapped-folio-benchmarks 0 write-protect 5 > 0.005779 > ... > ./pte-mapped-folio-benchmarks 0 write-unprotect 5 > 0.009113 > ... > > > With Pedro's patch #2: > $ ./pte-mapped-folio-benchmarks 0 write-protect 5 > 0.003941 > ... > $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5 > 0.006163 > ... > > > With the patch below: > > $ ./pte-mapped-folio-benchmarks 0 write-protect 5 > 0.003364 > > $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5 > 0.005729 Hmm. Thanks for the testing. Interesting. I'll give it a shot. I'll have results and/or a possible v3 by tomorrow, if need be. Apologies for the slight delay here! :) -- Pedro ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-03-30 19:55 ` [PATCH v2 0/2] mm/mprotect: micro-optimization work Luke Yang 2026-03-30 20:06 ` Andrew Morton @ 2026-04-01 14:10 ` Pedro Falcato 2026-04-02 13:55 ` Luke Yang 1 sibling, 1 reply; 17+ messages in thread From: Pedro Falcato @ 2026-04-01 14:10 UTC (permalink / raw) To: Luke Yang Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote: > Hi Pedro, > > Thanks for working on this. I just wanted to share that we've created a > test kernel with your patches and tested on the following CPUs: > > --- aarch64 --- > Ampere Altra > Ampere Altra Max > > --- x86_64 --- > AMD EPYC 7713 > AMD EPYC 7351 > AMD EPYC 7542 > AMD EPYC 7573X > AMD EPYC 7702 > AMD EPYC 9754 > Intel Xeon Gold 6126 > Into Xeon Gold 6330 > Intel Xeon Gold 6530 > Intel Xeon Platinum 8351N > Intel Core i7-6820HQ > > --- ppc64le --- > IBM Power 10 > > On average, we see improvements ranging from a minimum of 5% to a > maximum of 55%, with most improvements showing around a 25% speed up in > the libmicro/mprot_tw4m micro benchmark. Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU saw 55%? Or was it just inter-run variance? -- Pedro ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-04-01 14:10 ` Pedro Falcato @ 2026-04-02 13:55 ` Luke Yang 2026-04-06 14:32 ` Luke Yang 0 siblings, 1 reply; 17+ messages in thread From: Luke Yang @ 2026-04-02 13:55 UTC (permalink / raw) To: Pedro Falcato Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache On Wed, Apr 1, 2026 at 10:11 AM Pedro Falcato <pfalcato@suse.de> wrote: > > On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote: > > Hi Pedro, > > > > Thanks for working on this. I just wanted to share that we've created a > > test kernel with your patches and tested on the following CPUs: > > > > --- aarch64 --- > > Ampere Altra > > Ampere Altra Max > > > > --- x86_64 --- > > AMD EPYC 7713 > > AMD EPYC 7351 > > AMD EPYC 7542 > > AMD EPYC 7573X > > AMD EPYC 7702 > > AMD EPYC 9754 > > Intel Xeon Gold 6126 > > Into Xeon Gold 6330 > > Intel Xeon Gold 6530 > > Intel Xeon Platinum 8351N > > Intel Core i7-6820HQ > > > > --- ppc64le --- > > IBM Power 10 > > > > On average, we see improvements ranging from a minimum of 5% to a > > maximum of 55%, with most improvements showing around a 25% speed up in > > the libmicro/mprot_tw4m micro benchmark. > > Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU > saw 55%? Or was it just inter-run variance? > > -- > Pedro > 5% -> Ampere Altra Max 55% -> Ampere Altra Personally, I can't conclude at the moment if this is just inter-run variance. However, let me re-run the tests a few times on these two machines to see if this is consistent. Luke ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work 2026-04-02 13:55 ` Luke Yang @ 2026-04-06 14:32 ` Luke Yang 0 siblings, 0 replies; 17+ messages in thread From: Luke Yang @ 2026-04-06 14:32 UTC (permalink / raw) To: Pedro Falcato Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn, David Hildenbrand, Dev Jain, jhladky, linux-mm, linux-kernel, Nico Pache On Thu, Apr 2, 2026 at 9:55 AM Luke Yang <luyang@redhat.com> wrote: > > On Wed, Apr 1, 2026 at 10:11 AM Pedro Falcato <pfalcato@suse.de> wrote: > > > > On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote: > > > Hi Pedro, > > > > > > Thanks for working on this. I just wanted to share that we've created a > > > test kernel with your patches and tested on the following CPUs: > > > > > > --- aarch64 --- > > > Ampere Altra > > > Ampere Altra Max > > > > > > --- x86_64 --- > > > AMD EPYC 7713 > > > AMD EPYC 7351 > > > AMD EPYC 7542 > > > AMD EPYC 7573X > > > AMD EPYC 7702 > > > AMD EPYC 9754 > > > Intel Xeon Gold 6126 > > > Into Xeon Gold 6330 > > > Intel Xeon Gold 6530 > > > Intel Xeon Platinum 8351N > > > Intel Core i7-6820HQ > > > > > > --- ppc64le --- > > > IBM Power 10 > > > > > > On average, we see improvements ranging from a minimum of 5% to a > > > maximum of 55%, with most improvements showing around a 25% speed up in > > > the libmicro/mprot_tw4m micro benchmark. > > > > Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU > > saw 55%? Or was it just inter-run variance? > > > > -- > > Pedro > > > > 5% -> Ampere Altra Max > 55% -> Ampere Altra > > Personally, I can't conclude at the moment if this is just inter-run > variance. However, let me re-run the tests a few times on these two > machines to see if this is consistent. > > Luke Ampere Altra Max seems to consistently report ~5%. Ampere Altra seems to exhibit more test variance. I now have some test runs that report a 5% or 10% gain. Luke ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-04-06 14:32 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-24 15:43 [PATCH v2 0/2] mm/mprotect: micro-optimization work Pedro Falcato 2026-03-24 15:43 ` [PATCH v2 1/2] mm/mprotect: move softleaf code out of the main function Pedro Falcato 2026-03-24 20:12 ` David Hildenbrand (Arm) 2026-03-24 15:43 ` [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato 2026-03-24 20:18 ` David Hildenbrand (Arm) 2026-03-25 11:37 ` Pedro Falcato 2026-03-30 15:16 ` David Hildenbrand (Arm) 2026-04-02 0:09 ` Andrew Morton 2026-04-02 3:44 ` Andrew Morton 2026-04-02 7:11 ` David Hildenbrand (Arm) 2026-03-30 19:55 ` [PATCH v2 0/2] mm/mprotect: micro-optimization work Luke Yang 2026-03-30 20:06 ` Andrew Morton 2026-04-01 8:25 ` David Hildenbrand (Arm) 2026-04-01 14:14 ` Pedro Falcato 2026-04-01 14:10 ` Pedro Falcato 2026-04-02 13:55 ` Luke Yang 2026-04-06 14:32 ` Luke Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox