[PATCH 0/4] mm/mprotect: micro-optimization work

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/4] mm/mprotect: micro-optimization work
@ 2026-03-19 18:31 Pedro Falcato
  2026-03-19 18:31 ` [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline Pedro Falcato
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Pedro Falcato @ 2026-03-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

After a long session of performance-cat herding, here's the first version
I am relatively ok with.

Micro-optimize the change_protection functionality and the
change_pte_range() routine. This set of functions works in an incredibly
tight loop, and even small inefficiencies are incredibly evident when spun
hundreds, thousands or hundreds of thousands of times.

There was an attempt to keep the batching functionality as much as possible,
which introduced some part of the slowness, but not all of it. Removing it
for !arm64 architectures would speed mprotect() up even further, but could
easily pessimize cases where large folios are mapped (which is not as rare
as it seems, particularly when it comes to the page cache these days).

The micro-benchmark used for the tests was [0] (usable using google/benchmark
and g++ -O2 -lbenchmark repro.cpp)

This resulted in the following (first entry is baseline):

---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
mprotect_bench      85967 ns        85967 ns         6935
mprotect_bench      82402 ns        82402 ns         6745
mprotect_bench      86776 ns        86776 ns         8100
mprotect_bench      86463 ns        86463 ns         8087
mprotect_bench      73374 ns        73373 ns         9602


After the patchset we can observe a 14% speedup in mprotect. Wonderful
for the elusive mprotect-based workloads!

Testing & more ideas welcome. I suspect there is plenty of improvement possible
but it would require more time than what I have on my hands right now. The
entire inlined function (which inlines into change_protection()) is gigantic
- I'm not surprised this is so finnicky.


[0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/

Cc: Vlastimil Babka <vbabka@kernel.org> 
Cc: Jann Horn <jannh@google.com> 
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com> 
Cc: Luke Yang <luyang@redhat.com>
Cc: jhladky@redhat.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Pedro Falcato (4):
  mm/mprotect: encourage inlining with __always_inline
  mm/mprotect: move softleaf code out of the main function
  mm/mprotect: un-inline folio_pte_batch_flags()
  mm/mprotect: special-case small folios when applying write permissions

 mm/mprotect.c | 158 +++++++++++++++++++++++++++-----------------------
 1 file changed, 85 insertions(+), 73 deletions(-)

-- 
2.53.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline
  2026-03-19 18:31 [PATCH 0/4] mm/mprotect: micro-optimization work Pedro Falcato
@ 2026-03-19 18:31 ` Pedro Falcato
  2026-03-19 18:59   ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:28   ` David Hildenbrand (Arm)
  2026-03-19 18:31 ` [PATCH 2/4] mm/mprotect: move softleaf code out of the main function Pedro Falcato
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Pedro Falcato @ 2026-03-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

Encourage the compiler to inline batch PTE logic and resolve constant
branches by adding __always_inline strategically.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 mm/mprotect.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9681f055b9fc..1bd0d4aa07c2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return can_change_shared_pte_writable(vma, pte);
 }
 
-static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 				    pte_t pte, int max_nr_ptes, fpb_t flags)
 {
 	/* No underlying folio, so cannot batch */
@@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 }
 
 /* Set nr_ptes number of ptes, starting from idx */
-static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
-		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
-		int idx, bool set_write, struct mmu_gather *tlb)
+static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
+		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
 {
 	/*
 	 * Advance the position in the batch by idx; note that if idx > 0,
@@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
  * pte of the batch. Therefore, we must individually check all pages and
  * retrieve sub-batches.
  */
-static void commit_anon_folio_batch(struct vm_area_struct *vma,
+static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
 		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
 		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
 {
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline
  2026-03-19 18:31 ` [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline Pedro Falcato
@ 2026-03-19 18:59   ` Lorenzo Stoakes (Oracle)
  2026-03-19 19:00     ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:28   ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 18:59 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 06:31:05PM +0000, Pedro Falcato wrote:
> Encourage the compiler to inline batch PTE logic and resolve constant
> branches by adding __always_inline strategically.

>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>

Does this vary by compiler/arch that much?

I wonder about how much ends up on the stack here too given the HUGE number of
arguments passed around but I guess you'd be pushing and popping some even if
these weren't inlined.

I wonder if we wouldn't want to carefully check different arches for this
though!

> ---
>  mm/mprotect.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 9681f055b9fc..1bd0d4aa07c2 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return can_change_shared_pte_writable(vma, pte);
>  }
>
> -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  				    pte_t pte, int max_nr_ptes, fpb_t flags)
>  {
>  	/* No underlying folio, so cannot batch */
> @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  }
>
>  /* Set nr_ptes number of ptes, starting from idx */
> -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> -		int idx, bool set_write, struct mmu_gather *tlb)
> +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
>  {
>  	/*
>  	 * Advance the position in the batch by idx; note that if idx > 0,
> @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>   * pte of the batch. Therefore, we must individually check all pages and
>   * retrieve sub-batches.
>   */
> -static void commit_anon_folio_batch(struct vm_area_struct *vma,
> +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline
  2026-03-19 18:59   ` Lorenzo Stoakes (Oracle)
@ 2026-03-19 19:00     ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 19:00 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 06:59:37PM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 06:31:05PM +0000, Pedro Falcato wrote:
> > Encourage the compiler to inline batch PTE logic and resolve constant
> > branches by adding __always_inline strategically.
>
> >
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>
> Does this vary by compiler/arch that much?
>
> I wonder about how much ends up on the stack here too given the HUGE number of
> arguments passed around but I guess you'd be pushing and popping some even if
> these weren't inlined.
>
> I wonder if we wouldn't want to carefully check different arches for this
> though!

BTW have previously seen how compilers can be suuuuper picky as to what inlines
or not based on hueristics so can see why doing this would move the needle if we
were sure it'd universally help/at least not cause issues.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline
  2026-03-19 18:31 ` [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline Pedro Falcato
  2026-03-19 18:59   ` Lorenzo Stoakes (Oracle)
@ 2026-03-19 21:28   ` David Hildenbrand (Arm)
  2026-03-20  9:59     ` Pedro Falcato
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 21:28 UTC (permalink / raw)
  To: Pedro Falcato, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky,
	linux-mm, linux-kernel

On 3/19/26 19:31, Pedro Falcato wrote:
> Encourage the compiler to inline batch PTE logic and resolve constant
> branches by adding __always_inline strategically.
> 
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
>  mm/mprotect.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 9681f055b9fc..1bd0d4aa07c2 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return can_change_shared_pte_writable(vma, pte);
>  }
>  
> -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  				    pte_t pte, int max_nr_ptes, fpb_t flags)
>  {
>  	/* No underlying folio, so cannot batch */
> @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  }
>  
>  /* Set nr_ptes number of ptes, starting from idx */
> -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> -		int idx, bool set_write, struct mmu_gather *tlb)
> +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
>  {
>  	/*
>  	 * Advance the position in the batch by idx; note that if idx > 0,
> @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>   * pte of the batch. Therefore, we must individually check all pages and
>   * retrieve sub-batches.
>   */
> -static void commit_anon_folio_batch(struct vm_area_struct *vma,
> +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {

From my micro-optimization work on zapping and fork, I learned that
these batching functions are best optimized for order-0 page by
explicitly calling them from the code with "nr_ptes == 1" and then
force-inlining them. nr_ptes and all loops will essentially be optimized
out.

With no such explicit constants, is there really a real benefit to be
had here?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline
  2026-03-19 21:28   ` David Hildenbrand (Arm)
@ 2026-03-20  9:59     ` Pedro Falcato
  2026-03-20 10:08       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 32+ messages in thread
From: Pedro Falcato @ 2026-03-20  9:59 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Lorenzo Stoakes
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On Thu, Mar 19, 2026 at 10:28:47PM +0100, David Hildenbrand (Arm) wrote:
> On 3/19/26 19:31, Pedro Falcato wrote:
> > Encourage the compiler to inline batch PTE logic and resolve constant
> > branches by adding __always_inline strategically.
> > 
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> >  mm/mprotect.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 9681f055b9fc..1bd0d4aa07c2 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> >  	return can_change_shared_pte_writable(vma, pte);
> >  }
> >  
> > -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >  				    pte_t pte, int max_nr_ptes, fpb_t flags)
> >  {
> >  	/* No underlying folio, so cannot batch */
> > @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >  }
> >  
> >  /* Set nr_ptes number of ptes, starting from idx */
> > -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> > -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> > -		int idx, bool set_write, struct mmu_gather *tlb)
> > +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
> > +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> > +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
> >  {
> >  	/*
> >  	 * Advance the position in the batch by idx; note that if idx > 0,
> > @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
> >   * pte of the batch. Therefore, we must individually check all pages and
> >   * retrieve sub-batches.
> >   */
> > -static void commit_anon_folio_batch(struct vm_area_struct *vma,
> > +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
> >  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
> >  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> >  {
> 
> From my micro-optimization work on zapping and fork, I learned that
> these batching functions are best optimized for order-0 page by
> explicitly calling them from the code with "nr_ptes == 1" and then
> force-inlining them. nr_ptes and all loops will essentially be optimized
> out.
> 
> With no such explicit constants, is there really a real benefit to be
> had here?

Per my measurements, I could measure a real speedup here. Of course things
may heavily depend on the microarchitecture you use. I want to note that
these three functions are part of the hot loop and thus we definitely want
them inlined. Particularly if we start special-casing stuff. You can cut
down _a lot_ of code if you simply tell it "yeah don't bother you're looking
at 1 pte only".

Of course a lot of this is just codegen fengshui but I tried sticking to
good fundamentals and inlining things that matter, while noinlining
things that aren't frequent. As-is the compiler seems to make poor
inlining decisions on its own (and basically every static function is
inlined, except e.g prot_commit_flush_ptes, FOR SOME REASON).

-- 
Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline
  2026-03-20  9:59     ` Pedro Falcato
@ 2026-03-20 10:08       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-20 10:08 UTC (permalink / raw)
  To: Pedro Falcato, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On 3/20/26 10:59, Pedro Falcato wrote:
> On Thu, Mar 19, 2026 at 10:28:47PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/19/26 19:31, Pedro Falcato wrote:
>>> Encourage the compiler to inline batch PTE logic and resolve constant
>>> branches by adding __always_inline strategically.
>>>
>>> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>>> ---
>>>  mm/mprotect.c | 10 +++++-----
>>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 9681f055b9fc..1bd0d4aa07c2 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>  	return can_change_shared_pte_writable(vma, pte);
>>>  }
>>>  
>>> -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>> +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>>  				    pte_t pte, int max_nr_ptes, fpb_t flags)
>>>  {
>>>  	/* No underlying folio, so cannot batch */
>>> @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>>  }
>>>  
>>>  /* Set nr_ptes number of ptes, starting from idx */
>>> -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
>>> -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
>>> -		int idx, bool set_write, struct mmu_gather *tlb)
>>> +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
>>> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>>> +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
>>>  {
>>>  	/*
>>>  	 * Advance the position in the batch by idx; note that if idx > 0,
>>> @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>>>   * pte of the batch. Therefore, we must individually check all pages and
>>>   * retrieve sub-batches.
>>>   */
>>> -static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>> +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
>>>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>>>  {
>>
>> From my micro-optimization work on zapping and fork, I learned that
>> these batching functions are best optimized for order-0 page by
>> explicitly calling them from the code with "nr_ptes == 1" and then
>> force-inlining them. nr_ptes and all loops will essentially be optimized
>> out.
>>
>> With no such explicit constants, is there really a real benefit to be
>> had here?
> 
> Per my measurements, I could measure a real speedup here. Of course things
> may heavily depend on the microarchitecture you use. I want to note that
> these three functions are part of the hot loop and thus we definitely want
> them inlined. Particularly if we start special-casing stuff. You can cut
> down _a lot_ of code if you simply tell it "yeah don't bother you're looking
> at 1 pte only".

That's why I think this change is a lot more valuable when squashing
patch #4.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/4] mm/mprotect: move softleaf code out of the main function
  2026-03-19 18:31 [PATCH 0/4] mm/mprotect: micro-optimization work Pedro Falcato
  2026-03-19 18:31 ` [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline Pedro Falcato
@ 2026-03-19 18:31 ` Pedro Falcato
  2026-03-19 19:06   ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:33   ` David Hildenbrand (Arm)
  2026-03-19 18:31 ` [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags() Pedro Falcato
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Pedro Falcato @ 2026-03-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

Move softleaf change_pte_range code into a separate function. This makes
the change_pte_range() function (or where it inlines) a good bit
smaller. Plus it lessens cognitive load when reading through the
function.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 mm/mprotect.c | 128 +++++++++++++++++++++++++++-----------------------
 1 file changed, 68 insertions(+), 60 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1bd0d4aa07c2..8d4fa38a8a26 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -211,6 +211,73 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
 	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
 }
 
+static noinline long change_pte_softleaf(struct vm_area_struct *vma,
+	unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags)
+{
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	softleaf_t entry = softleaf_from_pte(oldpte);
+	pte_t newpte;
+
+	if (softleaf_is_migration_write(entry)) {
+		const struct folio *folio = softleaf_to_folio(entry);
+
+		/*
+		 * A protection check is difficult so
+		 * just be safe and disable write
+		 */
+		if (folio_test_anon(folio))
+			entry = make_readable_exclusive_migration_entry(
+					     swp_offset(entry));
+		else
+			entry = make_readable_migration_entry(swp_offset(entry));
+		newpte = swp_entry_to_pte(entry);
+		if (pte_swp_soft_dirty(oldpte))
+			newpte = pte_swp_mksoft_dirty(newpte);
+	} else if (softleaf_is_device_private_write(entry)) {
+		/*
+		 * We do not preserve soft-dirtiness. See
+		 * copy_nonpresent_pte() for explanation.
+		 */
+		entry = make_readable_device_private_entry(
+					swp_offset(entry));
+		newpte = swp_entry_to_pte(entry);
+		if (pte_swp_uffd_wp(oldpte))
+			newpte = pte_swp_mkuffd_wp(newpte);
+	} else if (softleaf_is_marker(entry)) {
+		/*
+		 * Ignore error swap entries unconditionally,
+		 * because any access should sigbus/sigsegv
+		 * anyway.
+		 */
+		if (softleaf_is_poison_marker(entry) ||
+		    softleaf_is_guard_marker(entry))
+			return 0;
+		/*
+		 * If this is uffd-wp pte marker and we'd like
+		 * to unprotect it, drop it; the next page
+		 * fault will trigger without uffd trapping.
+		 */
+		if (uffd_wp_resolve) {
+			pte_clear(vma->vm_mm, addr, pte);
+			return 1;
+		}
+	} else {
+		newpte = oldpte;
+	}
+
+	if (uffd_wp)
+		newpte = pte_swp_mkuffd_wp(newpte);
+	else if (uffd_wp_resolve)
+		newpte = pte_swp_clear_uffd_wp(newpte);
+
+	if (!pte_same(oldpte, newpte)) {
+		set_pte_at(vma->vm_mm, addr, pte, newpte);
+		return 1;
+	}
+	return 0;
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -317,66 +384,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 				pages++;
 			}
 		} else  {
-			softleaf_t entry = softleaf_from_pte(oldpte);
-			pte_t newpte;
-
-			if (softleaf_is_migration_write(entry)) {
-				const struct folio *folio = softleaf_to_folio(entry);
-
-				/*
-				 * A protection check is difficult so
-				 * just be safe and disable write
-				 */
-				if (folio_test_anon(folio))
-					entry = make_readable_exclusive_migration_entry(
-							     swp_offset(entry));
-				else
-					entry = make_readable_migration_entry(swp_offset(entry));
-				newpte = swp_entry_to_pte(entry);
-				if (pte_swp_soft_dirty(oldpte))
-					newpte = pte_swp_mksoft_dirty(newpte);
-			} else if (softleaf_is_device_private_write(entry)) {
-				/*
-				 * We do not preserve soft-dirtiness. See
-				 * copy_nonpresent_pte() for explanation.
-				 */
-				entry = make_readable_device_private_entry(
-							swp_offset(entry));
-				newpte = swp_entry_to_pte(entry);
-				if (pte_swp_uffd_wp(oldpte))
-					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (softleaf_is_marker(entry)) {
-				/*
-				 * Ignore error swap entries unconditionally,
-				 * because any access should sigbus/sigsegv
-				 * anyway.
-				 */
-				if (softleaf_is_poison_marker(entry) ||
-				    softleaf_is_guard_marker(entry))
-					continue;
-				/*
-				 * If this is uffd-wp pte marker and we'd like
-				 * to unprotect it, drop it; the next page
-				 * fault will trigger without uffd trapping.
-				 */
-				if (uffd_wp_resolve) {
-					pte_clear(vma->vm_mm, addr, pte);
-					pages++;
-				}
-				continue;
-			} else {
-				newpte = oldpte;
-			}
-
-			if (uffd_wp)
-				newpte = pte_swp_mkuffd_wp(newpte);
-			else if (uffd_wp_resolve)
-				newpte = pte_swp_clear_uffd_wp(newpte);
-
-			if (!pte_same(oldpte, newpte)) {
-				set_pte_at(vma->vm_mm, addr, pte, newpte);
-				pages++;
-			}
+			pages += change_pte_softleaf(vma, addr, pte, oldpte, cp_flags);
 		}
 	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
 	lazy_mmu_mode_disable();
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm/mprotect: move softleaf code out of the main function
  2026-03-19 18:31 ` [PATCH 2/4] mm/mprotect: move softleaf code out of the main function Pedro Falcato
@ 2026-03-19 19:06   ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:33   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 19:06 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 06:31:06PM +0000, Pedro Falcato wrote:
> Move softleaf change_pte_range code into a separate function. This makes
> the change_pte_range() function (or where it inlines) a good bit
> smaller. Plus it lessens cognitive load when reading through the
> function.
>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>

Honestly I like this as a refactoring, the noinline notwithstanding, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/mprotect.c | 128 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 68 insertions(+), 60 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1bd0d4aa07c2..8d4fa38a8a26 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -211,6 +211,73 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>  	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>  }
>
> +static noinline long change_pte_softleaf(struct vm_area_struct *vma,
> +	unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags)
> +{
> +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	softleaf_t entry = softleaf_from_pte(oldpte);
> +	pte_t newpte;
> +
> +	if (softleaf_is_migration_write(entry)) {
> +		const struct folio *folio = softleaf_to_folio(entry);
> +
> +		/*
> +		 * A protection check is difficult so
> +		 * just be safe and disable write
> +		 */
> +		if (folio_test_anon(folio))
> +			entry = make_readable_exclusive_migration_entry(
> +					     swp_offset(entry));
> +		else
> +			entry = make_readable_migration_entry(swp_offset(entry));
> +		newpte = swp_entry_to_pte(entry);
> +		if (pte_swp_soft_dirty(oldpte))
> +			newpte = pte_swp_mksoft_dirty(newpte);
> +	} else if (softleaf_is_device_private_write(entry)) {
> +		/*
> +		 * We do not preserve soft-dirtiness. See
> +		 * copy_nonpresent_pte() for explanation.
> +		 */
> +		entry = make_readable_device_private_entry(
> +					swp_offset(entry));
> +		newpte = swp_entry_to_pte(entry);
> +		if (pte_swp_uffd_wp(oldpte))
> +			newpte = pte_swp_mkuffd_wp(newpte);
> +	} else if (softleaf_is_marker(entry)) {
> +		/*
> +		 * Ignore error swap entries unconditionally,
> +		 * because any access should sigbus/sigsegv
> +		 * anyway.
> +		 */
> +		if (softleaf_is_poison_marker(entry) ||
> +		    softleaf_is_guard_marker(entry))
> +			return 0;

This just continues in the original:

				if (softleaf_is_poison_marker(entry) ||
				    softleaf_is_guard_marker(entry))
					continue;
So this is correct.


> +		/*
> +		 * If this is uffd-wp pte marker and we'd like
> +		 * to unprotect it, drop it; the next page
> +		 * fault will trigger without uffd trapping.
> +		 */
> +		if (uffd_wp_resolve) {
> +			pte_clear(vma->vm_mm, addr, pte);
> +			return 1;

This incrmements pages and continues in the orignal:

				if (uffd_wp_resolve) {
					pte_clear(vma->vm_mm, addr, pte);
					pages++;
				}
				continue;

So this is correct.

> +		}
> +	} else {
> +		newpte = oldpte;
> +	}
> +
> +	if (uffd_wp)
> +		newpte = pte_swp_mkuffd_wp(newpte);
> +	else if (uffd_wp_resolve)
> +		newpte = pte_swp_clear_uffd_wp(newpte);
> +
> +	if (!pte_same(oldpte, newpte)) {
> +		set_pte_at(vma->vm_mm, addr, pte, newpte);
> +		return 1;

This increments pages and is at the end of the loop in the original:

			if (!pte_same(oldpte, newpte)) {
				set_pte_at(vma->vm_mm, addr, pte, newpte);
				pages++;
			}

So this is correct.

> +	}
> +	return 0;
> +}
> +
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -317,66 +384,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				pages++;
>  			}
>  		} else  {
> -			softleaf_t entry = softleaf_from_pte(oldpte);
> -			pte_t newpte;
> -
> -			if (softleaf_is_migration_write(entry)) {
> -				const struct folio *folio = softleaf_to_folio(entry);
> -
> -				/*
> -				 * A protection check is difficult so
> -				 * just be safe and disable write
> -				 */
> -				if (folio_test_anon(folio))
> -					entry = make_readable_exclusive_migration_entry(
> -							     swp_offset(entry));
> -				else
> -					entry = make_readable_migration_entry(swp_offset(entry));
> -				newpte = swp_entry_to_pte(entry);
> -				if (pte_swp_soft_dirty(oldpte))
> -					newpte = pte_swp_mksoft_dirty(newpte);
> -			} else if (softleaf_is_device_private_write(entry)) {
> -				/*
> -				 * We do not preserve soft-dirtiness. See
> -				 * copy_nonpresent_pte() for explanation.
> -				 */
> -				entry = make_readable_device_private_entry(
> -							swp_offset(entry));
> -				newpte = swp_entry_to_pte(entry);
> -				if (pte_swp_uffd_wp(oldpte))
> -					newpte = pte_swp_mkuffd_wp(newpte);
> -			} else if (softleaf_is_marker(entry)) {
> -				/*
> -				 * Ignore error swap entries unconditionally,
> -				 * because any access should sigbus/sigsegv
> -				 * anyway.
> -				 */
> -				if (softleaf_is_poison_marker(entry) ||
> -				    softleaf_is_guard_marker(entry))
> -					continue;
> -				/*
> -				 * If this is uffd-wp pte marker and we'd like
> -				 * to unprotect it, drop it; the next page
> -				 * fault will trigger without uffd trapping.
> -				 */
> -				if (uffd_wp_resolve) {
> -					pte_clear(vma->vm_mm, addr, pte);
> -					pages++;
> -				}
> -				continue;
> -			} else {
> -				newpte = oldpte;
> -			}
> -
> -			if (uffd_wp)
> -				newpte = pte_swp_mkuffd_wp(newpte);
> -			else if (uffd_wp_resolve)
> -				newpte = pte_swp_clear_uffd_wp(newpte);
> -
> -			if (!pte_same(oldpte, newpte)) {
> -				set_pte_at(vma->vm_mm, addr, pte, newpte);
> -				pages++;
> -			}
> +			pages += change_pte_softleaf(vma, addr, pte, oldpte, cp_flags);
>  		}
>  	} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
>  	lazy_mmu_mode_disable();
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm/mprotect: move softleaf code out of the main function
  2026-03-19 18:31 ` [PATCH 2/4] mm/mprotect: move softleaf code out of the main function Pedro Falcato
  2026-03-19 19:06   ` Lorenzo Stoakes (Oracle)
@ 2026-03-19 21:33   ` David Hildenbrand (Arm)
  2026-03-20 10:04     ` Pedro Falcato
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 21:33 UTC (permalink / raw)
  To: Pedro Falcato, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky,
	linux-mm, linux-kernel

On 3/19/26 19:31, Pedro Falcato wrote:
> Move softleaf change_pte_range code into a separate function. This makes
> the change_pte_range() function (or where it inlines) a good bit
> smaller. Plus it lessens cognitive load when reading through the
> function.
> 
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
>  mm/mprotect.c | 128 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 68 insertions(+), 60 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1bd0d4aa07c2..8d4fa38a8a26 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -211,6 +211,73 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>  	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>  }
>  
> +static noinline long change_pte_softleaf(struct vm_area_struct *vma,

Why the noinline? This sounds like something that works good on some
CPUs and bad on others, no?

I like the cleanup, but the noinline really is odd.

> +	unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags)

I'd call that "change_softleaf_pte"

> +{
> +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

both can be const.

> +	softleaf_t entry = softleaf_from_pte(oldpte);
> +	pte_t newpte;
> +

[...]

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm/mprotect: move softleaf code out of the main function
  2026-03-19 21:33   ` David Hildenbrand (Arm)
@ 2026-03-20 10:04     ` Pedro Falcato
  2026-03-20 10:07       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 32+ messages in thread
From: Pedro Falcato @ 2026-03-20 10:04 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On Thu, Mar 19, 2026 at 10:33:47PM +0100, David Hildenbrand (Arm) wrote:
> On 3/19/26 19:31, Pedro Falcato wrote:
> > Move softleaf change_pte_range code into a separate function. This makes
> > the change_pte_range() function (or where it inlines) a good bit
> > smaller. Plus it lessens cognitive load when reading through the
> > function.
> > 
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> >  mm/mprotect.c | 128 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 68 insertions(+), 60 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 1bd0d4aa07c2..8d4fa38a8a26 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -211,6 +211,73 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> >  	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
> >  }
> >  
> > +static noinline long change_pte_softleaf(struct vm_area_struct *vma,
> 
> Why the noinline? This sounds like something that works good on some
> CPUs and bad on others, no?
>

If you don't like the noinline I can always remove it, but see my other email
for the justification for most code movement here. In this case, I don't see
how softleaf is the common case (at all) and it's a sizeable amount of code.

Note that the results don't show improvement here (in fact, the opposite),
but we are also spinning on the mprotect() system call, so it's hard for
the icache, branch predictors not to be primed.

> I like the cleanup, but the noinline really is odd.
> 
> > +	unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags)
> 
> I'd call that "change_softleaf_pte"
> 
> > +{
> > +	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> > +	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> 
> both can be const.

ACK, will do!

-- 
Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm/mprotect: move softleaf code out of the main function
  2026-03-20 10:04     ` Pedro Falcato
@ 2026-03-20 10:07       ` David Hildenbrand (Arm)
  2026-03-20 10:54         ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-20 10:07 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On 3/20/26 11:04, Pedro Falcato wrote:
> On Thu, Mar 19, 2026 at 10:33:47PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/19/26 19:31, Pedro Falcato wrote:
>>> Move softleaf change_pte_range code into a separate function. This makes
>>> the change_pte_range() function (or where it inlines) a good bit
>>> smaller. Plus it lessens cognitive load when reading through the
>>> function.
>>>
>>> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>>> ---
>>>  mm/mprotect.c | 128 +++++++++++++++++++++++++++-----------------------
>>>  1 file changed, 68 insertions(+), 60 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 1bd0d4aa07c2..8d4fa38a8a26 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -211,6 +211,73 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
>>>  	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
>>>  }
>>>  
>>> +static noinline long change_pte_softleaf(struct vm_area_struct *vma,
>>
>> Why the noinline? This sounds like something that works good on some
>> CPUs and bad on others, no?
>>
> 
> If you don't like the noinline I can always remove it,

Yes, please. It's easier to argue about __always_inline and constant
propagation than "this code is too scary big for my CPU so I better do
an expensive function call if the code is actually needed".

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm/mprotect: move softleaf code out of the main function
  2026-03-20 10:07       ` David Hildenbrand (Arm)
@ 2026-03-20 10:54         ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-20 10:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Pedro Falcato, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On Fri, Mar 20, 2026 at 11:07:26AM +0100, David Hildenbrand (Arm) wrote:
> On 3/20/26 11:04, Pedro Falcato wrote:
> > On Thu, Mar 19, 2026 at 10:33:47PM +0100, David Hildenbrand (Arm) wrote:
> >> On 3/19/26 19:31, Pedro Falcato wrote:
> >>> Move softleaf change_pte_range code into a separate function. This makes
> >>> the change_pte_range() function (or where it inlines) a good bit
> >>> smaller. Plus it lessens cognitive load when reading through the
> >>> function.
> >>>
> >>> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> >>> ---
> >>>  mm/mprotect.c | 128 +++++++++++++++++++++++++++-----------------------
> >>>  1 file changed, 68 insertions(+), 60 deletions(-)
> >>>
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index 1bd0d4aa07c2..8d4fa38a8a26 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -211,6 +211,73 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
> >>>  	commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
> >>>  }
> >>>
> >>> +static noinline long change_pte_softleaf(struct vm_area_struct *vma,
> >>
> >> Why the noinline? This sounds like something that works good on some
> >> CPUs and bad on others, no?
> >>
> >
> > If you don't like the noinline I can always remove it,
>
> Yes, please. It's easier to argue about __always_inline and constant
> propagation than "this code is too scary big for my CPU so I better do
> an expensive function call if the code is actually needed".

It would make this much more acceptable as a change in general for sure, as it
is already good from a code readability point of view.

Sometimes that and perf align nicely it seems... :)

I do worry about some of these things that might be fine on x86-64, but will
somehow be a total problem on other real architectures (i.e. 64 bit).

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-19 18:31 [PATCH 0/4] mm/mprotect: micro-optimization work Pedro Falcato
  2026-03-19 18:31 ` [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline Pedro Falcato
  2026-03-19 18:31 ` [PATCH 2/4] mm/mprotect: move softleaf code out of the main function Pedro Falcato
@ 2026-03-19 18:31 ` Pedro Falcato
  2026-03-19 19:14   ` Lorenzo Stoakes (Oracle)
  2026-03-19 18:31 ` [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato
  2026-03-20  2:42 ` [PATCH 0/4] mm/mprotect: micro-optimization work Andrew Morton
  4 siblings, 1 reply; 32+ messages in thread
From: Pedro Falcato @ 2026-03-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

Hoist some previous, important to be inlined checks to the main loop
and noinline the entirety of folio_pte_batch_flags(). The loop itself is
quite large and whether it is inlined or not should not matter, as we
are ideally dealing with larger orders.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 mm/mprotect.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8d4fa38a8a26..aa845f5bf14d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return can_change_shared_pte_writable(vma, pte);
 }
 
-static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 				    pte_t pte, int max_nr_ptes, fpb_t flags)
 {
-	/* No underlying folio, so cannot batch */
-	if (!folio)
-		return 1;
-
-	if (!folio_test_large(folio))
-		return 1;
-
 	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
 }
 
@@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
 				continue;
 			}
 
-			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
+			/* No underlying folio (or not large), so cannot batch */
+			if (likely(!folio || !folio_test_large(folio)))
+				nr_ptes = 1;
+			else
+				nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
+					  max_nr_ptes, flags);
 
 			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-19 18:31 ` [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags() Pedro Falcato
@ 2026-03-19 19:14   ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:41     ` David Hildenbrand (Arm)
  2026-03-20 10:34     ` Pedro Falcato
  0 siblings, 2 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 19:14 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> Hoist some previous, important to be inlined checks to the main loop
> and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> quite large and whether it is inlined or not should not matter, as we
> are ideally dealing with larger orders.
>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
>  mm/mprotect.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8d4fa38a8a26..aa845f5bf14d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return can_change_shared_pte_writable(vma, pte);
>  }
>
> -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,

Hmm I'm iffy about noinline-ing a 1 line function now?

And now yu're noinlining something that contains an inline (but not guaranteed
to be function, it's a bit strange overall?

>  				    pte_t pte, int max_nr_ptes, fpb_t flags)
>  {
> -	/* No underlying folio, so cannot batch */
> -	if (!folio)
> -		return 1;
> -
> -	if (!folio_test_large(folio))
> -		return 1;
> -
>  	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
>  }
>
> @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				continue;
>  			}
>

^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
which now won't be doing:

	if (!folio)
		return 1;

	if (!folio_test_large(folio))
		return 1;

At all, so isn't that potentially a pessimisation in itself?


> -			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> +			/* No underlying folio (or not large), so cannot batch */
> +			if (likely(!folio || !folio_test_large(folio)))

We actually sure of this likely()? Is there data to support it? Am not a fan of
using likely()/unlikey() without something to back it.

> +				nr_ptes = 1;
> +			else
> +				nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> +					  max_nr_ptes, flags);

It's also pretty gross to throw this out into a massive function.

>
>  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
> --
> 2.53.0
>

This is all seems VERY delicate, and subject to somebody else coming along and
breaking it/causing some of these noinline/__always_inline invocations to make
things far worse.

I also reserve the right to seriously rework this pile of crap software.

I'd rather we try to find less fragile ways to optimise!

Maybe there's some steps that are bigger wins than others?

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-19 19:14   ` Lorenzo Stoakes (Oracle)
@ 2026-03-19 21:41     ` David Hildenbrand (Arm)
  2026-03-20 10:36       ` Lorenzo Stoakes (Oracle)
  2026-03-20 10:34     ` Pedro Falcato
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 21:41 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel


> 
> This is all seems VERY delicate, and subject to somebody else coming along and
> breaking it/causing some of these noinline/__always_inline invocations to make
> things far worse.
> 
> I also reserve the right to seriously rework this pile of crap software.
> 
> I'd rather we try to find less fragile ways to optimise!
> 
> Maybe there's some steps that are bigger wins than others?

What we can do is, collect similar folio_pte_batch_*() variants and
centralize them in mm/utils.c.

For

	nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
					   max_nr_ptes, /* flags = */ 0)

We might just be able to use folio_pte_batch()?

For the other variant (soft-dirt+write) we'd have to create a helper like

	folio_pte_batch_sd_w() [better name suggestion welcome]

That will reduce the code footprint overall I guess.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-19 21:41     ` David Hildenbrand (Arm)
@ 2026-03-20 10:36       ` Lorenzo Stoakes (Oracle)
  2026-03-20 10:59         ` Pedro Falcato
  2026-03-20 11:01         ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-20 10:36 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Pedro Falcato, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
>
> >
> > This is all seems VERY delicate, and subject to somebody else coming along and
> > breaking it/causing some of these noinline/__always_inline invocations to make
> > things far worse.
> >
> > I also reserve the right to seriously rework this pile of crap software.
> >
> > I'd rather we try to find less fragile ways to optimise!
> >
> > Maybe there's some steps that are bigger wins than others?
>
> What we can do is, collect similar folio_pte_batch_*() variants and
> centralize them in mm/utils.c.

I'm not sure that addresses any of the comments above?

Putting logic specific to components of mm away from where those components
are and into mm/util.c seems like a complete regression in terms of
fragility and code separation.

And for what reason would you want to do that? To force a noinline of an
inline and people 'just have to know' that's why you randomly separated the
two?

Doesn't sound appealing overall.

I'd rather we find a way to implement the batching such that it doesn't
exhibit bad inlining decisions in the first place.

I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
only there is already silly, we should have a _general_ function that does
optimisations like that.

Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
function in internal.h but rather a function in mm/util.c?

>
> For
>
> 	nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> 					   max_nr_ptes, /* flags = */ 0)
>
> We might just be able to use folio_pte_batch()?

Yup.

>
> For the other variant (soft-dirt+write) we'd have to create a helper like
>
> 	folio_pte_batch_sd_w() [better name suggestion welcome]
>
> That will reduce the code footprint overall I guess.

I mean yeah that's a terrible name so obviously it'd have to be something
better.

But again, this seems pretty stupid, now we're writing a bunch of duplicate
per-case code to force noinline because we made the central function inline
no?

That's also super fragile, because an engineer might later decide that
pattern is horrible and fix it, and we regress this.

But I mean overall, is the perf here really all that important? Are people
really that dependent on mprotect() et al. performing brilliantly fast?

Couldn't we do this with any mm interface and end up making efforts that
degrade code quality, increase fragility for dubious benefit?

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 10:36       ` Lorenzo Stoakes (Oracle)
@ 2026-03-20 10:59         ` Pedro Falcato
  2026-03-20 11:02           ` David Hildenbrand (Arm)
  2026-03-20 11:27           ` Lorenzo Stoakes (Oracle)
  2026-03-20 11:01         ` David Hildenbrand (Arm)
  1 sibling, 2 replies; 32+ messages in thread
From: Pedro Falcato @ 2026-03-20 10:59 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky,
	linux-mm, linux-kernel

On Fri, Mar 20, 2026 at 10:36:59AM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> >
> > >
> > > This is all seems VERY delicate, and subject to somebody else coming along and
> > > breaking it/causing some of these noinline/__always_inline invocations to make
> > > things far worse.
> > >
> > > I also reserve the right to seriously rework this pile of crap software.
> > >
> > > I'd rather we try to find less fragile ways to optimise!
> > >
> > > Maybe there's some steps that are bigger wins than others?
> >
> > What we can do is, collect similar folio_pte_batch_*() variants and
> > centralize them in mm/utils.c.
> 
> I'm not sure that addresses any of the comments above?
> 
> Putting logic specific to components of mm away from where those components
> are and into mm/util.c seems like a complete regression in terms of
> fragility and code separation.
> 
> And for what reason would you want to do that? To force a noinline of an
> inline and people 'just have to know' that's why you randomly separated the
> two?
> 
> Doesn't sound appealing overall.
> 
> I'd rather we find a way to implement the batching such that it doesn't
> exhibit bad inlining decisions in the first place.

Yes, you make a good point. At the end of the day (taking change_protection()
as the example at hand here), after these changes:
change_protection()
 loop over p4ds
   loop over puds
     loop over pmds
       loop over ptes
         nr_ptes = loop over ptes and find out how many we have
         if (making write) {
           loop over nr_ptes
              loop over ptes and find out how many are anonexclusive or not, in a row
              loop over ptes and set them
         } else {
	      loop over ptes and set them
         }

which the compiler FWIW tries to inline it all into one function, but then
does a poor job at figuring things out. And the CPU gets confused too. It
was frankly shocking how much performance I could squeeze out of a

if (nr_ptes == 1) {
	/* don't bother with the loops and the funny logic */
}

I would not be surprised if the other syscalls have similar problems.

> 
> I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> only there is already silly, we should have a _general_ function that does
> optimisations like that.
> 
> Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> function in internal.h but rather a function in mm/util.c?
> 
> >
> > For
> >
> > 	nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > 					   max_nr_ptes, /* flags = */ 0)
> >
> > We might just be able to use folio_pte_batch()?
> 
> Yup.
> 
> >
> > For the other variant (soft-dirt+write) we'd have to create a helper like
> >
> > 	folio_pte_batch_sd_w() [better name suggestion welcome]
> >
> > That will reduce the code footprint overall I guess.
> 
> I mean yeah that's a terrible name so obviously it'd have to be something
> better.
> 
> But again, this seems pretty stupid, now we're writing a bunch of duplicate
> per-case code to force noinline because we made the central function inline
> no?

Yeah.

> 
> That's also super fragile, because an engineer might later decide that
> pattern is horrible and fix it, and we regress this.
> 
> But I mean overall, is the perf here really all that important? Are people
> really that dependent on mprotect() et al. performing brilliantly fast?

Obviously no one truly depends on mprotect, but I believe in fast primitives :)

> Couldn't we do this with any mm interface and end up making efforts that
> degrade code quality, increase fragility for dubious benefit?

Yes, which is why I don't want to degrade code quality :) It would be ideal to
find something that works both ways. Per my description of the existing code
above, you can tell that it's neither fast, nor beautiful :p

-- 
Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 10:59         ` Pedro Falcato
@ 2026-03-20 11:02           ` David Hildenbrand (Arm)
  2026-03-20 11:27           ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-20 11:02 UTC (permalink / raw)
  To: Pedro Falcato, Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel


> 
> which the compiler FWIW tries to inline it all into one function, but then
> does a poor job at figuring things out. And the CPU gets confused too. It
> was frankly shocking how much performance I could squeeze out of a
> 
> if (nr_ptes == 1) {
> 	/* don't bother with the loops and the funny logic */
> }

Yes, same experience with fork/munmap.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 10:59         ` Pedro Falcato
  2026-03-20 11:02           ` David Hildenbrand (Arm)
@ 2026-03-20 11:27           ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-20 11:27 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand (Arm), Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky,
	linux-mm, linux-kernel

On Fri, Mar 20, 2026 at 10:59:55AM +0000, Pedro Falcato wrote:
> On Fri, Mar 20, 2026 at 10:36:59AM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> > >
> > > >
> > > > This is all seems VERY delicate, and subject to somebody else coming along and
> > > > breaking it/causing some of these noinline/__always_inline invocations to make
> > > > things far worse.
> > > >
> > > > I also reserve the right to seriously rework this pile of crap software.
> > > >
> > > > I'd rather we try to find less fragile ways to optimise!
> > > >
> > > > Maybe there's some steps that are bigger wins than others?
> > >
> > > What we can do is, collect similar folio_pte_batch_*() variants and
> > > centralize them in mm/utils.c.
> >
> > I'm not sure that addresses any of the comments above?
> >
> > Putting logic specific to components of mm away from where those components
> > are and into mm/util.c seems like a complete regression in terms of
> > fragility and code separation.
> >
> > And for what reason would you want to do that? To force a noinline of an
> > inline and people 'just have to know' that's why you randomly separated the
> > two?
> >
> > Doesn't sound appealing overall.
> >
> > I'd rather we find a way to implement the batching such that it doesn't
> > exhibit bad inlining decisions in the first place.
>
> Yes, you make a good point. At the end of the day (taking change_protection()
> as the example at hand here), after these changes:
> change_protection()
>  loop over p4ds
>    loop over puds
>      loop over pmds
>        loop over ptes
>          nr_ptes = loop over ptes and find out how many we have
>          if (making write) {
>            loop over nr_ptes
>               loop over ptes and find out how many are anonexclusive or not, in a row
>               loop over ptes and set them
>          } else {
> 	      loop over ptes and set them
>          }
>
> which the compiler FWIW tries to inline it all into one function, but then
> does a poor job at figuring things out. And the CPU gets confused too. It
> was frankly shocking how much performance I could squeeze out of a
>
> if (nr_ptes == 1) {
> 	/* don't bother with the loops and the funny logic */
> }

Let's maybe focus on generalising this then to start?

I think David + I are in agreement that these are obvious (TM) improvements. But
let's see if we can generalise them?

>
> I would not be surprised if the other syscalls have similar problems.
>
> >
> > I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> > only there is already silly, we should have a _general_ function that does
> > optimisations like that.
> >
> > Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> > function in internal.h but rather a function in mm/util.c?
> >
> > >
> > > For
> > >
> > > 	nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > > 					   max_nr_ptes, /* flags = */ 0)
> > >
> > > We might just be able to use folio_pte_batch()?
> >
> > Yup.
> >
> > >
> > > For the other variant (soft-dirt+write) we'd have to create a helper like
> > >
> > > 	folio_pte_batch_sd_w() [better name suggestion welcome]
> > >
> > > That will reduce the code footprint overall I guess.
> >
> > I mean yeah that's a terrible name so obviously it'd have to be something
> > better.
> >
> > But again, this seems pretty stupid, now we're writing a bunch of duplicate
> > per-case code to force noinline because we made the central function inline
> > no?
>
> Yeah.

I guess we can discuss on other part of thread :)

>
> >
> > That's also super fragile, because an engineer might later decide that
> > pattern is horrible and fix it, and we regress this.
> >
> > But I mean overall, is the perf here really all that important? Are people
> > really that dependent on mprotect() et al. performing brilliantly fast?
>
> Obviously no one truly depends on mprotect, but I believe in fast primitives :)
>
> > Couldn't we do this with any mm interface and end up making efforts that
> > degrade code quality, increase fragility for dubious benefit?
>
> Yes, which is why I don't want to degrade code quality :) It would be ideal to
> find something that works both ways. Per my description of the existing code
> above, you can tell that it's neither fast, nor beautiful :p

I mean the code is terrible in most of these places (mremap() is much better
because I put a lot of time into de-insane-it-ising it), but it's less degrading
quality but rather:

static noinline void blahdy_blah(a trillion parameters)
{
	... lots of code ...
}

Developer comes along, modifies code/params/etc. or uses function in another
place and degrades perf somehow and suddenly what made sense at X point of time
no longer makes sense at Y point of time, but develoeprs don't understand it so
are scared to remove it and etc.

Which is why it's better to try to abstract as much as possible and have some
way of then putting the sensitive stuff in a specific place like:

/*
 * Lorenzo-style overly long comment with lots of exposition, a beginning,
 * middle and end, ASCII diagrams and all sorts...
 *
 * ...
 */
static noinline void __super_special_perf_bit_of_whatever_dont_touch_please(...)
{
}

That's wrapped up in some saner interface that people can use at will with the
perf-y component safely locked away in an insane asylum^W^W another compilation
unit or header.

>
> --
> Pedro

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 10:36       ` Lorenzo Stoakes (Oracle)
  2026-03-20 10:59         ` Pedro Falcato
@ 2026-03-20 11:01         ` David Hildenbrand (Arm)
  2026-03-20 11:45           ` Lorenzo Stoakes (Oracle)
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-20 11:01 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Pedro Falcato, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On 3/20/26 11:36, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
>>
>>>
>>> This is all seems VERY delicate, and subject to somebody else coming along and
>>> breaking it/causing some of these noinline/__always_inline invocations to make
>>> things far worse.
>>>
>>> I also reserve the right to seriously rework this pile of crap software.
>>>
>>> I'd rather we try to find less fragile ways to optimise!
>>>
>>> Maybe there's some steps that are bigger wins than others?
>>
>> What we can do is, collect similar folio_pte_batch_*() variants and
>> centralize them in mm/utils.c.
> 
> I'm not sure that addresses any of the comments above?

Well, the point is that you still end up with an optimized variant
regarding flags, and you don't have to play games in this file with
"inline".

> 
> Putting logic specific to components of mm away from where those components
> are and into mm/util.c seems like a complete regression in terms of
> fragility and code separation.

My point is that in mm/rmap.c we have another caller that passes
FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY, that could use the same
optimized function.

For folio_pte_batch_flags() we primarily care about propagating the
flags such that __pte_batch_clear_ignored() will become as short and
small as possible.

How much that matters in practice? For fork() and unmap() it was
measurable. So I'd assume for mprotect() it would matter as well.

> 
> And for what reason would you want to do that? To force a noinline of an
> inline and people 'just have to know' that's why you randomly separated the
> two?

Again, I don't want the noinline. But providing a single optimized
instance for two users that care about the same flags makes perfect
sense to me.

> 
> Doesn't sound appealing overall.
> 
> I'd rather we find a way to implement the batching such that it doesn't
> exhibit bad inlining decisions in the first place.
> 
> I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> only there is already silly, we should have a _general_ function that does
> optimisations like that.
> 
> Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> function in internal.h but rather a function in mm/util.c?

No, we want specialized code for fork and zapping. That's the whole
purpose: optimize out the flag checks in the loop.

> 
>>
>> For
>>
>> 	nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
>> 					   max_nr_ptes, /* flags = */ 0)
>>
>> We might just be able to use folio_pte_batch()?
> 
> Yup.
> 
>>
>> For the other variant (soft-dirt+write) we'd have to create a helper like
>>
>> 	folio_pte_batch_sd_w() [better name suggestion welcome]
>>
>> That will reduce the code footprint overall I guess.
> 
> I mean yeah that's a terrible name so obviously it'd have to be something
> better.
> 
> But again, this seems pretty stupid, now we're writing a bunch of duplicate
> per-case code to force noinline because we made the central function inline
> no?
> 
> That's also super fragile, because an engineer might later decide that
> pattern is horrible and fix it, and we regress this.
> 
> But I mean overall, is the perf here really all that important? Are people
> really that dependent on mprotect() et al. performing brilliantly fast?

For basic primitives like mprotect/zap/fork I think yes. For other stuff
like rmap.c, maybe not.

> 
> Couldn't we do this with any mm interface and end up making efforts that
> degrade code quality, increase fragility for dubious benefit?

I don't see a big issue here like you do.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 11:01         ` David Hildenbrand (Arm)
@ 2026-03-20 11:45           ` Lorenzo Stoakes (Oracle)
  2026-03-23 12:56             ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-20 11:45 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Pedro Falcato, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On Fri, Mar 20, 2026 at 12:01:09PM +0100, David Hildenbrand (Arm) wrote:
> On 3/20/26 11:36, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> >>
> >>>
> >>> This is all seems VERY delicate, and subject to somebody else coming along and
> >>> breaking it/causing some of these noinline/__always_inline invocations to make
> >>> things far worse.
> >>>
> >>> I also reserve the right to seriously rework this pile of crap software.
> >>>
> >>> I'd rather we try to find less fragile ways to optimise!
> >>>
> >>> Maybe there's some steps that are bigger wins than others?
> >>
> >> What we can do is, collect similar folio_pte_batch_*() variants and
> >> centralize them in mm/utils.c.
> >
> > I'm not sure that addresses any of the comments above?
>
> Well, the point is that you still end up with an optimized variant
> regarding flags, and you don't have to play games in this file with
> "inline".

Right, I mean maybe we're talking across purposes here, if the variants are say
generalised specific sets of behaviour rather than 'mprotect folio pte batch' or
'mremap folio pte batch' or whatever then that's fine.

>
> >
> > Putting logic specific to components of mm away from where those components
> > are and into mm/util.c seems like a complete regression in terms of
> > fragility and code separation.
>
> My point is that in mm/rmap.c we have another caller that passes
> FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY, that could use the same
> optimized function.
>
> For folio_pte_batch_flags() we primarily care about propagating the
> flags such that __pte_batch_clear_ignored() will become as short and
> small as possible.
>
> How much that matters in practice? For fork() and unmap() it was
> measurable. So I'd assume for mprotect() it would matter as well.

Yeah I think this was a misunderstanding, that's fine.

>
> >
> > And for what reason would you want to do that? To force a noinline of an
> > inline and people 'just have to know' that's why you randomly separated the
> > two?
>
> Again, I don't want the noinline. But providing a single optimized
> instance for two users that care about the same flags makes perfect
> sense to me.

Yup, that's fine! I agree with you now I realise this was a
misunderstanding :p

>
> >
> > Doesn't sound appealing overall.
> >
> > I'd rather we find a way to implement the batching such that it doesn't
> > exhibit bad inlining decisions in the first place.
> >
> > I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> > only there is already silly, we should have a _general_ function that does
> > optimisations like that.
> >
> > Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> > function in internal.h but rather a function in mm/util.c?
>
> No, we want specialized code for fork and zapping. That's the whole
> purpose: optimize out the flag checks in the loop.

Yeah, I think something isn't quite working right there though if we're
seeing measureable improvements by doing:

static noinline unsigned int blahdy_blah()
{
	...
	return folio_pte_batch_flags(...);
}

But we need to figure out exactly what via actual perf analysis, as perf is
an area that totally confounds developer expectation and 'going with your
gut' is a perfect way of doing completely the wrong thing.

(And I've repeatedly seen otherwise smart developers do this OVER and OVER
again, it's like a honey trap for the clever).

Not saying any of that applies to you or Pedro :) just a general point.

I don't _love_ that function being inline like that, but I will cede to the
data, but right now it seems, at least for I guess order-0 folios that it's
not quite doing what we want.

Anyway we're agreed the best way forwards here, for now, is to specialise
order-0 which just seems like an all-round win, and maybe the rest is not
as much of a contributor?

>
> >
> >>
> >> For
> >>
> >> 	nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> >> 					   max_nr_ptes, /* flags = */ 0)
> >>
> >> We might just be able to use folio_pte_batch()?
> >
> > Yup.
> >
> >>
> >> For the other variant (soft-dirt+write) we'd have to create a helper like
> >>
> >> 	folio_pte_batch_sd_w() [better name suggestion welcome]
> >>
> >> That will reduce the code footprint overall I guess.
> >
> > I mean yeah that's a terrible name so obviously it'd have to be something
> > better.
> >
> > But again, this seems pretty stupid, now we're writing a bunch of duplicate
> > per-case code to force noinline because we made the central function inline
> > no?
> >
> > That's also super fragile, because an engineer might later decide that
> > pattern is horrible and fix it, and we regress this.
> >
> > But I mean overall, is the perf here really all that important? Are people
> > really that dependent on mprotect() et al. performing brilliantly fast?
>
> For basic primitives like mprotect/zap/fork I think yes. For other stuff
> like rmap.c, maybe not.

Well on big ranges of mprotect() it could be, and I know often databases
like to do this kind of thing potentially, so yeah sure.

But more so the microbenchmark stuff of *a million protect() invocations*
is not something to optimise for so much.

Rather I'd say mprotect() over larger ranges is what we should look to.

Fork of course is utterly critical despite fork being both a terrible
mistake and a great idea at the exact same time in OS development history
:)

>
> >
> > Couldn't we do this with any mm interface and end up making efforts that
> > degrade code quality, increase fragility for dubious benefit?
>
> I don't see a big issue here like you do.

As I've said to Pedro elsewhere here, I guess my concern is nuanced:

So if we introduce stuff like carefully chosen __always_inline or noinline
or other things that have characteristics like:

- They're beneficial for the code AS-IS.
- They're based on compiler codegen that can easily be altered by other
  changes.
- It is not obvious how other changes to the code might break them.

We are asking for trouble - because people WILL change that code and WILL
break that, OR a possibly worse outcome - something like a noinline sticks
around when it makes sense, but everybody's scared to remove it + _doesn't
know why it's there_ - so it becomes a part of 'oh yeah we don't touch
that' lore that exists for a lot of 'weird' stuff in the kernel.

Then it might end up actually _worsening_ the performance in future
accidentally because nobody dare touch it.

Or another hellish future is one in which such things cause bot perf
regression reports for otherwise fine series, on microoptimisations we're
not even clear matter, and cause developers to have to spend hours figuring
out how to avoid them, meanwhile potentially making it even more difficult
to understand why the code is the way it is.

So what is the solution?

1. Focus on the changes that are NOT brittle like this, e.g. special casing
   order-0 is fine, adding profile/benchmark-proven likely()/unlikely(),
   etc. - these are not things that have the above characteristics and are
   just wins.

2. For cases where things MIGHT have the characteristics listed above,
   avoid the issue by abstracting it as much as possible, adding lengthily
   comments and making it as hard as possible to screw it up/misunderstand
   it.

3. Often times perf issues coming up might be an indication that the
   underlying mechanism is itself not well abstracted/already adding
   unnecessary complexity that manifests in perf issues, so in that case -
   rework first.

mm/mprotect.c is a good example of case 3 I think in that it's a big ball
of mud with overly long functions (e.g. change_pte_range(),
do_mprotect_pkey()) that likely refactoring code actually see perf gains
just from the compiler not having to heuristically determine inlining/other
optimisations due to functions being in smalle rchunks.

Anwyay in this case, we can pull out an example of approach 3 (the softleaf
specialisation), and approach 1 (order-0) handling easily, and defer
considering taking approach 2 if it makes sense to later, if we get most of
the wins by doing the former 2 things.

>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 11:45           ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 12:56             ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-23 12:56 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Pedro Falcato, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

>>> I mean yeah that's a terrible name so obviously it'd have to be something
>>> better.
>>>
>>> But again, this seems pretty stupid, now we're writing a bunch of duplicate
>>> per-case code to force noinline because we made the central function inline
>>> no?
>>>
>>> That's also super fragile, because an engineer might later decide that
>>> pattern is horrible and fix it, and we regress this.
>>>
>>> But I mean overall, is the perf here really all that important? Are people
>>> really that dependent on mprotect() et al. performing brilliantly fast?
>>
>> For basic primitives like mprotect/zap/fork I think yes. For other stuff
>> like rmap.c, maybe not.
> 
> Well on big ranges of mprotect() it could be, and I know often databases
> like to do this kind of thing potentially, so yeah sure.
> 
> But more so the microbenchmark stuff of *a million protect() invocations*
> is not something to optimise for so much.
> 
> Rather I'd say mprotect() over larger ranges is what we should look to.

I tend to agree (and I think I made a similar point in previous
discussions around mprotect() performance).

There is the use case for userspace jits etc to call mprotect() on
individual pages. I suspected that TLB flushing and syscall overhead
would overshadow most micro-optimizations. :)

[...]

> 
> As I've said to Pedro elsewhere here, I guess my concern is nuanced:
> 
> So if we introduce stuff like carefully chosen __always_inline or noinline
> or other things that have characteristics like:
> 
> - They're beneficial for the code AS-IS.
> - They're based on compiler codegen that can easily be altered by other
>   changes.
> - It is not obvious how other changes to the code might break them.
> 
> We are asking for trouble - because people WILL change that code and WILL
> break that, OR a possibly worse outcome - something like a noinline sticks
> around when it makes sense, but everybody's scared to remove it + _doesn't
> know why it's there_ - so it becomes a part of 'oh yeah we don't touch
> that' lore that exists for a lot of 'weird' stuff in the kernel.
> 
> Then it might end up actually _worsening_ the performance in future
> accidentally because nobody dare touch it.
> 
> Or another hellish future is one in which such things cause bot perf
> regression reports for otherwise fine series, on microoptimisations we're
> not even clear matter, and cause developers to have to spend hours figuring
> out how to avoid them, meanwhile potentially making it even more difficult
> to understand why the code is the way it is.
> 
> So what is the solution?
> 
> 1. Focus on the changes that are NOT brittle like this, e.g. special casing
>    order-0 is fine, adding profile/benchmark-proven likely()/unlikely(),
>    etc. - these are not things that have the above characteristics and are
>    just wins.

Agreed.

> 
> 2. For cases where things MIGHT have the characteristics listed above,
>    avoid the issue by abstracting it as much as possible, adding lengthily
>    comments and making it as hard as possible to screw it up/misunderstand
>    it.

Agreed.

> 
> 3. Often times perf issues coming up might be an indication that the
>    underlying mechanism is itself not well abstracted/already adding
>    unnecessary complexity that manifests in perf issues, so in that case -
>    rework first.
Agreed.


I think the usage of noinline for micro-performance optimization is
really questionable and should be avoided at all costs.

The folio_pte_patch() stuff likely really should just be a set of
mm/util.c helpers that specialize on the flags only to make the inner
loop as efficient as possible.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-19 19:14   ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:41     ` David Hildenbrand (Arm)
@ 2026-03-20 10:34     ` Pedro Falcato
  2026-03-20 10:51       ` Lorenzo Stoakes (Oracle)
  1 sibling, 1 reply; 32+ messages in thread
From: Pedro Falcato @ 2026-03-20 10:34 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 07:14:38PM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> > Hoist some previous, important to be inlined checks to the main loop
> > and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> > quite large and whether it is inlined or not should not matter, as we
> > are ideally dealing with larger orders.
> >
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> >  mm/mprotect.c | 16 +++++++---------
> >  1 file changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 8d4fa38a8a26..aa845f5bf14d 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> >  	return can_change_shared_pte_writable(vma, pte);
> >  }
> >
> > -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> 
> Hmm I'm iffy about noinline-ing a 1 line function now?

Well, it's a 1 line function that itself (probably) has an inlined
folio_pte_batch_flags() with a bunch of inlined functions, calls, etc. But yes, I
see your point.

> 
> And now yu're noinlining something that contains an inline (but not guaranteed
> to be function, it's a bit strange overall?
> 
> >  				    pte_t pte, int max_nr_ptes, fpb_t flags)
> >  {
> > -	/* No underlying folio, so cannot batch */
> > -	if (!folio)
> > -		return 1;
> > -
> > -	if (!folio_test_large(folio))
> > -		return 1;
> > -
> >  	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> >  }
> >
> > @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> >  				continue;
> >  			}
> >
> 
> ^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
> which now won't be doing:

Yep, this is a bug (that sashiko also caught!)

> 
> 	if (!folio)
> 		return 1;
> 
> 	if (!folio_test_large(folio))
> 		return 1;
> 
> At all, so isn't that potentially a pessimisation in itself?
> 
> 
> > -			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > +			/* No underlying folio (or not large), so cannot batch */
> > +			if (likely(!folio || !folio_test_large(folio)))
> 
> We actually sure of this likely()? Is there data to support it? Am not a fan of
> using likely()/unlikey() without something to back it.

I did a small little test yesterday with bpftrace + a kernel build +
filemap_get_folio. I got a staggering majority of order-0 folios, with some
smaller folios spread out across a bunch of orders (up to 8, iirc). Of course
I'm on x86 so I don't have mTHP, etc enabled, so those will all be order-0 or
PMD_ORDER folios (which we won't see here).

> 
> > +				nr_ptes = 1;
> > +			else
> > +				nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > +					  max_nr_ptes, flags);
> 
> It's also pretty gross to throw this out into a massive function.
> 
> >
> >  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> >  			ptent = pte_modify(oldpte, newprot);
> > --
> > 2.53.0
> >
> 
> This is all seems VERY delicate, and subject to somebody else coming along and
> breaking it/causing some of these noinline/__always_inline invocations to make
> things far worse.

Who told you optimization isn't delicate?!

> 
> I also reserve the right to seriously rework this pile of crap software.
> 
> I'd rather we try to find less fragile ways to optimise!

But yes, I understand your point. Hm. I'll need to think about this some more.
There's nothing I would love more than to simply slap a

if (pte_batch_hint(ptep, pte) == 1)
	nr_ptes = 1;

on !contpte archs. I don't see where most of the wins for those architectures
would exist, but apparently they do. Confusing :/

> 
> Maybe there's some steps that are bigger wins than others?

Definitely :) So yeah I'll probably be dropping this patch, or at least
reworking this one a good bit.

-- 
Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags()
  2026-03-20 10:34     ` Pedro Falcato
@ 2026-03-20 10:51       ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-20 10:51 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Fri, Mar 20, 2026 at 10:34:29AM +0000, Pedro Falcato wrote:
> On Thu, Mar 19, 2026 at 07:14:38PM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> > > Hoist some previous, important to be inlined checks to the main loop
> > > and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> > > quite large and whether it is inlined or not should not matter, as we
> > > are ideally dealing with larger orders.
> > >
> > > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > > ---
> > >  mm/mprotect.c | 16 +++++++---------
> > >  1 file changed, 7 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 8d4fa38a8a26..aa845f5bf14d 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > >  	return can_change_shared_pte_writable(vma, pte);
> > >  }
> > >
> > > -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > > +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >
> > Hmm I'm iffy about noinline-ing a 1 line function now?
>
> Well, it's a 1 line function that itself (probably) has an inlined
> folio_pte_batch_flags() with a bunch of inlined functions, calls, etc. But yes, I
> see your point.

I mean again the weird thing is folio_pte_batch_flags() being inline in a header
file in the first place. It even says:

 * This function will be inlined to optimize based on the input parameters;
 * consider using folio_pte_batch() instead if applicable.

In the comment.

noinline'ing to not inline an inline function that explicitly says it's inline
for performance seems... silly.

>
> >
> > And now yu're noinlining something that contains an inline (but not guaranteed
> > to be function, it's a bit strange overall?
> >
> > >  				    pte_t pte, int max_nr_ptes, fpb_t flags)
> > >  {
> > > -	/* No underlying folio, so cannot batch */
> > > -	if (!folio)
> > > -		return 1;
> > > -
> > > -	if (!folio_test_large(folio))
> > > -		return 1;
> > > -
> > >  	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> > >  }
> > >
> > > @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> > >  				continue;
> > >  			}
> > >
> >
> > ^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
> > which now won't be doing:
>
> Yep, this is a bug (that sashiko also caught!)
>
> >
> > 	if (!folio)
> > 		return 1;
> >
> > 	if (!folio_test_large(folio))
> > 		return 1;
> >
> > At all, so isn't that potentially a pessimisation in itself?
> >
> >
> > > -			nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > > +			/* No underlying folio (or not large), so cannot batch */
> > > +			if (likely(!folio || !folio_test_large(folio)))
> >
> > We actually sure of this likely()? Is there data to support it? Am not a fan of
> > using likely()/unlikey() without something to back it.
>
> I did a small little test yesterday with bpftrace + a kernel build +
> filemap_get_folio. I got a staggering majority of order-0 folios, with some
> smaller folios spread out across a bunch of orders (up to 8, iirc). Of course
> I'm on x86 so I don't have mTHP, etc enabled, so those will all be order-0 or
> PMD_ORDER folios (which we won't see here).
>
> >
> > > +				nr_ptes = 1;
> > > +			else
> > > +				nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > > +					  max_nr_ptes, flags);
> >
> > It's also pretty gross to throw this out into a massive function.
> >
> > >
> > >  			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> > >  			ptent = pte_modify(oldpte, newprot);
> > > --
> > > 2.53.0
> > >
> >
> > This is all seems VERY delicate, and subject to somebody else coming along and
> > breaking it/causing some of these noinline/__always_inline invocations to make
> > things far worse.
>
> Who told you optimization isn't delicate?!

I don't want to accept changes that will randomly break in future, it's fairly
pointless, because I guarantee somebody will break things at some point if
they're fragile.

And we _really_ do not want or need to have functions where a bot will shout at
you and you spend X hours figuring out why you reduced the time on a
microbenchmark that does something nobody cares about.

BUT, if we can _abstract_ the bigger wins by rethinking batching somewhat then
we could achieve something here, and that'd definitely be worthwhile.

I think that really the batching implementation is a bit of a mess because it
was done without fixing any existing technical debt first.

We need to stop accepting series that pile more code on top of existing code
that is already a mess like that.

>
> >
> > I also reserve the right to seriously rework this pile of crap software.
> >
> > I'd rather we try to find less fragile ways to optimise!
>
> But yes, I understand your point. Hm. I'll need to think about this some more.
> There's nothing I would love more than to simply slap a
>
> if (pte_batch_hint(ptep, pte) == 1)
> 	nr_ptes = 1;
>
> on !contpte archs. I don't see where most of the wins for those architectures
> would exist, but apparently they do. Confusing :/
>
> >
> > Maybe there's some steps that are bigger wins than others?
>
> Definitely :) So yeah I'll probably be dropping this patch, or at least
> reworking this one a good bit.

Thanks, sorry to be negative - I appreciate the detailed perf analysis, but I
want to make sure we do this in such a way that future Lorenzo and future Pedro
and future David and future whoever can remember WHY we did X and NOT to do Y
that breaks it :)

So that means abstracting it in a way where that's made easy for us (abstraction
stays stable and handles the perf details for us which are documented in e.g. a
comment, callers call, everybody's happy-ish).

>
> --
> Pedro

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions
  2026-03-19 18:31 [PATCH 0/4] mm/mprotect: micro-optimization work Pedro Falcato
                   ` (2 preceding siblings ...)
  2026-03-19 18:31 ` [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags() Pedro Falcato
@ 2026-03-19 18:31 ` Pedro Falcato
  2026-03-19 19:17   ` Lorenzo Stoakes (Oracle)
  2026-03-19 21:43   ` David Hildenbrand (Arm)
  2026-03-20  2:42 ` [PATCH 0/4] mm/mprotect: micro-optimization work Andrew Morton
  4 siblings, 2 replies; 32+ messages in thread
From: Pedro Falcato @ 2026-03-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Pedro Falcato, Vlastimil Babka, Jann Horn, David Hildenbrand,
	Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

The common order-0 case is important enough to want its own branch, and
avoids the hairy, large loop logic that the CPU does not seem to handle
particularly well.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 mm/mprotect.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index aa845f5bf14d..23a741f9a4c8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -170,6 +170,13 @@ static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
 	int sub_batch_idx = 0;
 	int len;
 
+	/* Optimize for the common order-0 case. */
+	if (likely(nr_ptes == 1)) {
+		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
+				       0, PageAnonExclusive(first_page), tlb);
+		return;
+	}
+
 	while (nr_ptes) {
 		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
 		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions
  2026-03-19 18:31 ` [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato
@ 2026-03-19 19:17   ` Lorenzo Stoakes (Oracle)
  2026-03-20 10:36     ` Pedro Falcato
  2026-03-19 21:43   ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 19:17 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 06:31:08PM +0000, Pedro Falcato wrote:
> The common order-0 case is important enough to want its own branch, and
> avoids the hairy, large loop logic that the CPU does not seem to handle
> particularly well.
>

I think it'd be good to get a sense per-commit what the perf impact is.

> Signed-off-by: Pedro Falcato <pfalcato@suse.de>

I am iffy on the likely() in the same way I am always iffy on non-profile backed
likely()/unlikely() but this change seems sensible enough so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/mprotect.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index aa845f5bf14d..23a741f9a4c8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -170,6 +170,13 @@ static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>  	int sub_batch_idx = 0;
>  	int len;
>
> +	/* Optimize for the common order-0 case. */
> +	if (likely(nr_ptes == 1)) {

I mean I we're likely()'ing here without data to back it, but I guess I believe
this one :)

> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
> +				       0, PageAnonExclusive(first_page), tlb);
> +		return;
> +	}
> +
>  	while (nr_ptes) {
>  		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
>  		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
> --
> 2.53.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions
  2026-03-19 19:17   ` Lorenzo Stoakes (Oracle)
@ 2026-03-20 10:36     ` Pedro Falcato
  2026-03-20 10:42       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 32+ messages in thread
From: Pedro Falcato @ 2026-03-20 10:36 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, Mar 19, 2026 at 07:17:31PM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 06:31:08PM +0000, Pedro Falcato wrote:
> > The common order-0 case is important enough to want its own branch, and
> > avoids the hairy, large loop logic that the CPU does not seem to handle
> > particularly well.
> >
> 
> I think it'd be good to get a sense per-commit what the perf impact is.

I added some numbers on the cover letter - this patch + the __always_inline
one do the heavy lifting.

> 
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> 
> I am iffy on the likely() in the same way I am always iffy on non-profile backed
> likely()/unlikely() but this change seems sensible enough so:
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thanks for the review!

-- 
Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions
  2026-03-20 10:36     ` Pedro Falcato
@ 2026-03-20 10:42       ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-20 10:42 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Fri, Mar 20, 2026 at 10:36:17AM +0000, Pedro Falcato wrote:
> On Thu, Mar 19, 2026 at 07:17:31PM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 06:31:08PM +0000, Pedro Falcato wrote:
> > > The common order-0 case is important enough to want its own branch, and
> > > avoids the hairy, large loop logic that the CPU does not seem to handle
> > > particularly well.
> > >
> >
> > I think it'd be good to get a sense per-commit what the perf impact is.
>
> I added some numbers on the cover letter - this patch + the __always_inline
> one do the heavy lifting.

I mean again this makes me wonder if we shouldn't have some generalised batch
logic to handle order-0 cases.

>
> >
> > > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> >
> > I am iffy on the likely() in the same way I am always iffy on non-profile backed
> > likely()/unlikely() but this change seems sensible enough so:
> >
> > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>
> Thanks for the review!
>
> --
> Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions
  2026-03-19 18:31 ` [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato
  2026-03-19 19:17   ` Lorenzo Stoakes (Oracle)
@ 2026-03-19 21:43   ` David Hildenbrand (Arm)
  2026-03-20 10:37     ` Pedro Falcato
  1 sibling, 1 reply; 32+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19 21:43 UTC (permalink / raw)
  To: Pedro Falcato, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes
  Cc: Vlastimil Babka, Jann Horn, Dev Jain, Luke Yang, jhladky,
	linux-mm, linux-kernel

On 3/19/26 19:31, Pedro Falcato wrote:
> The common order-0 case is important enough to want its own branch, and
> avoids the hairy, large loop logic that the CPU does not seem to handle
> particularly well.
> 
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
>  mm/mprotect.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index aa845f5bf14d..23a741f9a4c8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -170,6 +170,13 @@ static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>  	int sub_batch_idx = 0;
>  	int len;
>  
> +	/* Optimize for the common order-0 case. */
> +	if (likely(nr_ptes == 1)) {
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
> +				       0, PageAnonExclusive(first_page), tlb);
> +		return;
> +	}
> +
>  	while (nr_ptes) {
>  		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
>  		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,

That should likely be squashed into patch #1, because only then the
inline makes more sense.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions
  2026-03-19 21:43   ` David Hildenbrand (Arm)
@ 2026-03-20 10:37     ` Pedro Falcato
  0 siblings, 0 replies; 32+ messages in thread
From: Pedro Falcato @ 2026-03-20 10:37 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Dev Jain, Luke Yang, jhladky, linux-mm, linux-kernel

On Thu, Mar 19, 2026 at 10:43:35PM +0100, David Hildenbrand (Arm) wrote:
> On 3/19/26 19:31, Pedro Falcato wrote:
> > The common order-0 case is important enough to want its own branch, and
> > avoids the hairy, large loop logic that the CPU does not seem to handle
> > particularly well.
> > 
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> >  mm/mprotect.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index aa845f5bf14d..23a741f9a4c8 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -170,6 +170,13 @@ static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
> >  	int sub_batch_idx = 0;
> >  	int len;
> >  
> > +	/* Optimize for the common order-0 case. */
> > +	if (likely(nr_ptes == 1)) {
> > +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
> > +				       0, PageAnonExclusive(first_page), tlb);
> > +		return;
> > +	}
> > +
> >  	while (nr_ptes) {
> >  		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
> >  		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
> 
> That should likely be squashed into patch #1, because only then the
> inline makes more sense.

Will do, thanks.

-- 
Pedro


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] mm/mprotect: micro-optimization work
  2026-03-19 18:31 [PATCH 0/4] mm/mprotect: micro-optimization work Pedro Falcato
                   ` (3 preceding siblings ...)
  2026-03-19 18:31 ` [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato
@ 2026-03-20  2:42 ` Andrew Morton
  4 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2026-03-20  2:42 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	David Hildenbrand, Dev Jain, Luke Yang, jhladky, linux-mm,
	linux-kernel

On Thu, 19 Mar 2026 18:31:04 +0000 Pedro Falcato <pfalcato@suse.de> wrote:

> After a long session of performance-cat herding, here's the first version
> I am relatively ok with.
> 
> Micro-optimize the change_protection functionality and the
> change_pte_range() routine. This set of functions works in an incredibly
> tight loop, and even small inefficiencies are incredibly evident when spun
> hundreds, thousands or hundreds of thousands of times.

AI review has questions:
	https://sashiko.dev/#/patchset/20260319183108.1105090-1-pfalcato%40suse.de


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-03-23 12:56 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 18:31 [PATCH 0/4] mm/mprotect: micro-optimization work Pedro Falcato
2026-03-19 18:31 ` [PATCH 1/4] mm/mprotect: encourage inlining with __always_inline Pedro Falcato
2026-03-19 18:59   ` Lorenzo Stoakes (Oracle)
2026-03-19 19:00     ` Lorenzo Stoakes (Oracle)
2026-03-19 21:28   ` David Hildenbrand (Arm)
2026-03-20  9:59     ` Pedro Falcato
2026-03-20 10:08       ` David Hildenbrand (Arm)
2026-03-19 18:31 ` [PATCH 2/4] mm/mprotect: move softleaf code out of the main function Pedro Falcato
2026-03-19 19:06   ` Lorenzo Stoakes (Oracle)
2026-03-19 21:33   ` David Hildenbrand (Arm)
2026-03-20 10:04     ` Pedro Falcato
2026-03-20 10:07       ` David Hildenbrand (Arm)
2026-03-20 10:54         ` Lorenzo Stoakes (Oracle)
2026-03-19 18:31 ` [PATCH 3/4] mm/mprotect: un-inline folio_pte_batch_flags() Pedro Falcato
2026-03-19 19:14   ` Lorenzo Stoakes (Oracle)
2026-03-19 21:41     ` David Hildenbrand (Arm)
2026-03-20 10:36       ` Lorenzo Stoakes (Oracle)
2026-03-20 10:59         ` Pedro Falcato
2026-03-20 11:02           ` David Hildenbrand (Arm)
2026-03-20 11:27           ` Lorenzo Stoakes (Oracle)
2026-03-20 11:01         ` David Hildenbrand (Arm)
2026-03-20 11:45           ` Lorenzo Stoakes (Oracle)
2026-03-23 12:56             ` David Hildenbrand (Arm)
2026-03-20 10:34     ` Pedro Falcato
2026-03-20 10:51       ` Lorenzo Stoakes (Oracle)
2026-03-19 18:31 ` [PATCH 4/4] mm/mprotect: special-case small folios when applying write permissions Pedro Falcato
2026-03-19 19:17   ` Lorenzo Stoakes (Oracle)
2026-03-20 10:36     ` Pedro Falcato
2026-03-20 10:42       ` Lorenzo Stoakes (Oracle)
2026-03-19 21:43   ` David Hildenbrand (Arm)
2026-03-20 10:37     ` Pedro Falcato
2026-03-20  2:42 ` [PATCH 0/4] mm/mprotect: micro-optimization work Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox