[PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
@ 2025-08-07 18:58 Lorenzo Stoakes
  2025-08-07 19:10 ` David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 18:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Barry Song, Dev Jain, linux-mm, linux-kernel, David Hildenbrand

It was discovered in the attached report that commit f822a9a81a31 ("mm:
optimize mremap() by PTE batching") introduced a significant performance
regression on a number of metrics on x86-64, most notably
stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
number of mremap() calls per second.

I was able to reproduce this locally on an intel x86-64 raptor lake system,
noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
2,131 or 2.6%) - a 43.3% regression.

During testing I was able to determine that there was no meaningful
difference in efforts to optimise the folio_pte_batch() operation, nor
checking folio_test_large().

This is within expectation, as a regression this large is likely to
indicate we are accessing memory that is not yet in a cache line (and
perhaps may even cause a main memory fetch).

The expectation by those discussing this from the start was that
vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
culprit due to having to retrieve memory from the vmemmap (which mremap()
page table moves does not otherwise do, meaning this is inevitably cold
memory).

I was able to definitively determine that this theory is indeed correct and
the cause of the issue.

The solution is to restore part of an approach previously discarded on
review, that is to invoke pte_batch_hint() which explicitly determines,
through reference to the PTE alone (thus no vmemmap lookup), what the PTE
batch size may be.

On platforms other than arm64 this is currently hardcoded to return 1, so
this naturally resolves the issue for x86-64, and for arm64 introduces
little to no overhead as the pte cache line will be hot.

With this patch applied, we move from 81,503 realloc calls/sec to
138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
accounting for the variance in the original result, this is broadly
restoring performance to its prior state.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mremap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 677a4d744df9..9afa8cd524f5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -179,6 +179,10 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
 	if (max_nr == 1)
 		return 1;

+	/* Avoid expensive folio lookup if we stand no chance of benefit. */
+	if (pte_batch_hint(ptep, pte) == 1)
+		return 1;
+
 	folio = vm_normal_folio(vma, addr, pte);
 	if (!folio || !folio_test_large(folio))
 		return 1;
--
2.50.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 18:58 [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch Lorenzo Stoakes
@ 2025-08-07 19:10 ` David Hildenbrand
  2025-08-07 19:20   ` Lorenzo Stoakes
  2025-08-07 19:14 ` Pedro Falcato
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-08-07 19:10 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Barry Song, Dev Jain, linux-mm, linux-kernel

On 07.08.25 20:58, Lorenzo Stoakes wrote:
> It was discovered in the attached report that commit f822a9a81a31 ("mm:
> optimize mremap() by PTE batching") introduced a significant performance
> regression on a number of metrics on x86-64, most notably
> stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
> number of mremap() calls per second.
> 
> I was able to reproduce this locally on an intel x86-64 raptor lake system,
> noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
> 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
> 2,131 or 2.6%) - a 43.3% regression.
> 
> During testing I was able to determine that there was no meaningful
> difference in efforts to optimise the folio_pte_batch() operation, nor
> checking folio_test_large().
> 
> This is within expectation, as a regression this large is likely to
> indicate we are accessing memory that is not yet in a cache line (and
> perhaps may even cause a main memory fetch).
> 
> The expectation by those discussing this from the start was that
> vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> culprit due to having to retrieve memory from the vmemmap (which mremap()
> page table moves does not otherwise do, meaning this is inevitably cold
> memory).
> 
> I was able to definitively determine that this theory is indeed correct and
> the cause of the issue.
> 
> The solution is to restore part of an approach previously discarded on
> review, that is to invoke pte_batch_hint() which explicitly determines,
> through reference to the PTE alone (thus no vmemmap lookup), what the PTE
> batch size may be.
> 
> On platforms other than arm64 this is currently hardcoded to return 1, so
> this naturally resolves the issue for x86-64, and for arm64 introduces
> little to no overhead as the pte cache line will be hot.
> 
> With this patch applied, we move from 81,503 realloc calls/sec to
> 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
> accounting for the variance in the original result, this is broadly
> restoring performance to its prior state.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   mm/mremap.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 677a4d744df9..9afa8cd524f5 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -179,6 +179,10 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>   	if (max_nr == 1)
>   		return 1;
> 
> +	/* Avoid expensive folio lookup if we stand no chance of benefit. */
> +	if (pte_batch_hint(ptep, pte) == 1)
> +		return 1;
> +
>   	folio = vm_normal_folio(vma, addr, pte);
>   	if (!folio || !folio_test_large(folio))
 >   		return 1;

Acked-by: David Hildenbrand <david@redhat.com>

Wondering whether we could then just use the patch hint instead of going 
via the folio.

IOW,

return pte_batch_hint(ptep, pte);


Not sure if that was discussed at some point before we went into the 
direction of using folios. But there really doesn't seem to be anything 
gained for other architectures here (as raised by Jann).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 18:58 [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch Lorenzo Stoakes
  2025-08-07 19:10 ` David Hildenbrand
@ 2025-08-07 19:14 ` Pedro Falcato
  2025-08-07 19:22   ` Lorenzo Stoakes
  2025-08-08  5:19 ` Dev Jain
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Pedro Falcato @ 2025-08-07 19:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Barry Song, Dev Jain, linux-mm, linux-kernel, David Hildenbrand

On Thu, Aug 07, 2025 at 07:58:19PM +0100, Lorenzo Stoakes wrote:
> It was discovered in the attached report that commit f822a9a81a31 ("mm:
> optimize mremap() by PTE batching") introduced a significant performance
> regression on a number of metrics on x86-64, most notably
> stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
> number of mremap() calls per second.
> 
> I was able to reproduce this locally on an intel x86-64 raptor lake system,
> noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
> 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
> 2,131 or 2.6%) - a 43.3% regression.
> 
> During testing I was able to determine that there was no meaningful
> difference in efforts to optimise the folio_pte_batch() operation, nor
> checking folio_test_large().
> 
> This is within expectation, as a regression this large is likely to
> indicate we are accessing memory that is not yet in a cache line (and
> perhaps may even cause a main memory fetch).
> 
> The expectation by those discussing this from the start was that
> vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> culprit due to having to retrieve memory from the vmemmap (which mremap()
> page table moves does not otherwise do, meaning this is inevitably cold
> memory).
> 
> I was able to definitively determine that this theory is indeed correct and
> the cause of the issue.
> 
> The solution is to restore part of an approach previously discarded on
> review, that is to invoke pte_batch_hint() which explicitly determines,
> through reference to the PTE alone (thus no vmemmap lookup), what the PTE
> batch size may be.
> 
> On platforms other than arm64 this is currently hardcoded to return 1, so
> this naturally resolves the issue for x86-64, and for arm64 introduces
> little to no overhead as the pte cache line will be hot.
> 
> With this patch applied, we move from 81,503 realloc calls/sec to
> 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
> accounting for the variance in the original result, this is broadly
> restoring performance to its prior state.
>

So, do we still have a regression then? If so, do we have any idea why?

> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Fix looks great, thanks!

Acked-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:10 ` David Hildenbrand
@ 2025-08-07 19:20   ` Lorenzo Stoakes
  2025-08-07 19:41     ` David Hildenbrand
  2025-08-07 19:56     ` Ryan Roberts
  0 siblings, 2 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 19:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel,
	Ryan Roberts

+cc Ryan for ContPTE stuff.

On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

>
> Wondering whether we could then just use the patch hint instead of going via
> the folio.
>
> IOW,
>
> return pte_batch_hint(ptep, pte);

Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
have some conflicting flags potentially. The check is empirical:

static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
{
	if (!pte_valid_cont(pte))
		return 1;

	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
}

So it's 'the most number of PTEs that _might_ coalesce'.

(note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).

I suppose we could not even bother with checking if same folio and _just_ check
if PTEs have consecutive PFNs, which is not very likely if different folio
but... could that break something?

It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
thing' for a contPTE batch (I may be missing something - please correct me if so
Dev/Ryan).

So actually do we even really care that much about folio?

>
>
> Not sure if that was discussed at some point before we went into the
> direction of using folios. But there really doesn't seem to be anything
> gained for other architectures here (as raised by Jann).

Yup... I wonder about the other instances of this... ruh roh.

>
> --
> Cheers,
>
> David / dhildenb
>
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:14 ` Pedro Falcato
@ 2025-08-07 19:22   ` Lorenzo Stoakes
  2025-08-07 19:33     ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 19:22 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Barry Song, Dev Jain, linux-mm, linux-kernel, David Hildenbrand

On Thu, Aug 07, 2025 at 08:14:09PM +0100, Pedro Falcato wrote:
> On Thu, Aug 07, 2025 at 07:58:19PM +0100, Lorenzo Stoakes wrote:
> > It was discovered in the attached report that commit f822a9a81a31 ("mm:
> > optimize mremap() by PTE batching") introduced a significant performance
> > regression on a number of metrics on x86-64, most notably
> > stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
> > number of mremap() calls per second.
> >
> > I was able to reproduce this locally on an intel x86-64 raptor lake system,
> > noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
> > 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
> > 2,131 or 2.6%) - a 43.3% regression.
> >
> > During testing I was able to determine that there was no meaningful
> > difference in efforts to optimise the folio_pte_batch() operation, nor
> > checking folio_test_large().
> >
> > This is within expectation, as a regression this large is likely to
> > indicate we are accessing memory that is not yet in a cache line (and
> > perhaps may even cause a main memory fetch).
> >
> > The expectation by those discussing this from the start was that
> > vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> > culprit due to having to retrieve memory from the vmemmap (which mremap()
> > page table moves does not otherwise do, meaning this is inevitably cold
> > memory).
> >
> > I was able to definitively determine that this theory is indeed correct and
> > the cause of the issue.
> >
> > The solution is to restore part of an approach previously discarded on
> > review, that is to invoke pte_batch_hint() which explicitly determines,
> > through reference to the PTE alone (thus no vmemmap lookup), what the PTE
> > batch size may be.
> >
> > On platforms other than arm64 this is currently hardcoded to return 1, so
> > this naturally resolves the issue for x86-64, and for arm64 introduces
> > little to no overhead as the pte cache line will be hot.
> >
> > With this patch applied, we move from 81,503 realloc calls/sec to
> > 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
> > accounting for the variance in the original result, this is broadly
> > restoring performance to its prior state.
> >
>
> So, do we still have a regression then? If so, do we have any idea why?

It's within 1 stddev of the original results, so I'd say it's possibly
noise.

Let's see what the bots say. If there's something else we can obviously
take a look, I think Jann's point about cold cache is the key point here,
and the delta here is indicative.

>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> > Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Fix looks great, thanks!
>
> Acked-by: Pedro Falcato <pfalcato@suse.de>

Thanks!

>
> --
> Pedro

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:22   ` Lorenzo Stoakes
@ 2025-08-07 19:33     ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-08-07 19:33 UTC (permalink / raw)
  To: Lorenzo Stoakes, Pedro Falcato
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Barry Song, Dev Jain, linux-mm, linux-kernel

On 07.08.25 21:22, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:14:09PM +0100, Pedro Falcato wrote:
>> On Thu, Aug 07, 2025 at 07:58:19PM +0100, Lorenzo Stoakes wrote:
>>> It was discovered in the attached report that commit f822a9a81a31 ("mm:
>>> optimize mremap() by PTE batching") introduced a significant performance
>>> regression on a number of metrics on x86-64, most notably
>>> stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
>>> number of mremap() calls per second.
>>>
>>> I was able to reproduce this locally on an intel x86-64 raptor lake system,
>>> noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
>>> 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
>>> 2,131 or 2.6%) - a 43.3% regression.
>>>
>>> During testing I was able to determine that there was no meaningful
>>> difference in efforts to optimise the folio_pte_batch() operation, nor
>>> checking folio_test_large().
>>>
>>> This is within expectation, as a regression this large is likely to
>>> indicate we are accessing memory that is not yet in a cache line (and
>>> perhaps may even cause a main memory fetch).
>>>
>>> The expectation by those discussing this from the start was that
>>> vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
>>> culprit due to having to retrieve memory from the vmemmap (which mremap()
>>> page table moves does not otherwise do, meaning this is inevitably cold
>>> memory).
>>>
>>> I was able to definitively determine that this theory is indeed correct and
>>> the cause of the issue.
>>>
>>> The solution is to restore part of an approach previously discarded on
>>> review, that is to invoke pte_batch_hint() which explicitly determines,
>>> through reference to the PTE alone (thus no vmemmap lookup), what the PTE
>>> batch size may be.
>>>
>>> On platforms other than arm64 this is currently hardcoded to return 1, so
>>> this naturally resolves the issue for x86-64, and for arm64 introduces
>>> little to no overhead as the pte cache line will be hot.
>>>
>>> With this patch applied, we move from 81,503 realloc calls/sec to
>>> 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
>>> accounting for the variance in the original result, this is broadly
>>> restoring performance to its prior state.
>>>
>>
>> So, do we still have a regression then? If so, do we have any idea why?
> 
> It's within 1 stddev of the original results, so I'd say it's possibly
> noise.

It's very likely noise. And even if it's not, even a simple code layout 
change by the compiler can provoke something like that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:20   ` Lorenzo Stoakes
@ 2025-08-07 19:41     ` David Hildenbrand
  2025-08-07 20:11       ` Lorenzo Stoakes
  2025-08-07 19:56     ` Ryan Roberts
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-08-07 19:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel,
	Ryan Roberts

On 07.08.25 21:20, Lorenzo Stoakes wrote:
> +cc Ryan for ContPTE stuff.
> 
> On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Thanks!
> 
>>
>> Wondering whether we could then just use the patch hint instead of going via
>> the folio.
>>
>> IOW,
>>
>> return pte_batch_hint(ptep, pte);
> 
> Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
> have some conflicting flags potentially. The check is empirical:
> 
> static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> {
> 	if (!pte_valid_cont(pte))
> 		return 1;
> 
> 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> }
> 
> So it's 'the most number of PTEs that _might_ coalesce'.

No. If the bit is set, all PTEs in the aligned range (e.g., 64 KiB 
block) are coalesced. It's literally the bit telling the hardware that 
it can coalesce in that range because all PTEs are alike.

The function tells you exactly how many PTEs you can batch from the 
given PTEP in that 64 KiB block.

That's also why folio_pte_batch_flags() just jumps over that.

All you have to do is limit it by the maximum number.

So likely you would have to do here

diff --git a/mm/mremap.c b/mm/mremap.c
index 677a4d744df9c..58f9cf52eb6bd 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -174,16 +174,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
  static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned 
long addr,
                 pte_t *ptep, pte_t pte, int max_nr)
  {
-       struct folio *folio;
-
-       if (max_nr == 1)
-               return 1;
-
-       folio = vm_normal_folio(vma, addr, pte);
-       if (!folio || !folio_test_large(folio))
-               return 1;
-
-       return folio_pte_batch(folio, ptep, pte, max_nr);
+       return min_t(unsigned int, max_nr, pte_batch_hint(ptep, pte));
  }

  static int move_ptes(struct pagetable_move_control *pmc,


And make sure that the compiler realizes that max_nr >= 1 and optimized 
away the min_t on !arm64.

> 
> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> 
 > I suppose we could not even bother with checking if same folio and 
_just_ check> if PTEs have consecutive PFNs, which is not very likely if 
different folio
> but... could that break something?
> 
> It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
> thing' for a contPTE batch (I may be missing something - please correct me if so
> Dev/Ryan).
> 
> So actually do we even really care that much about folio?

I don't think so. Not in this case here where we don't use the folio for 
anything else.


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:20   ` Lorenzo Stoakes
  2025-08-07 19:41     ` David Hildenbrand
@ 2025-08-07 19:56     ` Ryan Roberts
  2025-08-07 20:58       ` Lorenzo Stoakes
  1 sibling, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2025-08-07 19:56 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel

On 07/08/2025 20:20, Lorenzo Stoakes wrote:
> +cc Ryan for ContPTE stuff.

Appologies, I was aware of the other thread and on-going issues but haven't had
the bandwidth to follow too closely.

> 
> On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Thanks!
> 
>>
>> Wondering whether we could then just use the patch hint instead of going via
>> the folio.
>>
>> IOW,
>>
>> return pte_batch_hint(ptep, pte);
> 
> Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
> have some conflicting flags potentially. The check is empirical:
> 
> static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> {
> 	if (!pte_valid_cont(pte))
> 		return 1;
> 
> 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> }
> 
> So it's 'the most number of PTEs that _might_ coalesce'.

No that's not correct; It's "at least this number of ptes _do_ coalesce".
folio_pte_batch() may end up returning a larger batch, but never smaller.

This function is looking to see if ptep is inside a conpte mapping, and if it
is, it's returning the number of ptes to the end of the contpte mapping (which
is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
if the physical memory is appropriately aligned/sized and all belongs to a
single folio.

> 
> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> 
> I suppose we could not even bother with checking if same folio and _just_ check
> if PTEs have consecutive PFNs, which is not very likely if different folio
> but... could that break something?

Yes something could break; the batch must *all* belong to the same folio.
Functions like set_ptes() require that in their documentation, and arm64 depends
upon it in order not to screw up the access/dirty bits.

> 
> It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
> thing' for a contPTE batch (I may be missing something - please correct me if so
> Dev/Ryan).

It will all do the right thing functionally no matter how you call it. But if
you can set_ptes() (and friends) on full contpte mappings, things are more
efficient.

> 
> So actually do we even really care that much about folio?

From arm64's perspective, we're happy enough with batches the size of
pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
for this location.

For the record, I'm pretty sure I was the person pushing for protecting
vm_normal_folio() with pte_batch_hint() right at the start of this process :)

Thanks,
Ryan

> 
>>
>>
>> Not sure if that was discussed at some point before we went into the
>> direction of using folios. But there really doesn't seem to be anything
>> gained for other architectures here (as raised by Jann).
> 
> Yup... I wonder about the other instances of this... ruh roh.

IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
already needed the folio. I haven't actually looked at how mprotect ended up,
but maybe worth checking to see if it should protect with pte_batch_hint() too.

Thanks,
Ryan

> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
>>
> 
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:41     ` David Hildenbrand
@ 2025-08-07 20:11       ` Lorenzo Stoakes
  2025-08-07 21:01         ` Lorenzo Stoakes
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 20:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel,
	Ryan Roberts

On Thu, Aug 07, 2025 at 09:41:24PM +0200, David Hildenbrand wrote:
> On 07.08.25 21:20, Lorenzo Stoakes wrote:
> > +cc Ryan for ContPTE stuff.
> >
> > On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
> > > Acked-by: David Hildenbrand <david@redhat.com>
> >
> > Thanks!
> >
> > >
> > > Wondering whether we could then just use the patch hint instead of going via
> > > the folio.
> > >
> > > IOW,
> > >
> > > return pte_batch_hint(ptep, pte);
> >
> > Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
> > have some conflicting flags potentially. The check is empirical:
> >
> > static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> > {
> > 	if (!pte_valid_cont(pte))
> > 		return 1;
> >
> > 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> > }
> >
> > So it's 'the most number of PTEs that _might_ coalesce'.
>
> No. If the bit is set, all PTEs in the aligned range (e.g., 64 KiB block)
> are coalesced. It's literally the bit telling the hardware that it can
> coalesce in that range because all PTEs are alike.

Sigh. So this is just yet another a horribly named function then. I was pretty
certain somebody explained it to me this way, but it's another reminder to never
trust anything you're told and to check everything...

My understanding of the word 'hint' does not align with what this function
does... perhaps there's some deeper meaning I'm missing...?

>
> The function tells you exactly how many PTEs you can batch from the given
> PTEP in that 64 KiB block.
>
> That's also why folio_pte_batch_flags() just jumps over that.

It would still be the case if it were the maximum it _could_ be if you could
ascertain if it _was_. But of course we don't, indeed.

>
> All you have to do is limit it by the maximum number.
>
> So likely you would have to do here
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 677a4d744df9c..58f9cf52eb6bd 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -174,16 +174,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long
> addr,
>                 pte_t *ptep, pte_t pte, int max_nr)
>  {
> -       struct folio *folio;
> -
> -       if (max_nr == 1)
> -               return 1;
> -
> -       folio = vm_normal_folio(vma, addr, pte);
> -       if (!folio || !folio_test_large(folio))
> -               return 1;
> -
> -       return folio_pte_batch(folio, ptep, pte, max_nr);
> +       return min_t(unsigned int, max_nr, pte_batch_hint(ptep, pte));
>  }

Right except you're ignoring A/D bits no? Or what's the point of
folio_pte_batch()?

Why don't they matter here? I thought they did?

>
>  static int move_ptes(struct pagetable_move_control *pmc,
>
>
> And make sure that the compiler realizes that max_nr >= 1 and optimized away
> the min_t on !arm64.
>
> >
> > (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> >
> > I suppose we could not even bother with checking if same folio and _just_
> check> if PTEs have consecutive PFNs, which is not very likely if different
> folio
> > but... could that break something?
> >
> > It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
> > thing' for a contPTE batch (I may be missing something - please correct me if so
> > Dev/Ryan).
> >
> > So actually do we even really care that much about folio?
>
> I don't think so. Not in this case here where we don't use the folio for
> anything else.

I mean my suggestion is that we don't actually then really need the folio
at all, it's very unlkely we'll get contiguous PFNs. So we could have a
version of folio_pte_batch_flags() that doesn't need the folio...

Anyway strikes me all this should be stuff we look at after the hotfix,
better to get this landed so the regression is resolved.

>
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 19:56     ` Ryan Roberts
@ 2025-08-07 20:58       ` Lorenzo Stoakes
  2025-08-08  5:18         ` Dev Jain
  2025-08-08  7:19         ` Ryan Roberts
  0 siblings, 2 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 20:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Barry Song, Dev Jain,
	linux-mm, linux-kernel

On Thu, Aug 07, 2025 at 08:56:44PM +0100, Ryan Roberts wrote:
> On 07/08/2025 20:20, Lorenzo Stoakes wrote:
> > +cc Ryan for ContPTE stuff.
>
> Appologies, I was aware of the other thread and on-going issues but haven't had
> the bandwidth to follow too closely.
>
> >
> > On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
> >> Acked-by: David Hildenbrand <david@redhat.com>
> >
> > Thanks!
> >
> >>
> >> Wondering whether we could then just use the patch hint instead of going via
> >> the folio.
> >>
> >> IOW,
> >>
> >> return pte_batch_hint(ptep, pte);
> >
> > Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
> > have some conflicting flags potentially. The check is empirical:
> >
> > static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> > {
> > 	if (!pte_valid_cont(pte))
> > 		return 1;
> >
> > 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> > }
> >
> > So it's 'the most number of PTEs that _might_ coalesce'.
>
> No that's not correct; It's "at least this number of ptes _do_ coalesce".
> folio_pte_batch() may end up returning a larger batch, but never smaller.

Yup David explained.

I suggest you rename this from 'hint', because that's not what hint means
:) unless I'm really misunderstanding what this word means (it's 10pm and I
started work at 6am so it's currently rather possible).

I understand the con PTE bit is a 'hint' but as I recall you saying at
LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
from, but that's kinda deceptive.

Anyway the reason I was emphatic here is on the basis that I believe I had
this explained to met his way, which obviously I or whoever it was (don't
recall) must have misunderstood. Or perhaps I hallucinated it... :)

I see that folio_pte_batch() can get _more_, is this on the basis of there
being adjacent, physically contiguous contPTE entries that can also be
batched up?

>
> This function is looking to see if ptep is inside a conpte mapping, and if it
> is, it's returning the number of ptes to the end of the contpte mapping (which
> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
> if the physical memory is appropriately aligned/sized and all belongs to a
> single folio.
>
> >
> > (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> >
> > I suppose we could not even bother with checking if same folio and _just_ check
> > if PTEs have consecutive PFNs, which is not very likely if different folio
> > but... could that break something?
>
> Yes something could break; the batch must *all* belong to the same folio.
> Functions like set_ptes() require that in their documentation, and arm64 depends
> upon it in order not to screw up the access/dirty bits.

Turning this around - is a cont pte range guaranteed to belong to only one
folio?

If so then we can just limit the range to one batched block for the sake of
mremap that perhaps doesn't necessarily hugely benefit from further
batching anyway?

Let's take the time to check performance on arm64 hardware.

Are we able to check to see how things behave if we have small folios only
in the tested range on arm64

>
> >
> > It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
> > thing' for a contPTE batch (I may be missing something - please correct me if so
> > Dev/Ryan).
>
> It will all do the right thing functionally no matter how you call it. But if
> you can set_ptes() (and friends) on full contpte mappings, things are more
> efficient.

Yup this is what I was... hinting at ;)

>
> >
> > So actually do we even really care that much about folio?
>
> From arm64's perspective, we're happy enough with batches the size of
> pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
> for this location.

OK, so I think we should definitely refactor this.

David pointed out off-list we are duplicating the a/d handing _anyway_ in
get_and_clear_ptes(). So that bit is just wasted effort, so there's really
no need to do much that.

>
> For the record, I'm pretty sure I was the person pushing for protecting
> vm_normal_folio() with pte_batch_hint() right at the start of this process :)

I think you didn't give your hint clearly enough ;)

>
> Thanks,
> Ryan
>
> >
> >>
> >>
> >> Not sure if that was discussed at some point before we went into the
> >> direction of using folios. But there really doesn't seem to be anything
> >> gained for other architectures here (as raised by Jann).
> >
> > Yup... I wonder about the other instances of this... ruh roh.
>
> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
> already needed the folio. I haven't actually looked at how mprotect ended up,
> but maybe worth checking to see if it should protect with pte_batch_hint() too.

mprotect didn't? I mean let's check.

We definitely need to be careful about other arches.

>
> Thanks,
> Ryan

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 20:11       ` Lorenzo Stoakes
@ 2025-08-07 21:01         ` Lorenzo Stoakes
  0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 21:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel,
	Ryan Roberts

On Thu, Aug 07, 2025 at 09:11:12PM +0100, Lorenzo Stoakes wrote:
> Right except you're ignoring A/D bits no? Or what's the point of
> folio_pte_batch()?
>
> Why don't they matter here? I thought they did?

(Discussed with David off-list - it turns out get_and_clear_ptes() handles this
anyway, so it's not a concern).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 20:58       ` Lorenzo Stoakes
@ 2025-08-08  5:18         ` Dev Jain
  2025-08-08  7:19         ` Ryan Roberts
  1 sibling, 0 replies; 26+ messages in thread
From: Dev Jain @ 2025-08-08  5:18 UTC (permalink / raw)
  To: Lorenzo Stoakes, Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Barry Song, linux-mm,
	linux-kernel


On 08/08/25 2:28 am, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:56:44PM +0100, Ryan Roberts wrote:
>> On 07/08/2025 20:20, Lorenzo Stoakes wrote:
>>> +cc Ryan for ContPTE stuff.
>> Appologies, I was aware of the other thread and on-going issues but haven't had
>> the bandwidth to follow too closely.
>>
>>> On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Thanks!
>>>
>>>> Wondering whether we could then just use the patch hint instead of going via
>>>> the folio.
>>>>
>>>> IOW,
>>>>
>>>> return pte_batch_hint(ptep, pte);
>>> Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
>>> have some conflicting flags potentially. The check is empirical:
>>>
>>> static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
>>> {
>>> 	if (!pte_valid_cont(pte))
>>> 		return 1;
>>>
>>> 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
>>> }
>>>
>>> So it's 'the most number of PTEs that _might_ coalesce'.
>> No that's not correct; It's "at least this number of ptes _do_ coalesce".
>> folio_pte_batch() may end up returning a larger batch, but never smaller.
> Yup David explained.
>
> I suggest you rename this from 'hint', because that's not what hint means
> :) unless I'm really misunderstanding what this word means (it's 10pm and I
> started work at 6am so it's currently rather possible).
>
> I understand the con PTE bit is a 'hint' but as I recall you saying at
> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
> from, but that's kinda deceptive.
>
> Anyway the reason I was emphatic here is on the basis that I believe I had
> this explained to met his way, which obviously I or whoever it was (don't
> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>
> I see that folio_pte_batch() can get _more_, is this on the basis of there
> being adjacent, physically contiguous contPTE entries that can also be
> batched up?
>
>> This function is looking to see if ptep is inside a conpte mapping, and if it
>> is, it's returning the number of ptes to the end of the contpte mapping (which
>> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
>> if the physical memory is appropriately aligned/sized and all belongs to a
>> single folio.
>>
>>> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
>>>
>>> I suppose we could not even bother with checking if same folio and _just_ check
>>> if PTEs have consecutive PFNs, which is not very likely if different folio
>>> but... could that break something?
>> Yes something could break; the batch must *all* belong to the same folio.
>> Functions like set_ptes() require that in their documentation, and arm64 depends
>> upon it in order not to screw up the access/dirty bits.
> Turning this around - is a cont pte range guaranteed to belong to only one
> folio?
>
> If so then we can just limit the range to one batched block for the sake of
> mremap that perhaps doesn't necessarily hugely benefit from further
> batching anyway?
>
> Let's take the time to check performance on arm64 hardware.
>
> Are we able to check to see how things behave if we have small folios only
> in the tested range on arm64
>
>>> It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
>>> thing' for a contPTE batch (I may be missing something - please correct me if so
>>> Dev/Ryan).
>> It will all do the right thing functionally no matter how you call it. But if
>> you can set_ptes() (and friends) on full contpte mappings, things are more
>> efficient.
> Yup this is what I was... hinting at ;)
>
>>> So actually do we even really care that much about folio?
>>  From arm64's perspective, we're happy enough with batches the size of
>> pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
>> for this location.
> OK, so I think we should definitely refactor this.
>
> David pointed out off-list we are duplicating the a/d handing _anyway_ in
> get_and_clear_ptes(). So that bit is just wasted effort, so there's really
> no need to do much that.
>
>> For the record, I'm pretty sure I was the person pushing for protecting
>> vm_normal_folio() with pte_batch_hint() right at the start of this process :)
> I think you didn't give your hint clearly enough ;)
>
>> Thanks,
>> Ryan
>>
>>>>
>>>> Not sure if that was discussed at some point before we went into the
>>>> direction of using folios. But there really doesn't seem to be anything
>>>> gained for other architectures here (as raised by Jann).
>>> Yup... I wonder about the other instances of this... ruh roh.
>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>> already needed the folio. I haven't actually looked at how mprotect ended up,
>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
> mprotect didn't? I mean let's check.

Yeah it didn't, it took the folio only for prot_numa case. For that reason I had first come up
with that maybe_contiguous_pte_pfns() [1] thingy for mremap, not sure how useful that is - It will
increase ptep_get() calls potentially and also won't work for large folios split into small
folios, since they will have contiguous pfns but not useful for batching.

But, I think even that won't work - I think the regression we see here is because batching
isn't actually saving anything here as Jann mentions, so maybe_contiguous_pte_pfns will
still regress for large folios due to retrieving the folio.

So fundamentally the optimization to be made in this specific case is only for arm64 - save
on ptep_get calls and TLB flushes.

[1] https://lore.kernel.org/all/20250507060256.78278-3-dev.jain@arm.com/

>
> We definitely need to be careful about other arches.
>
>> Thanks,
>> Ryan
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 18:58 [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch Lorenzo Stoakes
  2025-08-07 19:10 ` David Hildenbrand
  2025-08-07 19:14 ` Pedro Falcato
@ 2025-08-08  5:19 ` Dev Jain
  2025-08-08  9:56 ` Vlastimil Babka
  2025-08-11  2:40 ` Barry Song
  4 siblings, 0 replies; 26+ messages in thread
From: Dev Jain @ 2025-08-08  5:19 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Barry Song, linux-mm, linux-kernel, David Hildenbrand


On 08/08/25 12:28 am, Lorenzo Stoakes wrote:
> It was discovered in the attached report that commit f822a9a81a31 ("mm:
> optimize mremap() by PTE batching") introduced a significant performance
> regression on a number of metrics on x86-64, most notably
> stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
> number of mremap() calls per second.
>
> I was able to reproduce this locally on an intel x86-64 raptor lake system,
> noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
> 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
> 2,131 or 2.6%) - a 43.3% regression.
>
> During testing I was able to determine that there was no meaningful
> difference in efforts to optimise the folio_pte_batch() operation, nor
> checking folio_test_large().
>
> This is within expectation, as a regression this large is likely to
> indicate we are accessing memory that is not yet in a cache line (and
> perhaps may even cause a main memory fetch).
>
> The expectation by those discussing this from the start was that
> vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> culprit due to having to retrieve memory from the vmemmap (which mremap()
> page table moves does not otherwise do, meaning this is inevitably cold
> memory).
>
> I was able to definitively determine that this theory is indeed correct and
> the cause of the issue.
>
> The solution is to restore part of an approach previously discarded on
> review, that is to invoke pte_batch_hint() which explicitly determines,
> through reference to the PTE alone (thus no vmemmap lookup), what the PTE
> batch size may be.
>
> On platforms other than arm64 this is currently hardcoded to return 1, so
> this naturally resolves the issue for x86-64, and for arm64 introduces
> little to no overhead as the pte cache line will be hot.
>
> With this patch applied, we move from 81,503 realloc calls/sec to
> 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
> accounting for the variance in the original result, this is broadly
> restoring performance to its prior state.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   mm/mremap.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 677a4d744df9..9afa8cd524f5 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -179,6 +179,10 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>   	if (max_nr == 1)
>   		return 1;
>
> +	/* Avoid expensive folio lookup if we stand no chance of benefit. */
> +	if (pte_batch_hint(ptep, pte) == 1)
> +		return 1;
> +
>   	folio = vm_normal_folio(vma, addr, pte);
>   	if (!folio || !folio_test_large(folio))
>   		return 1;
> --
> 2.50.1

Thanks for debugging this!

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 20:58       ` Lorenzo Stoakes
  2025-08-08  5:18         ` Dev Jain
@ 2025-08-08  7:19         ` Ryan Roberts
  2025-08-08  7:45           ` David Hildenbrand
  2025-08-08  9:40           ` Lorenzo Stoakes
  1 sibling, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2025-08-08  7:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Barry Song, Dev Jain,
	linux-mm, linux-kernel

On 07/08/2025 21:58, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:56:44PM +0100, Ryan Roberts wrote:
>> On 07/08/2025 20:20, Lorenzo Stoakes wrote:
>>> +cc Ryan for ContPTE stuff.
>>
>> Appologies, I was aware of the other thread and on-going issues but haven't had
>> the bandwidth to follow too closely.
>>
>>>
>>> On Thu, Aug 07, 2025 at 09:10:52PM +0200, David Hildenbrand wrote:
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>> Thanks!
>>>
>>>>
>>>> Wondering whether we could then just use the patch hint instead of going via
>>>> the folio.
>>>>
>>>> IOW,
>>>>
>>>> return pte_batch_hint(ptep, pte);
>>>
>>> Wouldn't that break the A/D stuff? Also this doesn't mean that the PTE won't
>>> have some conflicting flags potentially. The check is empirical:
>>>
>>> static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
>>> {
>>> 	if (!pte_valid_cont(pte))
>>> 		return 1;
>>>
>>> 	return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
>>> }
>>>
>>> So it's 'the most number of PTEs that _might_ coalesce'.
>>
>> No that's not correct; It's "at least this number of ptes _do_ coalesce".
>> folio_pte_batch() may end up returning a larger batch, but never smaller.
> 
> Yup David explained.
> 
> I suggest you rename this from 'hint', because that's not what hint means
> :) unless I'm really misunderstanding what this word means (it's 10pm and I
> started work at 6am so it's currently rather possible).

That's a lot of hours; I certainly appreciate you for putting the effort in and
figuring out the root cause so quickly.

Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
sure David named this function, but for me the name makes sense. The arch is
saying "I know that the pte batch is at least N ptes. It's up to you if you use
that information. I'll still work correctly if you ignore it".

For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
would be a guess, not a hint.

> 
> I understand the con PTE bit is a 'hint' but as I recall you saying at
> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
> from, but that's kinda deceptive.
> 
> Anyway the reason I was emphatic here is on the basis that I believe I had
> this explained to met his way, which obviously I or whoever it was (don't
> recall) must have misunderstood. Or perhaps I hallucinated it... :)

FWIW, this is the documentation for the function:

/**
 * pte_batch_hint - Number of pages that can be added to batch without scanning.
 * @ptep: Page table pointer for the entry.
 * @pte: Page table entry.
 *
 * Some architectures know that a set of contiguous ptes all map the same
 * contiguous memory with the same permissions. In this case, it can provide a
 * hint to aid pte batching without the core code needing to scan every pte.
 *
 * An architecture implementation may ignore the PTE accessed state. Further,
 * the dirty state must apply atomically to all the PTEs described by the hint.
 *
 * May be overridden by the architecture, else pte_batch_hint is always 1.
 */

> 
> I see that folio_pte_batch() can get _more_, is this on the basis of there
> being adjacent, physically contiguous contPTE entries that can also be
> batched up?

From folio_pte_batch()'s perspective, they just have to be physically contiguous
and part of the same folio; they are not *required* to be contpte. They could
even have different read/write permissions in the middle of the batch (if the
flags permit); that's one reason why such a batch wouldn't be mapped contpte (a
contpte mapping is logically a single mapping so the permissions must all be the
same).

> 
>>
>> This function is looking to see if ptep is inside a conpte mapping, and if it
>> is, it's returning the number of ptes to the end of the contpte mapping (which
>> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
>> if the physical memory is appropriately aligned/sized and all belongs to a
>> single folio.
>>
>>>
>>> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
>>>
>>> I suppose we could not even bother with checking if same folio and _just_ check
>>> if PTEs have consecutive PFNs, which is not very likely if different folio
>>> but... could that break something?
>>
>> Yes something could break; the batch must *all* belong to the same folio.
>> Functions like set_ptes() require that in their documentation, and arm64 depends
>> upon it in order not to screw up the access/dirty bits.
> 
> Turning this around - is a cont pte range guaranteed to belong to only one
> folio?

Yes.

> 
> If so then we can just limit the range to one batched block for the sake of
> mremap that perhaps doesn't necessarily hugely benefit from further
> batching anyway?

Yes.

> 
> Let's take the time to check performance on arm64 hardware.
> 
> Are we able to check to see how things behave if we have small folios only
> in the tested range on arm64

I thought Dev provided numbers for that, but I'll chat with him and ensure we
re-test (and broaden the testing if needed) with the new patch.

> 
>>
>>>
>>> It seems the 'magic' is in set_ptes() on arm64 where it'll know to do the 'right
>>> thing' for a contPTE batch (I may be missing something - please correct me if so
>>> Dev/Ryan).
>>
>> It will all do the right thing functionally no matter how you call it. But if
>> you can set_ptes() (and friends) on full contpte mappings, things are more
>> efficient.
> 
> Yup this is what I was... hinting at ;)

Bravo.

> 
>>
>>>
>>> So actually do we even really care that much about folio?
>>
>> From arm64's perspective, we're happy enough with batches the size of
>> pte_batch_hint(). folio_pte_batch() is a bonus, but certainly not a deal-breaker
>> for this location.
> 
> OK, so I think we should definitely refactor this.
> 
> David pointed out off-list we are duplicating the a/d handing _anyway_ in
> get_and_clear_ptes(). So that bit is just wasted effort, so there's really
> no need to do much that.
> 
>>
>> For the record, I'm pretty sure I was the person pushing for protecting
>> vm_normal_folio() with pte_batch_hint() right at the start of this process :)
> 
> I think you didn't give your hint clearly enough ;)
> 
>>
>> Thanks,
>> Ryan
>>
>>>
>>>>
>>>>
>>>> Not sure if that was discussed at some point before we went into the
>>>> direction of using folios. But there really doesn't seem to be anything
>>>> gained for other architectures here (as raised by Jann).
>>>
>>> Yup... I wonder about the other instances of this... ruh roh.
>>
>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>> already needed the folio. I haven't actually looked at how mprotect ended up,
>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
> 
> mprotect didn't? I mean let's check.

I think for mprotect, the folio was only previously needed for the numa case. I
have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
only get the folio and call it if the next PTE had an adjacent PFN (or something
like that). But it was deemed to complex. I might be misremembering... could
have been an internal conversation. I'll chat with Dev about it and revisit.

Thanks,
Ryan

> 
> We definitely need to be careful about other arches.
> 
>>
>> Thanks,
>> Ryan
> 
> Cheers, Lorenzo



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-08  7:19         ` Ryan Roberts
@ 2025-08-08  7:45           ` David Hildenbrand
  2025-08-08  7:56             ` Ryan Roberts
  2025-08-08  9:45             ` Lorenzo Stoakes
  2025-08-08  9:40           ` Lorenzo Stoakes
  1 sibling, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-08-08  7:45 UTC (permalink / raw)
  To: Ryan Roberts, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel

> 
> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
> sure David named this function, but for me the name makes sense. The arch is
> saying "I know that the pte batch is at least N ptes. It's up to you if you use
> that information. I'll still work correctly if you ignore it".

The last one is the important bit I think.

> 
> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
> would be a guess, not a hint.

I'm not a native speaker, so I'll let both of you figure that out. To me 
it makes sense as well ... but well, I was involved when creating that 
function. :)

> 
>>
>> I understand the con PTE bit is a 'hint' but as I recall you saying at
>> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
>> from, but that's kinda deceptive.
>>
>> Anyway the reason I was emphatic here is on the basis that I believe I had
>> this explained to met his way, which obviously I or whoever it was (don't
>> recall) must have misunderstood. Or perhaps I hallucinated it... :)
> 
> FWIW, this is the documentation for the function:
> 
> /**
>   * pte_batch_hint - Number of pages that can be added to batch without scanning.
>   * @ptep: Page table pointer for the entry.
>   * @pte: Page table entry.
>   *
>   * Some architectures know that a set of contiguous ptes all map the same
>   * contiguous memory with the same permissions. In this case, it can provide a
>   * hint to aid pte batching without the core code needing to scan every pte.
>   *
>   * An architecture implementation may ignore the PTE accessed state. Further,
>   * the dirty state must apply atomically to all the PTEs described by the hint.
>   *
>   * May be overridden by the architecture, else pte_batch_hint is always 1.
>   */

It's actually ... surprisingly good after reading it again after at 
least a year.

> 
>>
>> I see that folio_pte_batch() can get _more_, is this on the basis of there
>> being adjacent, physically contiguous contPTE entries that can also be
>> batched up?

[...]

>>>>
>>>>>
>>>>>
>>>>> Not sure if that was discussed at some point before we went into the
>>>>> direction of using folios. But there really doesn't seem to be anything
>>>>> gained for other architectures here (as raised by Jann).
>>>>
>>>> Yup... I wonder about the other instances of this... ruh roh.
>>>
>>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>>> already needed the folio. I haven't actually looked at how mprotect ended up,
>>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
>>
>> mprotect didn't? I mean let's check.
> 
> I think for mprotect, the folio was only previously needed for the numa case. I
> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
> only get the folio and call it if the next PTE had an adjacent PFN (or something
> like that). But it was deemed to complex. I might be misremembering... could
> have been an internal conversation. I'll chat with Dev about it and revisit.
> 

I am probably to blame here, because I think I rejected early to have 
arm64-only optimization, assuming other arch could benefit here as well 
with batching. But as it seems, batching in mremap() code really only 
serves the cont-pte managing code, and the folio_pte_batch() is really 
entirely unnecessary.

In case of mprotect(), I think really only (a) NUMA and (b) anon-folio 
write-upgrade required the folio. So it's a bit more tricky than 
mremap() here where ... the folio is entirely irrelevant.

One could detect the "anon write-upgrade possible" case early as well, 
and only lookup the folio in that case, otherwise use the straight pte hint.

So I think there is some room for further improvement.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-08  7:45           ` David Hildenbrand
@ 2025-08-08  7:56             ` Ryan Roberts
  2025-08-08  8:44               ` Dev Jain
  2025-08-08  9:45             ` Lorenzo Stoakes
  1 sibling, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2025-08-08  7:56 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, Dev Jain, linux-mm, linux-kernel

On 08/08/2025 08:45, David Hildenbrand wrote:
>>
>> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
>> sure David named this function, but for me the name makes sense. The arch is
>> saying "I know that the pte batch is at least N ptes. It's up to you if you use
>> that information. I'll still work correctly if you ignore it".
> 
> The last one is the important bit I think.
> 
>>
>> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
>> would be a guess, not a hint.
> 
> I'm not a native speaker, so I'll let both of you figure that out. To me it
> makes sense as well ... but well, I was involved when creating that function. :)
> 
>>
>>>
>>> I understand the con PTE bit is a 'hint' but as I recall you saying at
>>> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
>>> from, but that's kinda deceptive.
>>>
>>> Anyway the reason I was emphatic here is on the basis that I believe I had
>>> this explained to met his way, which obviously I or whoever it was (don't
>>> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>>
>> FWIW, this is the documentation for the function:
>>
>> /**
>>   * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>   * @ptep: Page table pointer for the entry.
>>   * @pte: Page table entry.
>>   *
>>   * Some architectures know that a set of contiguous ptes all map the same
>>   * contiguous memory with the same permissions. In this case, it can provide a
>>   * hint to aid pte batching without the core code needing to scan every pte.
>>   *
>>   * An architecture implementation may ignore the PTE accessed state. Further,
>>   * the dirty state must apply atomically to all the PTEs described by the hint.
>>   *
>>   * May be overridden by the architecture, else pte_batch_hint is always 1.
>>   */
> 
> It's actually ... surprisingly good after reading it again after at least a year.
> 
>>
>>>
>>> I see that folio_pte_batch() can get _more_, is this on the basis of there
>>> being adjacent, physically contiguous contPTE entries that can also be
>>> batched up?
> 
> [...]
> 
>>>>>
>>>>>>
>>>>>>
>>>>>> Not sure if that was discussed at some point before we went into the
>>>>>> direction of using folios. But there really doesn't seem to be anything
>>>>>> gained for other architectures here (as raised by Jann).
>>>>>
>>>>> Yup... I wonder about the other instances of this... ruh roh.
>>>>
>>>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>>>> already needed the folio. I haven't actually looked at how mprotect ended up,
>>>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
>>>
>>> mprotect didn't? I mean let's check.
>>
>> I think for mprotect, the folio was only previously needed for the numa case. I
>> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
>> only get the folio and call it if the next PTE had an adjacent PFN (or something
>> like that). But it was deemed to complex. I might be misremembering... could
>> have been an internal conversation. I'll chat with Dev about it and revisit.
>>
> 
> I am probably to blame here, because I think I rejected early to have arm64-only
> optimization, assuming other arch could benefit here as well with batching. But
> as it seems, batching in mremap() code really only serves the cont-pte managing
> code, and the folio_pte_batch() is really entirely unnecessary.
> 
> In case of mprotect(), I think really only (a) NUMA and (b) anon-folio write-
> upgrade required the folio. So it's a bit more tricky than mremap() here
> where ... the folio is entirely irrelevant.
> 
> One could detect the "anon write-upgrade possible" case early as well, and only
> lookup the folio in that case, otherwise use the straight pte hint.
> 
> So I think there is some room for further improvement.

ACK; Dev, perhaps you can take another look at this and work up a patch to more
agressively avoid vm_normal_folio() for mprotect?

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-08  7:56             ` Ryan Roberts
@ 2025-08-08  8:44               ` Dev Jain
  2025-08-08  9:50                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 26+ messages in thread
From: Dev Jain @ 2025-08-08  8:44 UTC (permalink / raw)
  To: Ryan Roberts, David Hildenbrand, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Barry Song, linux-mm, linux-kernel


On 08/08/25 1:26 pm, Ryan Roberts wrote:
> On 08/08/2025 08:45, David Hildenbrand wrote:
>>> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
>>> sure David named this function, but for me the name makes sense. The arch is
>>> saying "I know that the pte batch is at least N ptes. It's up to you if you use
>>> that information. I'll still work correctly if you ignore it".
>> The last one is the important bit I think.
>>
>>> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
>>> would be a guess, not a hint.
>> I'm not a native speaker, so I'll let both of you figure that out. To me it
>> makes sense as well ... but well, I was involved when creating that function. :)
>>
>>>> I understand the con PTE bit is a 'hint' but as I recall you saying at
>>>> LSF/MM 'modern CPUs take the hint'. Which presumably is where this comes
>>>> from, but that's kinda deceptive.
>>>>
>>>> Anyway the reason I was emphatic here is on the basis that I believe I had
>>>> this explained to met his way, which obviously I or whoever it was (don't
>>>> recall) must have misunderstood. Or perhaps I hallucinated it... :)
>>> FWIW, this is the documentation for the function:
>>>
>>> /**
>>>    * pte_batch_hint - Number of pages that can be added to batch without scanning.
>>>    * @ptep: Page table pointer for the entry.
>>>    * @pte: Page table entry.
>>>    *
>>>    * Some architectures know that a set of contiguous ptes all map the same
>>>    * contiguous memory with the same permissions. In this case, it can provide a
>>>    * hint to aid pte batching without the core code needing to scan every pte.
>>>    *
>>>    * An architecture implementation may ignore the PTE accessed state. Further,
>>>    * the dirty state must apply atomically to all the PTEs described by the hint.
>>>    *
>>>    * May be overridden by the architecture, else pte_batch_hint is always 1.
>>>    */
>> It's actually ... surprisingly good after reading it again after at least a year.
>>
>>>> I see that folio_pte_batch() can get _more_, is this on the basis of there
>>>> being adjacent, physically contiguous contPTE entries that can also be
>>>> batched up?
>> [...]
>>
>>>>>>>
>>>>>>> Not sure if that was discussed at some point before we went into the
>>>>>>> direction of using folios. But there really doesn't seem to be anything
>>>>>>> gained for other architectures here (as raised by Jann).
>>>>>> Yup... I wonder about the other instances of this... ruh roh.
>>>>> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
>>>>> already needed the folio. I haven't actually looked at how mprotect ended up,
>>>>> but maybe worth checking to see if it should protect with pte_batch_hint() too.
>>>> mprotect didn't? I mean let's check.
>>> I think for mprotect, the folio was only previously needed for the numa case. I
>>> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
>>> only get the folio and call it if the next PTE had an adjacent PFN (or something
>>> like that). But it was deemed to complex. I might be misremembering... could
>>> have been an internal conversation. I'll chat with Dev about it and revisit.
>>>
>> I am probably to blame here, because I think I rejected early to have arm64-only
>> optimization, assuming other arch could benefit here as well with batching. But
>> as it seems, batching in mremap() code really only serves the cont-pte managing
>> code, and the folio_pte_batch() is really entirely unnecessary.
>>
>> In case of mprotect(), I think really only (a) NUMA and (b) anon-folio write-
>> upgrade required the folio. So it's a bit more tricky than mremap() here
>> where ... the folio is entirely irrelevant.
>>
>> One could detect the "anon write-upgrade possible" case early as well, and only
>> lookup the folio in that case, otherwise use the straight pte hint.
>>
>> So I think there is some room for further improvement.
> ACK; Dev, perhaps you can take another look at this and work up a patch to more
> agressively avoid vm_normal_folio() for mprotect?

Yup I'll investigate this in a few weeks time perhaps - try to use pte_batch_hint(),
and when we have to unconditionally retrieve the folio, then use that instead.

I'll also look into whether even for arm64, there is any use in retrieving
the folio at all - the only benefit we get is to batch across contig blocks, but I don't
think there are any atomic operations (refcount/mapcount manipulation) which will be saved.
By batching across a single contig block we save on ptep_get calls and TLBIs, which was
our objective.

>
> Thanks,
> Ryan
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-08  7:19         ` Ryan Roberts
  2025-08-08  7:45           ` David Hildenbrand
@ 2025-08-08  9:40           ` Lorenzo Stoakes
  1 sibling, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-08  9:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Barry Song, Dev Jain,
	linux-mm, linux-kernel

On Fri, Aug 08, 2025 at 08:19:47AM +0100, Ryan Roberts wrote:
> > Yup David explained.
> >
> > I suggest you rename this from 'hint', because that's not what hint means
> > :) unless I'm really misunderstanding what this word means (it's 10pm and I
> > started work at 6am so it's currently rather possible).
>
> That's a lot of hours; I certainly appreciate you for putting the effort in and
> figuring out the root cause so quickly.

Thanks. Of course not all these hours were spent on this, it was really my
very sincere attempt to help out Dev who might not have an x86-64 bare
metal box as readily available (literally, physically next to me) as I did.

I'm glad I could help identify this! Jann, Dev and David gave very
insightful analysis also, and all happened to be entirely correct!

>
> Not sure if some sleep has changed your mind on what "hint" means? I'm pretty
> sure David named this function, but for me the name makes sense. The arch is
> saying "I know that the pte batch is at least N ptes. It's up to you if you use
> that information. I'll still work correctly if you ignore it".
>
> For me, your interpretation of 'the most number of PTEs that _might_ coalesce'
> would be a guess, not a hint.

I feel we need to drop this subject for sanity's sake... :)

In my view the crux of this is that a reasonable definition of hint (and
first that appears on google) is that _partial_ information is provided,
which I interpreted as a bound.

Of course it's 'partial' in that there may be adjacent physically
contiguous PTE entries, and the whole thing is made murkier by the fact
that the cont PTE bit itself is a hint, which is again presumably why we
named it so.

However, if you, David, Dev and possibly _everybody else_ interprets this
differently, then I'm happy to concede perhaps it's just me getting hung up
on semantics here.

So I won't insist on this changing, though personally I'd still prefer it.

At any rate what it does is now abundantly clear :)

The comment is good by the way, I kicked myself for only reading it after
the fact.

From my perspective, I think a misinterpreted conversation (from whichever
side) was the underlying issue, and I've again had it reinforced the need
to always work from first principles on review.

[snip]

>
> >
> > I see that folio_pte_batch() can get _more_, is this on the basis of there
> > being adjacent, physically contiguous contPTE entries that can also be
> > batched up?
>
> From folio_pte_batch()'s perspective, they just have to be physically contiguous
> and part of the same folio; they are not *required* to be contpte. They could
> even have different read/write permissions in the middle of the batch (if the
> flags permit); that's one reason why such a batch wouldn't be mapped contpte (a
> contpte mapping is logically a single mapping so the permissions must all be the
> same).

Yeah I suspected this might be the case.

Sooo. Now we're rejecting if the first PTE isn't contPTE, is this a problem?

I remember Dev's first attempt at this checked hint, if == 1 I believe it would
then manually look ahead to see if there was a possible batch.


>
> >
> >>
> >> This function is looking to see if ptep is inside a conpte mapping, and if it
> >> is, it's returning the number of ptes to the end of the contpte mapping (which
> >> is of 64K size and alignment on 4K kernels). A contpte mapping will only exist
> >> if the physical memory is appropriately aligned/sized and all belongs to a
> >> single folio.
> >>
> >>>
> >>> (note that a bit grossly we'll call it _again_ in folio_pte_batch_flags()).
> >>>
> >>> I suppose we could not even bother with checking if same folio and _just_ check
> >>> if PTEs have consecutive PFNs, which is not very likely if different folio
> >>> but... could that break something?
> >>
> >> Yes something could break; the batch must *all* belong to the same folio.
> >> Functions like set_ptes() require that in their documentation, and arm64 depends
> >> upon it in order not to screw up the access/dirty bits.
> >
> > Turning this around - is a cont pte range guaranteed to belong to only one
> > folio?
>
> Yes.

Great

>
> >
> > If so then we can just limit the range to one batched block for the sake of
> > mremap that perhaps doesn't necessarily hugely benefit from further
> > batching anyway?
>
> Yes.

Also great.

>
> >
> > Let's take the time to check performance on arm64 hardware.
> >
> > Are we able to check to see how things behave if we have small folios only
> > in the tested range on arm64
>
> I thought Dev provided numbers for that, but I'll chat with him and ensure we
> re-test (and broaden the testing if needed) with the new patch.

Thanks.


> >>>> Not sure if that was discussed at some point before we went into the
> >>>> direction of using folios. But there really doesn't seem to be anything
> >>>> gained for other architectures here (as raised by Jann).
> >>>
> >>> Yup... I wonder about the other instances of this... ruh roh.
> >>
> >> IIRC prior to Dev's mprotect and mremap optimizations, I believe all sites
> >> already needed the folio. I haven't actually looked at how mprotect ended up,
> >> but maybe worth checking to see if it should protect with pte_batch_hint() too.
> >
> > mprotect didn't? I mean let's check.
>
> I think for mprotect, the folio was only previously needed for the numa case. I
> have a vague memory that either Dev of I proposed wrapping folio_pte_batch() to
> only get the folio and call it if the next PTE had an adjacent PFN (or something
> like that). But it was deemed to complex. I might be misremembering... could
> have been an internal conversation. I'll chat with Dev about it and revisit.

OK thanks.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-08  7:45           ` David Hildenbrand
  2025-08-08  7:56             ` Ryan Roberts
@ 2025-08-08  9:45             ` Lorenzo Stoakes
  1 sibling, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-08  9:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Barry Song, Dev Jain, linux-mm,
	linux-kernel

On Fri, Aug 08, 2025 at 09:45:39AM +0200, David Hildenbrand wrote:
> I am probably to blame here, because I think I rejected early to have
> arm64-only optimization, assuming other arch could benefit here as well with
> batching. But as it seems, batching in mremap() code really only serves the
> cont-pte managing code, and the folio_pte_batch() is really entirely
> unnecessary.
>
> In case of mprotect(), I think really only (a) NUMA and (b) anon-folio
> write-upgrade required the folio. So it's a bit more tricky than mremap()
> here where ... the folio is entirely irrelevant.
>
> One could detect the "anon write-upgrade possible" case early as well, and
> only lookup the folio in that case, otherwise use the straight pte hint.
>
> So I think there is some room for further improvement.

These are all great ideas and given we handle A/D correctly in mremap case due
to get_and_clear_ptes() + additionally Ryan has hinted that it shouldn't be a
big deal to lose the 'what if there are other batches/physically contiguous
entries' logic, then your suggestion for simplification, that is - litearlly
having mremap-folio_pte_batch() figure out the result from the pte batch hint -
is probably the best way.

We could RFC it and Dev could go check arm64 perf perhaps?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-08  8:44               ` Dev Jain
@ 2025-08-08  9:50                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-08  9:50 UTC (permalink / raw)
  To: Dev Jain
  Cc: Ryan Roberts, David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Barry Song, linux-mm,
	linux-kernel

On Fri, Aug 08, 2025 at 02:14:08PM +0530, Dev Jain wrote:
>
> On 08/08/25 1:26 pm, Ryan Roberts wrote:
> > ACK; Dev, perhaps you can take another look at this and work up a patch to more
> > agressively avoid vm_normal_folio() for mprotect?
>
> Yup I'll investigate this in a few weeks time perhaps - try to use pte_batch_hint(),
> and when we have to unconditionally retrieve the folio, then use that instead.
>
> I'll also look into whether even for arm64, there is any use in retrieving
> the folio at all - the only benefit we get is to batch across contig blocks, but I don't
> think there are any atomic operations (refcount/mapcount manipulation) which will be saved.
> By batching across a single contig block we save on ptep_get calls and TLBIs, which was
> our objective.

Ah nice, if we only ever need to consider a single contig block this makes life
a lot easier I think.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 18:58 [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-08-08  5:19 ` Dev Jain
@ 2025-08-08  9:56 ` Vlastimil Babka
  2025-08-11  2:40 ` Barry Song
  4 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2025-08-08  9:56 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, Pedro Falcato, Barry Song, Dev Jain,
	linux-mm, linux-kernel, David Hildenbrand

On 8/7/25 20:58, Lorenzo Stoakes wrote:
> It was discovered in the attached report that commit f822a9a81a31 ("mm:
> optimize mremap() by PTE batching") introduced a significant performance
> regression on a number of metrics on x86-64, most notably
> stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
> number of mremap() calls per second.
> 
> I was able to reproduce this locally on an intel x86-64 raptor lake system,
> noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
> 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
> 2,131 or 2.6%) - a 43.3% regression.
> 
> During testing I was able to determine that there was no meaningful
> difference in efforts to optimise the folio_pte_batch() operation, nor
> checking folio_test_large().
> 
> This is within expectation, as a regression this large is likely to
> indicate we are accessing memory that is not yet in a cache line (and
> perhaps may even cause a main memory fetch).
> 
> The expectation by those discussing this from the start was that
> vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> culprit due to having to retrieve memory from the vmemmap (which mremap()
> page table moves does not otherwise do, meaning this is inevitably cold
> memory).
> 
> I was able to definitively determine that this theory is indeed correct and
> the cause of the issue.
> 
> The solution is to restore part of an approach previously discarded on
> review, that is to invoke pte_batch_hint() which explicitly determines,
> through reference to the PTE alone (thus no vmemmap lookup), what the PTE
> batch size may be.
> 
> On platforms other than arm64 this is currently hardcoded to return 1, so
> this naturally resolves the issue for x86-64, and for arm64 introduces
> little to no overhead as the pte cache line will be hot.
> 
> With this patch applied, we move from 81,503 realloc calls/sec to
> 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
> accounting for the variance in the original result, this is broadly
> restoring performance to its prior state.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Sadly, the improvement will be far from 3888.9% :(

> ---
>  mm/mremap.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 677a4d744df9..9afa8cd524f5 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -179,6 +179,10 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>  	if (max_nr == 1)
>  		return 1;
> 
> +	/* Avoid expensive folio lookup if we stand no chance of benefit. */
> +	if (pte_batch_hint(ptep, pte) == 1)
> +		return 1;
> +
>  	folio = vm_normal_folio(vma, addr, pte);
>  	if (!folio || !folio_test_large(folio))
>  		return 1;
> --
> 2.50.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-07 18:58 [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-08-08  9:56 ` Vlastimil Babka
@ 2025-08-11  2:40 ` Barry Song
  2025-08-11  4:57   ` Lorenzo Stoakes
  4 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-08-11  2:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Dev Jain, linux-mm, linux-kernel,
	David Hildenbrand

On Fri, Aug 8, 2025 at 2:59 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> It was discovered in the attached report that commit f822a9a81a31 ("mm:
> optimize mremap() by PTE batching") introduced a significant performance
> regression on a number of metrics on x86-64, most notably
> stress-ng.bigheap.realloc_calls_per_sec - indicating a 37.3% regression in
> number of mremap() calls per second.
>
> I was able to reproduce this locally on an intel x86-64 raptor lake system,
> noting an average of 143,857 realloc calls/sec (with a stddev of 4,531 or
> 3.1%) prior to this patch being applied, and 81,503 afterwards (stddev of
> 2,131 or 2.6%) - a 43.3% regression.
>
> During testing I was able to determine that there was no meaningful
> difference in efforts to optimise the folio_pte_batch() operation, nor
> checking folio_test_large().
>
> This is within expectation, as a regression this large is likely to
> indicate we are accessing memory that is not yet in a cache line (and
> perhaps may even cause a main memory fetch).
>
> The expectation by those discussing this from the start was that
> vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> culprit due to having to retrieve memory from the vmemmap (which mremap()
> page table moves does not otherwise do, meaning this is inevitably cold
> memory).

If vm_normal_folio() is so expensive, does that mean it negates the
benefits that commit f822a9a81a31 (“mm: optimize mremap() by PTE
batching”) was originally intended to achieve through PTE batching?

>
> I was able to definitively determine that this theory is indeed correct and
> the cause of the issue.
>
> The solution is to restore part of an approach previously discarded on
> review, that is to invoke pte_batch_hint() which explicitly determines,
> through reference to the PTE alone (thus no vmemmap lookup), what the PTE
> batch size may be.
>
> On platforms other than arm64 this is currently hardcoded to return 1, so
> this naturally resolves the issue for x86-64, and for arm64 introduces
> little to no overhead as the pte cache line will be hot.
>
> With this patch applied, we move from 81,503 realloc calls/sec to
> 138,701 (stddev of 496.1 or 0.4%), which is a -3.6% regression, however
> accounting for the variance in the original result, this is broadly
> restoring performance to its prior state.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  mm/mremap.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 677a4d744df9..9afa8cd524f5 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -179,6 +179,10 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>         if (max_nr == 1)
>                 return 1;
>
> +       /* Avoid expensive folio lookup if we stand no chance of benefit. */
> +       if (pte_batch_hint(ptep, pte) == 1)
> +               return 1;
> +
>         folio = vm_normal_folio(vma, addr, pte);
>         if (!folio || !folio_test_large(folio))
>                 return 1;
> --
> 2.50.1

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-11  2:40 ` Barry Song
@ 2025-08-11  4:57   ` Lorenzo Stoakes
  2025-08-11  6:52     ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-11  4:57 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Dev Jain, linux-mm, linux-kernel,
	David Hildenbrand

On Mon, Aug 11, 2025 at 10:40:50AM +0800, Barry Song wrote:
> On Fri, Aug 8, 2025 at 2:59 AM Lorenzo Stoakes
> > The expectation by those discussing this from the start was that
> > vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> > culprit due to having to retrieve memory from the vmemmap (which mremap()
> > page table moves does not otherwise do, meaning this is inevitably cold
> > memory).
>
> If vm_normal_folio() is so expensive, does that mean it negates the
> benefits that commit f822a9a81a31 (“mm: optimize mremap() by PTE
> batching”) was originally intended to achieve through PTE batching?

Not for arm64 apparently. And the hint check introduces here should avoid
regressions even there when small folios are in place.

In similar series to these in other areas, it appears we need the folio
anyway so there is no additional overhead to deal with, in mremap() you'd
otherwise just be looking at page tables which makes so this egregious
here.

> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
> > Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching")
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-11  4:57   ` Lorenzo Stoakes
@ 2025-08-11  6:52     ` Barry Song
  2025-08-11 15:08       ` Lorenzo Stoakes
  0 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2025-08-11  6:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Dev Jain, linux-mm, linux-kernel,
	David Hildenbrand

On Mon, Aug 11, 2025 at 12:57 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Aug 11, 2025 at 10:40:50AM +0800, Barry Song wrote:
> > On Fri, Aug 8, 2025 at 2:59 AM Lorenzo Stoakes
> > > The expectation by those discussing this from the start was that
> > > vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> > > culprit due to having to retrieve memory from the vmemmap (which mremap()
> > > page table moves does not otherwise do, meaning this is inevitably cold
> > > memory).
> >
> > If vm_normal_folio() is so expensive, does that mean it negates the
> > benefits that commit f822a9a81a31 (“mm: optimize mremap() by PTE
> > batching”) was originally intended to achieve through PTE batching?
>
> Not for arm64 apparently. And the hint check introduces here should avoid
> regressions even there when small folios are in place.

I still don’t understand why this is fine on arm64. We do have faster
folio_pte_batch(), get_and_clear_ptes(), and set_ptes() with contpte, but
are those benefits really enough to outweigh the disadvantage of
vm_normal_folio(), given those PTEs are likely in the same cacheline?

Unless the previous contpte_try_unfold() was very costly and removing it yielded
a significant improvement, it’s difficult to see how the benefits would outweigh
the drawbacks of vm_normal_folio(). Does this imply that there was already a
regression in mremap() caused by contpte_try_unfold() before?
And that Dev’s patch is essentially a fix for this regression on arm64?

Sorry, maybe I’m talking too much, but I’m curious about the whole story:-)

>
> In similar series to these in other areas, it appears we need the folio
> anyway so there is no additional overhead to deal with, in mremap() you'd
> otherwise just be looking at page tables which makes so this egregious
> here.
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-11  6:52     ` Barry Song
@ 2025-08-11 15:08       ` Lorenzo Stoakes
  2025-08-11 15:19         ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-11 15:08 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Dev Jain, linux-mm, linux-kernel,
	David Hildenbrand

On Mon, Aug 11, 2025 at 02:52:51PM +0800, Barry Song wrote:
> On Mon, Aug 11, 2025 at 12:57 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 10:40:50AM +0800, Barry Song wrote:
> > > On Fri, Aug 8, 2025 at 2:59 AM Lorenzo Stoakes
> > > > The expectation by those discussing this from the start was that
> > > > vm_normal_folio() (invoked by mremap_folio_pte_batch()) would likely be the
> > > > culprit due to having to retrieve memory from the vmemmap (which mremap()
> > > > page table moves does not otherwise do, meaning this is inevitably cold
> > > > memory).
> > >
> > > If vm_normal_folio() is so expensive, does that mean it negates the
> > > benefits that commit f822a9a81a31 (“mm: optimize mremap() by PTE
> > > batching”) was originally intended to achieve through PTE batching?
> >
> > Not for arm64 apparently. And the hint check introduces here should avoid
> > regressions even there when small folios are in place.
>
> I still don’t understand why this is fine on arm64. We do have faster
> folio_pte_batch(), get_and_clear_ptes(), and set_ptes() with contpte, but
> are those benefits really enough to outweigh the disadvantage of
> vm_normal_folio(), given those PTEs are likely in the same cacheline?

Well in operations that already need a folio it's not really an extra cost.

For mremap() where we don't, then note given that we're gating on the hint now,
we'd have to have cont PTE entries, and this would mean we're only looking up
the folio every 2, 3 or 4 PTE entries, not for each and every one.

So this is a significant reduction in time taken in theory.

In practice - well I'll let Dev handle that :)

>
> Unless the previous contpte_try_unfold() was very costly and removing it yielded
> a significant improvement, it’s difficult to see how the benefits would outweigh
> the drawbacks of vm_normal_folio(). Does this imply that there was already a
> regression in mremap() caused by contpte_try_unfold() before?
> And that Dev’s patch is essentially a fix for this regression on arm64?

Yeah maybe, and that'd be interesting - Dev/Ryan?

>
> Sorry, maybe I’m talking too much, but I’m curious about the whole story:-)

No please always query things, it's important stuff!

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch
  2025-08-11 15:08       ` Lorenzo Stoakes
@ 2025-08-11 15:19         ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-08-11 15:19 UTC (permalink / raw)
  To: Lorenzo Stoakes, Barry Song
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Dev Jain, linux-mm, linux-kernel


>>
>> Unless the previous contpte_try_unfold() was very costly and removing it yielded
>> a significant improvement, it’s difficult to see how the benefits would outweigh
>> the drawbacks of vm_normal_folio(). Does this imply that there was already a
>> regression in mremap() caused by contpte_try_unfold() before?
>> And that Dev’s patch is essentially a fix for this regression on arm64?

Fix/regression might be the wrong words here IMHO, but yes, this code 
was not playing nicely with automatic cont-pte folding/unfolding , and 
now it's optimized for that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-08-11 15:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 18:58 [PATCH HOTFIX 6.17] mm/mremap: avoid expensive folio lookup on mremap folio pte batch Lorenzo Stoakes
2025-08-07 19:10 ` David Hildenbrand
2025-08-07 19:20   ` Lorenzo Stoakes
2025-08-07 19:41     ` David Hildenbrand
2025-08-07 20:11       ` Lorenzo Stoakes
2025-08-07 21:01         ` Lorenzo Stoakes
2025-08-07 19:56     ` Ryan Roberts
2025-08-07 20:58       ` Lorenzo Stoakes
2025-08-08  5:18         ` Dev Jain
2025-08-08  7:19         ` Ryan Roberts
2025-08-08  7:45           ` David Hildenbrand
2025-08-08  7:56             ` Ryan Roberts
2025-08-08  8:44               ` Dev Jain
2025-08-08  9:50                 ` Lorenzo Stoakes
2025-08-08  9:45             ` Lorenzo Stoakes
2025-08-08  9:40           ` Lorenzo Stoakes
2025-08-07 19:14 ` Pedro Falcato
2025-08-07 19:22   ` Lorenzo Stoakes
2025-08-07 19:33     ` David Hildenbrand
2025-08-08  5:19 ` Dev Jain
2025-08-08  9:56 ` Vlastimil Babka
2025-08-11  2:40 ` Barry Song
2025-08-11  4:57   ` Lorenzo Stoakes
2025-08-11  6:52     ` Barry Song
2025-08-11 15:08       ` Lorenzo Stoakes
2025-08-11 15:19         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).