[PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
@ 2025-06-18 10:26 Dev Jain
  2025-06-18 16:14 ` David Hildenbrand
  2025-06-18 17:26 ` Lorenzo Stoakes
  0 siblings, 2 replies; 11+ messages in thread
From: Dev Jain @ 2025-06-18 10:26 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

Use PTE batching to optimize __collapse_huge_page_copy_succeeded().

On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
Then, calling ptep_clear() for every pte will cause a TLB flush for every
contpte block. Instead, clear_full_ptes() does a
contpte_try_unfold_partial() which will flush the TLB only for the (if any)
starting and ending contpte block, if they partially overlap with the range
khugepaged is looking at.

For all arches, there should be a benefit due to batching atomic operations
on mapcounts due to folio_remove_rmap_ptes().

No issues were observed with mm-selftests.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d45d08b521f6..649ccb2670f8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -700,12 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						spinlock_t *ptl,
 						struct list_head *compound_pagelist)
 {
+	unsigned long end = address + HPAGE_PMD_SIZE;
 	struct folio *src, *tmp;
-	pte_t *_pte;
+	pte_t *_pte = pte;
 	pte_t pteval;
+	int nr_ptes;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, address += PAGE_SIZE) {
+	do {
+		nr_ptes = 1;
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
@@ -719,23 +721,36 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
 			}
 		} else {
+			const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+			int max_nr_ptes;
+			bool is_large;
+
 			struct page *src_page = pte_page(pteval);
 
 			src = page_folio(src_page);
-			if (!folio_test_large(src))
+			is_large = folio_test_large(src);
+			if (!is_large)
 				release_pte_folio(src);
+
+			max_nr_ptes = (end - address) >> PAGE_SHIFT;
+			if (is_large && max_nr_ptes != 1)
+				nr_ptes = folio_pte_batch(src, address, _pte,
+							  pteval, max_nr_ptes,
+							  flags, NULL, NULL, NULL);
+
 			/*
 			 * ptl mostly unnecessary, but preempt has to
 			 * be disabled to update the per-cpu stats
 			 * inside folio_remove_rmap_pte().
 			 */
 			spin_lock(ptl);
-			ptep_clear(vma->vm_mm, address, _pte);
-			folio_remove_rmap_pte(src, src_page, vma);
+			clear_full_ptes(vma->vm_mm, address, _pte, nr_ptes, false);
+			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
 			spin_unlock(ptl);
-			free_folio_and_swap_cache(src);
+			free_swap_cache(src);
+			folio_put_refs(src, nr_ptes);
 		}
-	}
+	} while (_pte += nr_ptes, address += nr_ptes * PAGE_SIZE, address != end);
 
 	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
 		list_del(&src->lru);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 10:26 [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching Dev Jain
@ 2025-06-18 16:14 ` David Hildenbrand
  2025-06-18 17:10   ` Lorenzo Stoakes
  2025-06-19  3:54   ` Dev Jain
  2025-06-18 17:26 ` Lorenzo Stoakes
  1 sibling, 2 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-06-18 16:14 UTC (permalink / raw)
  To: Dev Jain, akpm
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On 18.06.25 12:26, Dev Jain wrote:
> Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
> 
> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> Then, calling ptep_clear() for every pte will cause a TLB flush for every
> contpte block. Instead, clear_full_ptes() does a
> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> starting and ending contpte block, if they partially overlap with the range
> khugepaged is looking at.
> 
> For all arches, there should be a benefit due to batching atomic operations
> on mapcounts due to folio_remove_rmap_ptes().
> 
> No issues were observed with mm-selftests.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>   1 file changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d45d08b521f6..649ccb2670f8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -700,12 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   						spinlock_t *ptl,
>   						struct list_head *compound_pagelist)
>   {
> +	unsigned long end = address + HPAGE_PMD_SIZE;
>   	struct folio *src, *tmp;
> -	pte_t *_pte;
> +	pte_t *_pte = pte;
>   	pte_t pteval;
> +	int nr_ptes;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	do {
> +		nr_ptes = 1;
>   		pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -719,23 +721,36 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>   			}
>   		} else {
> +			const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +			int max_nr_ptes;
> +			bool is_large;

folio_test_large() should be cheap, no need for the temporary variable 
(the compiler will likely optimize this either way).

> +
>   			struct page *src_page = pte_page(pteval);
>   
>   			src = page_folio(src_page);
> -			if (!folio_test_large(src))
> +			is_large = folio_test_large(src);
> +			if (!is_large)
>   				release_pte_folio(src);
> +
> +			max_nr_ptes = (end - address) >> PAGE_SHIFT;
> +			if (is_large && max_nr_ptes != 1)
> +				nr_ptes = folio_pte_batch(src, address, _pte,
> +							  pteval, max_nr_ptes,
> +							  flags, NULL, NULL, NULL);

Starting to wonder if we want a simplified, non-inlined version of 
folio_pte_batch() in mm/util.c (e.g., without the 3 NULL parameters), 
renaming existing folio_pte_batch to __folio_pte_batch() and only using 
it where required (performance like in fork/zap, or because the other 
parameters are relevant).

Let me see if I find time for a quick patch later. Have to look at what 
other similar code needs.

> +
>   			/*
>   			 * ptl mostly unnecessary, but preempt has to
>   			 * be disabled to update the per-cpu stats
>   			 * inside folio_remove_rmap_pte().
>   			 */
>   			spin_lock(ptl);

Existing code: The PTL locking should just be moved outside of the loop.

> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> +			clear_full_ptes(vma->vm_mm, address, _pte, nr_ptes, false);

Starting to wonder if we want a shortcut

#define clear_ptes(__mm, __addr, __pte, __nr_ptes) \
	clear_full_ptes(__mm, __addr, __pte, __nr_ptes, false)

> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>   			spin_unlock(ptl);
> -			free_folio_and_swap_cache(src);
> +			free_swap_cache(src);
> +			folio_put_refs(src, nr_ptes);
>   		}
> -	}
> +	} while (_pte += nr_ptes, address += nr_ptes * PAGE_SIZE, address != end);
>   
>   	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
>   		list_del(&src->lru);

I think this should just work.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 16:14 ` David Hildenbrand
@ 2025-06-18 17:10   ` Lorenzo Stoakes
  2025-06-18 17:23     ` David Hildenbrand
  2025-06-19  3:54   ` Dev Jain
  1 sibling, 1 reply; 11+ messages in thread
From: Lorenzo Stoakes @ 2025-06-18 17:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On Wed, Jun 18, 2025 at 06:14:22PM +0200, David Hildenbrand wrote:
> On 18.06.25 12:26, Dev Jain wrote:
> > +
> >   			/*
> >   			 * ptl mostly unnecessary, but preempt has to
> >   			 * be disabled to update the per-cpu stats
> >   			 * inside folio_remove_rmap_pte().
> >   			 */
> >   			spin_lock(ptl);
>
> Existing code: The PTL locking should just be moved outside of the loop.

Do we really want to hold the PTL for the duration of the loop? Are we sure
it's safe to do so? Are there any locks taken in other functions that might
sleep that'd mean holding a spinlock would be a problem?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 17:10   ` Lorenzo Stoakes
@ 2025-06-18 17:23     ` David Hildenbrand
  2025-06-18 17:26       ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-06-18 17:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On 18.06.25 19:10, Lorenzo Stoakes wrote:
> On Wed, Jun 18, 2025 at 06:14:22PM +0200, David Hildenbrand wrote:
>> On 18.06.25 12:26, Dev Jain wrote:
>>> +
>>>    			/*
>>>    			 * ptl mostly unnecessary, but preempt has to
>>>    			 * be disabled to update the per-cpu stats
>>>    			 * inside folio_remove_rmap_pte().
>>>    			 */
>>>    			spin_lock(ptl);
>>
>> Existing code: The PTL locking should just be moved outside of the loop.
> 
> Do we really want to hold the PTL for the duration of the loop? Are we sure
> it's safe to do so? Are there any locks taken in other functions that might
> sleep that'd mean holding a spinlock would be a problem?

It's a very weird thing to not hold the PTL while walking page tables, 
and then only grabbing it for clearing entries just to make selected 
functions happy ...

I mostly spotted the release_pte_folio(), which I think should be fine 
with a spinlock held. I missed the free_folio_and_swap_cache(), not sure 
if that is problematic.

Interestingly, release_pte_folio() does a

a) node_stat_mod_folio
b) folio_unlock
c) folio_putback_lru

... and folio_putback_lru() is documented to "lru_lock must not be held, 
interrupts must be enabled". Hmmmm. I suspect that doc is wrong.

So yeah, maybe better keep that weird looking locking like it is :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 17:23     ` David Hildenbrand
@ 2025-06-18 17:26       ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-06-18 17:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On 18.06.25 19:23, David Hildenbrand wrote:
> On 18.06.25 19:10, Lorenzo Stoakes wrote:
>> On Wed, Jun 18, 2025 at 06:14:22PM +0200, David Hildenbrand wrote:
>>> On 18.06.25 12:26, Dev Jain wrote:
>>>> +
>>>>     			/*
>>>>     			 * ptl mostly unnecessary, but preempt has to
>>>>     			 * be disabled to update the per-cpu stats
>>>>     			 * inside folio_remove_rmap_pte().
>>>>     			 */
>>>>     			spin_lock(ptl);
>>>
>>> Existing code: The PTL locking should just be moved outside of the loop.
>>
>> Do we really want to hold the PTL for the duration of the loop? Are we sure
>> it's safe to do so? Are there any locks taken in other functions that might
>> sleep that'd mean holding a spinlock would be a problem?
> 
> It's a very weird thing to not hold the PTL while walking page tables,
> and then only grabbing it for clearing entries just to make selected
> functions happy ...
> 
> I mostly spotted the release_pte_folio(), which I think should be fine
> with a spinlock held. I missed the free_folio_and_swap_cache(), not sure
> if that is problematic.
> 
> Interestingly, release_pte_folio() does a
> 
> a) node_stat_mod_folio
> b) folio_unlock
> c) folio_putback_lru
> 
> ... and folio_putback_lru() is documented to "lru_lock must not be held,
> interrupts must be enabled". Hmmmm. I suspect that doc is wrong.

It's late ... doc is correct. Was reading "must be held".

So yeah, let's leave it as is.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 10:26 [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching Dev Jain
  2025-06-18 16:14 ` David Hildenbrand
@ 2025-06-18 17:26 ` Lorenzo Stoakes
  2025-06-18 17:29   ` David Hildenbrand
  2025-06-19  3:22   ` Dev Jain
  1 sibling, 2 replies; 11+ messages in thread
From: Lorenzo Stoakes @ 2025-06-18 17:26 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On Wed, Jun 18, 2025 at 03:56:07PM +0530, Dev Jain wrote:
> Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
>
> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> Then, calling ptep_clear() for every pte will cause a TLB flush for every
> contpte block. Instead, clear_full_ptes() does a
> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> starting and ending contpte block, if they partially overlap with the range
> khugepaged is looking at.
>
> For all arches, there should be a benefit due to batching atomic operations
> on mapcounts due to folio_remove_rmap_ptes().
>
> No issues were observed with mm-selftests.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>  1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d45d08b521f6..649ccb2670f8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -700,12 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  						spinlock_t *ptl,
>  						struct list_head *compound_pagelist)
>  {
> +	unsigned long end = address + HPAGE_PMD_SIZE;

I assume we always enter here with aligned address...

>  	struct folio *src, *tmp;
> -	pte_t *_pte;
> +	pte_t *_pte = pte;
>  	pte_t pteval;
> +	int nr_ptes;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	do {
> +		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -719,23 +721,36 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>  			}
>  		} else {

Existing code but hate this level of indentation.

The code before was (barely) sort of ok-ish, but now it's realyl out of hand.

On the other hand, I look at __collapse_huge_page_isolate() and want to cry so I
guess this maybe is something that needs addressing outside of this patch.


> +			const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +			int max_nr_ptes;
> +			bool is_large;
> +
>  			struct page *src_page = pte_page(pteval);
>
>  			src = page_folio(src_page);
> -			if (!folio_test_large(src))
> +			is_large = folio_test_large(src);
> +			if (!is_large)
>  				release_pte_folio(src);

Hm, in this case right, release_pte_folio() does a folio_unlock().

Where does a large folio get unlocked?

I mean this must have been existing code because I don't see where this
happens previously either.

> +
> +			max_nr_ptes = (end - address) >> PAGE_SHIFT;
> +			if (is_large && max_nr_ptes != 1)

Is it really that harmful if max_nr_ptes == 1? Doesn't folio_pte_batch()
figure it out?

> +				nr_ptes = folio_pte_batch(src, address, _pte,
> +							  pteval, max_nr_ptes,
> +							  flags, NULL, NULL, NULL);
> +

It'd be nice(r) if this was:

if (folio_test_large(src))
	nr_ptes = folio_pte_batch(src, address, _pte,
		pteval, max_nr_ptes,
		flags, NULL, NULL, NULL);
else
	release_pte_folio(src);

But even that is horrid because of the asymmetry.

>  			/*
>  			 * ptl mostly unnecessary, but preempt has to
>  			 * be disabled to update the per-cpu stats
>  			 * inside folio_remove_rmap_pte().
>  			 */
>  			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> +			clear_full_ptes(vma->vm_mm, address, _pte, nr_ptes, false);

Be nice to use 'Liam's convention' of sticking `/* full = */ false)` on the
end here so we know what the false refers to.

> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);

Kinda neat that folio_remove_map_pte() is jus ta define onto this with
nr_ptes == 1 :)

>  			spin_unlock(ptl);
> -			free_folio_and_swap_cache(src);
> +			free_swap_cache(src);
> +			folio_put_refs(src, nr_ptes);
>  		}
> -	}
> +	} while (_pte += nr_ptes, address += nr_ptes * PAGE_SIZE, address != end);
>
>  	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
>  		list_del(&src->lru);
> --
> 2.30.2
>

I can't see much wrong with this though, just 'yuck' at existing code
really :)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 17:26 ` Lorenzo Stoakes
@ 2025-06-18 17:29   ` David Hildenbrand
  2025-06-19 12:52     ` Lorenzo Stoakes
  2025-06-19  3:22   ` Dev Jain
  1 sibling, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-06-18 17:29 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On 18.06.25 19:26, Lorenzo Stoakes wrote:
> On Wed, Jun 18, 2025 at 03:56:07PM +0530, Dev Jain wrote:
>> Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
>>
>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
>> Then, calling ptep_clear() for every pte will cause a TLB flush for every
>> contpte block. Instead, clear_full_ptes() does a
>> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
>> starting and ending contpte block, if they partially overlap with the range
>> khugepaged is looking at.
>>
>> For all arches, there should be a benefit due to batching atomic operations
>> on mapcounts due to folio_remove_rmap_ptes().
>>
>> No issues were observed with mm-selftests.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>>   1 file changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d45d08b521f6..649ccb2670f8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -700,12 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   						spinlock_t *ptl,
>>   						struct list_head *compound_pagelist)
>>   {
>> +	unsigned long end = address + HPAGE_PMD_SIZE;
> 
> I assume we always enter here with aligned address...
> 
>>   	struct folio *src, *tmp;
>> -	pte_t *_pte;
>> +	pte_t *_pte = pte;
>>   	pte_t pteval;
>> +	int nr_ptes;
>>
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> -	     _pte++, address += PAGE_SIZE) {
>> +	do {
>> +		nr_ptes = 1;
>>   		pteval = ptep_get(_pte);
>>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>> @@ -719,23 +721,36 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>   			}
>>   		} else {
> 
> Existing code but hate this level of indentation.
> 
> The code before was (barely) sort of ok-ish, but now it's realyl out of hand.
> 
> On the other hand, I look at __collapse_huge_page_isolate() and want to cry so I
> guess this maybe is something that needs addressing outside of this patch.
> 
> 
>> +			const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +			int max_nr_ptes;
>> +			bool is_large;
>> +
>>   			struct page *src_page = pte_page(pteval);
>>
>>   			src = page_folio(src_page);
>> -			if (!folio_test_large(src))
>> +			is_large = folio_test_large(src);
>> +			if (!is_large)
>>   				release_pte_folio(src);
> 
> Hm, in this case right, release_pte_folio() does a folio_unlock().
> 
> Where does a large folio get unlocked?

Through the "compound_pagelist" below. ... this code is so ugly.

"large_folio_list" ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 17:26 ` Lorenzo Stoakes
  2025-06-18 17:29   ` David Hildenbrand
@ 2025-06-19  3:22   ` Dev Jain
  2025-06-19 12:53     ` Lorenzo Stoakes
  1 sibling, 1 reply; 11+ messages in thread
From: Dev Jain @ 2025-06-19  3:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel


On 18/06/25 10:56 pm, Lorenzo Stoakes wrote:
> On Wed, Jun 18, 2025 at 03:56:07PM +0530, Dev Jain wrote:
>> Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
>>
>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
>> Then, calling ptep_clear() for every pte will cause a TLB flush for every
>> contpte block. Instead, clear_full_ptes() does a
>> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
>> starting and ending contpte block, if they partially overlap with the range
>> khugepaged is looking at.
>>
>> For all arches, there should be a benefit due to batching atomic operations
>> on mapcounts due to folio_remove_rmap_ptes().
>>
>> No issues were observed with mm-selftests.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>>   1 file changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d45d08b521f6..649ccb2670f8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -700,12 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   						spinlock_t *ptl,
>>   						struct list_head *compound_pagelist)
>>   {
>> +	unsigned long end = address + HPAGE_PMD_SIZE;
> I assume we always enter here with aligned address...

Yes.

>
>>   	struct folio *src, *tmp;
>> -	pte_t *_pte;
>> +	pte_t *_pte = pte;
>>   	pte_t pteval;
>> +	int nr_ptes;
>>
>> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> -	     _pte++, address += PAGE_SIZE) {
>> +	do {
>> +		nr_ptes = 1;
>>   		pteval = ptep_get(_pte);
>>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>> @@ -719,23 +721,36 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>>   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>   			}
>>   		} else {
> Existing code but hate this level of indentation.
>
> The code before was (barely) sort of ok-ish, but now it's realyl out of hand.
>
> On the other hand, I look at __collapse_huge_page_isolate() and want to cry so I
> guess this maybe is something that needs addressing outside of this patch.

Trust me I have already cried a lot before while doing the khugepaged mTHP stuff :)

>
>
>> +			const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +			int max_nr_ptes;
>> +			bool is_large;
>> +
>>   			struct page *src_page = pte_page(pteval);
>>
>>   			src = page_folio(src_page);
>> -			if (!folio_test_large(src))
>> +			is_large = folio_test_large(src);
>> +			if (!is_large)
>>   				release_pte_folio(src);
> Hm, in this case right, release_pte_folio() does a folio_unlock().
>
> Where does a large folio get unlocked?
>
> I mean this must have been existing code because I don't see where this
> happens previously either.
>
>> +
>> +			max_nr_ptes = (end - address) >> PAGE_SHIFT;
>> +			if (is_large && max_nr_ptes != 1)
> Is it really that harmful if max_nr_ptes == 1? Doesn't folio_pte_batch()
> figure it out?

Yup it will figure that out, was just following the pattern of zap_present_ptes
and copy_present_ptes. Will drop this.

>
>> +				nr_ptes = folio_pte_batch(src, address, _pte,
>> +							  pteval, max_nr_ptes,
>> +							  flags, NULL, NULL, NULL);
>> +
> It'd be nice(r) if this was:
>
> if (folio_test_large(src))
> 	nr_ptes = folio_pte_batch(src, address, _pte,
> 		pteval, max_nr_ptes,
> 		flags, NULL, NULL, NULL);
> else
> 	release_pte_folio(src);
>
> But even that is horrid because of the asymmetry.
>
>>   			/*
>>   			 * ptl mostly unnecessary, but preempt has to
>>   			 * be disabled to update the per-cpu stats
>>   			 * inside folio_remove_rmap_pte().
>>   			 */
>>   			spin_lock(ptl);
>> -			ptep_clear(vma->vm_mm, address, _pte);
>> -			folio_remove_rmap_pte(src, src_page, vma);
>> +			clear_full_ptes(vma->vm_mm, address, _pte, nr_ptes, false);
> Be nice to use 'Liam's convention' of sticking `/* full = */ false)` on the
> end here so we know what the false refers to.

Sounds good, although in the other mail David mentioned a way to elide this
so I will prefer that.

>
>> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
> Kinda neat that folio_remove_map_pte() is jus ta define onto this with
> nr_ptes == 1 :)
>
>>   			spin_unlock(ptl);
>> -			free_folio_and_swap_cache(src);
>> +			free_swap_cache(src);
>> +			folio_put_refs(src, nr_ptes);
>>   		}
>> -	}
>> +	} while (_pte += nr_ptes, address += nr_ptes * PAGE_SIZE, address != end);
>>
>>   	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
>>   		list_del(&src->lru);
>> --
>> 2.30.2
>>
> I can't see much wrong with this though, just 'yuck' at existing code
> really :)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 16:14 ` David Hildenbrand
  2025-06-18 17:10   ` Lorenzo Stoakes
@ 2025-06-19  3:54   ` Dev Jain
  1 sibling, 0 replies; 11+ messages in thread
From: Dev Jain @ 2025-06-19  3:54 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel


On 18/06/25 9:44 pm, David Hildenbrand wrote:
> On 18.06.25 12:26, Dev Jain wrote:
>> Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
>>
>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for 
>> collapse.
>> Then, calling ptep_clear() for every pte will cause a TLB flush for 
>> every
>> contpte block. Instead, clear_full_ptes() does a
>> contpte_try_unfold_partial() which will flush the TLB only for the 
>> (if any)
>> starting and ending contpte block, if they partially overlap with the 
>> range
>> khugepaged is looking at.
>>
>> For all arches, there should be a benefit due to batching atomic 
>> operations
>> on mapcounts due to folio_remove_rmap_ptes().
>>
>> No issues were observed with mm-selftests.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>>   1 file changed, 23 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d45d08b521f6..649ccb2670f8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -700,12 +700,14 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>                           spinlock_t *ptl,
>>                           struct list_head *compound_pagelist)
>>   {
>> +    unsigned long end = address + HPAGE_PMD_SIZE;
>>       struct folio *src, *tmp;
>> -    pte_t *_pte;
>> +    pte_t *_pte = pte;
>>       pte_t pteval;
>> +    int nr_ptes;
>>   -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> -         _pte++, address += PAGE_SIZE) {
>> +    do {
>> +        nr_ptes = 1;
>>           pteval = ptep_get(_pte);
>>           if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>               add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>> @@ -719,23 +721,36 @@ static void 
>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>                   ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>               }
>>           } else {
>> +            const fpb_t flags = FPB_IGNORE_DIRTY | 
>> FPB_IGNORE_SOFT_DIRTY;
>> +            int max_nr_ptes;
>> +            bool is_large;
>
> folio_test_large() should be cheap, no need for the temporary variable 
> (the compiler will likely optimize this either way).

Okay.

>
>> +
>>               struct page *src_page = pte_page(pteval);
>>                 src = page_folio(src_page);
>> -            if (!folio_test_large(src))
>> +            is_large = folio_test_large(src);
>> +            if (!is_large)
>>                   release_pte_folio(src);
>> +
>> +            max_nr_ptes = (end - address) >> PAGE_SHIFT;
>> +            if (is_large && max_nr_ptes != 1)
>> +                nr_ptes = folio_pte_batch(src, address, _pte,
>> +                              pteval, max_nr_ptes,
>> +                              flags, NULL, NULL, NULL);
>
> Starting to wonder if we want a simplified, non-inlined version of 
> folio_pte_batch() in mm/util.c (e.g., without the 3 NULL parameters), 
> renaming existing folio_pte_batch to __folio_pte_batch() and only 
> using it where required (performance like in fork/zap, or because the 
> other parameters are relevant).
>
> Let me see if I find time for a quick patch later. Have to look at 
> what other similar code needs.

Perhaps that version can also have the default fpb_flags ignoring dirty 
and soft-dirty, since that is what

most code will do. So the wrapper can pass which flags to remove.

>
>> +
>>               /*
>>                * ptl mostly unnecessary, but preempt has to
>>                * be disabled to update the per-cpu stats
>>                * inside folio_remove_rmap_pte().
>>                */
>>               spin_lock(ptl);
>
> Existing code: The PTL locking should just be moved outside of the loop.
>
>> -            ptep_clear(vma->vm_mm, address, _pte);
>> -            folio_remove_rmap_pte(src, src_page, vma);
>> +            clear_full_ptes(vma->vm_mm, address, _pte, nr_ptes, false);
>
> Starting to wonder if we want a shortcut
>
> #define clear_ptes(__mm, __addr, __pte, __nr_ptes) \
>     clear_full_ptes(__mm, __addr, __pte, __nr_ptes, false)

Thanks for the suggestion! I will definitely do this cleanup as part of this

series. It is very confusing and if someone does not know, it becomes hard

to find the batched version of ptep_clear, because this does not follow the

convention we have right now - ptep_set_wrprotect -> wrprotect_ptes and

so on, the "full" comes out of nowhere.

>
>> +            folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>>               spin_unlock(ptl);
>> -            free_folio_and_swap_cache(src);
>> +            free_swap_cache(src);
>> +            folio_put_refs(src, nr_ptes);
>>           }
>> -    }
>> +    } while (_pte += nr_ptes, address += nr_ptes * PAGE_SIZE, 
>> address != end);
>>         list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
>>           list_del(&src->lru);
>
> I think this should just work.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-18 17:29   ` David Hildenbrand
@ 2025-06-19 12:52     ` Lorenzo Stoakes
  0 siblings, 0 replies; 11+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 12:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel

On Wed, Jun 18, 2025 at 07:29:42PM +0200, David Hildenbrand wrote:
> On 18.06.25 19:26, Lorenzo Stoakes wrote:
> > Where does a large folio get unlocked?
>
> Through the "compound_pagelist" below. ... this code is so ugly.
>
> "large_folio_list" ...
>

Yeah I did wonder if that was what it was... yuck indeed!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching
  2025-06-19  3:22   ` Dev Jain
@ 2025-06-19 12:53     ` Lorenzo Stoakes
  0 siblings, 0 replies; 11+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 12:53 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On Thu, Jun 19, 2025 at 08:52:51AM +0530, Dev Jain wrote:
>
> On 18/06/25 10:56 pm, Lorenzo Stoakes wrote:
> > On Wed, Jun 18, 2025 at 03:56:07PM +0530, Dev Jain wrote:
> > > Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
> > >
> > > On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> > > Then, calling ptep_clear() for every pte will cause a TLB flush for every
> > > contpte block. Instead, clear_full_ptes() does a
> > > contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> > > starting and ending contpte block, if they partially overlap with the range
> > > khugepaged is looking at.
> > >
> > > For all arches, there should be a benefit due to batching atomic operations
> > > on mapcounts due to folio_remove_rmap_ptes().
> > >
> > > No issues were observed with mm-selftests.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
> > >   1 file changed, 23 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index d45d08b521f6..649ccb2670f8 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -700,12 +700,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> > >   						spinlock_t *ptl,
> > >   						struct list_head *compound_pagelist)
> > >   {
> > > +	unsigned long end = address + HPAGE_PMD_SIZE;
> > I assume we always enter here with aligned address...
>
> Yes.

OK cool would be weird otherwise :)

>
> >
> > >   	struct folio *src, *tmp;
> > > -	pte_t *_pte;
> > > +	pte_t *_pte = pte;
> > >   	pte_t pteval;
> > > +	int nr_ptes;
> > >
> > > -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > -	     _pte++, address += PAGE_SIZE) {
> > > +	do {
> > > +		nr_ptes = 1;
> > >   		pteval = ptep_get(_pte);
> > >   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > >   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > @@ -719,23 +721,36 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> > >   				ksm_might_unmap_zero_page(vma->vm_mm, pteval);
> > >   			}
> > >   		} else {
> > Existing code but hate this level of indentation.
> >
> > The code before was (barely) sort of ok-ish, but now it's realyl out of hand.
> >
> > On the other hand, I look at __collapse_huge_page_isolate() and want to cry so I
> > guess this maybe is something that needs addressing outside of this patch.
>
> Trust me I have already cried a lot before while doing the khugepaged mTHP stuff :)

Seems we all cry together about this code ;)

>
> >
> >
> > > +			const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +			int max_nr_ptes;
> > > +			bool is_large;
> > > +
> > >   			struct page *src_page = pte_page(pteval);
> > >
> > >   			src = page_folio(src_page);
> > > -			if (!folio_test_large(src))
> > > +			is_large = folio_test_large(src);
> > > +			if (!is_large)
> > >   				release_pte_folio(src);
> > Hm, in this case right, release_pte_folio() does a folio_unlock().
> >
> > Where does a large folio get unlocked?
> >
> > I mean this must have been existing code because I don't see where this
> > happens previously either.
> >
> > > +
> > > +			max_nr_ptes = (end - address) >> PAGE_SHIFT;
> > > +			if (is_large && max_nr_ptes != 1)
> > Is it really that harmful if max_nr_ptes == 1? Doesn't folio_pte_batch()
> > figure it out?
>
> Yup it will figure that out, was just following the pattern of zap_present_ptes
> and copy_present_ptes. Will drop this.

Thanks

>
> >
> > > +				nr_ptes = folio_pte_batch(src, address, _pte,
> > > +							  pteval, max_nr_ptes,
> > > +							  flags, NULL, NULL, NULL);
> > > +
> > It'd be nice(r) if this was:
> >
> > if (folio_test_large(src))
> > 	nr_ptes = folio_pte_batch(src, address, _pte,
> > 		pteval, max_nr_ptes,
> > 		flags, NULL, NULL, NULL);
> > else
> > 	release_pte_folio(src);
> >
> > But even that is horrid because of the asymmetry.
> >
> > >   			/*
> > >   			 * ptl mostly unnecessary, but preempt has to
> > >   			 * be disabled to update the per-cpu stats
> > >   			 * inside folio_remove_rmap_pte().
> > >   			 */
> > >   			spin_lock(ptl);
> > > -			ptep_clear(vma->vm_mm, address, _pte);
> > > -			folio_remove_rmap_pte(src, src_page, vma);
> > > +			clear_full_ptes(vma->vm_mm, address, _pte, nr_ptes, false);
> > Be nice to use 'Liam's convention' of sticking `/* full = */ false)` on the
> > end here so we know what the false refers to.
>
> Sounds good, although in the other mail David mentioned a way to elide this
> so I will prefer that.

OK

>
> >
> > > +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
> > Kinda neat that folio_remove_map_pte() is jus ta define onto this with
> > nr_ptes == 1 :)
> >
> > >   			spin_unlock(ptl);
> > > -			free_folio_and_swap_cache(src);
> > > +			free_swap_cache(src);
> > > +			folio_put_refs(src, nr_ptes);
> > >   		}
> > > -	}
> > > +	} while (_pte += nr_ptes, address += nr_ptes * PAGE_SIZE, address != end);
> > >
> > >   	list_for_each_entry_safe(src, tmp, compound_pagelist, lru) {
> > >   		list_del(&src->lru);
> > > --
> > > 2.30.2
> > >
> > I can't see much wrong with this though, just 'yuck' at existing code
> > really :)


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-19 12:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 10:26 [PATCH] khugepaged: Optimize __collapse_huge_page_copy_succeeded() for large folios by PTE batching Dev Jain
2025-06-18 16:14 ` David Hildenbrand
2025-06-18 17:10   ` Lorenzo Stoakes
2025-06-18 17:23     ` David Hildenbrand
2025-06-18 17:26       ` David Hildenbrand
2025-06-19  3:54   ` Dev Jain
2025-06-18 17:26 ` Lorenzo Stoakes
2025-06-18 17:29   ` David Hildenbrand
2025-06-19 12:52     ` Lorenzo Stoakes
2025-06-19  3:22   ` Dev Jain
2025-06-19 12:53     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).