[PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
@ 2025-06-18 15:56 Dev Jain
  2025-06-18 17:50 ` Lorenzo Stoakes
  2025-06-23  6:40 ` Baolin Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Dev Jain @ 2025-06-18 15:56 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

Use PTE batching to optimize collapse_pte_mapped_thp().

On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
Then, calling ptep_clear() for every pte will cause a TLB flush for every
contpte block. Instead, clear_full_ptes() does a
contpte_try_unfold_partial() which will flush the TLB only for the (if any)
starting and ending contpte block, if they partially overlap with the range
khugepaged is looking at.

For all arches, there should be a benefit due to batching atomic operations
on mapcounts due to folio_remove_rmap_ptes().

Note that we do not need to make a change to the check
"if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
pages of the folio will be equal to the corresponding pages of our
batch mapping consecutive pages.

No issues were observed with mm-selftests.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---

This is rebased on:
https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
If there will be a v2 of either version I'll send them together.

 mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 649ccb2670f8..7d37058eda5b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
 	struct mmu_notifier_range range;
 	bool notified = false;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	unsigned long end = haddr + HPAGE_PMD_SIZE;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct folio *folio;
 	pte_t *start_pte, *pte;
 	pmd_t *pmd, pgt_pmd;
 	spinlock_t *pml = NULL, *ptl;
-	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
 	mmap_assert_locked(mm);
@@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
 		goto abort;
 
+	i = 0, addr = haddr, pte = start_pte;
 	/* step 2: clear page table and adjust rmap */
-	for (i = 0, addr = haddr, pte = start_pte;
-	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
+	do {
+		const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+		int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
+		struct folio *this_folio;
 		struct page *page;
 		pte_t ptent = ptep_get(pte);
 
+		nr_batch_ptes = 1;
+
 		if (pte_none(ptent))
 			continue;
 		/*
@@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			goto abort;
 		}
 		page = vm_normal_page(vma, addr, ptent);
+		this_folio = page_folio(page);
+		if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
+			nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, ptent,
+					max_nr_batch_ptes, flags, NULL, NULL, NULL);
+
 		if (folio_page(folio, i) != page)
 			goto abort;
 
@@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 * TLB flush can be left until pmdp_collapse_flush() does it.
 		 * PTE dirty? Shmem page is already dirty; file is read-only.
 		 */
-		ptep_clear(mm, addr, pte);
-		folio_remove_rmap_pte(folio, page, vma);
-		nr_ptes++;
-	}
+		clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
+		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
+		nr_mapped_ptes += nr_batch_ptes;
+	} while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
+		 pte += nr_batch_ptes, i < HPAGE_PMD_NR);
 
 	if (!pml)
 		spin_unlock(ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (nr_ptes) {
-		folio_ref_sub(folio, nr_ptes);
-		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
+	if (nr_mapped_ptes) {
+		folio_ref_sub(folio, nr_mapped_ptes);
+		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
 	}
 
 	/* step 4: remove empty page table */
@@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			: SCAN_SUCCEED;
 	goto drop_folio;
 abort:
-	if (nr_ptes) {
+	if (nr_mapped_ptes) {
 		flush_tlb_mm(mm);
-		folio_ref_sub(folio, nr_ptes);
-		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
+		folio_ref_sub(folio, nr_mapped_ptes);
+		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
 	}
 unlock:
 	if (start_pte)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-18 15:56 [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching Dev Jain
@ 2025-06-18 17:50 ` Lorenzo Stoakes
  2025-06-19  3:48   ` Dev Jain
  2025-06-23  6:40 ` Baolin Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-06-18 17:50 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

This series has a lot of duplication in it esp vs. your other series [0], but
perhaps something we can tackle in a follow up.

It'd be nice if we could find a way to de-duplicate some of the near-identical
code though.

But it's a 'maybe' on that because hey, the code in this file is hideous anyway
and needs a mega rework in any case...

[0]: https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/

On Wed, Jun 18, 2025 at 09:26:08PM +0530, Dev Jain wrote:
> Use PTE batching to optimize collapse_pte_mapped_thp().
>
> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> Then, calling ptep_clear() for every pte will cause a TLB flush for every
> contpte block. Instead, clear_full_ptes() does a
> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> starting and ending contpte block, if they partially overlap with the range
> khugepaged is looking at.
>
> For all arches, there should be a benefit due to batching atomic operations
> on mapcounts due to folio_remove_rmap_ptes().
>
> Note that we do not need to make a change to the check
> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
> pages of the folio will be equal to the corresponding pages of our
> batch mapping consecutive pages.
>
> No issues were observed with mm-selftests.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>
> This is rebased on:
> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
> If there will be a v2 of either version I'll send them together.

Hmmm I say again - slow down a bit :) there's no need to shoot out multiple
patches in a single day and you'd maybe avoid some of this kind of thing.

It's really preferable to avoid possible conflicts like this or at least reduce
the chance by having review on one thing done first.

I mean, why not just put both of these in a series for the respin? Just a
thought ;) in fact this is probably an ideal use of a series for that as you can
ensure you deal with both if any conflicts arise.

>
>  mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>  1 file changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 649ccb2670f8..7d37058eda5b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			    bool install_pmd)
>  {
> +	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;

NIT: I don't know why you're moving this, and while y'know it's kind of the fun
of subjective stuff I'd rather the assigned values and unassigned values be on
different lines (yes I know this codebase violates this with the pml, ptl below
but hey :P)

>  	struct mmu_notifier_range range;
>  	bool notified = false;
>  	unsigned long haddr = addr & HPAGE_PMD_MASK;
> +	unsigned long end = haddr + HPAGE_PMD_SIZE;
>  	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>  	struct folio *folio;
>  	pte_t *start_pte, *pte;
>  	pmd_t *pmd, pgt_pmd;
>  	spinlock_t *pml = NULL, *ptl;
> -	int nr_ptes = 0, result = SCAN_FAIL;
>  	int i;
>
>  	mmap_assert_locked(mm);
> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>  		goto abort;
>
> +	i = 0, addr = haddr, pte = start_pte;

This is horrid, no absolutely not. This is not how we do assignment in arbitrary
C code.

I don't know why we need a do/while here in general, I think the for loop should
still work ok no?

>  	/* step 2: clear page table and adjust rmap */
> -	for (i = 0, addr = haddr, pte = start_pte;
> -	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> +	do {
> +		const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +		int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
> +		struct folio *this_folio;

Hate this name. We are not C#... ;)

Just call it folio no? The 'this_' is redundant.


>  		struct page *page;
>  		pte_t ptent = ptep_get(pte);
>
> +		nr_batch_ptes = 1;
> +
>  		if (pte_none(ptent))
>  			continue;
>  		/*
> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			goto abort;
>  		}
>  		page = vm_normal_page(vma, addr, ptent);
> +		this_folio = page_folio(page);
> +		if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
> +			nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, ptent,
> +					max_nr_batch_ptes, flags, NULL, NULL, NULL);
> +
>  		if (folio_page(folio, i) != page)
>  			goto abort;
>
> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		 * TLB flush can be left until pmdp_collapse_flush() does it.
>  		 * PTE dirty? Shmem page is already dirty; file is read-only.
>  		 */
> -		ptep_clear(mm, addr, pte);
> -		folio_remove_rmap_pte(folio, page, vma);
> -		nr_ptes++;
> -	}
> +		clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
> +		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
> +		nr_mapped_ptes += nr_batch_ptes;
> +	} while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
> +		 pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>
>  	if (!pml)
>  		spin_unlock(ptl);
>
>  	/* step 3: set proper refcount and mm_counters. */
> -	if (nr_ptes) {
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +	if (nr_mapped_ptes) {
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>  	}
>
>  	/* step 4: remove empty page table */
> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  			: SCAN_SUCCEED;
>  	goto drop_folio;
>  abort:
> -	if (nr_ptes) {
> +	if (nr_mapped_ptes) {
>  		flush_tlb_mm(mm);
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>  	}
>  unlock:
>  	if (start_pte)
> --
> 2.30.2
>

Logic looks generally sane though... :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-18 17:50 ` Lorenzo Stoakes
@ 2025-06-19  3:48   ` Dev Jain
  2025-06-19 12:55     ` Lorenzo Stoakes
  0 siblings, 1 reply; 9+ messages in thread
From: Dev Jain @ 2025-06-19  3:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel


On 18/06/25 11:20 pm, Lorenzo Stoakes wrote:
> This series has a lot of duplication in it esp vs. your other series [0], but
> perhaps something we can tackle in a follow up.
>
> It'd be nice if we could find a way to de-duplicate some of the near-identical
> code though.
>
> But it's a 'maybe' on that because hey, the code in this file is hideous anyway
> and needs a mega rework in any case...
>
> [0]: https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
>
> On Wed, Jun 18, 2025 at 09:26:08PM +0530, Dev Jain wrote:
>> Use PTE batching to optimize collapse_pte_mapped_thp().
>>
>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
>> Then, calling ptep_clear() for every pte will cause a TLB flush for every
>> contpte block. Instead, clear_full_ptes() does a
>> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
>> starting and ending contpte block, if they partially overlap with the range
>> khugepaged is looking at.
>>
>> For all arches, there should be a benefit due to batching atomic operations
>> on mapcounts due to folio_remove_rmap_ptes().
>>
>> Note that we do not need to make a change to the check
>> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
>> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
>> pages of the folio will be equal to the corresponding pages of our
>> batch mapping consecutive pages.
>>
>> No issues were observed with mm-selftests.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>
>> This is rebased on:
>> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
>> If there will be a v2 of either version I'll send them together.
> Hmmm I say again - slow down a bit :) there's no need to shoot out multiple
> patches in a single day and you'd maybe avoid some of this kind of thing.
>
> It's really preferable to avoid possible conflicts like this or at least reduce
> the chance by having review on one thing done first.
>
> I mean, why not just put both of these in a series for the respin? Just a
> thought ;) in fact this is probably an ideal use of a series for that as you can
> ensure you deal with both if any conflicts arise.

Sorry for this. I found these two patches independently with a couple of
hours difference, and I posted this one yesterday because stupid me thought
someone will, after seeing my first patch, point out that the optimization
can be made at one more place. So I will send this and the other patch as
a series anyway for v2.

>
>>   mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>>   1 file changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 649ccb2670f8..7d37058eda5b 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>   			    bool install_pmd)
>>   {
>> +	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
> NIT: I don't know why you're moving this, and while y'know it's kind of the fun
> of subjective stuff I'd rather the assigned values and unassigned values be on
> different lines (yes I know this codebase violates this with the pml, ptl below
> but hey :P)

To maintain a reverse Xmas fashion. Now I know that the declarations are already
not in an Xmas fashion, but I wanted to make sure the code I change maintains
that for the part I am changing :)

>
>>   	struct mmu_notifier_range range;
>>   	bool notified = false;
>>   	unsigned long haddr = addr & HPAGE_PMD_MASK;
>> +	unsigned long end = haddr + HPAGE_PMD_SIZE;
>>   	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>>   	struct folio *folio;
>>   	pte_t *start_pte, *pte;
>>   	pmd_t *pmd, pgt_pmd;
>>   	spinlock_t *pml = NULL, *ptl;
>> -	int nr_ptes = 0, result = SCAN_FAIL;
>>   	int i;
>>
>>   	mmap_assert_locked(mm);
>> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>   	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>   		goto abort;
>>
>> +	i = 0, addr = haddr, pte = start_pte;
> This is horrid, no absolutely not. This is not how we do assignment in arbitrary
> C code.
>
> I don't know why we need a do/while here in general, I think the for loop should
> still work ok no?

I don't recall now and I cannot even find it, but I have been following this
pattern for some time, by god I cannot remember which pattern I am copying
from. Anyhow I also hate the do/while thingy so I will change this to a
for loop.

>
>>   	/* step 2: clear page table and adjust rmap */
>> -	for (i = 0, addr = haddr, pte = start_pte;
>> -	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>> +	do {
>> +		const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +		int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
>> +		struct folio *this_folio;
> Hate this name. We are not C#... ;)
>
> Just call it folio no? The 'this_' is redundant.

There is already a struct folio *folio which we retrieve from filemap_lock_folio.
So wanted to differentiate, I didn't think hard about the name. How about mapped_folio?

>
>
>>   		struct page *page;
>>   		pte_t ptent = ptep_get(pte);
>>
>> +		nr_batch_ptes = 1;
>> +
>>   		if (pte_none(ptent))
>>   			continue;
>>   		/*
>> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>   			goto abort;
>>   		}
>>   		page = vm_normal_page(vma, addr, ptent);
>> +		this_folio = page_folio(page);
>> +		if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
>> +			nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, ptent,
>> +					max_nr_batch_ptes, flags, NULL, NULL, NULL);
>> +
>>   		if (folio_page(folio, i) != page)
>>   			goto abort;
>>
>> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>   		 * TLB flush can be left until pmdp_collapse_flush() does it.
>>   		 * PTE dirty? Shmem page is already dirty; file is read-only.
>>   		 */
>> -		ptep_clear(mm, addr, pte);
>> -		folio_remove_rmap_pte(folio, page, vma);
>> -		nr_ptes++;
>> -	}
>> +		clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
>> +		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
>> +		nr_mapped_ptes += nr_batch_ptes;
>> +	} while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
>> +		 pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>>
>>   	if (!pml)
>>   		spin_unlock(ptl);
>>
>>   	/* step 3: set proper refcount and mm_counters. */
>> -	if (nr_ptes) {
>> -		folio_ref_sub(folio, nr_ptes);
>> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>> +	if (nr_mapped_ptes) {
>> +		folio_ref_sub(folio, nr_mapped_ptes);
>> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>   	}
>>
>>   	/* step 4: remove empty page table */
>> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>   			: SCAN_SUCCEED;
>>   	goto drop_folio;
>>   abort:
>> -	if (nr_ptes) {
>> +	if (nr_mapped_ptes) {
>>   		flush_tlb_mm(mm);
>> -		folio_ref_sub(folio, nr_ptes);
>> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>> +		folio_ref_sub(folio, nr_mapped_ptes);
>> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>   	}
>>   unlock:
>>   	if (start_pte)
>> --
>> 2.30.2
>>
> Logic looks generally sane though... :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-19  3:48   ` Dev Jain
@ 2025-06-19 12:55     ` Lorenzo Stoakes
  2025-06-23 14:01       ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 12:55 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

On Thu, Jun 19, 2025 at 09:18:39AM +0530, Dev Jain wrote:
>
> On 18/06/25 11:20 pm, Lorenzo Stoakes wrote:
> > This series has a lot of duplication in it esp vs. your other series [0], but
> > perhaps something we can tackle in a follow up.
> >
> > It'd be nice if we could find a way to de-duplicate some of the near-identical
> > code though.
> >
> > But it's a 'maybe' on that because hey, the code in this file is hideous anyway
> > and needs a mega rework in any case...
> >
> > [0]: https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
> >
> > On Wed, Jun 18, 2025 at 09:26:08PM +0530, Dev Jain wrote:
> > > Use PTE batching to optimize collapse_pte_mapped_thp().
> > >
> > > On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> > > Then, calling ptep_clear() for every pte will cause a TLB flush for every
> > > contpte block. Instead, clear_full_ptes() does a
> > > contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> > > starting and ending contpte block, if they partially overlap with the range
> > > khugepaged is looking at.
> > >
> > > For all arches, there should be a benefit due to batching atomic operations
> > > on mapcounts due to folio_remove_rmap_ptes().
> > >
> > > Note that we do not need to make a change to the check
> > > "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
> > > to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
> > > pages of the folio will be equal to the corresponding pages of our
> > > batch mapping consecutive pages.
> > >
> > > No issues were observed with mm-selftests.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >
> > > This is rebased on:
> > > https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
> > > If there will be a v2 of either version I'll send them together.
> > Hmmm I say again - slow down a bit :) there's no need to shoot out multiple
> > patches in a single day and you'd maybe avoid some of this kind of thing.
> >
> > It's really preferable to avoid possible conflicts like this or at least reduce
> > the chance by having review on one thing done first.
> >
> > I mean, why not just put both of these in a series for the respin? Just a
> > thought ;) in fact this is probably an ideal use of a series for that as you can
> > ensure you deal with both if any conflicts arise.
>
> Sorry for this. I found these two patches independently with a couple of
> hours difference, and I posted this one yesterday because stupid me thought
> someone will, after seeing my first patch, point out that the optimization
> can be made at one more place. So I will send this and the other patch as
> a series anyway for v2.

Sure, thanks, it'll just make life easier.

>
> >
> > >   mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
> > >   1 file changed, 25 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 649ccb2670f8..7d37058eda5b 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > >   			    bool install_pmd)
> > >   {
> > > +	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
> > NIT: I don't know why you're moving this, and while y'know it's kind of the fun
> > of subjective stuff I'd rather the assigned values and unassigned values be on
> > different lines (yes I know this codebase violates this with the pml, ptl below
> > but hey :P)
>
> To maintain a reverse Xmas fashion. Now I know that the declarations are already
> not in an Xmas fashion, but I wanted to make sure the code I change maintains
> that for the part I am changing :)

We have no such requirement in mm nor do we particularly want to establish any
conventions around this.

I've already read enough stupid articles about unreasonable kernel devs insist
on yada yada... let's just keep it sensible and logical! :)

>
> >
> > >   	struct mmu_notifier_range range;
> > >   	bool notified = false;
> > >   	unsigned long haddr = addr & HPAGE_PMD_MASK;
> > > +	unsigned long end = haddr + HPAGE_PMD_SIZE;
> > >   	struct vm_area_struct *vma = vma_lookup(mm, haddr);
> > >   	struct folio *folio;
> > >   	pte_t *start_pte, *pte;
> > >   	pmd_t *pmd, pgt_pmd;
> > >   	spinlock_t *pml = NULL, *ptl;
> > > -	int nr_ptes = 0, result = SCAN_FAIL;
> > >   	int i;
> > >
> > >   	mmap_assert_locked(mm);
> > > @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > >   	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
> > >   		goto abort;
> > >
> > > +	i = 0, addr = haddr, pte = start_pte;
> > This is horrid, no absolutely not. This is not how we do assignment in arbitrary
> > C code.
> >
> > I don't know why we need a do/while here in general, I think the for loop should
> > still work ok no?
>
> I don't recall now and I cannot even find it, but I have been following this
> pattern for some time, by god I cannot remember which pattern I am copying
> from. Anyhow I also hate the do/while thingy so I will change this to a
> for loop.

Thanks

>
> >
> > >   	/* step 2: clear page table and adjust rmap */
> > > -	for (i = 0, addr = haddr, pte = start_pte;
> > > -	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> > > +	do {
> > > +		const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +		int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
> > > +		struct folio *this_folio;
> > Hate this name. We are not C#... ;)
> >
> > Just call it folio no? The 'this_' is redundant.
>
> There is already a struct folio *folio which we retrieve from filemap_lock_folio.
> So wanted to differentiate, I didn't think hard about the name. How about mapped_folio?

Ah damn, ok that sounds better, thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-18 15:56 [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching Dev Jain
  2025-06-18 17:50 ` Lorenzo Stoakes
@ 2025-06-23  6:40 ` Baolin Wang
  2025-06-23  7:16   ` Dev Jain
  1 sibling, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2025-06-23  6:40 UTC (permalink / raw)
  To: Dev Jain, akpm, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	linux-mm, linux-kernel



On 2025/6/18 23:56, Dev Jain wrote:
> Use PTE batching to optimize collapse_pte_mapped_thp().
> 
> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
> Then, calling ptep_clear() for every pte will cause a TLB flush for every
> contpte block. Instead, clear_full_ptes() does a
> contpte_try_unfold_partial() which will flush the TLB only for the (if any)
> starting and ending contpte block, if they partially overlap with the range
> khugepaged is looking at.
> 
> For all arches, there should be a benefit due to batching atomic operations
> on mapcounts due to folio_remove_rmap_ptes().
> 
> Note that we do not need to make a change to the check
> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
> pages of the folio will be equal to the corresponding pages of our
> batch mapping consecutive pages.
> 
> No issues were observed with mm-selftests.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> 
> This is rebased on:
> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
> If there will be a v2 of either version I'll send them together.
> 
>   mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>   1 file changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 649ccb2670f8..7d37058eda5b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   			    bool install_pmd)
>   {
> +	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
>   	struct mmu_notifier_range range;
>   	bool notified = false;
>   	unsigned long haddr = addr & HPAGE_PMD_MASK;
> +	unsigned long end = haddr + HPAGE_PMD_SIZE;
>   	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>   	struct folio *folio;
>   	pte_t *start_pte, *pte;
>   	pmd_t *pmd, pgt_pmd;
>   	spinlock_t *pml = NULL, *ptl;
> -	int nr_ptes = 0, result = SCAN_FAIL;
>   	int i;
>   
>   	mmap_assert_locked(mm);
> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>   		goto abort;
>   
> +	i = 0, addr = haddr, pte = start_pte;
>   	/* step 2: clear page table and adjust rmap */
> -	for (i = 0, addr = haddr, pte = start_pte;
> -	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> +	do {
> +		const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +		int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
> +		struct folio *this_folio;
>   		struct page *page;
>   		pte_t ptent = ptep_get(pte);
>   
> +		nr_batch_ptes = 1;
> +
>   		if (pte_none(ptent))
>   			continue;
>   		/*
> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   			goto abort;
>   		}
>   		page = vm_normal_page(vma, addr, ptent);
> +		this_folio = page_folio(page);
> +		if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
> +			nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, ptent,
> +					max_nr_batch_ptes, flags, NULL, NULL, NULL);
> +
>   		if (folio_page(folio, i) != page)
>   			goto abort;

IMO, 'this_folio' is always equal 'folio', right? Can't we just use 'folio'?

In addition, I think the folio_test_large() and max_nr_batch_ptes checks 
are redundant, since the 'folio' must be PMD-sized large folio after 
'folio_page(folio, i) != page' check.

So I think we can move the 'nr_batch_ptes' calculation after the 
folio_page() check, then shoule be:

nr_batch_ptes = folio_pte_batch(folio, addr, pte, ptent,
			max_nr_batch_ptes, flags, NULL, NULL, NULL);

> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		 * TLB flush can be left until pmdp_collapse_flush() does it.
>   		 * PTE dirty? Shmem page is already dirty; file is read-only.
>   		 */
> -		ptep_clear(mm, addr, pte);
> -		folio_remove_rmap_pte(folio, page, vma);
> -		nr_ptes++;
> -	}
> +		clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
> +		folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
> +		nr_mapped_ptes += nr_batch_ptes;
> +	} while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
> +		 pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>   
>   	if (!pml)
>   		spin_unlock(ptl);
>   
>   	/* step 3: set proper refcount and mm_counters. */
> -	if (nr_ptes) {
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +	if (nr_mapped_ptes) {
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>   	}
>   
>   	/* step 4: remove empty page table */
> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   			: SCAN_SUCCEED;
>   	goto drop_folio;
>   abort:
> -	if (nr_ptes) {
> +	if (nr_mapped_ptes) {
>   		flush_tlb_mm(mm);
> -		folio_ref_sub(folio, nr_ptes);
> -		add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
> +		folio_ref_sub(folio, nr_mapped_ptes);
> +		add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>   	}
>   unlock:
>   	if (start_pte)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-23  6:40 ` Baolin Wang
@ 2025-06-23  7:16   ` Dev Jain
  2025-06-23  7:21     ` Baolin Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Dev Jain @ 2025-06-23  7:16 UTC (permalink / raw)
  To: Baolin Wang, akpm, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	linux-mm, linux-kernel


On 23/06/25 12:10 pm, Baolin Wang wrote:
>
>
> On 2025/6/18 23:56, Dev Jain wrote:
>> Use PTE batching to optimize collapse_pte_mapped_thp().
>>
>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for 
>> collapse.
>> Then, calling ptep_clear() for every pte will cause a TLB flush for 
>> every
>> contpte block. Instead, clear_full_ptes() does a
>> contpte_try_unfold_partial() which will flush the TLB only for the 
>> (if any)
>> starting and ending contpte block, if they partially overlap with the 
>> range
>> khugepaged is looking at.
>>
>> For all arches, there should be a benefit due to batching atomic 
>> operations
>> on mapcounts due to folio_remove_rmap_ptes().
>>
>> Note that we do not need to make a change to the check
>> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
>> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
>> pages of the folio will be equal to the corresponding pages of our
>> batch mapping consecutive pages.
>>
>> No issues were observed with mm-selftests.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>
>> This is rebased on:
>> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
>> If there will be a v2 of either version I'll send them together.
>>
>>   mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>>   1 file changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 649ccb2670f8..7d37058eda5b 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct 
>> *vma, unsigned long addr,
>>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>                   bool install_pmd)
>>   {
>> +    int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
>>       struct mmu_notifier_range range;
>>       bool notified = false;
>>       unsigned long haddr = addr & HPAGE_PMD_MASK;
>> +    unsigned long end = haddr + HPAGE_PMD_SIZE;
>>       struct vm_area_struct *vma = vma_lookup(mm, haddr);
>>       struct folio *folio;
>>       pte_t *start_pte, *pte;
>>       pmd_t *pmd, pgt_pmd;
>>       spinlock_t *pml = NULL, *ptl;
>> -    int nr_ptes = 0, result = SCAN_FAIL;
>>       int i;
>>         mmap_assert_locked(mm);
>> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct 
>> *mm, unsigned long addr,
>>       if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>           goto abort;
>>   +    i = 0, addr = haddr, pte = start_pte;
>>       /* step 2: clear page table and adjust rmap */
>> -    for (i = 0, addr = haddr, pte = start_pte;
>> -         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>> +    do {
>> +        const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +        int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
>> +        struct folio *this_folio;
>>           struct page *page;
>>           pte_t ptent = ptep_get(pte);
>>   +        nr_batch_ptes = 1;
>> +
>>           if (pte_none(ptent))
>>               continue;
>>           /*
>> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct 
>> *mm, unsigned long addr,
>>               goto abort;
>>           }
>>           page = vm_normal_page(vma, addr, ptent);
>> +        this_folio = page_folio(page);
>> +        if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
>> +            nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, 
>> ptent,
>> +                    max_nr_batch_ptes, flags, NULL, NULL, NULL);
>> +
>>           if (folio_page(folio, i) != page)
>>               goto abort;
>
> IMO, 'this_folio' is always equal 'folio', right? Can't we just use 
> 'folio'?

I don't think so. What if we have mremapped some bytes of this PMD range

to point to another folio.

>
> In addition, I think the folio_test_large() and max_nr_batch_ptes 
> checks are redundant, since the 'folio' must be PMD-sized large folio 
> after 'folio_page(folio, i) != page' check.

As an improvement we can at least do likely(folio_test_large()) since 
this is very likely.


>
> So I think we can move the 'nr_batch_ptes' calculation after the 
> folio_page() check, then shoule be:
>
> nr_batch_ptes = folio_pte_batch(folio, addr, pte, ptent,
>             max_nr_batch_ptes, flags, NULL, NULL, NULL);
>
>> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct 
>> *mm, unsigned long addr,
>>            * TLB flush can be left until pmdp_collapse_flush() does it.
>>            * PTE dirty? Shmem page is already dirty; file is read-only.
>>            */
>> -        ptep_clear(mm, addr, pte);
>> -        folio_remove_rmap_pte(folio, page, vma);
>> -        nr_ptes++;
>> -    }
>> +        clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
>> +        folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
>> +        nr_mapped_ptes += nr_batch_ptes;
>> +    } while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
>> +         pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>>         if (!pml)
>>           spin_unlock(ptl);
>>         /* step 3: set proper refcount and mm_counters. */
>> -    if (nr_ptes) {
>> -        folio_ref_sub(folio, nr_ptes);
>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>> +    if (nr_mapped_ptes) {
>> +        folio_ref_sub(folio, nr_mapped_ptes);
>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>       }
>>         /* step 4: remove empty page table */
>> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct 
>> *mm, unsigned long addr,
>>               : SCAN_SUCCEED;
>>       goto drop_folio;
>>   abort:
>> -    if (nr_ptes) {
>> +    if (nr_mapped_ptes) {
>>           flush_tlb_mm(mm);
>> -        folio_ref_sub(folio, nr_ptes);
>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>> +        folio_ref_sub(folio, nr_mapped_ptes);
>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>       }
>>   unlock:
>>       if (start_pte)
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-23  7:16   ` Dev Jain
@ 2025-06-23  7:21     ` Baolin Wang
  2025-06-23  7:25       ` Dev Jain
  0 siblings, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2025-06-23  7:21 UTC (permalink / raw)
  To: Dev Jain, akpm, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	linux-mm, linux-kernel



On 2025/6/23 15:16, Dev Jain wrote:
> 
> On 23/06/25 12:10 pm, Baolin Wang wrote:
>>
>>
>> On 2025/6/18 23:56, Dev Jain wrote:
>>> Use PTE batching to optimize collapse_pte_mapped_thp().
>>>
>>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for 
>>> collapse.
>>> Then, calling ptep_clear() for every pte will cause a TLB flush for 
>>> every
>>> contpte block. Instead, clear_full_ptes() does a
>>> contpte_try_unfold_partial() which will flush the TLB only for the 
>>> (if any)
>>> starting and ending contpte block, if they partially overlap with the 
>>> range
>>> khugepaged is looking at.
>>>
>>> For all arches, there should be a benefit due to batching atomic 
>>> operations
>>> on mapcounts due to folio_remove_rmap_ptes().
>>>
>>> Note that we do not need to make a change to the check
>>> "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
>>> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
>>> pages of the folio will be equal to the corresponding pages of our
>>> batch mapping consecutive pages.
>>>
>>> No issues were observed with mm-selftests.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>
>>> This is rebased on:
>>> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
>>> If there will be a v2 of either version I'll send them together.
>>>
>>>   mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>>>   1 file changed, 25 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 649ccb2670f8..7d37058eda5b 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct 
>>> *vma, unsigned long addr,
>>>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>                   bool install_pmd)
>>>   {
>>> +    int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
>>>       struct mmu_notifier_range range;
>>>       bool notified = false;
>>>       unsigned long haddr = addr & HPAGE_PMD_MASK;
>>> +    unsigned long end = haddr + HPAGE_PMD_SIZE;
>>>       struct vm_area_struct *vma = vma_lookup(mm, haddr);
>>>       struct folio *folio;
>>>       pte_t *start_pte, *pte;
>>>       pmd_t *pmd, pgt_pmd;
>>>       spinlock_t *pml = NULL, *ptl;
>>> -    int nr_ptes = 0, result = SCAN_FAIL;
>>>       int i;
>>>         mmap_assert_locked(mm);
>>> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct mm_struct 
>>> *mm, unsigned long addr,
>>>       if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>>           goto abort;
>>>   +    i = 0, addr = haddr, pte = start_pte;
>>>       /* step 2: clear page table and adjust rmap */
>>> -    for (i = 0, addr = haddr, pte = start_pte;
>>> -         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>> +    do {
>>> +        const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>> +        int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
>>> +        struct folio *this_folio;
>>>           struct page *page;
>>>           pte_t ptent = ptep_get(pte);
>>>   +        nr_batch_ptes = 1;
>>> +
>>>           if (pte_none(ptent))
>>>               continue;
>>>           /*
>>> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct 
>>> *mm, unsigned long addr,
>>>               goto abort;
>>>           }
>>>           page = vm_normal_page(vma, addr, ptent);
>>> +        this_folio = page_folio(page);
>>> +        if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
>>> +            nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, 
>>> ptent,
>>> +                    max_nr_batch_ptes, flags, NULL, NULL, NULL);
>>> +
>>>           if (folio_page(folio, i) != page)
>>>               goto abort;
>>
>> IMO, 'this_folio' is always equal 'folio', right? Can't we just use 
>> 'folio'?
> 
> I don't think so. What if we have mremapped some bytes of this PMD range
> 
> to point to another folio.

Then 'folio_page(folio, i) != page' can catch this, which is why I 
suggest you move the 'nr_batch_ptes' calculation after the folio_page() 
check.

>> In addition, I think the folio_test_large() and max_nr_batch_ptes 
>> checks are redundant, since the 'folio' must be PMD-sized large folio 
>> after 'folio_page(folio, i) != page' check.
> 
> As an improvement we can at least do likely(folio_test_large()) since 
> this is very likely.
> 
> 
>>
>> So I think we can move the 'nr_batch_ptes' calculation after the 
>> folio_page() check, then shoule be:
>>
>> nr_batch_ptes = folio_pte_batch(folio, addr, pte, ptent,
>>             max_nr_batch_ptes, flags, NULL, NULL, NULL);
>>
>>> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct mm_struct 
>>> *mm, unsigned long addr,
>>>            * TLB flush can be left until pmdp_collapse_flush() does it.
>>>            * PTE dirty? Shmem page is already dirty; file is read-only.
>>>            */
>>> -        ptep_clear(mm, addr, pte);
>>> -        folio_remove_rmap_pte(folio, page, vma);
>>> -        nr_ptes++;
>>> -    }
>>> +        clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
>>> +        folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
>>> +        nr_mapped_ptes += nr_batch_ptes;
>>> +    } while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
>>> +         pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>>>         if (!pml)
>>>           spin_unlock(ptl);
>>>         /* step 3: set proper refcount and mm_counters. */
>>> -    if (nr_ptes) {
>>> -        folio_ref_sub(folio, nr_ptes);
>>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>>> +    if (nr_mapped_ptes) {
>>> +        folio_ref_sub(folio, nr_mapped_ptes);
>>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>>       }
>>>         /* step 4: remove empty page table */
>>> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct mm_struct 
>>> *mm, unsigned long addr,
>>>               : SCAN_SUCCEED;
>>>       goto drop_folio;
>>>   abort:
>>> -    if (nr_ptes) {
>>> +    if (nr_mapped_ptes) {
>>>           flush_tlb_mm(mm);
>>> -        folio_ref_sub(folio, nr_ptes);
>>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>>> +        folio_ref_sub(folio, nr_mapped_ptes);
>>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>>       }
>>>   unlock:
>>>       if (start_pte)
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-23  7:21     ` Baolin Wang
@ 2025-06-23  7:25       ` Dev Jain
  0 siblings, 0 replies; 9+ messages in thread
From: Dev Jain @ 2025-06-23  7:25 UTC (permalink / raw)
  To: Baolin Wang, akpm, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	linux-mm, linux-kernel


On 23/06/25 12:51 pm, Baolin Wang wrote:
>
>
> On 2025/6/23 15:16, Dev Jain wrote:
>>
>> On 23/06/25 12:10 pm, Baolin Wang wrote:
>>>
>>>
>>> On 2025/6/18 23:56, Dev Jain wrote:
>>>> Use PTE batching to optimize collapse_pte_mapped_thp().
>>>>
>>>> On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for 
>>>> collapse.
>>>> Then, calling ptep_clear() for every pte will cause a TLB flush for 
>>>> every
>>>> contpte block. Instead, clear_full_ptes() does a
>>>> contpte_try_unfold_partial() which will flush the TLB only for the 
>>>> (if any)
>>>> starting and ending contpte block, if they partially overlap with 
>>>> the range
>>>> khugepaged is looking at.
>>>>
>>>> For all arches, there should be a benefit due to batching atomic 
>>>> operations
>>>> on mapcounts due to folio_remove_rmap_ptes().
>>>>
>>>> Note that we do not need to make a change to the check
>>>> "if (folio_page(folio, i) != page)"; if i'th page of the folio is 
>>>> equal
>>>> to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
>>>> pages of the folio will be equal to the corresponding pages of our
>>>> batch mapping consecutive pages.
>>>>
>>>> No issues were observed with mm-selftests.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>
>>>> This is rebased on:
>>>> https://lore.kernel.org/all/20250618102607.10551-1-dev.jain@arm.com/
>>>> If there will be a v2 of either version I'll send them together.
>>>>
>>>>   mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>>>>   1 file changed, 25 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 649ccb2670f8..7d37058eda5b 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct 
>>>> vm_area_struct *vma, unsigned long addr,
>>>>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long 
>>>> addr,
>>>>                   bool install_pmd)
>>>>   {
>>>> +    int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
>>>>       struct mmu_notifier_range range;
>>>>       bool notified = false;
>>>>       unsigned long haddr = addr & HPAGE_PMD_MASK;
>>>> +    unsigned long end = haddr + HPAGE_PMD_SIZE;
>>>>       struct vm_area_struct *vma = vma_lookup(mm, haddr);
>>>>       struct folio *folio;
>>>>       pte_t *start_pte, *pte;
>>>>       pmd_t *pmd, pgt_pmd;
>>>>       spinlock_t *pml = NULL, *ptl;
>>>> -    int nr_ptes = 0, result = SCAN_FAIL;
>>>>       int i;
>>>>         mmap_assert_locked(mm);
>>>> @@ -1620,12 +1621,17 @@ int collapse_pte_mapped_thp(struct 
>>>> mm_struct *mm, unsigned long addr,
>>>>       if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>>>           goto abort;
>>>>   +    i = 0, addr = haddr, pte = start_pte;
>>>>       /* step 2: clear page table and adjust rmap */
>>>> -    for (i = 0, addr = haddr, pte = start_pte;
>>>> -         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>>> +    do {
>>>> +        const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +        int max_nr_batch_ptes = (end - addr) >> PAGE_SHIFT;
>>>> +        struct folio *this_folio;
>>>>           struct page *page;
>>>>           pte_t ptent = ptep_get(pte);
>>>>   +        nr_batch_ptes = 1;
>>>> +
>>>>           if (pte_none(ptent))
>>>>               continue;
>>>>           /*
>>>> @@ -1639,6 +1645,11 @@ int collapse_pte_mapped_thp(struct mm_struct 
>>>> *mm, unsigned long addr,
>>>>               goto abort;
>>>>           }
>>>>           page = vm_normal_page(vma, addr, ptent);
>>>> +        this_folio = page_folio(page);
>>>> +        if (folio_test_large(this_folio) && max_nr_batch_ptes != 1)
>>>> +            nr_batch_ptes = folio_pte_batch(this_folio, addr, pte, 
>>>> ptent,
>>>> +                    max_nr_batch_ptes, flags, NULL, NULL, NULL);
>>>> +
>>>>           if (folio_page(folio, i) != page)
>>>>               goto abort;
>>>
>>> IMO, 'this_folio' is always equal 'folio', right? Can't we just use 
>>> 'folio'?
>>
>> I don't think so. What if we have mremapped some bytes of this PMD range
>>
>> to point to another folio.
>
> Then 'folio_page(folio, i) != page' can catch this, which is why I 
> suggest you move the 'nr_batch_ptes' calculation after the 
> folio_page() check.

Ah. page will be part of the folio, so the folio derived from page is 
equal to folio, so it is large. Thanks.


>
>>> In addition, I think the folio_test_large() and max_nr_batch_ptes 
>>> checks are redundant, since the 'folio' must be PMD-sized large 
>>> folio after 'folio_page(folio, i) != page' check.
>>
>> As an improvement we can at least do likely(folio_test_large()) since 
>> this is very likely.
>>
>>
>>>
>>> So I think we can move the 'nr_batch_ptes' calculation after the 
>>> folio_page() check, then shoule be:
>>>
>>> nr_batch_ptes = folio_pte_batch(folio, addr, pte, ptent,
>>>             max_nr_batch_ptes, flags, NULL, NULL, NULL);
>>>
>>>> @@ -1647,18 +1658,19 @@ int collapse_pte_mapped_thp(struct 
>>>> mm_struct *mm, unsigned long addr,
>>>>            * TLB flush can be left until pmdp_collapse_flush() does 
>>>> it.
>>>>            * PTE dirty? Shmem page is already dirty; file is 
>>>> read-only.
>>>>            */
>>>> -        ptep_clear(mm, addr, pte);
>>>> -        folio_remove_rmap_pte(folio, page, vma);
>>>> -        nr_ptes++;
>>>> -    }
>>>> +        clear_full_ptes(mm, addr, pte, nr_batch_ptes, false);
>>>> +        folio_remove_rmap_ptes(folio, page, nr_batch_ptes, vma);
>>>> +        nr_mapped_ptes += nr_batch_ptes;
>>>> +    } while (i += nr_batch_ptes, addr += nr_batch_ptes * PAGE_SIZE,
>>>> +         pte += nr_batch_ptes, i < HPAGE_PMD_NR);
>>>>         if (!pml)
>>>>           spin_unlock(ptl);
>>>>         /* step 3: set proper refcount and mm_counters. */
>>>> -    if (nr_ptes) {
>>>> -        folio_ref_sub(folio, nr_ptes);
>>>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>>>> +    if (nr_mapped_ptes) {
>>>> +        folio_ref_sub(folio, nr_mapped_ptes);
>>>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>>>       }
>>>>         /* step 4: remove empty page table */
>>>> @@ -1691,10 +1703,10 @@ int collapse_pte_mapped_thp(struct 
>>>> mm_struct *mm, unsigned long addr,
>>>>               : SCAN_SUCCEED;
>>>>       goto drop_folio;
>>>>   abort:
>>>> -    if (nr_ptes) {
>>>> +    if (nr_mapped_ptes) {
>>>>           flush_tlb_mm(mm);
>>>> -        folio_ref_sub(folio, nr_ptes);
>>>> -        add_mm_counter(mm, mm_counter_file(folio), -nr_ptes);
>>>> +        folio_ref_sub(folio, nr_mapped_ptes);
>>>> +        add_mm_counter(mm, mm_counter_file(folio), -nr_mapped_ptes);
>>>>       }
>>>>   unlock:
>>>>       if (start_pte)
>>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching
  2025-06-19 12:55     ` Lorenzo Stoakes
@ 2025-06-23 14:01       ` David Hildenbrand
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-06-23 14:01 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

>>>
>>>>    mm/khugepaged.c | 38 +++++++++++++++++++++++++-------------
>>>>    1 file changed, 25 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 649ccb2670f8..7d37058eda5b 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1499,15 +1499,16 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>>>>    int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>    			    bool install_pmd)
>>>>    {
>>>> +	int nr_mapped_ptes = 0, nr_batch_ptes, result = SCAN_FAIL;
>>> NIT: I don't know why you're moving this, and while y'know it's kind of the fun
>>> of subjective stuff I'd rather the assigned values and unassigned values be on
>>> different lines (yes I know this codebase violates this with the pml, ptl below
>>> but hey :P)
>>
>> To maintain a reverse Xmas fashion. Now I know that the declarations are already
>> not in an Xmas fashion, but I wanted to make sure the code I change maintains
>> that for the part I am changing :)
> 
> We have no such requirement in mm nor do we particularly want to establish any
> conventions around this.

Well, if we already do have reverse xmas tree, we tend to maintain it

... or when adding new code.

So in MM it's actually very common to do that. But we avoid doing that 
just for the sake of it when touching some code.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-23 14:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 15:56 [PATCH] khugepaged: Optimize collapse_pte_mapped_thp() for large folios by PTE batching Dev Jain
2025-06-18 17:50 ` Lorenzo Stoakes
2025-06-19  3:48   ` Dev Jain
2025-06-19 12:55     ` Lorenzo Stoakes
2025-06-23 14:01       ` David Hildenbrand
2025-06-23  6:40 ` Baolin Wang
2025-06-23  7:16   ` Dev Jain
2025-06-23  7:21     ` Baolin Wang
2025-06-23  7:25       ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).