[PATCH v6 0/5] support batch checking of references and unmapping for large folios

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v6 0/5] support batch checking of references and unmapping for large folios
@ 2026-02-09 14:07 Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Similar to folio_referenced_one(), we can also apply batched unmapping for large
file folios to optimize the performance of file folio reclamation. By supporting
batched checking of the young flags, flushing TLB entries, and unmapping, I can
observed a significant performance improvements in my performance tests for file
folios reclamation. Please check the performance data in the commit message of
each patch.

Run stress-ng and mm selftests, no issues were found.

Patch 1: Add a new generic batched PTE helper that supports batched checks of
the references for large folios.
Patch 2 - 3: Preparation patches.
patch 4: Implement the Arm64 arch-specific clear_flush_young_ptes().
Patch 5: Support batched unmapping for file large folios.

Changes from v5:
 - Collect reviewed tags from Ryan, Harry and David. Thanks.
 - Fix some coding style issues (per David).
 - Skip batched unmapping for uffd case, reported by Dev. Thanks.

Changes from v4:
 - Fix passing the incorrect 'CONT_PTES' for non-batched APIs.
 - Rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() (per Ryan).
 - Fix some coding style issues (per Ryan).
 - Add reviewed tag from Ryan. Thanks.

Changes from v3:
 - Fix using an incorrect parameter in ptep_clear_flush_young_notify()
   (per Liam).

Changes from v2:
 - Rearrange the patch set (per Ryan).
 - Add pte_cont() check in clear_flush_young_ptes() (per Ryan).
 - Add a helper to do contpte block alignment (per Ryan).
 - Fix some coding style issues (per Lorenzo and Ryan).
 - Add more comments and update the commit message (per Lorenzo and Ryan).
 - Add acked tag from Barry. Thanks. 

Changes from v1:
 - Add a new patch to support batched unmapping for file large folios.
 - Update the cover letter

Baolin Wang (5):
  mm: rmap: support batched checks of the references for large folios
  arm64: mm: factor out the address and ptep alignment into a new helper
  arm64: mm: support batch clearing of the young flag for large folios
  arm64: mm: implement the architecture-specific
    clear_flush_young_ptes()
  mm: rmap: support batched unmapping for file large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++----
 arch/arm64/mm/contpte.c          | 62 ++++++++++++++++++++------------
 include/linux/mmu_notifier.h     |  9 ++---
 include/linux/pgtable.h          | 35 ++++++++++++++++++
 mm/rmap.c                        | 38 ++++++++++++++++----
 5 files changed, 129 insertions(+), 38 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 15:25   ` David Hildenbrand (Arm)
  2026-03-06 21:07   ` Barry Song
  2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation. And it will be overridden by the architecture
that implements a more efficient batch operation in the following patches.

While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch operation.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/mmu_notifier.h |  9 +++++----
 include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
 mm/rmap.c                    | 28 +++++++++++++++++++++++++---
 3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..07a2bbaf86e9 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
 	range->owner = owner;
 }
 
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)	\
 ({									\
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	unsigned int ___nr = __nr;					\
+	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
 	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
 						  ___address,		\
 						  ___address +		\
-							PAGE_SIZE);	\
+						  ___nr * PAGE_SIZE);	\
 	__young;							\
 })
 
@@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 
 #define mmu_notifier_range_update_to_read_only(r) false
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define clear_flush_young_ptes_notify clear_flush_young_ptes
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_young_notify ptep_test_and_clear_young
 #define pmdp_clear_young_notify pmdp_test_and_clear_young
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 21b67d937555..a50df42a893f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+#ifndef clear_flush_young_ptes
+/**
+ * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
+ *			    folio as old and flush the TLB.
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear access bit.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_clear_flush_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	int young = 0;
+
+	for (;;) {
+		young |= ptep_clear_flush_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+
+	return young;
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/rmap.c b/mm/rmap.c
index a5a284f2a83d..8807f8a7df28 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -913,9 +913,11 @@ static bool folio_referenced_one(struct folio *folio,
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int ptes = 0, referenced = 0;
+	unsigned int nr;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
+		nr = 1;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			ptes++;
@@ -960,9 +962,21 @@ static bool folio_referenced_one(struct folio *folio,
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte))
+			if (folio_test_large(folio)) {
+				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
+				pte_t pteval = ptep_get(pvmw.pte);
+
+				nr = folio_pte_batch(folio, pvmw.pte,
+						     pteval, max_nr);
+			}
+
+			ptes += nr;
+			if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
 				referenced++;
+			/* Skip the batched PTEs */
+			pvmw.pte += nr - 1;
+			pvmw.address += (nr - 1) * PAGE_SIZE;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
 						pvmw.pmd))
@@ -972,7 +986,15 @@ static bool folio_referenced_one(struct folio *folio,
 			WARN_ON_ONCE(1);
 		}
 
-		pra->mapcount--;
+		pra->mapcount -= nr;
+		/*
+		 * If we are sure that we batched the entire folio,
+		 * we can just optimize and stop right here.
+		 */
+		if (ptes == pvmw.nr_pages) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
 	}
 
 	if (referenced)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
@ 2026-02-09 15:25   ` David Hildenbrand (Arm)
  2026-03-06 21:07   ` Barry Song
  1 sibling, 0 replies; 38+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:25 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On 2/9/26 15:07, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
> 
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> 
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Thanks!

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
  2026-02-09 15:25   ` David Hildenbrand (Arm)
@ 2026-03-06 21:07   ` Barry Song
  2026-03-07  2:22     ` Baolin Wang
  1 sibling, 1 reply; 38+ messages in thread
From: Barry Song @ 2026-03-06 21:07 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
>
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

LGTM,

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  include/linux/mmu_notifier.h |  9 +++++----
>  include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
>  mm/rmap.c                    | 28 +++++++++++++++++++++++++---
>  3 files changed, 65 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..07a2bbaf86e9 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>         range->owner = owner;
>  }
>
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
>  ({                                                                     \
>         int __young;                                                    \
>         struct vm_area_struct *___vma = __vma;                          \
>         unsigned long ___address = __address;                           \
> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
> +       unsigned int ___nr = __nr;                                      \
> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
>         __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
>                                                   ___address,           \
>                                                   ___address +          \
> -                                                       PAGE_SIZE);     \
> +                                                 ___nr * PAGE_SIZE);   \
>         __young;                                                        \
>  })
>
> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>
>  #define mmu_notifier_range_update_to_read_only(r) false
>
> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
>  #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>  #define ptep_clear_young_notify ptep_test_and_clear_young
>  #define pmdp_clear_young_notify pmdp_test_and_clear_young
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 21b67d937555..a50df42a893f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +#ifndef clear_flush_young_ptes
> +/**
> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> + *                         folio as old and flush the TLB.
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear access bit.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_clear_flush_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +               unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +       int young = 0;
> +
> +       for (;;) {
> +               young |= ptep_clear_flush_young(vma, addr, ptep);
> +               if (--nr == 0)
> +                       break;
> +               ptep++;
> +               addr += PAGE_SIZE;
> +       }
> +
> +       return young;
> +}
> +#endif

We might have an opportunity to batch the TLB synchronization,
using flush_tlb_range() instead of calling flush_tlb_page()
one by one. Not sure the benefit would be significant though,
especially if only one entry among nr has the young bit set.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-06 21:07   ` Barry Song
@ 2026-03-07  2:22     ` Baolin Wang
  2026-03-07  8:02       ` Barry Song
  0 siblings, 1 reply; 38+ messages in thread
From: Baolin Wang @ 2026-03-07  2:22 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel



On 3/7/26 5:07 AM, Barry Song wrote:
> On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>> of the young flags and flushing TLB entries, thereby improving performance
>> during large folio reclamation. And it will be overridden by the architecture
>> that implements a more efficient batch operation in the following patches.
>>
>> While we are at it, rename ptep_clear_flush_young_notify() to
>> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> LGTM,
> 
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks.

>> ---
>>   include/linux/mmu_notifier.h |  9 +++++----
>>   include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
>>   mm/rmap.c                    | 28 +++++++++++++++++++++++++---
>>   3 files changed, 65 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..07a2bbaf86e9 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>          range->owner = owner;
>>   }
>>
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
>>   ({                                                                     \
>>          int __young;                                                    \
>>          struct vm_area_struct *___vma = __vma;                          \
>>          unsigned long ___address = __address;                           \
>> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
>> +       unsigned int ___nr = __nr;                                      \
>> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
>>          __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
>>                                                    ___address,           \
>>                                                    ___address +          \
>> -                                                       PAGE_SIZE);     \
>> +                                                 ___nr * PAGE_SIZE);   \
>>          __young;                                                        \
>>   })
>>
>> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>
>>   #define mmu_notifier_range_update_to_read_only(r) false
>>
>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
>>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>   #define ptep_clear_young_notify ptep_test_and_clear_young
>>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 21b67d937555..a50df42a893f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>
>> +#ifndef clear_flush_young_ptes
>> +/**
>> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
>> + *                         folio as old and flush the TLB.
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear access bit.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_clear_flush_young().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>> + * some PTEs might be write-protected.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +               unsigned long addr, pte_t *ptep, unsigned int nr)
>> +{
>> +       int young = 0;
>> +
>> +       for (;;) {
>> +               young |= ptep_clear_flush_young(vma, addr, ptep);
>> +               if (--nr == 0)
>> +                       break;
>> +               ptep++;
>> +               addr += PAGE_SIZE;
>> +       }
>> +
>> +       return young;
>> +}
>> +#endif
> 
> We might have an opportunity to batch the TLB synchronization,
> using flush_tlb_range() instead of calling flush_tlb_page()
> one by one. Not sure the benefit would be significant though,
> especially if only one entry among nr has the young bit set.

Yes. In addition, this will involve many architectures’ implementations 
and their differing TLB flush mechanisms, so it’s difficult to make a 
reasonable per-architecture measurement. If any architecture has a more 
efficient flush method, I’d prefer to implement an architecture‑specific 
clear_flush_young_ptes().


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-07  2:22     ` Baolin Wang
@ 2026-03-07  8:02       ` Barry Song
  2026-03-10  1:37         ` Baolin Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Barry Song @ 2026-03-07  8:02 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/7/26 5:07 AM, Barry Song wrote:
> > On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >> Currently, folio_referenced_one() always checks the young flag for each PTE
> >> sequentially, which is inefficient for large folios. This inefficiency is
> >> especially noticeable when reclaiming clean file-backed large folios, where
> >> folio_referenced() is observed as a significant performance hotspot.
> >>
> >> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> >> an optimization to clear the young flags for PTEs within a contiguous range.
> >> However, this is not sufficient. We can extend this to perform batched operations
> >> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> >>
> >> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> >> of the young flags and flushing TLB entries, thereby improving performance
> >> during large folio reclamation. And it will be overridden by the architecture
> >> that implements a more efficient batch operation in the following patches.
> >>
> >> While we are at it, rename ptep_clear_flush_young_notify() to
> >> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> >>
> >> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > LGTM,
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
>
> Thanks.
>
> >> ---
> >>   include/linux/mmu_notifier.h |  9 +++++----
> >>   include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
> >>   mm/rmap.c                    | 28 +++++++++++++++++++++++++---
> >>   3 files changed, 65 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> index d1094c2d5fb6..07a2bbaf86e9 100644
> >> --- a/include/linux/mmu_notifier.h
> >> +++ b/include/linux/mmu_notifier.h
> >> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
> >>          range->owner = owner;
> >>   }
> >>
> >> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
> >> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
> >>   ({                                                                     \
> >>          int __young;                                                    \
> >>          struct vm_area_struct *___vma = __vma;                          \
> >>          unsigned long ___address = __address;                           \
> >> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
> >> +       unsigned int ___nr = __nr;                                      \
> >> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
> >>          __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
> >>                                                    ___address,           \
> >>                                                    ___address +          \
> >> -                                                       PAGE_SIZE);     \
> >> +                                                 ___nr * PAGE_SIZE);   \
> >>          __young;                                                        \
> >>   })
> >>
> >> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
> >>
> >>   #define mmu_notifier_range_update_to_read_only(r) false
> >>
> >> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> >> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
> >>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
> >>   #define ptep_clear_young_notify ptep_test_and_clear_young
> >>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 21b67d937555..a50df42a893f 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>   }
> >>   #endif
> >>
> >> +#ifndef clear_flush_young_ptes
> >> +/**
> >> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> >> + *                         folio as old and flush the TLB.
> >> + * @vma: The virtual memory area the pages are mapped into.
> >> + * @addr: Address the first page is mapped at.
> >> + * @ptep: Page table pointer for the first entry.
> >> + * @nr: Number of entries to clear access bit.
> >> + *
> >> + * May be overridden by the architecture; otherwise, implemented as a simple
> >> + * loop over ptep_clear_flush_young().
> >> + *
> >> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> >> + * some PTEs might be write-protected.
> >> + *
> >> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> >> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> >> + */
> >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> >> +               unsigned long addr, pte_t *ptep, unsigned int nr)
> >> +{
> >> +       int young = 0;
> >> +
> >> +       for (;;) {
> >> +               young |= ptep_clear_flush_young(vma, addr, ptep);
> >> +               if (--nr == 0)
> >> +                       break;
> >> +               ptep++;
> >> +               addr += PAGE_SIZE;
> >> +       }
> >> +
> >> +       return young;
> >> +}
> >> +#endif
> >
> > We might have an opportunity to batch the TLB synchronization,
> > using flush_tlb_range() instead of calling flush_tlb_page()
> > one by one. Not sure the benefit would be significant though,
> > especially if only one entry among nr has the young bit set.
>
> Yes. In addition, this will involve many architectures’ implementations
> and their differing TLB flush mechanisms, so it’s difficult to make a
> reasonable per-architecture measurement. If any architecture has a more
> efficient flush method, I’d prefer to implement an architecture‑specific
> clear_flush_young_ptes().

Right! Since TLBI is usually quite expensive, I wonder if a generic
implementation for architectures lacking clear_flush_young_ptes()
might benefit from something like the below (just a very rough idea):

int clear_flush_young_ptes(struct vm_area_struct *vma,
                unsigned long addr, pte_t *ptep, unsigned int nr)
{
        unsigned long curr_addr = addr;
        int young = 0;

        while (nr--) {
                young |= ptep_test_and_clear_young(vma, curr_addr, ptep);
                ptep++;
                curr_addr += PAGE_SIZE;
        }

        if (young)
                flush_tlb_range(vma, addr, curr_addr);
        return young;
}

Thanks
Barry


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-07  8:02       ` Barry Song
@ 2026-03-10  1:37         ` Baolin Wang
  2026-03-10  8:17           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 38+ messages in thread
From: Baolin Wang @ 2026-03-10  1:37 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel



On 3/7/26 4:02 PM, Barry Song wrote:
> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 3/7/26 5:07 AM, Barry Song wrote:
>>> On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>> Currently, folio_referenced_one() always checks the young flag for each PTE
>>>> sequentially, which is inefficient for large folios. This inefficiency is
>>>> especially noticeable when reclaiming clean file-backed large folios, where
>>>> folio_referenced() is observed as a significant performance hotspot.
>>>>
>>>> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
>>>> an optimization to clear the young flags for PTEs within a contiguous range.
>>>> However, this is not sufficient. We can extend this to perform batched operations
>>>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>>>
>>>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>>>> of the young flags and flushing TLB entries, thereby improving performance
>>>> during large folio reclamation. And it will be overridden by the architecture
>>>> that implements a more efficient batch operation in the following patches.
>>>>
>>>> While we are at it, rename ptep_clear_flush_young_notify() to
>>>> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>>>>
>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>
>>> LGTM,
>>>
>>> Reviewed-by: Barry Song <baohua@kernel.org>
>>
>> Thanks.
>>
>>>> ---
>>>>    include/linux/mmu_notifier.h |  9 +++++----
>>>>    include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
>>>>    mm/rmap.c                    | 28 +++++++++++++++++++++++++---
>>>>    3 files changed, 65 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>>> index d1094c2d5fb6..07a2bbaf86e9 100644
>>>> --- a/include/linux/mmu_notifier.h
>>>> +++ b/include/linux/mmu_notifier.h
>>>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>>>           range->owner = owner;
>>>>    }
>>>>
>>>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
>>>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
>>>>    ({                                                                     \
>>>>           int __young;                                                    \
>>>>           struct vm_area_struct *___vma = __vma;                          \
>>>>           unsigned long ___address = __address;                           \
>>>> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
>>>> +       unsigned int ___nr = __nr;                                      \
>>>> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
>>>>           __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
>>>>                                                     ___address,           \
>>>>                                                     ___address +          \
>>>> -                                                       PAGE_SIZE);     \
>>>> +                                                 ___nr * PAGE_SIZE);   \
>>>>           __young;                                                        \
>>>>    })
>>>>
>>>> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>>>
>>>>    #define mmu_notifier_range_update_to_read_only(r) false
>>>>
>>>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>>>> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
>>>>    #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>>>    #define ptep_clear_young_notify ptep_test_and_clear_young
>>>>    #define pmdp_clear_young_notify pmdp_test_and_clear_young
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 21b67d937555..a50df42a893f 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    }
>>>>    #endif
>>>>
>>>> +#ifndef clear_flush_young_ptes
>>>> +/**
>>>> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
>>>> + *                         folio as old and flush the TLB.
>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>> + * @addr: Address the first page is mapped at.
>>>> + * @ptep: Page table pointer for the first entry.
>>>> + * @nr: Number of entries to clear access bit.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>> + * loop over ptep_clear_flush_young().
>>>> + *
>>>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>>> + * some PTEs might be write-protected.
>>>> + *
>>>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>>>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>>> + */
>>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +               unsigned long addr, pte_t *ptep, unsigned int nr)
>>>> +{
>>>> +       int young = 0;
>>>> +
>>>> +       for (;;) {
>>>> +               young |= ptep_clear_flush_young(vma, addr, ptep);
>>>> +               if (--nr == 0)
>>>> +                       break;
>>>> +               ptep++;
>>>> +               addr += PAGE_SIZE;
>>>> +       }
>>>> +
>>>> +       return young;
>>>> +}
>>>> +#endif
>>>
>>> We might have an opportunity to batch the TLB synchronization,
>>> using flush_tlb_range() instead of calling flush_tlb_page()
>>> one by one. Not sure the benefit would be significant though,
>>> especially if only one entry among nr has the young bit set.
>>
>> Yes. In addition, this will involve many architectures’ implementations
>> and their differing TLB flush mechanisms, so it’s difficult to make a
>> reasonable per-architecture measurement. If any architecture has a more
>> efficient flush method, I’d prefer to implement an architecture‑specific
>> clear_flush_young_ptes().
> 
> Right! Since TLBI is usually quite expensive, I wonder if a generic
> implementation for architectures lacking clear_flush_young_ptes()
> might benefit from something like the below (just a very rough idea):
> 
> int clear_flush_young_ptes(struct vm_area_struct *vma,
>                  unsigned long addr, pte_t *ptep, unsigned int nr)
> {
>          unsigned long curr_addr = addr;
>          int young = 0;
> 
>          while (nr--) {
>                  young |= ptep_test_and_clear_young(vma, curr_addr, ptep);
>                  ptep++;
>                  curr_addr += PAGE_SIZE;
>          }
> 
>          if (young)
>                  flush_tlb_range(vma, addr, curr_addr);
>          return young;
> }

I understand your point. I’m concerned that I can’t test this patch on 
every architecture to validate the benefits. Anyway, let me try this on 
my X86 machine first.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-10  1:37         ` Baolin Wang
@ 2026-03-10  8:17           ` David Hildenbrand (Arm)
  2026-03-16  6:25             ` Baolin Wang
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-10  8:17 UTC (permalink / raw)
  To: Baolin Wang, Barry Song
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On 3/10/26 02:37, Baolin Wang wrote:
> 
> 
> On 3/7/26 4:02 PM, Barry Song wrote:
>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>> Yes. In addition, this will involve many architectures’ implementations
>>> and their differing TLB flush mechanisms, so it’s difficult to make a
>>> reasonable per-architecture measurement. If any architecture has a more
>>> efficient flush method, I’d prefer to implement an architecture‑specific
>>> clear_flush_young_ptes().
>>
>> Right! Since TLBI is usually quite expensive, I wonder if a generic
>> implementation for architectures lacking clear_flush_young_ptes()
>> might benefit from something like the below (just a very rough idea):
>>
>> int clear_flush_young_ptes(struct vm_area_struct *vma,
>>                  unsigned long addr, pte_t *ptep, unsigned int nr)
>> {
>>          unsigned long curr_addr = addr;
>>          int young = 0;
>>
>>          while (nr--) {
>>                  young |= ptep_test_and_clear_young(vma, curr_addr,
>> ptep);
>>                  ptep++;
>>                  curr_addr += PAGE_SIZE;
>>          }
>>
>>          if (young)
>>                  flush_tlb_range(vma, addr, curr_addr);
>>          return young;
>> }
> 
> I understand your point. I’m concerned that I can’t test this patch on
> every architecture to validate the benefits. Anyway, let me try this on
> my X86 machine first.

In any case, please make that a follow-up patch :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-10  8:17           ` David Hildenbrand (Arm)
@ 2026-03-16  6:25             ` Baolin Wang
  2026-03-16 14:15               ` David Hildenbrand (Arm)
  2026-03-17  7:30               ` Barry Song
  0 siblings, 2 replies; 38+ messages in thread
From: Baolin Wang @ 2026-03-16  6:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Barry Song
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel



On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
> On 3/10/26 02:37, Baolin Wang wrote:
>>
>>
>> On 3/7/26 4:02 PM, Barry Song wrote:
>>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Yes. In addition, this will involve many architectures’ implementations
>>>> and their differing TLB flush mechanisms, so it’s difficult to make a
>>>> reasonable per-architecture measurement. If any architecture has a more
>>>> efficient flush method, I’d prefer to implement an architecture‑specific
>>>> clear_flush_young_ptes().
>>>
>>> Right! Since TLBI is usually quite expensive, I wonder if a generic
>>> implementation for architectures lacking clear_flush_young_ptes()
>>> might benefit from something like the below (just a very rough idea):
>>>
>>> int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>                   unsigned long addr, pte_t *ptep, unsigned int nr)
>>> {
>>>           unsigned long curr_addr = addr;
>>>           int young = 0;
>>>
>>>           while (nr--) {
>>>                   young |= ptep_test_and_clear_young(vma, curr_addr,
>>> ptep);
>>>                   ptep++;
>>>                   curr_addr += PAGE_SIZE;
>>>           }
>>>
>>>           if (young)
>>>                   flush_tlb_range(vma, addr, curr_addr);
>>>           return young;
>>> }
>>
>> I understand your point. I’m concerned that I can’t test this patch on
>> every architecture to validate the benefits. Anyway, let me try this on
>> my X86 machine first.
> 
> In any case, please make that a follow-up patch :)

Sure. However, after investigating RISC‑V and x86, I found that 
ptep_clear_flush_young() does not flush the TLB on these architectures:

int ptep_clear_flush_young(struct vm_area_struct *vma,
			   unsigned long address, pte_t *ptep)
{
	/*
	 * On x86 CPUs, clearing the accessed bit without a TLB flush
	 * doesn't cause data corruption. [ It could cause incorrect
	 * page aging and the (mistaken) reclaim of hot pages, but the
	 * chance of that should be relatively low. ]
	 *
	 * So as a performance optimization don't flush the TLB when
	 * clearing the accessed bit, it will eventually be flushed by
	 * a context switch or a VM operation anyway. [ In the rare
	 * event of it not getting flushed for a long time the delay
	 * shouldn't really matter because there's no real memory
	 * pressure for swapout to react to. ]
	 */
	return ptep_test_and_clear_young(vma, address, ptep);
}

I don't have access to other architectures, so I think we can postpone 
this optimization unless someone is interested in optimizing the TLB flush.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-16  6:25             ` Baolin Wang
@ 2026-03-16 14:15               ` David Hildenbrand (Arm)
  2026-03-25 14:36                 ` Lorenzo Stoakes (Oracle)
  2026-03-17  7:30               ` Barry Song
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-16 14:15 UTC (permalink / raw)
  To: Baolin Wang, Barry Song
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On 3/16/26 07:25, Baolin Wang wrote:
> 
> 
> On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
>> On 3/10/26 02:37, Baolin Wang wrote:
>>>
>>>
>>>
>>> I understand your point. I’m concerned that I can’t test this patch on
>>> every architecture to validate the benefits. Anyway, let me try this on
>>> my X86 machine first.
>>
>> In any case, please make that a follow-up patch :)
> 
> Sure. However, after investigating RISC‑V and x86, I found that
> ptep_clear_flush_young() does not flush the TLB on these architectures:
> 
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                unsigned long address, pte_t *ptep)
> {
>     /*
>      * On x86 CPUs, clearing the accessed bit without a TLB flush
>      * doesn't cause data corruption. [ It could cause incorrect
>      * page aging and the (mistaken) reclaim of hot pages, but the
>      * chance of that should be relatively low. ]
>      *
>      * So as a performance optimization don't flush the TLB when
>      * clearing the accessed bit, it will eventually be flushed by
>      * a context switch or a VM operation anyway. [ In the rare
>      * event of it not getting flushed for a long time the delay
>      * shouldn't really matter because there's no real memory
>      * pressure for swapout to react to. ]
>      */
>     return ptep_test_and_clear_young(vma, address, ptep);
> }

You'd probably want an arch helper then, that tells you whether
a flush_tlb_range() after ptep_test_and_clear_young() is required.

Or some special flush_tlb_range() helper.

I agree that it requires more work.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-16 14:15               ` David Hildenbrand (Arm)
@ 2026-03-25 14:36                 ` Lorenzo Stoakes (Oracle)
  2026-03-25 14:58                   ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-25 14:36 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Baolin Wang, Barry Song, akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
> On 3/16/26 07:25, Baolin Wang wrote:
> >
> >
> > On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
> >> On 3/10/26 02:37, Baolin Wang wrote:
> >>>
> >>>
> >>>
> >>> I understand your point. I’m concerned that I can’t test this patch on
> >>> every architecture to validate the benefits. Anyway, let me try this on
> >>> my X86 machine first.
> >>
> >> In any case, please make that a follow-up patch :)
> >
> > Sure. However, after investigating RISC‑V and x86, I found that
> > ptep_clear_flush_young() does not flush the TLB on these architectures:
> >
> > int ptep_clear_flush_young(struct vm_area_struct *vma,
> >                unsigned long address, pte_t *ptep)
> > {
> >     /*
> >      * On x86 CPUs, clearing the accessed bit without a TLB flush
> >      * doesn't cause data corruption. [ It could cause incorrect
> >      * page aging and the (mistaken) reclaim of hot pages, but the
> >      * chance of that should be relatively low. ]
> >      *
> >      * So as a performance optimization don't flush the TLB when
> >      * clearing the accessed bit, it will eventually be flushed by
> >      * a context switch or a VM operation anyway. [ In the rare
> >      * event of it not getting flushed for a long time the delay
> >      * shouldn't really matter because there's no real memory
> >      * pressure for swapout to react to. ]
> >      */
> >     return ptep_test_and_clear_young(vma, address, ptep);
> > }
>
> You'd probably want an arch helper then, that tells you whether
> a flush_tlb_range() after ptep_test_and_clear_young() is required.
>
> Or some special flush_tlb_range() helper.
>
> I agree that it requires more work.

Sorry unclear here - does the series need more work or does a follow up patch
need more work?

As this is in mm-stable afaict.

Thanks, Lorenzo

>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 14:36                 ` Lorenzo Stoakes (Oracle)
@ 2026-03-25 14:58                   ` David Hildenbrand (Arm)
  2026-03-25 15:06                     ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-25 14:58 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Baolin Wang, Barry Song, akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
> On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/16/26 07:25, Baolin Wang wrote:
>>>
>>>
>>>
>>> Sure. However, after investigating RISC‑V and x86, I found that
>>> ptep_clear_flush_young() does not flush the TLB on these architectures:
>>>
>>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>                unsigned long address, pte_t *ptep)
>>> {
>>>     /*
>>>      * On x86 CPUs, clearing the accessed bit without a TLB flush
>>>      * doesn't cause data corruption. [ It could cause incorrect
>>>      * page aging and the (mistaken) reclaim of hot pages, but the
>>>      * chance of that should be relatively low. ]
>>>      *
>>>      * So as a performance optimization don't flush the TLB when
>>>      * clearing the accessed bit, it will eventually be flushed by
>>>      * a context switch or a VM operation anyway. [ In the rare
>>>      * event of it not getting flushed for a long time the delay
>>>      * shouldn't really matter because there's no real memory
>>>      * pressure for swapout to react to. ]
>>>      */
>>>     return ptep_test_and_clear_young(vma, address, ptep);
>>> }
>>
>> You'd probably want an arch helper then, that tells you whether
>> a flush_tlb_range() after ptep_test_and_clear_young() is required.
>>
>> Or some special flush_tlb_range() helper.
>>
>> I agree that it requires more work.
> 
> Sorry unclear here - does the series need more work or does a follow up patch
> need more work?

Follow up!

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 14:58                   ` David Hildenbrand (Arm)
@ 2026-03-25 15:06                     ` Lorenzo Stoakes (Oracle)
  2026-03-25 15:30                       ` Andrew Morton
  2026-03-26  1:47                       ` Baolin Wang
  0 siblings, 2 replies; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-25 15:06 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Baolin Wang, Barry Song, akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
> On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
> > On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
> >> On 3/16/26 07:25, Baolin Wang wrote:
> >>>
> >>>
> >>>
> >>> Sure. However, after investigating RISC‑V and x86, I found that
> >>> ptep_clear_flush_young() does not flush the TLB on these architectures:
> >>>
> >>> int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>>                unsigned long address, pte_t *ptep)
> >>> {
> >>>     /*
> >>>      * On x86 CPUs, clearing the accessed bit without a TLB flush
> >>>      * doesn't cause data corruption. [ It could cause incorrect
> >>>      * page aging and the (mistaken) reclaim of hot pages, but the
> >>>      * chance of that should be relatively low. ]
> >>>      *
> >>>      * So as a performance optimization don't flush the TLB when
> >>>      * clearing the accessed bit, it will eventually be flushed by
> >>>      * a context switch or a VM operation anyway. [ In the rare
> >>>      * event of it not getting flushed for a long time the delay
> >>>      * shouldn't really matter because there's no real memory
> >>>      * pressure for swapout to react to. ]
> >>>      */
> >>>     return ptep_test_and_clear_young(vma, address, ptep);
> >>> }
> >>
> >> You'd probably want an arch helper then, that tells you whether
> >> a flush_tlb_range() after ptep_test_and_clear_young() is required.
> >>
> >> Or some special flush_tlb_range() helper.
> >>
> >> I agree that it requires more work.
> >
> > Sorry unclear here - does the series need more work or does a follow up patch
> > need more work?
>
> Follow up!

Ok good as in mm-stable now. Sadly means I don't get to review it but there we
go.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 15:06                     ` Lorenzo Stoakes (Oracle)
@ 2026-03-25 15:30                       ` Andrew Morton
  2026-03-25 15:32                         ` Lorenzo Stoakes (Oracle)
  2026-03-26  1:47                       ` Baolin Wang
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2026-03-25 15:30 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Baolin Wang, Barry Song, catalin.marinas,
	will, lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, 25 Mar 2026 15:06:26 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:

> > > Sorry unclear here - does the series need more work or does a follow up patch
> > > need more work?
> >
> > Follow up!
> 
> Ok good as in mm-stable now. Sadly means I don't get to review it but there we
> go.

Well, this was sent 6 weeks ago, at v6!

But please go ahead.  If review is very bad, there's git-rebase
(happens sometimes) or some followup fix and a minor bisection hole
(happens a bit more often).

And those followup fixes will trickle in all the way out to 7.1-rc7,
via the hotfixes path.  This happens a great deal - about half(?) of
hotfixes pertain to material we added in the current -rc cycle.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 15:30                       ` Andrew Morton
@ 2026-03-25 15:32                         ` Lorenzo Stoakes (Oracle)
  2026-03-25 16:23                           ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-25 15:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand (Arm), Baolin Wang, Barry Song, catalin.marinas,
	will, lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, Mar 25, 2026 at 08:30:16AM -0700, Andrew Morton wrote:
> On Wed, 25 Mar 2026 15:06:26 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
>
> > > > Sorry unclear here - does the series need more work or does a follow up patch
> > > > need more work?
> > >
> > > Follow up!
> >
> > Ok good as in mm-stable now. Sadly means I don't get to review it but there we
> > go.
>
> Well, this was sent 6 weeks ago, at v6!

Yup I know, I've struggled to clear my backlog.

>
> But please go ahead.  If review is very bad, there's git-rebase
> (happens sometimes) or some followup fix and a minor bisection hole
> (happens a bit more often).
>
> And those followup fixes will trickle in all the way out to 7.1-rc7,
> via the hotfixes path.  This happens a great deal - about half(?) of
> hotfixes pertain to material we added in the current -rc cycle.
>

Well given the other tags it's unlikely that would happen, so it doesn't feel
really worthwhile at this point.

Really I guess I'd like to know the timing of when mm-stable is taken so I can
know the deadline for stuff like this.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 15:32                         ` Lorenzo Stoakes (Oracle)
@ 2026-03-25 16:23                           ` Andrew Morton
  2026-03-25 16:28                             ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2026-03-25 16:23 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Baolin Wang, Barry Song, catalin.marinas,
	will, lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, 25 Mar 2026 15:32:50 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:

> Really I guess I'd like to know the timing of when mm-stable is taken so I can
> know the deadline for stuff like this.

Yeah, as mentioned elsewhere I'll add some words into the series file to
try to describe this.

But the start of the migration is quite variable (-rc4 through -rc6) so
I'll aim to keep people updated in the weekly "mm.git review status"
emails.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 16:23                           ` Andrew Morton
@ 2026-03-25 16:28                             ` Lorenzo Stoakes (Oracle)
  2026-03-25 18:43                               ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-25 16:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand (Arm), Baolin Wang, Barry Song, catalin.marinas,
	will, lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, Mar 25, 2026 at 09:23:00AM -0700, Andrew Morton wrote:
> On Wed, 25 Mar 2026 15:32:50 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
>
> > Really I guess I'd like to know the timing of when mm-stable is taken so I can
> > know the deadline for stuff like this.
>
> Yeah, as mentioned elsewhere I'll add some words into the series file to
> try to describe this.
>
> But the start of the migration is quite variable (-rc4 through -rc6) so
> I'll aim to keep people updated in the weekly "mm.git review status"
> emails.
>

Thanks. Wasn't aware that was a thing actually, but I see
https://lore.kernel.org/linux-mm/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org/
now.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 16:28                             ` Lorenzo Stoakes (Oracle)
@ 2026-03-25 18:43                               ` Andrew Morton
  2026-03-25 18:58                                 ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2026-03-25 18:43 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Baolin Wang, Barry Song, catalin.marinas,
	will, lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, 25 Mar 2026 16:28:18 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:

> On Wed, Mar 25, 2026 at 09:23:00AM -0700, Andrew Morton wrote:
> > On Wed, 25 Mar 2026 15:32:50 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> >
> > > Really I guess I'd like to know the timing of when mm-stable is taken so I can
> > > know the deadline for stuff like this.
> >
> > Yeah, as mentioned elsewhere I'll add some words into the series file to
> > try to describe this.
> >
> > But the start of the migration is quite variable (-rc4 through -rc6) so
> > I'll aim to keep people updated in the weekly "mm.git review status"
> > emails.
> >
> 
> Thanks. Wasn't aware that was a thing actually, but I see
> https://lore.kernel.org/linux-mm/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org/
> now.

Well now I'm all sad.

Yeah, I started doing this Feb 2, shall attempt to sustain this weekly
after we've hit -rc4 ish.  I expect I'll be adding more waffle to these as
things occur to me.  A general overview of where the tree is at.

I can Bcc: people if asked ;)   lmk.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 18:43                               ` Andrew Morton
@ 2026-03-25 18:58                                 ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-25 18:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand (Arm), Baolin Wang, Barry Song, catalin.marinas,
	will, lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, Mar 25, 2026 at 11:43:05AM -0700, Andrew Morton wrote:
> On Wed, 25 Mar 2026 16:28:18 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
>
> > On Wed, Mar 25, 2026 at 09:23:00AM -0700, Andrew Morton wrote:
> > > On Wed, 25 Mar 2026 15:32:50 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > >
> > > > Really I guess I'd like to know the timing of when mm-stable is taken so I can
> > > > know the deadline for stuff like this.
> > >
> > > Yeah, as mentioned elsewhere I'll add some words into the series file to
> > > try to describe this.
> > >
> > > But the start of the migration is quite variable (-rc4 through -rc6) so
> > > I'll aim to keep people updated in the weekly "mm.git review status"
> > > emails.
> > >
> >
> > Thanks. Wasn't aware that was a thing actually, but I see
> > https://lore.kernel.org/linux-mm/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org/
> > now.
>
> Well now I'm all sad.
>
> Yeah, I started doing this Feb 2, shall attempt to sustain this weekly
> after we've hit -rc4 ish.  I expect I'll be adding more waffle to these as
> things occur to me.  A general overview of where the tree is at.
>
> I can Bcc: people if asked ;)   lmk.
>

Yes please!

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-25 15:06                     ` Lorenzo Stoakes (Oracle)
  2026-03-25 15:30                       ` Andrew Morton
@ 2026-03-26  1:47                       ` Baolin Wang
  2026-03-26  5:31                         ` Barry Song
  2026-03-26 11:10                         ` Lorenzo Stoakes (Oracle)
  1 sibling, 2 replies; 38+ messages in thread
From: Baolin Wang @ 2026-03-26  1:47 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), David Hildenbrand (Arm)
  Cc: Barry Song, akpm, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 3/25/26 11:06 PM, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
>>> On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
>>>> On 3/16/26 07:25, Baolin Wang wrote:
>>>>>
>>>>>
>>>>>
>>>>> Sure. However, after investigating RISC‑V and x86, I found that
>>>>> ptep_clear_flush_young() does not flush the TLB on these architectures:
>>>>>
>>>>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>                 unsigned long address, pte_t *ptep)
>>>>> {
>>>>>      /*
>>>>>       * On x86 CPUs, clearing the accessed bit without a TLB flush
>>>>>       * doesn't cause data corruption. [ It could cause incorrect
>>>>>       * page aging and the (mistaken) reclaim of hot pages, but the
>>>>>       * chance of that should be relatively low. ]
>>>>>       *
>>>>>       * So as a performance optimization don't flush the TLB when
>>>>>       * clearing the accessed bit, it will eventually be flushed by
>>>>>       * a context switch or a VM operation anyway. [ In the rare
>>>>>       * event of it not getting flushed for a long time the delay
>>>>>       * shouldn't really matter because there's no real memory
>>>>>       * pressure for swapout to react to. ]
>>>>>       */
>>>>>      return ptep_test_and_clear_young(vma, address, ptep);
>>>>> }
>>>>
>>>> You'd probably want an arch helper then, that tells you whether
>>>> a flush_tlb_range() after ptep_test_and_clear_young() is required.
>>>>
>>>> Or some special flush_tlb_range() helper.
>>>>
>>>> I agree that it requires more work.

(Sorry, David. I forgot to reply to your email because I've had a lot to 
sort out recently.)

Rather than adding more arch helpers (we already have plenty for the 
young flag check), I think we should try removing the TLB flush, as I 
mentioned to Barry[1]. MGLRU reclaim already skips the TLB flush, and it 
seems to work fine. What do you think?

Here are our previous attempts to remove the TLB flush:

My patch: https://lkml.org/lkml/2023/10/24/533
Barry's patch:
https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/

[1] 
https://lore.kernel.org/all/6bdc4b03-9631-4717-a3fa-2785a7930aba@linux.alibaba.com/

>>> Sorry unclear here - does the series need more work or does a follow up patch
>>> need more work?
>>
>> Follow up!
> 
> Ok good as in mm-stable now. Sadly means I don't get to review it but there we
> go.

Actually this patchset has already been merged upstream:)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-26  1:47                       ` Baolin Wang
@ 2026-03-26  5:31                         ` Barry Song
  2026-03-26 11:10                         ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 38+ messages in thread
From: Barry Song @ 2026-03-26  5:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Lorenzo Stoakes (Oracle), David Hildenbrand (Arm), akpm,
	catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On Thu, Mar 26, 2026 at 9:47 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/25/26 11:06 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
> >> On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
> >>> On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
> >>>> On 3/16/26 07:25, Baolin Wang wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Sure. However, after investigating RISC‑V and x86, I found that
> >>>>> ptep_clear_flush_young() does not flush the TLB on these architectures:
> >>>>>
> >>>>> int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>>>>                 unsigned long address, pte_t *ptep)
> >>>>> {
> >>>>>      /*
> >>>>>       * On x86 CPUs, clearing the accessed bit without a TLB flush
> >>>>>       * doesn't cause data corruption. [ It could cause incorrect
> >>>>>       * page aging and the (mistaken) reclaim of hot pages, but the
> >>>>>       * chance of that should be relatively low. ]
> >>>>>       *
> >>>>>       * So as a performance optimization don't flush the TLB when
> >>>>>       * clearing the accessed bit, it will eventually be flushed by
> >>>>>       * a context switch or a VM operation anyway. [ In the rare
> >>>>>       * event of it not getting flushed for a long time the delay
> >>>>>       * shouldn't really matter because there's no real memory
> >>>>>       * pressure for swapout to react to. ]
> >>>>>       */
> >>>>>      return ptep_test_and_clear_young(vma, address, ptep);
> >>>>> }
> >>>>
> >>>> You'd probably want an arch helper then, that tells you whether
> >>>> a flush_tlb_range() after ptep_test_and_clear_young() is required.
> >>>>
> >>>> Or some special flush_tlb_range() helper.
> >>>>
> >>>> I agree that it requires more work.
>
> (Sorry, David. I forgot to reply to your email because I've had a lot to
> sort out recently.)
>
> Rather than adding more arch helpers (we already have plenty for the
> young flag check), I think we should try removing the TLB flush, as I
> mentioned to Barry[1]. MGLRU reclaim already skips the TLB flush, and it
> seems to work fine. What do you think?
>
> Here are our previous attempts to remove the TLB flush:
>
> My patch: https://lkml.org/lkml/2023/10/24/533
> Barry's patch:
> https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>
> [1]
> https://lore.kernel.org/all/6bdc4b03-9631-4717-a3fa-2785a7930aba@linux.alibaba.com/

x86: ptep_clear_flush_young does not perform any TLB
invalidation. simply, calling ptep_test_and_clear_young()

RISC-V: follows the exact same behavior as x86.

S390:
simply, calling ptep_test_and_clear_young()

powerpc:
simply, calling ptep_test_and_clear_young();

parisc:
set_pte + __flush_cache_page
but ptep_test_and_clear_young() doesn't need __flush_cache_page()

arm64:
ptep_test_and_clear_young() followed by
flush_tlb_page_nosync() can still be expensive,
based on my previous observations.

others:
ptep_test_and_clear_young + flush_tlb_page

revisiting the comment for x86:
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */

At least I feel this also applies to ARM64?
Maybe Ryan, Will, or Catalin can clarify why ARM64 requires a
nosync TLBI, whereas x86 does not?

Thanks
Barry


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-26  1:47                       ` Baolin Wang
  2026-03-26  5:31                         ` Barry Song
@ 2026-03-26 11:10                         ` Lorenzo Stoakes (Oracle)
  2026-03-26 12:04                           ` Baolin Wang
  1 sibling, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 11:10 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand (Arm), Barry Song, akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Thu, Mar 26, 2026 at 09:47:51AM +0800, Baolin Wang wrote:
>
>
> On 3/25/26 11:06 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
> > > On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
> > > > On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
> > > > > On 3/16/26 07:25, Baolin Wang wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Sure. However, after investigating RISC‑V and x86, I found that
> > > > > > ptep_clear_flush_young() does not flush the TLB on these architectures:
> > > > > >
> > > > > > int ptep_clear_flush_young(struct vm_area_struct *vma,
> > > > > >                 unsigned long address, pte_t *ptep)
> > > > > > {
> > > > > >      /*
> > > > > >       * On x86 CPUs, clearing the accessed bit without a TLB flush
> > > > > >       * doesn't cause data corruption. [ It could cause incorrect
> > > > > >       * page aging and the (mistaken) reclaim of hot pages, but the
> > > > > >       * chance of that should be relatively low. ]
> > > > > >       *
> > > > > >       * So as a performance optimization don't flush the TLB when
> > > > > >       * clearing the accessed bit, it will eventually be flushed by
> > > > > >       * a context switch or a VM operation anyway. [ In the rare
> > > > > >       * event of it not getting flushed for a long time the delay
> > > > > >       * shouldn't really matter because there's no real memory
> > > > > >       * pressure for swapout to react to. ]
> > > > > >       */
> > > > > >      return ptep_test_and_clear_young(vma, address, ptep);
> > > > > > }
> > > > >
> > > > > You'd probably want an arch helper then, that tells you whether
> > > > > a flush_tlb_range() after ptep_test_and_clear_young() is required.
> > > > >
> > > > > Or some special flush_tlb_range() helper.
> > > > >
> > > > > I agree that it requires more work.
>
> (Sorry, David. I forgot to reply to your email because I've had a lot to
> sort out recently.)
>
> Rather than adding more arch helpers (we already have plenty for the young
> flag check), I think we should try removing the TLB flush, as I mentioned to
> Barry[1]. MGLRU reclaim already skips the TLB flush, and it seems to work
> fine. What do you think?
>
> Here are our previous attempts to remove the TLB flush:
>
> My patch: https://lkml.org/lkml/2023/10/24/533
> Barry's patch:
> https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>
> [1] https://lore.kernel.org/all/6bdc4b03-9631-4717-a3fa-2785a7930aba@linux.alibaba.com/
>
> > > > Sorry unclear here - does the series need more work or does a follow up patch
> > > > need more work?
> > >
> > > Follow up!
> >
> > Ok good as in mm-stable now. Sadly means I don't get to review it but there we
> > go.
>
> Actually this patchset has already been merged upstream:)

Err but this revision was sent _during_ the merge window...?

Was sent on 9th Feb on Monday in merge window week 1, with a functional change
listed:

- Skip batched unmapping for uffd case, reported by Dev. Thanks.

And then sent in 2nd batch on 18th Feb (see [0]).

So we were ok with 1 week of 'testing' (does anybody actually test -next during
the merge window? Was it even sent to -next?) for what appears to be a
functional change?

And there was ongoing feedback on this and the v5 series (at [1])?

This doesn't really feel sane?

And now I'm confused as to whether mm-stable patches can collect tags, since
presumably this was in mm-stable at the point this respin was done?

Maybe I'm missing something here but this doesn't feel like a sane process?

Thanks, Lorenzo

[0]:https://lore.kernel.org/all/20260218200016.8906fb904af9439e7b496327@linux-foundation.org/
[1]:https://lore.kernel.org/linux-mm/cover.1766631066.git.baolin.wang@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-26 11:10                         ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 12:04                           ` Baolin Wang
  2026-03-26 12:21                             ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 38+ messages in thread
From: Baolin Wang @ 2026-03-26 12:04 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: David Hildenbrand (Arm), Barry Song, akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel



On 3/26/26 7:10 PM, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 26, 2026 at 09:47:51AM +0800, Baolin Wang wrote:
>>
>>
>> On 3/25/26 11:06 PM, Lorenzo Stoakes (Oracle) wrote:
>>> On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
>>>> On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
>>>>> On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
>>>>>> On 3/16/26 07:25, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sure. However, after investigating RISC‑V and x86, I found that
>>>>>>> ptep_clear_flush_young() does not flush the TLB on these architectures:
>>>>>>>
>>>>>>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>>                  unsigned long address, pte_t *ptep)
>>>>>>> {
>>>>>>>       /*
>>>>>>>        * On x86 CPUs, clearing the accessed bit without a TLB flush
>>>>>>>        * doesn't cause data corruption. [ It could cause incorrect
>>>>>>>        * page aging and the (mistaken) reclaim of hot pages, but the
>>>>>>>        * chance of that should be relatively low. ]
>>>>>>>        *
>>>>>>>        * So as a performance optimization don't flush the TLB when
>>>>>>>        * clearing the accessed bit, it will eventually be flushed by
>>>>>>>        * a context switch or a VM operation anyway. [ In the rare
>>>>>>>        * event of it not getting flushed for a long time the delay
>>>>>>>        * shouldn't really matter because there's no real memory
>>>>>>>        * pressure for swapout to react to. ]
>>>>>>>        */
>>>>>>>       return ptep_test_and_clear_young(vma, address, ptep);
>>>>>>> }
>>>>>>
>>>>>> You'd probably want an arch helper then, that tells you whether
>>>>>> a flush_tlb_range() after ptep_test_and_clear_young() is required.
>>>>>>
>>>>>> Or some special flush_tlb_range() helper.
>>>>>>
>>>>>> I agree that it requires more work.
>>
>> (Sorry, David. I forgot to reply to your email because I've had a lot to
>> sort out recently.)
>>
>> Rather than adding more arch helpers (we already have plenty for the young
>> flag check), I think we should try removing the TLB flush, as I mentioned to
>> Barry[1]. MGLRU reclaim already skips the TLB flush, and it seems to work
>> fine. What do you think?
>>
>> Here are our previous attempts to remove the TLB flush:
>>
>> My patch: https://lkml.org/lkml/2023/10/24/533
>> Barry's patch:
>> https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
>>
>> [1] https://lore.kernel.org/all/6bdc4b03-9631-4717-a3fa-2785a7930aba@linux.alibaba.com/
>>
>>>>> Sorry unclear here - does the series need more work or does a follow up patch
>>>>> need more work?
>>>>
>>>> Follow up!
>>>
>>> Ok good as in mm-stable now. Sadly means I don't get to review it but there we
>>> go.
>>
>> Actually this patchset has already been merged upstream:)

Let me try to make things clear.

> Err but this revision was sent _during_ the merge window...?
> 
> Was sent on 9th Feb on Monday in merge window week 1, with a functional change
> listed:
> 
> - Skip batched unmapping for uffd case, reported by Dev. Thanks.
> 
> And then sent in 2nd batch on 18th Feb (see [0]).
> 
> So we were ok with 1 week of 'testing' (does anybody actually test -next during
> the merge window? Was it even sent to -next?) for what appears to be a
> functional change?

I posted v5 on Dec 26th[0], and it collected quite a few Reviewed-by 
tags and sat in mm-unstable for testing.

Later, Dev reported a uffd-related issue (I hope you recall that 
discussion). I posted a fix[1] for it on Jan 16th, which Andrew accepted.

Since then, the v5 series (plus the fix) continued to be tested in 
mm-unstable. We kept it there mainly because David mentioned he wanted 
to review the series, so we were waiting for his time.

On Feb 9th, after returning from vacation, David reviewed the series 
(thanks, David!). I replied to and addressed all his comments, then 
posted v6 on the same day[2].

Additionally, v6 had no functional changes compared to v5 + the fix, and 
it mainly addressed some coding style issues pointed out by David. I 
also discussed this with David off-list, and since there were no 
functional changes, my expectation was that it could still make it into 
the merge window. That is why v6 was merged.

[0] 
https://lore.kernel.org/linux-mm/cover.1766631066.git.baolin.wang@linux.alibaba.com/#t
[1] 
https://lore.kernel.org/linux-mm/20260116162652.176054-1-baolin.wang@linux.alibaba.com/
[2] 
https://lore.kernel.org/all/cover.1770645603.git.baolin.wang@linux.alibaba.com/

> And there was ongoing feedback on this and the v5 series (at [1])?

Regarding the feedback on v5, I believe everything has been addressed.

> This doesn't really feel sane?
> 
> And now I'm confused as to whether mm-stable patches can collect tags, since
> presumably this was in mm-stable at the point this respin was done?
> 
> Maybe I'm missing something here but this doesn't feel like a sane process?

Andrew, David, please correct me if I've missed anything. Also, please 
let me know if there's anything in the process that needs to be 
improved. Thanks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-26 12:04                           ` Baolin Wang
@ 2026-03-26 12:21                             ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 12:21 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand (Arm), Barry Song, akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Thu, Mar 26, 2026 at 08:04:10PM +0800, Baolin Wang wrote:
>
>
> On 3/26/26 7:10 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 26, 2026 at 09:47:51AM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 3/25/26 11:06 PM, Lorenzo Stoakes (Oracle) wrote:
> > > > On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
> > > > > On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
> > > > > > On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
> > > > > > > On 3/16/26 07:25, Baolin Wang wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Sure. However, after investigating RISC‑V and x86, I found that
> > > > > > > > ptep_clear_flush_young() does not flush the TLB on these architectures:
> > > > > > > >
> > > > > > > > int ptep_clear_flush_young(struct vm_area_struct *vma,
> > > > > > > >                  unsigned long address, pte_t *ptep)
> > > > > > > > {
> > > > > > > >       /*
> > > > > > > >        * On x86 CPUs, clearing the accessed bit without a TLB flush
> > > > > > > >        * doesn't cause data corruption. [ It could cause incorrect
> > > > > > > >        * page aging and the (mistaken) reclaim of hot pages, but the
> > > > > > > >        * chance of that should be relatively low. ]
> > > > > > > >        *
> > > > > > > >        * So as a performance optimization don't flush the TLB when
> > > > > > > >        * clearing the accessed bit, it will eventually be flushed by
> > > > > > > >        * a context switch or a VM operation anyway. [ In the rare
> > > > > > > >        * event of it not getting flushed for a long time the delay
> > > > > > > >        * shouldn't really matter because there's no real memory
> > > > > > > >        * pressure for swapout to react to. ]
> > > > > > > >        */
> > > > > > > >       return ptep_test_and_clear_young(vma, address, ptep);
> > > > > > > > }
> > > > > > >
> > > > > > > You'd probably want an arch helper then, that tells you whether
> > > > > > > a flush_tlb_range() after ptep_test_and_clear_young() is required.
> > > > > > >
> > > > > > > Or some special flush_tlb_range() helper.
> > > > > > >
> > > > > > > I agree that it requires more work.
> > >
> > > (Sorry, David. I forgot to reply to your email because I've had a lot to
> > > sort out recently.)
> > >
> > > Rather than adding more arch helpers (we already have plenty for the young
> > > flag check), I think we should try removing the TLB flush, as I mentioned to
> > > Barry[1]. MGLRU reclaim already skips the TLB flush, and it seems to work
> > > fine. What do you think?
> > >
> > > Here are our previous attempts to remove the TLB flush:
> > >
> > > My patch: https://lkml.org/lkml/2023/10/24/533
> > > Barry's patch:
> > > https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/
> > >
> > > [1] https://lore.kernel.org/all/6bdc4b03-9631-4717-a3fa-2785a7930aba@linux.alibaba.com/
> > >
> > > > > > Sorry unclear here - does the series need more work or does a follow up patch
> > > > > > need more work?
> > > > >
> > > > > Follow up!
> > > >
> > > > Ok good as in mm-stable now. Sadly means I don't get to review it but there we
> > > > go.
> > >
> > > Actually this patchset has already been merged upstream:)
>
> Let me try to make things clear.
>
> > Err but this revision was sent _during_ the merge window...?
> >
> > Was sent on 9th Feb on Monday in merge window week 1, with a functional change
> > listed:
> >
> > - Skip batched unmapping for uffd case, reported by Dev. Thanks.
> >
> > And then sent in 2nd batch on 18th Feb (see [0]).
> >
> > So we were ok with 1 week of 'testing' (does anybody actually test -next during
> > the merge window? Was it even sent to -next?) for what appears to be a
> > functional change?
>
> I posted v5 on Dec 26th[0], and it collected quite a few Reviewed-by tags
> and sat in mm-unstable for testing.
>
> Later, Dev reported a uffd-related issue (I hope you recall that
> discussion). I posted a fix[1] for it on Jan 16th, which Andrew accepted.
>
> Since then, the v5 series (plus the fix) continued to be tested in
> mm-unstable. We kept it there mainly because David mentioned he wanted to
> review the series, so we were waiting for his time.
>
> On Feb 9th, after returning from vacation, David reviewed the series
> (thanks, David!). I replied to and addressed all his comments, then posted
> v6 on the same day[2].

OK thanks, I see that now.

I still don't think we should have made any changes _during_ the merge window,
even if they were simple code quality things.

Changing patches then seems just crazy to me, as even code quality stuff can
cause unexpected bugs, and now we're having upstream take it.

Also this speaks to -fix patches just being broken in general.

If you'd just respun with the fix as a v6, then we'd know 'v6 sent on 16th Jan
addressed this' and there'd be no isssue.

Now v5 isn't v5, there's v5 and something-not-v5 and to have a sense of the
testing you have to go read a bunch of email chains.

It also means change logs are now really inaccurate:

Changes from v5:
 - Collect reviewed tags from Ryan, Harry and David. Thanks.
 - Fix some coding style issues (per David).
 - Skip batched unmapping for uffd case, reported by Dev. Thanks.

And that to me means 'v5 didn't have this, v6 does'.

And it's really hard to track timelines for testing.

>
> Additionally, v6 had no functional changes compared to v5 + the fix, and it
> mainly addressed some coding style issues pointed out by David. I also
> discussed this with David off-list, and since there were no functional
> changes, my expectation was that it could still make it into the merge
> window. That is why v6 was merged.

Yeah, we still shouldn't have taken changes to a series DURING the merge window,
it's just crazy.

>
> [0] https://lore.kernel.org/linux-mm/cover.1766631066.git.baolin.wang@linux.alibaba.com/#t
> [1] https://lore.kernel.org/linux-mm/20260116162652.176054-1-baolin.wang@linux.alibaba.com/
> [2] https://lore.kernel.org/all/cover.1770645603.git.baolin.wang@linux.alibaba.com/
>
> > And there was ongoing feedback on this and the v5 series (at [1])?
>
> Regarding the feedback on v5, I believe everything has been addressed.
>
> > This doesn't really feel sane?
> >
> > And now I'm confused as to whether mm-stable patches can collect tags, since
> > presumably this was in mm-stable at the point this respin was done?
> >
> > Maybe I'm missing something here but this doesn't feel like a sane process?
>
> Andrew, David, please correct me if I've missed anything. Also, please let
> me know if there's anything in the process that needs to be improved.
> Thanks.

This isn't on you, it's about the process as a whole. We need clear rules about
when changes will be accepted and when not.

And frankly I think we need to do away with fix patches as a whole based on
this, or at least anything even vaguely non-trivial or that potentially impacts
code.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-16  6:25             ` Baolin Wang
  2026-03-16 14:15               ` David Hildenbrand (Arm)
@ 2026-03-17  7:30               ` Barry Song
  2026-03-18  1:37                 ` Baolin Wang
  1 sibling, 1 reply; 38+ messages in thread
From: Barry Song @ 2026-03-17  7:30 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand (Arm), akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Mon, Mar 16, 2026 at 2:25 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
> > On 3/10/26 02:37, Baolin Wang wrote:
> >>
> >>
> >> On 3/7/26 4:02 PM, Barry Song wrote:
> >>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
> >>> <baolin.wang@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> Yes. In addition, this will involve many architectures’ implementations
> >>>> and their differing TLB flush mechanisms, so it’s difficult to make a
> >>>> reasonable per-architecture measurement. If any architecture has a more
> >>>> efficient flush method, I’d prefer to implement an architecture‑specific
> >>>> clear_flush_young_ptes().
> >>>
> >>> Right! Since TLBI is usually quite expensive, I wonder if a generic
> >>> implementation for architectures lacking clear_flush_young_ptes()
> >>> might benefit from something like the below (just a very rough idea):
> >>>
> >>> int clear_flush_young_ptes(struct vm_area_struct *vma,
> >>>                   unsigned long addr, pte_t *ptep, unsigned int nr)
> >>> {
> >>>           unsigned long curr_addr = addr;
> >>>           int young = 0;
> >>>
> >>>           while (nr--) {
> >>>                   young |= ptep_test_and_clear_young(vma, curr_addr,
> >>> ptep);
> >>>                   ptep++;
> >>>                   curr_addr += PAGE_SIZE;
> >>>           }
> >>>
> >>>           if (young)
> >>>                   flush_tlb_range(vma, addr, curr_addr);
> >>>           return young;
> >>> }
> >>
> >> I understand your point. I’m concerned that I can’t test this patch on
> >> every architecture to validate the benefits. Anyway, let me try this on
> >> my X86 machine first.
> >
> > In any case, please make that a follow-up patch :)
>
> Sure. However, after investigating RISC‑V and x86, I found that
> ptep_clear_flush_young() does not flush the TLB on these architectures:
>
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                            unsigned long address, pte_t *ptep)
> {
>         /*
>          * On x86 CPUs, clearing the accessed bit without a TLB flush
>          * doesn't cause data corruption. [ It could cause incorrect
>          * page aging and the (mistaken) reclaim of hot pages, but the
>          * chance of that should be relatively low. ]
>          *
>          * So as a performance optimization don't flush the TLB when
>          * clearing the accessed bit, it will eventually be flushed by
>          * a context switch or a VM operation anyway. [ In the rare
>          * event of it not getting flushed for a long time the delay
>          * shouldn't really matter because there's no real memory
>          * pressure for swapout to react to. ]
>          */
>         return ptep_test_and_clear_young(vma, address, ptep);
> }
>
> I don't have access to other architectures, so I think we can postpone
> this optimization unless someone is interested in optimizing the TLB flush.

The comment is interesting. I think it likely applies to most
architectures, including ARM64. The main reason ARM64 doesn’t use
this approach is probably that it can issue tlbi_nosync and then
rely on a final dsb to ensure all invalidations are completed—
and tlbi_nosync itself is relatively cheap.

Thanks
Barry


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-17  7:30               ` Barry Song
@ 2026-03-18  1:37                 ` Baolin Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Baolin Wang @ 2026-03-18  1:37 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand (Arm), akpm, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel



On 3/17/26 3:30 PM, Barry Song wrote:
> On Mon, Mar 16, 2026 at 2:25 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 3/10/26 4:17 PM, David Hildenbrand (Arm) wrote:
>>> On 3/10/26 02:37, Baolin Wang wrote:
>>>>
>>>>
>>>> On 3/7/26 4:02 PM, Barry Song wrote:
>>>>> On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> Yes. In addition, this will involve many architectures’ implementations
>>>>>> and their differing TLB flush mechanisms, so it’s difficult to make a
>>>>>> reasonable per-architecture measurement. If any architecture has a more
>>>>>> efficient flush method, I’d prefer to implement an architecture‑specific
>>>>>> clear_flush_young_ptes().
>>>>>
>>>>> Right! Since TLBI is usually quite expensive, I wonder if a generic
>>>>> implementation for architectures lacking clear_flush_young_ptes()
>>>>> might benefit from something like the below (just a very rough idea):
>>>>>
>>>>> int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>                    unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>> {
>>>>>            unsigned long curr_addr = addr;
>>>>>            int young = 0;
>>>>>
>>>>>            while (nr--) {
>>>>>                    young |= ptep_test_and_clear_young(vma, curr_addr,
>>>>> ptep);
>>>>>                    ptep++;
>>>>>                    curr_addr += PAGE_SIZE;
>>>>>            }
>>>>>
>>>>>            if (young)
>>>>>                    flush_tlb_range(vma, addr, curr_addr);
>>>>>            return young;
>>>>> }
>>>>
>>>> I understand your point. I’m concerned that I can’t test this patch on
>>>> every architecture to validate the benefits. Anyway, let me try this on
>>>> my X86 machine first.
>>>
>>> In any case, please make that a follow-up patch :)
>>
>> Sure. However, after investigating RISC‑V and x86, I found that
>> ptep_clear_flush_young() does not flush the TLB on these architectures:
>>
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>                             unsigned long address, pte_t *ptep)
>> {
>>          /*
>>           * On x86 CPUs, clearing the accessed bit without a TLB flush
>>           * doesn't cause data corruption. [ It could cause incorrect
>>           * page aging and the (mistaken) reclaim of hot pages, but the
>>           * chance of that should be relatively low. ]
>>           *
>>           * So as a performance optimization don't flush the TLB when
>>           * clearing the accessed bit, it will eventually be flushed by
>>           * a context switch or a VM operation anyway. [ In the rare
>>           * event of it not getting flushed for a long time the delay
>>           * shouldn't really matter because there's no real memory
>>           * pressure for swapout to react to. ]
>>           */
>>          return ptep_test_and_clear_young(vma, address, ptep);
>> }
>>
>> I don't have access to other architectures, so I think we can postpone
>> this optimization unless someone is interested in optimizing the TLB flush.
> 
> The comment is interesting. I think it likely applies to most
> architectures, including ARM64. The main reason ARM64 doesn’t use
> this approach is probably that it can issue tlbi_nosync and then
> rely on a final dsb to ensure all invalidations are completed—
> and tlbi_nosync itself is relatively cheap.

Actually, we both tried this a few years ago, but neither succeeded :).

My patch: https://lkml.org/lkml/2023/10/24/533

Your patch: 
https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/

Now I’m more inclined toward your approach, to align with MGLRU. It’s 
time to restart the discussion on this patch? :)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Factor out the contpte block's address and ptep alignment into a new helper,
and will be reused in the following patch.

No functional changes.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/mm/contpte.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 589bcf878938..e4ddeb46f25d 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -26,6 +26,26 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
 }
 
+static inline pte_t *contpte_align_addr_ptep(unsigned long *start,
+					     unsigned long *end, pte_t *ptep,
+					     unsigned int nr)
+{
+	/*
+	 * Note: caller must ensure these nr PTEs are consecutive (present)
+	 * PTEs that map consecutive pages of the same large folio within a
+	 * single VMA and a single page table.
+	 */
+	if (pte_cont(__ptep_get(ptep + nr - 1)))
+		*end = ALIGN(*end, CONT_PTE_SIZE);
+
+	if (pte_cont(__ptep_get(ptep))) {
+		*start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
+		ptep = contpte_align_down(ptep);
+	}
+
+	return ptep;
+}
+
 static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
 {
@@ -569,14 +589,7 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
 	unsigned long start = addr;
 	unsigned long end = start + nr * PAGE_SIZE;
 
-	if (pte_cont(__ptep_get(ptep + nr - 1)))
-		end = ALIGN(end, CONT_PTE_SIZE);
-
-	if (pte_cont(__ptep_get(ptep))) {
-		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
-		ptep = contpte_align_down(ptep);
-	}
-
+	ptep = contpte_align_addr_ptep(&start, &end, ptep, nr);
 	__clear_young_dirty_ptes(vma, start, ptep, (end - start) / PAGE_SIZE, flags);
 }
 EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
only clear the young flag and flush TLBs for PTEs within the contiguous range.
To support batch PTE operations for other sized large folios in the following
patches, adding a new parameter to specify the number of PTEs that map consecutive
pages of the same large folio in a single VMA and a single page table.

While we are at it, rename the functions to maintain consistency with other
contpte_*() functions.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ++++++------
 arch/arm64/mm/contpte.c          | 33 ++++++++++++++++++--------------
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d94445b4f3df..3dabf5ea17fa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1648,10 +1648,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep,
 				unsigned int nr, int full);
-extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
-extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
 extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
@@ -1823,7 +1823,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_test_and_clear_young(vma, addr, ptep);
 
-	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	return contpte_test_and_clear_young_ptes(vma, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1835,7 +1835,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
-	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
 #define wrprotect_ptes wrprotect_ptes
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index e4ddeb46f25d..b929a455103f 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -508,8 +508,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
 
-int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
 {
 	/*
 	 * ptep_clear_flush_young() technically requires us to clear the access
@@ -518,41 +519,45 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 	 * contig range when the range is covered by a single folio, we can get
 	 * away with clearing young for the whole contig range here, so we avoid
 	 * having to unfold.
+	 *
+	 * The 'nr' means consecutive (present) PTEs that map consecutive pages
+	 * of the same large folio in a single VMA and a single page table.
 	 */
 
+	unsigned long end = addr + nr * PAGE_SIZE;
 	int young = 0;
-	int i;
 
-	ptep = contpte_align_down(ptep);
-	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-
-	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+	ptep = contpte_align_addr_ptep(&addr, &end, ptep, nr);
+	for (; addr != end; ptep++, addr += PAGE_SIZE)
 		young |= __ptep_test_and_clear_young(vma, addr, ptep);
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
+EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
 
-int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr)
 {
 	int young;
 
-	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
 
 	if (young) {
+		unsigned long end = addr + nr * PAGE_SIZE;
+
+		contpte_align_addr_ptep(&addr, &end, ptep, nr);
 		/*
 		 * See comment in __ptep_clear_flush_young(); same rationale for
 		 * eliding the trailing DSB applies here.
 		 */
-		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, addr, end,
 					 PAGE_SIZE, true, 3);
 	}
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
+EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
 
 void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (2 preceding siblings ...)
  2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 15:30   ` David Hildenbrand (Arm)
  2026-03-06 21:20   ` Barry Song
  2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
  2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
  5 siblings, 2 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
batched checking of young flags and TLB flushing, improving performance during
large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3dabf5ea17fa..a17eb8a76788 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}
+
 #define wrprotect_ptes wrprotect_ptes
 static __always_inline void wrprotect_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
@ 2026-02-09 15:30   ` David Hildenbrand (Arm)
  2026-02-10  0:39     ` Baolin Wang
  2026-03-06 21:20   ` Barry Song
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:30 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On 2/9/26 15:07, Baolin Wang wrote:
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
> 
> W/o patchset:
> real	0m1.518s
> user	0m0.000s
> sys	0m1.518s
> 
> W/ patchset:
> real	0m1.018s
> user	0m0.000s
> sys	0m1.018s
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3dabf5ea17fa..a17eb8a76788 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>   }
>   
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
I guess similar cases where we should never end up with non-present ptes 
should be updated accordingly.

ptep_test_and_clear_young(), for example, should never be called on 
non-present ptes.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 15:30   ` David Hildenbrand (Arm)
@ 2026-02-10  0:39     ` Baolin Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-10  0:39 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel



On 2/9/26 11:30 PM, David Hildenbrand (Arm) wrote:
> On 2/9/26 15:07, Baolin Wang wrote:
>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to 
>> enable
>> batched checking of young flags and TLB flushing, improving 
>> performance during
>> large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, 
>> and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can 
>> observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ 
>> improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() 
>> dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real    0m1.518s
>> user    0m0.000s
>> sys    0m1.518s
>>
>> W/ patchset:
>> real    0m1.018s
>> user    0m0.000s
>> sys    0m1.018s
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ 
>> asm/pgtable.h
>> index 3dabf5ea17fa..a17eb8a76788 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct 
>> vm_area_struct *vma,
>>       return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>   }
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +                     unsigned long addr, pte_t *ptep,
>> +                     unsigned int nr)
>> +{
>> +    if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> I guess similar cases where we should never end up with non-present ptes 
> should be updated accordingly.
> 
> ptep_test_and_clear_young(), for example, should never be called on non- 
> present ptes.

Yes. I already adrressed this in my follow-up patchset.

> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
  2026-02-09 15:30   ` David Hildenbrand (Arm)
@ 2026-03-06 21:20   ` Barry Song
  2026-03-07  2:14     ` Baolin Wang
  1 sibling, 1 reply; 38+ messages in thread
From: Barry Song @ 2026-03-06 21:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
>
> W/o patchset:
> real    0m1.518s
> user    0m0.000s
> sys     0m1.518s
>
> W/ patchset:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3dabf5ea17fa..a17eb8a76788 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>         return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>  }
>
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +                                        unsigned long addr, pte_t *ptep,
> +                                        unsigned int nr)
> +{
> +       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> +               return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +       return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}

A similar question arises here:

If nr = 4 for 16KB large folios and one of those entries is young,
we end up flushing the TLB for all 4 PTEs.

If all four entries are young, we win; if only one is young, it seems
we flush 3 redundant pages. but arm64 has TLB coalescing, so
maybe they are just one TLB?

> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>                                 unsigned long addr, pte_t *ptep, unsigned int nr)
> --
> 2.47.3

Thanks
Barry


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-03-06 21:20   ` Barry Song
@ 2026-03-07  2:14     ` Baolin Wang
  2026-03-07  7:41       ` Barry Song
  0 siblings, 1 reply; 38+ messages in thread
From: Baolin Wang @ 2026-03-07  2:14 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel



On 3/7/26 5:20 AM, Barry Song wrote:
> On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
>> batched checking of young flags and TLB flushing, improving performance during
>> large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real    0m1.518s
>> user    0m0.000s
>> sys     0m1.518s
>>
>> W/ patchset:
>> real    0m1.018s
>> user    0m0.000s
>> sys     0m1.018s
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks Barry. But this series has been upstreamed, I can not add your 
reviewed tag.

> 
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 3dabf5ea17fa..a17eb8a76788 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>          return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>   }
>>
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +                                        unsigned long addr, pte_t *ptep,
>> +                                        unsigned int nr)
>> +{
>> +       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>> +               return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +       return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>> +}
> 
> A similar question arises here:
> 
> If nr = 4 for 16KB large folios and one of those entries is young,
> we end up flushing the TLB for all 4 PTEs.
> 
> If all four entries are young, we win; if only one is young, it seems
> we flush 3 redundant pages. but arm64 has TLB coalescing, so
> maybe they are just one TLB?

We discussed a similar issue in the previous thread [1], and I quote 
some comments from Ryan:

"
My concern was the opportunity cost of evicting the entries for all the
non-accessed parts of the folio from the TLB. But of course, I'm talking
nonsense because the architecture does not allow caching non-accessed 
entries in the TLB.
"

[1] 
https://lore.kernel.org/all/02239ca7-9701-4bfa-af0f-dcf0d05a3e89@linux.alibaba.com/



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-03-07  2:14     ` Baolin Wang
@ 2026-03-07  7:41       ` Barry Song
  0 siblings, 0 replies; 38+ messages in thread
From: Barry Song @ 2026-03-07  7:41 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On Sat, Mar 7, 2026 at 10:14 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/7/26 5:20 AM, Barry Song wrote:
> > On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> >> batched checking of young flags and TLB flushing, improving performance during
> >> large folio reclamation.
> >>
> >> Performance testing:
> >> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> >> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> >> from approximately 35% to around 5%.
> >>
> >> W/o patchset:
> >> real    0m1.518s
> >> user    0m0.000s
> >> sys     0m1.518s
> >>
> >> W/ patchset:
> >> real    0m1.018s
> >> user    0m0.000s
> >> sys     0m1.018s
> >>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
>
> Thanks Barry. But this series has been upstreamed, I can not add your
> reviewed tag.
>
> >
> >> ---
> >>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 3dabf5ea17fa..a17eb8a76788 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>          return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
> >>   }
> >>
> >> +#define clear_flush_young_ptes clear_flush_young_ptes
> >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> >> +                                        unsigned long addr, pte_t *ptep,
> >> +                                        unsigned int nr)
> >> +{
> >> +       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> >> +               return __ptep_clear_flush_young(vma, addr, ptep);
> >> +
> >> +       return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> >> +}
> >
> > A similar question arises here:
> >
> > If nr = 4 for 16KB large folios and one of those entries is young,
> > we end up flushing the TLB for all 4 PTEs.
> >
> > If all four entries are young, we win; if only one is young, it seems
> > we flush 3 redundant pages. but arm64 has TLB coalescing, so
> > maybe they are just one TLB?
>
> We discussed a similar issue in the previous thread [1], and I quote
> some comments from Ryan:
>
> "
> My concern was the opportunity cost of evicting the entries for all the
> non-accessed parts of the folio from the TLB. But of course, I'm talking
> nonsense because the architecture does not allow caching non-accessed
> entries in the TLB.
> "

You and Ryan are clearly smarter than me :-) Thinking about it
again, worrying about shooting down the TLBs of non-accessed
PTEs seems to be nonsense.

>
> [1]
> https://lore.kernel.org/all/02239ca7-9701-4bfa-af0f-dcf0d05a3e89@linux.alibaba.com/
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (3 preceding siblings ...)
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 15:31   ` David Hildenbrand (Arm)
  2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
  5 siblings, 1 reply; 38+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Similar to folio_referenced_one(), we can apply batched unmapping for file
large folios to optimize the performance of file folios reclamation.

Barry previously implemented batched unmapping for lazyfree anonymous large
folios[1] and did not further optimize anonymous large folios or file-backed
large folios at that stage. As for file-backed large folios, the batched
unmapping support is relatively straightforward, as we only need to clear
the consecutive (present) PTE entries for file-backed large folios.

Note that it's not ready to support batched unmapping for uffd case, so
let's still fallback to per-page unmapping for the uffd case.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
75% performance improvement on my Arm64 32-core server (and 50%+ improvement
on my X86 machine) with this patch.

W/o patch:
real    0m1.018s
user    0m0.000s
sys     0m1.018s

W/ patch:
real	0m0.249s
user	0m0.000s
sys	0m0.249s

[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Barry Song <baohua@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 8807f8a7df28..43cb9ac6f523 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1945,12 +1945,16 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree batching for now ... */
-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+	/* We only support lazyfree or file folios batching for now ... */
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
 		return 1;
+
 	if (pte_unused(pte))
 		return 1;
 
+	if (userfaultfd_wp(vma))
+		return 1;
+
 	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
 }
 
@@ -2313,7 +2317,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(folio));
+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 discard:
 		if (unlikely(folio_test_hugetlb(folio))) {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
@ 2026-02-09 15:31   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:31 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On 2/9/26 15:07, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
> 
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
> 
> Note that it's not ready to support batched unmapping for uffd case, so
> let's still fallback to per-page unmapping for the uffd case.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
> 
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
> 
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
> 
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 0/5] support batch checking of references and unmapping for large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (4 preceding siblings ...)
  2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
@ 2026-02-10  1:53 ` Andrew Morton
  2026-02-10  2:01   ` Baolin Wang
  5 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2026-02-10  1:53 UTC (permalink / raw)
  To: Baolin Wang
  Cc: david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Mon,  9 Feb 2026 22:07:23 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Similar to folio_referenced_one(), we can also apply batched unmapping for large
> file folios to optimize the performance of file folio reclamation. By supporting
> batched checking of the young flags, flushing TLB entries, and unmapping, I can
> observed a significant performance improvements in my performance tests for file
> folios reclamation. Please check the performance data in the commit message of
> each patch.
> 

Thanks, I updated mm.git to this version.  Below is how v6 altered
mm.git.

I notice that this fix:

https://lore.kernel.org/all/de141225-a0c1-41fd-b3e1-bcab09827ddd@linux.alibaba.com/T/#u

was not carried forward.  Was this deliberate?

Also, regarding the 80-column tricks in folio_referenced_one(): we're
allowed to do this ;)


				unsigned long end_addr;
				unsigned int max_nr;

				end_addr = pmd_addr_end(address, vma->vm_end);
				max_nr = (end_addr - address) >> PAGE_SHIFT;




 arch/arm64/include/asm/pgtable.h |    2 +-
 include/linux/pgtable.h          |   16 ++++++++++------
 mm/rmap.c                        |    9 +++------
 3 files changed, 14 insertions(+), 13 deletions(-)

--- a/arch/arm64/include/asm/pgtable.h~b
+++ a/arch/arm64/include/asm/pgtable.h
@@ -1843,7 +1843,7 @@ static inline int clear_flush_young_ptes
 					 unsigned long addr, pte_t *ptep,
 					 unsigned int nr)
 {
-	if (likely(nr == 1 && !pte_valid_cont(__ptep_get(ptep))))
+	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
--- a/include/linux/pgtable.h~b
+++ a/include/linux/pgtable.h
@@ -1070,8 +1070,8 @@ static inline void wrprotect_ptes(struct
 
 #ifndef clear_flush_young_ptes
 /**
- * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
- *			    that map consecutive pages of the same folio.
+ * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
+ *			    folio as old and flush the TLB.
  * @vma: The virtual memory area the pages are mapped into.
  * @addr: Address the first page is mapped at.
  * @ptep: Page table pointer for the first entry.
@@ -1087,13 +1087,17 @@ static inline void wrprotect_ptes(struct
  * pages that belong to the same folio.  The PTEs are all in the same PMD.
  */
 static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
-					 unsigned long addr, pte_t *ptep,
-					 unsigned int nr)
+		unsigned long addr, pte_t *ptep, unsigned int nr)
 {
-	int i, young = 0;
+	int young = 0;
 
-	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
+	for (;;) {
 		young |= ptep_clear_flush_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
 
 	return young;
 }
--- a/mm/rmap.c~b
+++ a/mm/rmap.c
@@ -963,10 +963,8 @@ static bool folio_referenced_one(struct
 				referenced++;
 		} else if (pvmw.pte) {
 			if (folio_test_large(folio)) {
-				unsigned long end_addr =
-					pmd_addr_end(address, vma->vm_end);
-				unsigned int max_nr =
-					(end_addr - address) >> PAGE_SHIFT;
+				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
 				pte_t pteval = ptep_get(pvmw.pte);
 
 				nr = folio_pte_batch(folio, pvmw.pte,
@@ -974,8 +972,7 @@ static bool folio_referenced_one(struct
 			}
 
 			ptes += nr;
-			if (clear_flush_young_ptes_notify(vma, address,
-						pvmw.pte, nr))
+			if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
 				referenced++;
 			/* Skip the batched PTEs */
 			pvmw.pte += nr - 1;
_



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 0/5] support batch checking of references and unmapping for large folios
  2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
@ 2026-02-10  2:01   ` Baolin Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Baolin Wang @ 2026-02-10  2:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 2/10/26 9:53 AM, Andrew Morton wrote:
> On Mon,  9 Feb 2026 22:07:23 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Similar to folio_referenced_one(), we can also apply batched unmapping for large
>> file folios to optimize the performance of file folio reclamation. By supporting
>> batched checking of the young flags, flushing TLB entries, and unmapping, I can
>> observed a significant performance improvements in my performance tests for file
>> folios reclamation. Please check the performance data in the commit message of
>> each patch.
>>
> 
> Thanks, I updated mm.git to this version.  Below is how v6 altered
> mm.git.
> 
> I notice that this fix:
> 
> https://lore.kernel.org/all/de141225-a0c1-41fd-b3e1-bcab09827ddd@linux.alibaba.com/T/#u
> 
> was not carried forward.  Was this deliberate?

Yes. After discussing with David[1], we believe the original patch is 
correct, so the 'fix' is unnecessary.

[1] 
https://lore.kernel.org/all/280ae63e-d66e-438f-8045-6c870420fe76@linux.alibaba.com/

The following diff looks good to me. Thanks.

> Also, regarding the 80-column tricks in folio_referenced_one(): we're
> allowed to do this ;)
> 
> 
> 				unsigned long end_addr;
> 				unsigned int max_nr;
> 
> 				end_addr = pmd_addr_end(address, vma->vm_end);
> 				max_nr = (end_addr - address) >> PAGE_SHIFT;
> 
> 
> 
> 
>   arch/arm64/include/asm/pgtable.h |    2 +-
>   include/linux/pgtable.h          |   16 ++++++++++------
>   mm/rmap.c                        |    9 +++------
>   3 files changed, 14 insertions(+), 13 deletions(-)
> 
> --- a/arch/arm64/include/asm/pgtable.h~b
> +++ a/arch/arm64/include/asm/pgtable.h
> @@ -1843,7 +1843,7 @@ static inline int clear_flush_young_ptes
>   					 unsigned long addr, pte_t *ptep,
>   					 unsigned int nr)
>   {
> -	if (likely(nr == 1 && !pte_valid_cont(__ptep_get(ptep))))
> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>   		return __ptep_clear_flush_young(vma, addr, ptep);
>   
>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> --- a/include/linux/pgtable.h~b
> +++ a/include/linux/pgtable.h
> @@ -1070,8 +1070,8 @@ static inline void wrprotect_ptes(struct
>   
>   #ifndef clear_flush_young_ptes
>   /**
> - * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
> - *			    that map consecutive pages of the same folio.
> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> + *			    folio as old and flush the TLB.
>    * @vma: The virtual memory area the pages are mapped into.
>    * @addr: Address the first page is mapped at.
>    * @ptep: Page table pointer for the first entry.
> @@ -1087,13 +1087,17 @@ static inline void wrprotect_ptes(struct
>    * pages that belong to the same folio.  The PTEs are all in the same PMD.
>    */
>   static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> -					 unsigned long addr, pte_t *ptep,
> -					 unsigned int nr)
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>   {
> -	int i, young = 0;
> +	int young = 0;
>   
> -	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
> +	for (;;) {
>   		young |= ptep_clear_flush_young(vma, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
>   
>   	return young;
>   }
> --- a/mm/rmap.c~b
> +++ a/mm/rmap.c
> @@ -963,10 +963,8 @@ static bool folio_referenced_one(struct
>   				referenced++;
>   		} else if (pvmw.pte) {
>   			if (folio_test_large(folio)) {
> -				unsigned long end_addr =
> -					pmd_addr_end(address, vma->vm_end);
> -				unsigned int max_nr =
> -					(end_addr - address) >> PAGE_SHIFT;
> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
>   				pte_t pteval = ptep_get(pvmw.pte);
>   
>   				nr = folio_pte_batch(folio, pvmw.pte,
> @@ -974,8 +972,7 @@ static bool folio_referenced_one(struct
>   			}
>   
>   			ptes += nr;
> -			if (clear_flush_young_ptes_notify(vma, address,
> -						pvmw.pte, nr))
> +			if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
>   				referenced++;
>   			/* Skip the batched PTEs */
>   			pvmw.pte += nr - 1;
> _



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2026-03-26 12:21 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
2026-02-09 15:25   ` David Hildenbrand (Arm)
2026-03-06 21:07   ` Barry Song
2026-03-07  2:22     ` Baolin Wang
2026-03-07  8:02       ` Barry Song
2026-03-10  1:37         ` Baolin Wang
2026-03-10  8:17           ` David Hildenbrand (Arm)
2026-03-16  6:25             ` Baolin Wang
2026-03-16 14:15               ` David Hildenbrand (Arm)
2026-03-25 14:36                 ` Lorenzo Stoakes (Oracle)
2026-03-25 14:58                   ` David Hildenbrand (Arm)
2026-03-25 15:06                     ` Lorenzo Stoakes (Oracle)
2026-03-25 15:30                       ` Andrew Morton
2026-03-25 15:32                         ` Lorenzo Stoakes (Oracle)
2026-03-25 16:23                           ` Andrew Morton
2026-03-25 16:28                             ` Lorenzo Stoakes (Oracle)
2026-03-25 18:43                               ` Andrew Morton
2026-03-25 18:58                                 ` Lorenzo Stoakes (Oracle)
2026-03-26  1:47                       ` Baolin Wang
2026-03-26  5:31                         ` Barry Song
2026-03-26 11:10                         ` Lorenzo Stoakes (Oracle)
2026-03-26 12:04                           ` Baolin Wang
2026-03-26 12:21                             ` Lorenzo Stoakes (Oracle)
2026-03-17  7:30               ` Barry Song
2026-03-18  1:37                 ` Baolin Wang
2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
2026-02-09 15:30   ` David Hildenbrand (Arm)
2026-02-10  0:39     ` Baolin Wang
2026-03-06 21:20   ` Barry Song
2026-03-07  2:14     ` Baolin Wang
2026-03-07  7:41       ` Barry Song
2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
2026-02-09 15:31   ` David Hildenbrand (Arm)
2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
2026-02-10  2:01   ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox