[PATCH v2 0/3] support batch checking of references and unmapping for large folios

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] support batch checking of references and unmapping for large folios
@ 2025-12-11  8:16 Baolin Wang
  2025-12-11  8:16 ` [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag " Baolin Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-11  8:16 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, baolin.wang,
	linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Similar to folio_referenced_one(), we can also apply batched unmapping for large
file folios to optimize the performance of file folio reclamation. By supporting
batched checking of the young flags, flushing TLB entries, and unmapping, I can
observed a significant performance improvements in my performance tests for file
folios reclamation. Please check the performance data in the commit message of
each patch.

Run stress-ng and mm selftests, no issues were found.

Changes from v1:
 - Add a new patch to support batched unmapping for file large folios.
 - Update the cover letter.

Baolin Wang (3):
  arm64: mm: support batch clearing of the young flag for large folios
  mm: rmap: support batched checks of the references for large folios
  mm: rmap: support batched unmapping for file large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 include/linux/mmu_notifier.h     |  9 ++++---
 include/linux/pgtable.h          | 19 ++++++++++++++
 mm/rmap.c                        | 29 +++++++++++++++++----
 5 files changed, 96 insertions(+), 28 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-11  8:16 [PATCH v2 0/3] support batch checking of references and unmapping for large folios Baolin Wang
@ 2025-12-11  8:16 ` Baolin Wang
  2025-12-15 11:36   ` Lorenzo Stoakes
  2025-12-17 15:43   ` Ryan Roberts
  2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
  2025-12-11  8:16 ` [PATCH v2 3/3] mm: rmap: support batched unmapping for file " Baolin Wang
  2 siblings, 2 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-11  8:16 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, baolin.wang,
	linux-mm, linux-arm-kernel, linux-kernel

Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
only clear the young flag and flush TLBs for PTEs within the contiguous range.
To support batch PTE operations for other sized large folios in the following
patches, adding a new parameter to specify the number of PTEs.

While we are at it, rename the functions to maintain consistency with other
contpte_*() functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 2 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0944e296dd4a..e03034683156 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep,
 				unsigned int nr, int full);
-extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
-extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
+extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
 extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
@@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_test_and_clear_young(vma, addr, ptep);
 
-	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
-	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
 #define wrprotect_ptes wrprotect_ptes
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c0557945939c..19b122441be3 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
 
-int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
 {
 	/*
 	 * ptep_clear_flush_young() technically requires us to clear the access
@@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 	 * having to unfold.
 	 */
 
+	unsigned long start = addr;
+	unsigned long end = start + nr * PAGE_SIZE;
 	int young = 0;
 	int i;
 
-	ptep = contpte_align_down(ptep);
-	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	if (pte_cont(__ptep_get(ptep + nr - 1)))
+		end = ALIGN(end, CONT_PTE_SIZE);
 
-	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
-		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+	if (pte_cont(__ptep_get(ptep))) {
+		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
+		ptep = contpte_align_down(ptep);
+	}
+
+	nr = (end - start) / PAGE_SIZE;
+	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, start, ptep);
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
+EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
 
-int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr)
 {
 	int young;
 
-	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
 
 	if (young) {
+		unsigned long start = addr;
+		unsigned long end = start + nr * PAGE_SIZE;
+
+		if (pte_cont(__ptep_get(ptep + nr - 1)))
+			end = ALIGN(end, CONT_PTE_SIZE);
+
+		if (pte_cont(__ptep_get(ptep)))
+			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
+
 		/*
 		 * See comment in __ptep_clear_flush_young(); same rationale for
 		 * eliding the trailing DSB applies here.
 		 */
-		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, start, end,
 					 PAGE_SIZE, true, 3);
 	}
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
+EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
 
 void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-11  8:16 [PATCH v2 0/3] support batch checking of references and unmapping for large folios Baolin Wang
  2025-12-11  8:16 ` [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag " Baolin Wang
@ 2025-12-11  8:16 ` Baolin Wang
  2025-12-15 12:22   ` Lorenzo Stoakes
                     ` (3 more replies)
  2025-12-11  8:16 ` [PATCH v2 3/3] mm: rmap: support batched unmapping for file " Baolin Wang
  2 siblings, 4 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-11  8:16 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, baolin.wang,
	linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++++++
 include/linux/mmu_notifier.h     |  9 +++++----
 include/linux/pgtable.h          | 19 +++++++++++++++++++
 mm/rmap.c                        | 22 ++++++++++++++++++++--
 4 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e03034683156..a865bd8c46a3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
+{
+	if (likely(nr == 1))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}
+
 #define wrprotect_ptes wrprotect_ptes
 static __always_inline void wrprotect_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, unsigned int nr)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..be594b274729 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
 	range->owner = owner;
 }
 
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
 ({									\
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	unsigned int ___nr = __nr;					\
+	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
 	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
 						  ___address,		\
 						  ___address +		\
-							PAGE_SIZE);	\
+						nr * PAGE_SIZE);	\
 	__young;							\
 })
 
@@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 
 #define mmu_notifier_range_update_to_read_only(r) false
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_young_notify clear_flush_young_ptes
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_young_notify ptep_test_and_clear_young
 #define pmdp_clear_young_notify pmdp_test_and_clear_young
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b13b6f42be3c..c7d0fd228cb7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+#ifndef clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	int young = 0;
+
+	for (;;) {
+		young |= ptep_clear_flush_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+
+	return young;
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/rmap.c b/mm/rmap.c
index d6799afe1114..ec232165c47d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int ptes = 0, referenced = 0;
+	unsigned int nr;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
+		nr = 1;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			ptes++;
@@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
+			if (folio_test_large(folio)) {
+				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
+				pte_t pteval = ptep_get(pvmw.pte);
+
+				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
+			}
+
+			ptes += nr;
 			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte))
+						pvmw.pte, nr))
 				referenced++;
+			/* Skip the batched PTEs */
+			pvmw.pte += nr - 1;
+			pvmw.address += (nr - 1) * PAGE_SIZE;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
 						pvmw.pmd))
@@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
 			WARN_ON_ONCE(1);
 		}
 
-		pra->mapcount--;
+		pra->mapcount -= nr;
+		if (ptes == pvmw.nr_pages) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
 	}
 
 	if (referenced)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-11  8:16 [PATCH v2 0/3] support batch checking of references and unmapping for large folios Baolin Wang
  2025-12-11  8:16 ` [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag " Baolin Wang
  2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
@ 2025-12-11  8:16 ` Baolin Wang
  2025-12-11 12:36   ` Barry Song
  2025-12-15 12:38   ` Lorenzo Stoakes
  2 siblings, 2 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-11  8:16 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, baolin.wang,
	linux-mm, linux-arm-kernel, linux-kernel

Similar to folio_referenced_one(), we can apply batched unmapping for file
large folios to optimize the performance of file folios reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
75% performance improvement on my Arm64 32-core server.

W/o patch:
real    0m1.018s
user    0m0.000s
sys     0m1.018s

W/ patch:
real	0m0.249s
user	0m0.000s
sys	0m0.249s

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ec232165c47d..4c9d5777c8da 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree batching for now ... */
-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+	/* We only support lazyfree or file folios batching for now ... */
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
 		return 1;
+
 	if (pte_unused(pte))
 		return 1;
 
@@ -2223,7 +2224,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(folio));
+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 discard:
 		if (unlikely(folio_test_hugetlb(folio))) {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-11  8:16 ` [PATCH v2 3/3] mm: rmap: support batched unmapping for file " Baolin Wang
@ 2025-12-11 12:36   ` Barry Song
  2025-12-15 12:38   ` Lorenzo Stoakes
  1 sibling, 0 replies; 35+ messages in thread
From: Barry Song @ 2025-12-11 12:36 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel

On Thu, Dec 11, 2025 at 4:17 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server.
>
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
>
> W/ patch:
> real    0m0.249s
> user    0m0.000s
> sys     0m0.249s
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

It appears quite straightforward to introduce file folios support based on the
current lazyfree implementation.

Acked-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-11  8:16 ` [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag " Baolin Wang
@ 2025-12-15 11:36   ` Lorenzo Stoakes
  2025-12-16  3:32     ` Baolin Wang
  2025-12-17 15:43   ` Ryan Roberts
  1 sibling, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-15 11:36 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel

On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> only clear the young flag and flush TLBs for PTEs within the contiguous range.
> To support batch PTE operations for other sized large folios in the following
> patches, adding a new parameter to specify the number of PTEs.
>
> While we are at it, rename the functions to maintain consistency with other
> contpte_*() functions.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  2 files changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0944e296dd4a..e03034683156 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep,
>  				unsigned int nr, int full);
> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);

In core mm at least, as a convention, we strip 'extern' from header declarations
as we go as they're not necessary.

I wonder about this naming convention also. The contpte_xxx_() prefix
obviously implies we are operating upon PTEs in the contiguous range, now
we are doing something... different and 'nr' is a bit vague.

Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
now we're not saying they're necessarily contiguous in the sense of contpte...

I wonder whether we'd be better off introducing a new function that
explicitly has 'batch' or 'batched' in the name, and have the usual
contpte_***() stuff and callers that need batching using a new underlying helper?

Obviously I defer to Ryan on all this, just my thoughts...

>  extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_test_and_clear_young(vma, addr, ptep);
>
> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_clear_flush_young(vma, addr, ptep);
>
> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>
>  #define wrprotect_ptes wrprotect_ptes
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c0557945939c..19b122441be3 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>
> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
>  {
>  	/*
>  	 * ptep_clear_flush_young() technically requires us to clear the access
> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	 * having to unfold.
>  	 */

Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:

"And since we only create a contig range when the range is covered by a single
folio, we can get away with clearing young for the whole contig range here, so
we avoid having to unfold."

However now you're allowing for large folios that are not contpte?

Overall feeling pretty iffy about implementing this this way.

>
> +	unsigned long start = addr;
> +	unsigned long end = start + nr * PAGE_SIZE;

So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
table?

>  	int young = 0;
>  	int i;
>
> -	ptep = contpte_align_down(ptep);
> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> +		end = ALIGN(end, CONT_PTE_SIZE);

OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
with other PTE entries present there not having PTE_CONT set (Ryan - do
correct me if I'm wrong!)

I guess then no worry about running off the end of the PTE table here.

I wonder about using pte_cont_addr_end() here, but I guess you are not
intending to limit to a range here but rather capture the entire contpte
range of the end.

I don't love that now a function prefixed with contpte_... can have a
condition where part of the input range is _not_ a contpte entry, or is
specified partially...


>
> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +	if (pte_cont(__ptep_get(ptep))) {
> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +		ptep = contpte_align_down(ptep);
> +	}
> +
> +	nr = (end - start) / PAGE_SIZE;

Hm don't love this reuse of input param.

> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, start, ptep);

Again, overall find it a bit iffy that we are essentially overloading the
code with non-contpte behaviour.

>
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>
> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr)
>  {
>  	int young;
>
> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>
>  	if (young) {
> +		unsigned long start = addr;
> +		unsigned long end = start + nr * PAGE_SIZE;
> +
> +		if (pte_cont(__ptep_get(ptep + nr - 1)))
> +			end = ALIGN(end, CONT_PTE_SIZE);
> +
> +		if (pte_cont(__ptep_get(ptep)))
> +			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +
>  		/*
>  		 * See comment in __ptep_clear_flush_young(); same rationale for
>  		 * eliding the trailing DSB applies here.
>  		 */
> -		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, start, end,
>  					 PAGE_SIZE, true, 3);

Similar comments to above.

Am curious as to Ryan's opinion.

>  	}
>
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
> +EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
>
>  void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  					pte_t *ptep, unsigned int nr)
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
@ 2025-12-15 12:22   ` Lorenzo Stoakes
  2025-12-16  3:47     ` Baolin Wang
  2025-12-17  6:23   ` Dev Jain
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-15 12:22 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel

On Thu, Dec 11, 2025 at 04:16:55PM +0800, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already

arm64 you mean :)

> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
>
> W/o patchset:
> real	0m1.518s
> user	0m0.000s
> sys	0m1.518s
>
> W/ patchset:
> real	0m1.018s
> user	0m0.000s
> sys	0m1.018s

That's nice!

Have you performed the same kind of performance testing on non-arm64? As in the
past we've had a batch optimisation go horribly wrong on non-arm64 even if it
was ok on arm64 :)

>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>  include/linux/mmu_notifier.h     |  9 +++++----
>  include/linux/pgtable.h          | 19 +++++++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++++++++--
>  4 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e03034683156..a865bd8c46a3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
> +{
> +	if (likely(nr == 1))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}

Hmm again this is a weird way of exposing a contepte-specific function, you
really need to rework that as discussed in patch 1/3.

It seems to me we can share code to avoid this.

> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, unsigned int nr)
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..be594b274729 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>  	range->owner = owner;
>  }
>
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
>  ({									\
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	unsigned int ___nr = __nr;					\
> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>  	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>  						  ___address,		\
>  						  ___address +		\
> -							PAGE_SIZE);	\
> +						nr * PAGE_SIZE);	\
>  	__young;							\
>  })

An aside, but I wonder why this needs to be a (pretty disgusting) macro?

>
> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>
>  #define mmu_notifier_range_update_to_read_only(r) false
>
> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>  #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>  #define ptep_clear_young_notify ptep_test_and_clear_young
>  #define pmdp_clear_young_notify pmdp_test_and_clear_young
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b13b6f42be3c..c7d0fd228cb7 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +#ifndef clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	int young = 0;
> +
> +	for (;;) {
> +		young |= ptep_clear_flush_young(vma, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +
> +	return young;
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d6799afe1114..ec232165c47d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>  	struct folio_referenced_arg *pra = arg;
>  	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>  	int ptes = 0, referenced = 0;
> +	unsigned int nr;
>
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		address = pvmw.address;
> +		nr = 1;
>
>  		if (vma->vm_flags & VM_LOCKED) {
>  			ptes++;
> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>  			if (lru_gen_look_around(&pvmw))
>  				referenced++;
>  		} else if (pvmw.pte) {
> +			if (folio_test_large(folio)) {
> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
> +				pte_t pteval = ptep_get(pvmw.pte);
> +
> +				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);

I do wish we could put this fiddly logic into a helper for each place in
which we do similar kind 'end of the PTE table, maximum number we could
have' logic.

> +			}

NIT but we're running into pretty long lines here.

> +
> +			ptes += nr;
>  			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte))
> +						pvmw.pte, nr))
>  				referenced++;

I find this referenced logic weird, it seems like it should be a boolean,
but this is outside the scope of your patch here :)

> +			/* Skip the batched PTEs */
> +			pvmw.pte += nr - 1;
> +			pvmw.address += (nr - 1) * PAGE_SIZE;
>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>  			if (pmdp_clear_flush_young_notify(vma, address,
>  						pvmw.pmd))
> @@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
>  			WARN_ON_ONCE(1);
>  		}
>
> -		pra->mapcount--;
> +		pra->mapcount -= nr;
> +		if (ptes == pvmw.nr_pages) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
>  	}
>
>  	if (referenced)
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-11  8:16 ` [PATCH v2 3/3] mm: rmap: support batched unmapping for file " Baolin Wang
  2025-12-11 12:36   ` Barry Song
@ 2025-12-15 12:38   ` Lorenzo Stoakes
  2025-12-16  5:48     ` Baolin Wang
  1 sibling, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-15 12:38 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel

On Thu, Dec 11, 2025 at 04:16:56PM +0800, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server.

Again, you must test on non-arm64 architectures and report the numbers for this
also.

>
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
>
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/rmap.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ec232165c47d..4c9d5777c8da 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	end_addr = pmd_addr_end(addr, vma->vm_end);
>  	max_nr = (end_addr - addr) >> PAGE_SHIFT;
>
> -	/* We only support lazyfree batching for now ... */
> -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> +	/* We only support lazyfree or file folios batching for now ... */
> +	if (folio_test_anon(folio) && folio_test_swapbacked(folio))

Why is it now ok to support file-backed batched unmapping when it wasn't in
Barry's series (see [0])? You don't seem to be justifying this?

[0]:https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u


>  		return 1;
> +
>  	if (pte_unused(pte))
>  		return 1;
>
> @@ -2223,7 +2224,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 *
>  			 * See Documentation/mm/mmu_notifier.rst
>  			 */
> -			dec_mm_counter(mm, mm_counter_file(folio));
> +			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);

Was this just a bug before?

>  		}
>  discard:
>  		if (unlikely(folio_test_hugetlb(folio))) {
> --
> 2.47.3
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-15 11:36   ` Lorenzo Stoakes
@ 2025-12-16  3:32     ` Baolin Wang
  2025-12-16 11:11       ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-16  3:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel



On 2025/12/15 19:36, Lorenzo Stoakes wrote:
> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>> To support batch PTE operations for other sized large folios in the following
>> patches, adding a new parameter to specify the number of PTEs.
>>
>> While we are at it, rename the functions to maintain consistency with other
>> contpte_*() functions.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>   2 files changed, 37 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0944e296dd4a..e03034683156 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep,
>>   				unsigned int nr, int full);
>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> 
> In core mm at least, as a convention, we strip 'extern' from header declarations
> as we go as they're not necessary.

Sure. I can remove the 'extern'.

> I wonder about this naming convention also. The contpte_xxx_() prefix
> obviously implies we are operating upon PTEs in the contiguous range, now
> we are doing something... different and 'nr' is a bit vague.
> 
> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But

Yes, 'nr' here means nr_pages, which follows the convention of the 
parameters in contpte_xxx_() functions.

> now we're not saying they're necessarily contiguous in the sense of contpte...
> 
> I wonder whether we'd be better off introducing a new function that
> explicitly has 'batch' or 'batched' in the name, and have the usual
> contpte_***() stuff and callers that need batching using a new underlying helper?

OK. I get your point. However, this is outside the scope of my patchset.

Perhaps we can clean up all the contpte_xxx_() functions if everyone 
agrees that this is confusing.

> Obviously I defer to Ryan on all this, just my thoughts...
> 
>>   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   				pte_t *ptep, unsigned int nr);
>>   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_test_and_clear_young(vma, addr, ptep);
>>
>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>
>>   #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_clear_flush_young(vma, addr, ptep);
>>
>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>
>>   #define wrprotect_ptes wrprotect_ptes
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..19b122441be3 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   }
>>   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>
>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -					unsigned long addr, pte_t *ptep)
>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>>   {
>>   	/*
>>   	 * ptep_clear_flush_young() technically requires us to clear the access
>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	 * having to unfold.
>>   	 */
> 
> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
> 
> "And since we only create a contig range when the range is covered by a single
> folio, we can get away with clearing young for the whole contig range here, so
> we avoid having to unfold."

I think the original comments are still clear:
"
/*
  * ptep_clear_flush_young() technically requires us to clear the access
  * flag for a _single_ pte. However, the core-mm code actually tracks
  * access/dirty per folio, not per page. And since we only create a
  * contig range when the range is covered by a single folio, we can get
  * away with clearing young for the whole contig range here, so we avoid
  * having to unfold.
  */
"

> However now you're allowing for large folios that are not contpte?
> 
> Overall feeling pretty iffy about implementing this this way.

Please see the above comments, which explain why we should do that.

>> +	unsigned long start = addr;
>> +	unsigned long end = start + nr * PAGE_SIZE;
> 
> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
> table?

Caller has made sure that (see folio_referenced_one()).

>>   	int young = 0;
>>   	int i;
>>
>> -	ptep = contpte_align_down(ptep);
>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>> +		end = ALIGN(end, CONT_PTE_SIZE);
> 
> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
> with other PTE entries present there not having PTE_CONT set (Ryan - do
> correct me if I'm wrong!)
> 
> I guess then no worry about running off the end of the PTE table here.
> 
> I wonder about using pte_cont_addr_end() here, but I guess you are not
> intending to limit to a range here but rather capture the entire contpte
> range of the end.

Right.

> I don't love that now a function prefixed with contpte_... can have a
> condition where part of the input range is _not_ a contpte entry, or is
> specified partially...

Like I said above, this is outside the scope of my patchset. If Ryan 
also thinks we need to do some cleanups for all the contpte_xxx() 
functions in the contpte.c file, we can send a follow-up patchset to 
rename all the contpte_xxx() functions to make it clear.

>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +	if (pte_cont(__ptep_get(ptep))) {
>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>> +		ptep = contpte_align_down(ptep);
>> +	}
>> +
>> +	nr = (end - start) / PAGE_SIZE;
> 
> Hm don't love this reuse of input param.

OK. Will use another local variable instead.

> 
>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
> 
> Again, overall find it a bit iffy that we are essentially overloading the
> code with non-contpte behaviour.

Let me clarify further: this is the handling logic of the contpte_xxx() 
functions in the contpte.c file. For example, the function 
contpte_set_ptes() has a parameter 'nr', which may be greater than 
CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of 
a large folio.

It might be a bit confusing, and I understand this is why you want to 
change the 'contpte_' prefix to the 'batch_' prefix. Of course, we can 
hope Ryan can provide some input on this.

Thanks for reviewing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-15 12:22   ` Lorenzo Stoakes
@ 2025-12-16  3:47     ` Baolin Wang
  0 siblings, 0 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-16  3:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel



On 2025/12/15 20:22, Lorenzo Stoakes wrote:
> On Thu, Dec 11, 2025 at 04:16:55PM +0800, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> 
> arm64 you mean :)

Right. Will make it clear.

>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>> of the young flags and flushing TLB entries, thereby improving performance
>> during large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real	0m1.518s
>> user	0m0.000s
>> sys	0m1.518s
>>
>> W/ patchset:
>> real	0m1.018s
>> user	0m0.000s
>> sys	0m1.018s
> 
> That's nice!
> 
> Have you performed the same kind of performance testing on non-arm64? As in the
> past we've had a batch optimisation go horribly wrong on non-arm64 even if it
> was ok on arm64 :)

Yes, seems you missed my test results for the x86 machine in the commit 
message :)

"I can observe 33% performance improvement on my Arm64 32-core server 
(and 10%+ improvement on my X86 machine)."

>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   include/linux/mmu_notifier.h     |  9 +++++----
>>   include/linux/pgtable.h          | 19 +++++++++++++++++++
>>   mm/rmap.c                        | 22 ++++++++++++++++++++--
>>   4 files changed, 55 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index e03034683156..a865bd8c46a3 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>> +{
>> +	if (likely(nr == 1))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>> +}
> 
> Hmm again this is a weird way of exposing a contepte-specific function, you
> really need to rework that as discussed in patch 1/3.
> 
> It seems to me we can share code to avoid this.

Sorry I don't think so. This is the current way of exposing a contpte 
for Arm64. Please take a look at set_ptes(), clear_full_ptes(), 
wrprotect_ptes() and so on (in this file).

>>   #define wrprotect_ptes wrprotect_ptes
>>   static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep, unsigned int nr)
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..be594b274729 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>   	range->owner = owner;
>>   }
>>
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
>> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
>>   ({									\
>>   	int __young;							\
>>   	struct vm_area_struct *___vma = __vma;				\
>>   	unsigned long ___address = __address;				\
>> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
>> +	unsigned int ___nr = __nr;					\
>> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>>   	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>>   						  ___address,		\
>>   						  ___address +		\
>> -							PAGE_SIZE);	\
>> +						nr * PAGE_SIZE);	\
>>   	__young;							\
>>   })
> 
> An aside, but I wonder why this needs to be a (pretty disgusting) macro?

Um, I can send a follow-up to clean up all these related macros.

>> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>
>>   #define mmu_notifier_range_update_to_read_only(r) false
>>
>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>   #define ptep_clear_young_notify ptep_test_and_clear_young
>>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b13b6f42be3c..c7d0fd228cb7 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>
>> +#ifndef clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					 unsigned long addr, pte_t *ptep,
>> +					 unsigned int nr)
>> +{
>> +	int young = 0;
>> +
>> +	for (;;) {
>> +		young |= ptep_clear_flush_young(vma, addr, ptep);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +
>> +	return young;
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index d6799afe1114..ec232165c47d 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>>   	struct folio_referenced_arg *pra = arg;
>>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>>   	int ptes = 0, referenced = 0;
>> +	unsigned int nr;
>>
>>   	while (page_vma_mapped_walk(&pvmw)) {
>>   		address = pvmw.address;
>> +		nr = 1;
>>
>>   		if (vma->vm_flags & VM_LOCKED) {
>>   			ptes++;
>> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>>   			if (lru_gen_look_around(&pvmw))
>>   				referenced++;
>>   		} else if (pvmw.pte) {
>> +			if (folio_test_large(folio)) {
>> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
>> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
>> +				pte_t pteval = ptep_get(pvmw.pte);
>> +
>> +				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> 
> I do wish we could put this fiddly logic into a helper for each place in
> which we do similar kind 'end of the PTE table, maximum number we could
> have' logic.

Um, the logic is already simple, and I don’t think adding a new helper 
would improve readability. If some code can reuse this logic, we can 
factor it out into a helper at that point.

>> +			}
> 
> NIT but we're running into pretty long lines here.

OK. Will fix this.

>> +
>> +			ptes += nr;
>>   			if (ptep_clear_flush_young_notify(vma, address,
>> -						pvmw.pte))
>> +						pvmw.pte, nr))
>>   				referenced++;
> 
> I find this referenced logic weird, it seems like it should be a boolean,
> but this is outside the scope of your patch here :)

Right. Thanks for reviewing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-15 12:38   ` Lorenzo Stoakes
@ 2025-12-16  5:48     ` Baolin Wang
  2025-12-16  6:13       ` Barry Song
  2025-12-16 10:53       ` Lorenzo Stoakes
  0 siblings, 2 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-16  5:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel



On 2025/12/15 20:38, Lorenzo Stoakes wrote:
> On Thu, Dec 11, 2025 at 04:16:56PM +0800, Baolin Wang wrote:
>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>> large folios to optimize the performance of file folios reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 75% performance improvement on my Arm64 32-core server.
> 
> Again, you must test on non-arm64 architectures and report the numbers for this
> also.

Yes, I've tested on the x86 machine, and will add the data in the commit 
message.

>> W/o patch:
>> real    0m1.018s
>> user    0m0.000s
>> sys     0m1.018s
>>
>> W/ patch:
>> real	0m0.249s
>> user	0m0.000s
>> sys	0m0.249s
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/rmap.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index ec232165c47d..4c9d5777c8da 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>   	end_addr = pmd_addr_end(addr, vma->vm_end);
>>   	max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>
>> -	/* We only support lazyfree batching for now ... */
>> -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>> +	/* We only support lazyfree or file folios batching for now ... */
>> +	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> 
> Why is it now ok to support file-backed batched unmapping when it wasn't in
> Barry's series (see [0])? You don't seem to be justifying this?

Barry's series[0] is merely aimed at optimizing lazyfree anonymous large 
folios and does not continue to optimize anonymous large folios or 
file-backed large folios at that point.

Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous 
large folios. As for file-backed large folios, the batched unmapping 
support is relatively simple, since we only need to clear the PTE 
entries for file-backed large folios.

> [0]:https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u

[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

>>   		return 1;
>> +
>>   	if (pte_unused(pte))
>>   		return 1;
>>
>> @@ -2223,7 +2224,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>   			 *
>>   			 * See Documentation/mm/mmu_notifier.rst
>>   			 */
>> -			dec_mm_counter(mm, mm_counter_file(folio));
>> +			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> 
> Was this just a bug before?

Nope. Before this patch, we never supported batched unmapping for 
file-backed large folios, so the 'nr_pages' was always 1. After this 
patch, we should use the number of pages in this file-backed large folio.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-16  5:48     ` Baolin Wang
@ 2025-12-16  6:13       ` Barry Song
  2025-12-16  6:22         ` Baolin Wang
  2025-12-16 10:53       ` Lorenzo Stoakes
  1 sibling, 1 reply; 35+ messages in thread
From: Barry Song @ 2025-12-16  6:13 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Lorenzo Stoakes, akpm, david, catalin.marinas, will, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel

> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index ec232165c47d..4c9d5777c8da 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >>      end_addr = pmd_addr_end(addr, vma->vm_end);
> >>      max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >>
> >> -    /* We only support lazyfree batching for now ... */
> >> -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >> +    /* We only support lazyfree or file folios batching for now ... */
> >> +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >
> > Why is it now ok to support file-backed batched unmapping when it wasn't in
> > Barry's series (see [0])? You don't seem to be justifying this?
>
> Barry's series[0] is merely aimed at optimizing lazyfree anonymous large
> folios and does not continue to optimize anonymous large folios or
> file-backed large folios at that point.

Yep. At that time, I didn’t have an Android machine with a filesystem
that supported large folios, so I focused on lazyfree. But I
agree that lazyfree anon folios and file folios are quite
similar.

>
> Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous
> large folios. As for file-backed large folios, the batched unmapping
> support is relatively simple, since we only need to clear the PTE
> entries for file-backed large folios.

Yep. It is actually quite straightforward to go from lazyfree
anon folios to file folios. Swap-backed anon folios are much
more tricky, though.

>
> > [0]:https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-16  6:13       ` Barry Song
@ 2025-12-16  6:22         ` Baolin Wang
  2025-12-16 10:54           ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-16  6:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Lorenzo Stoakes, akpm, david, catalin.marinas, will, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel



On 2025/12/16 14:13, Barry Song wrote:
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index ec232165c47d..4c9d5777c8da 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>
>>>> -    /* We only support lazyfree batching for now ... */
>>>> -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>> +    /* We only support lazyfree or file folios batching for now ... */
>>>> +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>
>>> Why is it now ok to support file-backed batched unmapping when it wasn't in
>>> Barry's series (see [0])? You don't seem to be justifying this?
>>
>> Barry's series[0] is merely aimed at optimizing lazyfree anonymous large
>> folios and does not continue to optimize anonymous large folios or
>> file-backed large folios at that point.
> 
> Yep. At that time, I didn’t have an Android machine with a filesystem
> that supported large folios, so I focused on lazyfree. But I
> agree that lazyfree anon folios and file folios are quite
> similar.
> 
>>
>> Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous
>> large folios. As for file-backed large folios, the batched unmapping
>> support is relatively simple, since we only need to clear the PTE
>> entries for file-backed large folios.
> 
> Yep. It is actually quite straightforward to go from lazyfree
> anon folios to file folios. Swap-backed anon folios are much
> more tricky, though.

Agree. Thanks Barry for reviewing and confirming.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-16  5:48     ` Baolin Wang
  2025-12-16  6:13       ` Barry Song
@ 2025-12-16 10:53       ` Lorenzo Stoakes
  1 sibling, 0 replies; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-16 10:53 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel

On Tue, Dec 16, 2025 at 01:48:52PM +0800, Baolin Wang wrote:
>
>
> On 2025/12/15 20:38, Lorenzo Stoakes wrote:
> > On Thu, Dec 11, 2025 at 04:16:56PM +0800, Baolin Wang wrote:
> > > Similar to folio_referenced_one(), we can apply batched unmapping for file
> > > large folios to optimize the performance of file folios reclamation.
> > >
> > > Performance testing:
> > > Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> > > reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> > > 75% performance improvement on my Arm64 32-core server.
> >
> > Again, you must test on non-arm64 architectures and report the numbers for this
> > also.
>
> Yes, I've tested on the x86 machine, and will add the data in the commit
> message.
>
> > > W/o patch:
> > > real    0m1.018s
> > > user    0m0.000s
> > > sys     0m1.018s
> > >
> > > W/ patch:
> > > real	0m0.249s
> > > user	0m0.000s
> > > sys	0m0.249s
> > >
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > ---
> > >   mm/rmap.c | 7 ++++---
> > >   1 file changed, 4 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index ec232165c47d..4c9d5777c8da 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> > >   	end_addr = pmd_addr_end(addr, vma->vm_end);
> > >   	max_nr = (end_addr - addr) >> PAGE_SHIFT;
> > >
> > > -	/* We only support lazyfree batching for now ... */
> > > -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> > > +	/* We only support lazyfree or file folios batching for now ... */
> > > +	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >
> > Why is it now ok to support file-backed batched unmapping when it wasn't in
> > Barry's series (see [0])? You don't seem to be justifying this?
>
> Barry's series[0] is merely aimed at optimizing lazyfree anonymous large
> folios and does not continue to optimize anonymous large folios or
> file-backed large folios at that point.
>
> Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous
> large folios. As for file-backed large folios, the batched unmapping support
> is relatively simple, since we only need to clear the PTE entries for
> file-backed large folios.

Yeah, but he sent an entire patch changing a bunch of logic to accommodate this,
you're just changing the conditional and not really justifying it in the commit
message?

It really needs a 'it is safe to allow this for file-backed because blah blah
blah'.

Is this relying on the prior commits you've added? If so you should say so.

If it was as easy as just changing the conditional then it begs the question as
to why Barry didn't do that in the first place :)

>
> > [0]:https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/
>
> > >   		return 1;
> > > +
> > >   	if (pte_unused(pte))
> > >   		return 1;
> > >
> > > @@ -2223,7 +2224,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > >   			 *
> > >   			 * See Documentation/mm/mmu_notifier.rst
> > >   			 */
> > > -			dec_mm_counter(mm, mm_counter_file(folio));
> > > +			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >
> > Was this just a bug before?
>
> Nope. Before this patch, we never supported batched unmapping for
> file-backed large folios, so the 'nr_pages' was always 1. After this patch,
> we should use the number of pages in this file-backed large folio.

Right ok :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-16  6:22         ` Baolin Wang
@ 2025-12-16 10:54           ` Lorenzo Stoakes
  2025-12-17  3:11             ` Baolin Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-16 10:54 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Barry Song, akpm, david, catalin.marinas, will, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel

On Tue, Dec 16, 2025 at 02:22:11PM +0800, Baolin Wang wrote:
>
>
> On 2025/12/16 14:13, Barry Song wrote:
> > > > >
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index ec232165c47d..4c9d5777c8da 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> > > > >       end_addr = pmd_addr_end(addr, vma->vm_end);
> > > > >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> > > > >
> > > > > -    /* We only support lazyfree batching for now ... */
> > > > > -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> > > > > +    /* We only support lazyfree or file folios batching for now ... */
> > > > > +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> > > >
> > > > Why is it now ok to support file-backed batched unmapping when it wasn't in
> > > > Barry's series (see [0])? You don't seem to be justifying this?
> > >
> > > Barry's series[0] is merely aimed at optimizing lazyfree anonymous large
> > > folios and does not continue to optimize anonymous large folios or
> > > file-backed large folios at that point.
> >
> > Yep. At that time, I didn’t have an Android machine with a filesystem
> > that supported large folios, so I focused on lazyfree. But I
> > agree that lazyfree anon folios and file folios are quite
> > similar.
> >
> > >
> > > Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous
> > > large folios. As for file-backed large folios, the batched unmapping
> > > support is relatively simple, since we only need to clear the PTE
> > > entries for file-backed large folios.
> >
> > Yep. It is actually quite straightforward to go from lazyfree
> > anon folios to file folios. Swap-backed anon folios are much
> > more tricky, though.
>
> Agree. Thanks Barry for reviewing and confirming.

OK that makes me less concerned, but you do need to put some more justification
in the commit message.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-16  3:32     ` Baolin Wang
@ 2025-12-16 11:11       ` Lorenzo Stoakes
  2025-12-17  3:53         ` Baolin Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-16 11:11 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel

On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>
>
> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
> > On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
> > > Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> > > only clear the young flag and flush TLBs for PTEs within the contiguous range.
> > > To support batch PTE operations for other sized large folios in the following
> > > patches, adding a new parameter to specify the number of PTEs.
> > >
> > > While we are at it, rename the functions to maintain consistency with other
> > > contpte_*() functions.
> > >
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > ---
> > >   arch/arm64/include/asm/pgtable.h | 12 ++++-----
> > >   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
> > >   2 files changed, 37 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index 0944e296dd4a..e03034683156 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> > >   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > >   				unsigned long addr, pte_t *ptep,
> > >   				unsigned int nr, int full);
> > > -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > -				unsigned long addr, pte_t *ptep);
> > > -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> > > -				unsigned long addr, pte_t *ptep);
> > > +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> > > +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> >
> > In core mm at least, as a convention, we strip 'extern' from header declarations
> > as we go as they're not necessary.
>
> Sure. I can remove the 'extern'.

Thanks.

>
> > I wonder about this naming convention also. The contpte_xxx_() prefix
> > obviously implies we are operating upon PTEs in the contiguous range, now
> > we are doing something... different and 'nr' is a bit vague.
> >
> > Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>
> Yes, 'nr' here means nr_pages, which follows the convention of the
> parameters in contpte_xxx_() functions.

Except you're violating the convention already by doing non-contpte stuff in
contpte functions?

The confusion is between nr of contpte ranges and nr of pages, so I think we
should be explicit.

>
> > now we're not saying they're necessarily contiguous in the sense of contpte...
> >
> > I wonder whether we'd be better off introducing a new function that
> > explicitly has 'batch' or 'batched' in the name, and have the usual
> > contpte_***() stuff and callers that need batching using a new underlying helper?
>
> OK. I get your point. However, this is outside the scope of my patchset.

Well, no, this is review feedback that needs to be addressed, I really don't
like this patch as-is.

So for this to be upstreamable you really need to separate this out.

>
> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
> that this is confusing.

I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
reject the patch unless we do this.

The contpte logic is already very confusing, and this is only adding to it.

>
> > Obviously I defer to Ryan on all this, just my thoughts...
> >
> > >   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > >   				pte_t *ptep, unsigned int nr);
> > >   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > > @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >   	if (likely(!pte_valid_cont(orig_pte)))
> > >   		return __ptep_test_and_clear_young(vma, addr, ptep);
> > >
> > > -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> > > +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
> > >   }
> > >
> > >   #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> > > @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >   	if (likely(!pte_valid_cont(orig_pte)))
> > >   		return __ptep_clear_flush_young(vma, addr, ptep);
> > >
> > > -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> > > +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
> > >   }
> > >
> > >   #define wrprotect_ptes wrprotect_ptes
> > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > > index c0557945939c..19b122441be3 100644
> > > --- a/arch/arm64/mm/contpte.c
> > > +++ b/arch/arm64/mm/contpte.c
> > > @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > >   }
> > >   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
> > >
> > > -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > -					unsigned long addr, pte_t *ptep)
> > > +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > +					unsigned long addr, pte_t *ptep,
> > > +					unsigned int nr)
> > >   {
> > >   	/*
> > >   	 * ptep_clear_flush_young() technically requires us to clear the access
> > > @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >   	 * having to unfold.
> > >   	 */
> >
> > Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
> >
> > "And since we only create a contig range when the range is covered by a single
> > folio, we can get away with clearing young for the whole contig range here, so
> > we avoid having to unfold."
>
> I think the original comments are still clear:
> "
> /*
>  * ptep_clear_flush_young() technically requires us to clear the access
>  * flag for a _single_ pte. However, the core-mm code actually tracks
>  * access/dirty per folio, not per page. And since we only create a
>  * contig range when the range is covered by a single folio, we can get
>  * away with clearing young for the whole contig range here, so we avoid
>  * having to unfold.
>  */
> "

Except nr means you can span more than a single contig pte range? So this
comment is no longer applicable as-is?

You've changed the function, the comment needs to change too.

>
> > However now you're allowing for large folios that are not contpte?
> >
> > Overall feeling pretty iffy about implementing this this way.
>
> Please see the above comments, which explain why we should do that.

You haven't explained anything? Maybe I missed it?

Can you explain clearly what nr might actually be in relation to contpte's? I'm
confused even as to the utility here.

Is it batched ranges of contptes? But then why are you trying to see if end is a
contpte + expand the range if so?

>
> > > +	unsigned long start = addr;
> > > +	unsigned long end = start + nr * PAGE_SIZE;
> >
> > So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
> > table?
>
> Caller has made sure that (see folio_referenced_one()).

You should comment this somehow, I hate this nested assumptions stuff 'function
X which calls function Y which calls function Z makes sure that parameter A
fulfils requirements B but we don't mention it anywhere'.

>
> > >   	int young = 0;
> > >   	int i;
> > >
> > > -	ptep = contpte_align_down(ptep);
> > > -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> > > +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> > > +		end = ALIGN(end, CONT_PTE_SIZE);
> >
> > OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
> > with other PTE entries present there not having PTE_CONT set (Ryan - do
> > correct me if I'm wrong!)
> >
> > I guess then no worry about running off the end of the PTE table here.
> >
> > I wonder about using pte_cont_addr_end() here, but I guess you are not
> > intending to limit to a range here but rather capture the entire contpte
> > range of the end.
>
> Right.
>
> > I don't love that now a function prefixed with contpte_... can have a
> > condition where part of the input range is _not_ a contpte entry, or is
> > specified partially...
>
> Like I said above, this is outside the scope of my patchset. If Ryan also
> thinks we need to do some cleanups for all the contpte_xxx() functions in
> the contpte.c file, we can send a follow-up patchset to rename all the
> contpte_xxx() functions to make it clear.

It's really not outside of the scope of the patch set, this is review feedback
you have to address :) trying not to be a grumpy kernel dev here :>)

In general in the kernel we don't accept confusing patches then defer making
them not-confusing until later.

We review until the series is at the correct standard to be accepted before we
merge.

As always I'm happy to be convinced that I'm mistaken or something's not
necesary, however!

>
> > > -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> > > -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> > > +	if (pte_cont(__ptep_get(ptep))) {
> > > +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> > > +		ptep = contpte_align_down(ptep);
> > > +	}
> > > +
> > > +	nr = (end - start) / PAGE_SIZE;
> >
> > Hm don't love this reuse of input param.
>
> OK. Will use another local variable instead.

Thanks.

>
> >
> > > +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> > > +		young |= __ptep_test_and_clear_young(vma, start, ptep);
> >
> > Again, overall find it a bit iffy that we are essentially overloading the
> > code with non-contpte behaviour.
>
> Let me clarify further: this is the handling logic of the contpte_xxx()
> functions in the contpte.c file. For example, the function
> contpte_set_ptes() has a parameter 'nr', which may be greater than
> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
> large folio.
>
> It might be a bit confusing, and I understand this is why you want to change
> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
> can provide some input on this.
>
> Thanks for reviewing.

OK so we have some reeeeal confusion here. And this speaks to the patch needing
work.

This seems to answer what I asked above - basically - are we doing _batches_ of
_contpte_ entries, or are we just accepting arbitrary large folios here and
batching up operations on them, including non-contpte entries.

Presumably from the above you're saying - this is dealing with batches of
contpte entries.

In which case this has to be made _super_ clear. Not just adding an arbitrary
'nr' parameter and letting people guess.

The contpte code is _already_ confusing and somewhat surprising if you're not
aware of it, we need to be going in the other direction.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-16 10:54           ` Lorenzo Stoakes
@ 2025-12-17  3:11             ` Baolin Wang
  2025-12-17 14:28               ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-17  3:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, akpm, david, catalin.marinas, will, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel



On 2025/12/16 18:54, Lorenzo Stoakes wrote:
> On Tue, Dec 16, 2025 at 02:22:11PM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/12/16 14:13, Barry Song wrote:
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index ec232165c47d..4c9d5777c8da 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>
>>>>>> -    /* We only support lazyfree batching for now ... */
>>>>>> -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>> +    /* We only support lazyfree or file folios batching for now ... */
>>>>>> +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>
>>>>> Why is it now ok to support file-backed batched unmapping when it wasn't in
>>>>> Barry's series (see [0])? You don't seem to be justifying this?
>>>>
>>>> Barry's series[0] is merely aimed at optimizing lazyfree anonymous large
>>>> folios and does not continue to optimize anonymous large folios or
>>>> file-backed large folios at that point.
>>>
>>> Yep. At that time, I didn’t have an Android machine with a filesystem
>>> that supported large folios, so I focused on lazyfree. But I
>>> agree that lazyfree anon folios and file folios are quite
>>> similar.
>>>
>>>>
>>>> Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous
>>>> large folios. As for file-backed large folios, the batched unmapping
>>>> support is relatively simple, since we only need to clear the PTE
>>>> entries for file-backed large folios.
>>>
>>> Yep. It is actually quite straightforward to go from lazyfree
>>> anon folios to file folios. Swap-backed anon folios are much
>>> more tricky, though.
>>
>> Agree. Thanks Barry for reviewing and confirming.
> 
> OK that makes me less concerned, but you do need to put some more justification
> in the commit message.

Sure. Will do.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-16 11:11       ` Lorenzo Stoakes
@ 2025-12-17  3:53         ` Baolin Wang
  2025-12-17 14:50           ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-17  3:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel



On 2025/12/16 19:11, Lorenzo Stoakes wrote:
> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>> To support batch PTE operations for other sized large folios in the following
>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>
>>>> While we are at it, rename the functions to maintain consistency with other
>>>> contpte_*() functions.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>    2 files changed, 37 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0944e296dd4a..e03034683156 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>    				unsigned long addr, pte_t *ptep,
>>>>    				unsigned int nr, int full);
>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -				unsigned long addr, pte_t *ptep);
>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> -				unsigned long addr, pte_t *ptep);
>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>
>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>> as we go as they're not necessary.
>>
>> Sure. I can remove the 'extern'.
> 
> Thanks.
> 
>>
>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>> we are doing something... different and 'nr' is a bit vague.
>>>
>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>
>> Yes, 'nr' here means nr_pages, which follows the convention of the
>> parameters in contpte_xxx_() functions.
> 
> Except you're violating the convention already by doing non-contpte stuff in
> contpte functions?

Nope. Sorry for my poor English, which might have caused some confusion. 
I followed the contpte's convention, and let me try to explain it again 
below.

> The confusion is between nr of contpte ranges and nr of pages, so I think we
> should be explicit.
> 
>>
>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>
>>> I wonder whether we'd be better off introducing a new function that
>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>
>> OK. I get your point. However, this is outside the scope of my patchset.
> 
> Well, no, this is review feedback that needs to be addressed, I really don't
> like this patch as-is.
> 
> So for this to be upstreamable you really need to separate this out.
> 
>>
>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>> that this is confusing.
> 
> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
> reject the patch unless we do this.
> 
> The contpte logic is already very confusing, and this is only adding to it.
> 
>>
>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>
>>>>    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    				pte_t *ptep, unsigned int nr);
>>>>    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>    		return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>
>>>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>
>>>>    #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>    		return __ptep_clear_flush_young(vma, addr, ptep);
>>>>
>>>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>
>>>>    #define wrprotect_ptes wrprotect_ptes
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index c0557945939c..19b122441be3 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>
>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -					unsigned long addr, pte_t *ptep)
>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +					unsigned long addr, pte_t *ptep,
>>>> +					unsigned int nr)
>>>>    {
>>>>    	/*
>>>>    	 * ptep_clear_flush_young() technically requires us to clear the access
>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>    	 * having to unfold.
>>>>    	 */
>>>
>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>
>>> "And since we only create a contig range when the range is covered by a single
>>> folio, we can get away with clearing young for the whole contig range here, so
>>> we avoid having to unfold."
>>
>> I think the original comments are still clear:
>> "
>> /*
>>   * ptep_clear_flush_young() technically requires us to clear the access
>>   * flag for a _single_ pte. However, the core-mm code actually tracks
>>   * access/dirty per folio, not per page. And since we only create a
>>   * contig range when the range is covered by a single folio, we can get
>>   * away with clearing young for the whole contig range here, so we avoid
>>   * having to unfold.
>>   */
>> "
> 
> Except nr means you can span more than a single contig pte range? So this
> comment is no longer applicable as-is?

The nr means consecutive (present) PTEs that map consecutive pages of 
the same large folio in a single VMA and a single page table.

I think this is a prerequisite. If you think it's necessary to explain 
it, I can add it to the comments.

> You've changed the function, the comment needs to change too.
> 
>>
>>> However now you're allowing for large folios that are not contpte?
>>>
>>> Overall feeling pretty iffy about implementing this this way.
>>
>> Please see the above comments, which explain why we should do that.
> 
> You haven't explained anything? Maybe I missed it?
> 
> Can you explain clearly what nr might actually be in relation to contpte's? I'm
> confused even as to the utility here.
> 
> Is it batched ranges of contptes? But then why are you trying to see if end is a
> contpte + expand the range if so?

See below.

>>>> +	unsigned long start = addr;
>>>> +	unsigned long end = start + nr * PAGE_SIZE;
>>>
>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>> table?
>>
>> Caller has made sure that (see folio_referenced_one()).
> 
> You should comment this somehow, I hate this nested assumptions stuff 'function
> X which calls function Y which calls function Z makes sure that parameter A
> fulfils requirements B but we don't mention it anywhere'.

Sure.

>>>>    	int young = 0;
>>>>    	int i;
>>>>
>>>> -	ptep = contpte_align_down(ptep);
>>>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>>
>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>> correct me if I'm wrong!)
>>>
>>> I guess then no worry about running off the end of the PTE table here.
>>>
>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>> intending to limit to a range here but rather capture the entire contpte
>>> range of the end.
>>
>> Right.
>>
>>> I don't love that now a function prefixed with contpte_... can have a
>>> condition where part of the input range is _not_ a contpte entry, or is
>>> specified partially...
>>
>> Like I said above, this is outside the scope of my patchset. If Ryan also
>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>> the contpte.c file, we can send a follow-up patchset to rename all the
>> contpte_xxx() functions to make it clear.
> 
> It's really not outside of the scope of the patch set, this is review feedback
> you have to address :) trying not to be a grumpy kernel dev here :>)
> 
> In general in the kernel we don't accept confusing patches then defer making
> them not-confusing until later.
> 
> We review until the series is at the correct standard to be accepted before we
> merge.
> 
> As always I'm happy to be convinced that I'm mistaken or something's not
> necesary, however!

Yes, let me try to explain it again. See below.

>>>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>> +	if (pte_cont(__ptep_get(ptep))) {
>>>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>> +		ptep = contpte_align_down(ptep);
>>>> +	}
>>>> +
>>>> +	nr = (end - start) / PAGE_SIZE;
>>>
>>> Hm don't love this reuse of input param.
>>
>> OK. Will use another local variable instead.
> 
> Thanks.
> 
>>
>>>
>>>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>
>>> Again, overall find it a bit iffy that we are essentially overloading the
>>> code with non-contpte behaviour.
>>
>> Let me clarify further: this is the handling logic of the contpte_xxx()
>> functions in the contpte.c file. For example, the function
>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>> large folio.
>>
>> It might be a bit confusing, and I understand this is why you want to change
>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>> can provide some input on this.
>>
>> Thanks for reviewing.
> 
> OK so we have some reeeeal confusion here. And this speaks to the patch needing
> work.
> 
> This seems to answer what I asked above - basically - are we doing _batches_ of
> _contpte_ entries, or are we just accepting arbitrary large folios here and
> batching up operations on them, including non-contpte entries.
> 
> Presumably from the above you're saying - this is dealing with batches of
> contpte entries.
> 
> In which case this has to be made _super_ clear. Not just adding an arbitrary
> 'nr' parameter and letting people guess.
> 
> The contpte code is _already_ confusing and somewhat surprising if you're not
> aware of it, we need to be going in the other direction.

Sure. Sorry for causing confusion. I try to explain it again to make it 
clear.

First, it is necessary to explain how the contpte implementation in 
Arm64 is exposed, using set_ptes() as an example.

Arm64 defines an arch-specific implementation of set_ptes():
"
static __always_inline void set_ptes(struct mm_struct *mm, unsigned long 
addr,
				pte_t *ptep, pte_t pte, unsigned int nr)
{
	pte = pte_mknoncont(pte);

	if (likely(nr == 1)) {
		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
		__set_ptes(mm, addr, ptep, pte, 1);
		contpte_try_fold(mm, addr, ptep, pte);
	} else {
		contpte_set_ptes(mm, addr, ptep, pte, nr);
	}
}
"

Here, 'nr' refers to consecutive (present) PTEs that map consecutive 
pages of the same large folio within a single VMA and a single page 
table. Based on 'nr', it determines whether the PTE_CONT attribute 
should be set or not.

Second, the underlying implementation of contpte_set_ptes() also 
calculates 'end' based on 'nr' and checks whether the PTE_CONT attribute 
should be set. For example:

For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will 
not be set since the 'next' is not aligned with CONT_PTE_MASK.

For a large folio with order = 4 (nr = 16), it aligns perfectly, 
allowing the PTE_CONT attribute to be set for the entire PTE entries of 
the large folio.

For a large folio with order = 5 (nr = 32), this involves two ranges of 
CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute 
to be set separately for each range.

"
void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
					pte_t *ptep, pte_t pte, unsigned int nr)
{
	.......

	end = addr + (nr << PAGE_SHIFT);
	pfn = pte_pfn(pte);
	prot = pte_pgprot(pte);

	do {
		next = pte_cont_addr_end(addr, end);
		nr = (next - addr) >> PAGE_SHIFT;
		pte = pfn_pte(pfn, prot);

		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
			pte = pte_mkcont(pte);
		else
			pte = pte_mknoncont(pte);

		__set_ptes(mm, addr, ptep, pte, nr);

		addr = next;
		ptep += nr;
		pfn += nr;

	} while (addr != end);
}
"

Third, regarding the implementation of my patch, I have followed the 
contpte logic here by adding the 'nr' parameter to 
contpte_ptep_test_and_clear_young() for batch processing consecutive 
(present) PTEs. Moreover, the semantics of 'nr' remain consistent with 
the original contpte functions, and I have not broken the existing 
contpte implementation logic.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
  2025-12-15 12:22   ` Lorenzo Stoakes
@ 2025-12-17  6:23   ` Dev Jain
  2025-12-17  6:44     ` Baolin Wang
  2025-12-17  6:49   ` Dev Jain
  2025-12-17 16:39   ` Ryan Roberts
  3 siblings, 1 reply; 35+ messages in thread
From: Dev Jain @ 2025-12-17  6:23 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel


On 11/12/25 1:46 pm, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
>
> W/o patchset:
> real	0m1.518s
> user	0m0.000s
> sys	0m1.518s
>
> W/ patchset:
> real	0m1.018s
> user	0m0.000s
> sys	0m1.018s
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>  include/linux/mmu_notifier.h     |  9 +++++----
>  include/linux/pgtable.h          | 19 +++++++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++++++++--
>  4 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e03034683156..a865bd8c46a3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
> +{
> +	if (likely(nr == 1))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}
> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, unsigned int nr)
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..be594b274729 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>  	range->owner = owner;
>  }
>  
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
>  ({									\
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	unsigned int ___nr = __nr;					\
> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>  	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>  						  ___address,		\
>  						  ___address +		\
> -							PAGE_SIZE);	\
> +						nr * PAGE_SIZE);	\
>  	__young;							\
>  })

Do we have an existing bug here, in that mmu_notifier_clear_flush_young() should
have been called for CONT_PTES length if the folio was contpte mapped?

>  
> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>  
>  #define mmu_notifier_range_update_to_read_only(r) false
>  
> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>  #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>  #define ptep_clear_young_notify ptep_test_and_clear_young
>  #define pmdp_clear_young_notify pmdp_test_and_clear_young
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b13b6f42be3c..c7d0fd228cb7 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +#ifndef clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	int young = 0;
> +
> +	for (;;) {
> +		young |= ptep_clear_flush_young(vma, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +
> +	return young;
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d6799afe1114..ec232165c47d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>  	struct folio_referenced_arg *pra = arg;
>  	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>  	int ptes = 0, referenced = 0;
> +	unsigned int nr;
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		address = pvmw.address;
> +		nr = 1;
>  
>  		if (vma->vm_flags & VM_LOCKED) {
>  			ptes++;
> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>  			if (lru_gen_look_around(&pvmw))
>  				referenced++;
>  		} else if (pvmw.pte) {
> +			if (folio_test_large(folio)) {
> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
> +				pte_t pteval = ptep_get(pvmw.pte);
> +
> +				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> +			}
> +
> +			ptes += nr;
>  			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte))
> +						pvmw.pte, nr))
>  				referenced++;
> +			/* Skip the batched PTEs */
> +			pvmw.pte += nr - 1;
> +			pvmw.address += (nr - 1) * PAGE_SIZE;
>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>  			if (pmdp_clear_flush_young_notify(vma, address,
>  						pvmw.pmd))
> @@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
>  			WARN_ON_ONCE(1);
>  		}
>  
> -		pra->mapcount--;
> +		pra->mapcount -= nr;
> +		if (ptes == pvmw.nr_pages) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
>  	}
>  
>  	if (referenced)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-17  6:23   ` Dev Jain
@ 2025-12-17  6:44     ` Baolin Wang
  0 siblings, 0 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-17  6:44 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel



On 2025/12/17 14:23, Dev Jain wrote:
> 
> On 11/12/25 1:46 pm, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>> of the young flags and flushing TLB entries, thereby improving performance
>> during large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real	0m1.518s
>> user	0m0.000s
>> sys	0m1.518s
>>
>> W/ patchset:
>> real	0m1.018s
>> user	0m0.000s
>> sys	0m1.018s
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   include/linux/mmu_notifier.h     |  9 +++++----
>>   include/linux/pgtable.h          | 19 +++++++++++++++++++
>>   mm/rmap.c                        | 22 ++++++++++++++++++++--
>>   4 files changed, 55 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index e03034683156..a865bd8c46a3 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>> +{
>> +	if (likely(nr == 1))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>> +}
>> +
>>   #define wrprotect_ptes wrprotect_ptes
>>   static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep, unsigned int nr)
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..be594b274729 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>   	range->owner = owner;
>>   }
>>   
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
>> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
>>   ({									\
>>   	int __young;							\
>>   	struct vm_area_struct *___vma = __vma;				\
>>   	unsigned long ___address = __address;				\
>> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
>> +	unsigned int ___nr = __nr;					\
>> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>>   	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>>   						  ___address,		\
>>   						  ___address +		\
>> -							PAGE_SIZE);	\
>> +						nr * PAGE_SIZE);	\
>>   	__young;							\
>>   })
> 
> Do we have an existing bug here, in that mmu_notifier_clear_flush_young() should
> have been called for CONT_PTES length if the folio was contpte mapped?

I can't call it a bug, because folio_referenced_one() does iterate 
through each PTE of the large folio, but it is indeed inefficient.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
  2025-12-15 12:22   ` Lorenzo Stoakes
  2025-12-17  6:23   ` Dev Jain
@ 2025-12-17  6:49   ` Dev Jain
  2025-12-17  7:09     ` Baolin Wang
  2025-12-17 16:39   ` Ryan Roberts
  3 siblings, 1 reply; 35+ messages in thread
From: Dev Jain @ 2025-12-17  6:49 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel


On 11/12/25 1:46 pm, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
>
> W/o patchset:
> real	0m1.518s
> user	0m0.000s
> sys	0m1.518s
>
> W/ patchset:
> real	0m1.018s
> user	0m0.000s
> sys	0m1.018s
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>  include/linux/mmu_notifier.h     |  9 +++++----
>  include/linux/pgtable.h          | 19 +++++++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++++++++--
>  4 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e03034683156..a865bd8c46a3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
> +{
> +	if (likely(nr == 1))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}
> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, unsigned int nr)
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..be594b274729 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>  	range->owner = owner;
>  }
>  
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
>  ({									\
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	unsigned int ___nr = __nr;					\
> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>  	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>  						  ___address,		\
>  						  ___address +		\
> -							PAGE_SIZE);	\
> +						nr * PAGE_SIZE);	\
>  	__young;							\
>  })
>  
> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>  
>  #define mmu_notifier_range_update_to_read_only(r) false
>  
> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>  #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>  #define ptep_clear_young_notify ptep_test_and_clear_young
>  #define pmdp_clear_young_notify pmdp_test_and_clear_young
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b13b6f42be3c..c7d0fd228cb7 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +#ifndef clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	int young = 0;
> +
> +	for (;;) {
> +		young |= ptep_clear_flush_young(vma, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +
> +	return young;
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d6799afe1114..ec232165c47d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>  	struct folio_referenced_arg *pra = arg;
>  	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>  	int ptes = 0, referenced = 0;
> +	unsigned int nr;
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		address = pvmw.address;
> +		nr = 1;
>  
>  		if (vma->vm_flags & VM_LOCKED) {
>  			ptes++;
> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>  			if (lru_gen_look_around(&pvmw))
>  				referenced++;
>  		} else if (pvmw.pte) {
> +			if (folio_test_large(folio)) {
> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);

I may be hallucinating here but I am just trying to recall things - is this a bug in
folio_pte_batch_flags()? A folio may not be naturally aligned in virtual space and hence
we may cross the PTE table while batching across it, which can be fixed by taking into
account pmd_addr_end() while computing max_nr.

> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
> +				pte_t pteval = ptep_get(pvmw.pte);
> +
> +				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> +			}
> +
> +			ptes += nr;
>  			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte))
> +						pvmw.pte, nr))
>  				referenced++;
> +			/* Skip the batched PTEs */
> +			pvmw.pte += nr - 1;
> +			pvmw.address += (nr - 1) * PAGE_SIZE;
>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>  			if (pmdp_clear_flush_young_notify(vma, address,
>  						pvmw.pmd))
> @@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
>  			WARN_ON_ONCE(1);
>  		}
>  
> -		pra->mapcount--;
> +		pra->mapcount -= nr;
> +		if (ptes == pvmw.nr_pages) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
>  	}
>  
>  	if (referenced)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-17  6:49   ` Dev Jain
@ 2025-12-17  7:09     ` Baolin Wang
  2025-12-17  7:23       ` Dev Jain
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-17  7:09 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel



On 2025/12/17 14:49, Dev Jain wrote:
> 
> On 11/12/25 1:46 pm, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>> of the young flags and flushing TLB entries, thereby improving performance
>> during large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real	0m1.518s
>> user	0m0.000s
>> sys	0m1.518s
>>
>> W/ patchset:
>> real	0m1.018s
>> user	0m0.000s
>> sys	0m1.018s
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   include/linux/mmu_notifier.h     |  9 +++++----
>>   include/linux/pgtable.h          | 19 +++++++++++++++++++
>>   mm/rmap.c                        | 22 ++++++++++++++++++++--
>>   4 files changed, 55 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index e03034683156..a865bd8c46a3 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>> +{
>> +	if (likely(nr == 1))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>> +}
>> +
>>   #define wrprotect_ptes wrprotect_ptes
>>   static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep, unsigned int nr)
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..be594b274729 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>   	range->owner = owner;
>>   }
>>   
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
>> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
>>   ({									\
>>   	int __young;							\
>>   	struct vm_area_struct *___vma = __vma;				\
>>   	unsigned long ___address = __address;				\
>> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
>> +	unsigned int ___nr = __nr;					\
>> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>>   	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>>   						  ___address,		\
>>   						  ___address +		\
>> -							PAGE_SIZE);	\
>> +						nr * PAGE_SIZE);	\
>>   	__young;							\
>>   })
>>   
>> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>   
>>   #define mmu_notifier_range_update_to_read_only(r) false
>>   
>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>   #define ptep_clear_young_notify ptep_test_and_clear_young
>>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b13b6f42be3c..c7d0fd228cb7 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>   
>> +#ifndef clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					 unsigned long addr, pte_t *ptep,
>> +					 unsigned int nr)
>> +{
>> +	int young = 0;
>> +
>> +	for (;;) {
>> +		young |= ptep_clear_flush_young(vma, addr, ptep);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +
>> +	return young;
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index d6799afe1114..ec232165c47d 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>>   	struct folio_referenced_arg *pra = arg;
>>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>>   	int ptes = 0, referenced = 0;
>> +	unsigned int nr;
>>   
>>   	while (page_vma_mapped_walk(&pvmw)) {
>>   		address = pvmw.address;
>> +		nr = 1;
>>   
>>   		if (vma->vm_flags & VM_LOCKED) {
>>   			ptes++;
>> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>>   			if (lru_gen_look_around(&pvmw))
>>   				referenced++;
>>   		} else if (pvmw.pte) {
>> +			if (folio_test_large(folio)) {
>> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
> 
> I may be hallucinating here but I am just trying to recall things - is this a bug in
> folio_pte_batch_flags()? A folio may not be naturally aligned in virtual space and hence
> we may cross the PTE table while batching across it, which can be fixed by taking into
> account pmd_addr_end() while computing max_nr.

IMHO, the comments for the folio_pte_batch_flags() function have already 
made clear requirements for the caller to avoid such situations:

"
* @ptep must map any page of the folio. max_nr must be at least one and
* must be limited by the caller so scanning cannot exceed a single VMA and
* a single page table.
"

Additionally, Lance recently fixed a similar issue, see commit 
ddd05742b45b ("mm/rmap: fix potential out-of-bounds page table access 
during batched unmap").


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-17  7:09     ` Baolin Wang
@ 2025-12-17  7:23       ` Dev Jain
  0 siblings, 0 replies; 35+ messages in thread
From: Dev Jain @ 2025-12-17  7:23 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel


On 17/12/25 12:39 pm, Baolin Wang wrote:
>
>
> On 2025/12/17 14:49, Dev Jain wrote:
>>
>> On 11/12/25 1:46 pm, Baolin Wang wrote:
>>> Currently, folio_referenced_one() always checks the young flag for each PTE
>>> sequentially, which is inefficient for large folios. This inefficiency is
>>> especially noticeable when reclaiming clean file-backed large folios, where
>>> folio_referenced() is observed as a significant performance hotspot.
>>>
>>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
>>> an optimization to clear the young flags for PTEs within a contiguous range.
>>> However, this is not sufficient. We can extend this to perform batched
>>> operations
>>> for the entire large folio (which might exceed the contiguous range:
>>> CONT_PTE_SIZE).
>>>
>>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>>> of the young flags and flushing TLB entries, thereby improving performance
>>> during large folio reclamation.
>>>
>>> Performance testing:
>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>>> from approximately 35% to around 5%.
>>>
>>> W/o patchset:
>>> real    0m1.518s
>>> user    0m0.000s
>>> sys    0m1.518s
>>>
>>> W/ patchset:
>>> real    0m1.018s
>>> user    0m0.000s
>>> sys    0m1.018s
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>>   include/linux/mmu_notifier.h     |  9 +++++----
>>>   include/linux/pgtable.h          | 19 +++++++++++++++++++
>>>   mm/rmap.c                        | 22 ++++++++++++++++++++--
>>>   4 files changed, 55 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>> b/arch/arm64/include/asm/pgtable.h
>>> index e03034683156..a865bd8c46a3 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct
>>> vm_area_struct *vma,
>>>       return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>   }
>>>   +#define clear_flush_young_ptes clear_flush_young_ptes
>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep,
>>> +                    unsigned int nr)
>>> +{
>>> +    if (likely(nr == 1))
>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>> +    return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>>> +}
>>> +
>>>   #define wrprotect_ptes wrprotect_ptes
>>>   static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>>>                   unsigned long addr, pte_t *ptep, unsigned int nr)
>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>> index d1094c2d5fb6..be594b274729 100644
>>> --- a/include/linux/mmu_notifier.h
>>> +++ b/include/linux/mmu_notifier.h
>>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>>       range->owner = owner;
>>>   }
>>>   -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)        \
>>> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)    \
>>>   ({                                    \
>>>       int __young;                            \
>>>       struct vm_area_struct *___vma = __vma;                \
>>>       unsigned long ___address = __address;                \
>>> -    __young = ptep_clear_flush_young(___vma, ___address, __ptep);    \
>>> +    unsigned int ___nr = __nr;                    \
>>> +    __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
>>>       __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,    \
>>>                             ___address,        \
>>>                             ___address +        \
>>> -                            PAGE_SIZE);    \
>>> +                        nr * PAGE_SIZE);    \
>>>       __young;                            \
>>>   })
>>>   @@ -650,7 +651,7 @@ static inline void
>>> mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>>     #define mmu_notifier_range_update_to_read_only(r) false
>>>   -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>>> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>>>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>>   #define ptep_clear_young_notify ptep_test_and_clear_young
>>>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index b13b6f42be3c..c7d0fd228cb7 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm,
>>> unsigned long addr,
>>>   }
>>>   #endif
>>>   +#ifndef clear_flush_young_ptes
>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                     unsigned long addr, pte_t *ptep,
>>> +                     unsigned int nr)
>>> +{
>>> +    int young = 0;
>>> +
>>> +    for (;;) {
>>> +        young |= ptep_clear_flush_young(vma, addr, ptep);
>>> +        if (--nr == 0)
>>> +            break;
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +    }
>>> +
>>> +    return young;
>>> +}
>>> +#endif
>>> +
>>>   /*
>>>    * On some architectures hardware does not set page access bit when accessing
>>>    * memory page, it is responsibility of software setting this bit. It brings
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index d6799afe1114..ec232165c47d 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>>>       struct folio_referenced_arg *pra = arg;
>>>       DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>>>       int ptes = 0, referenced = 0;
>>> +    unsigned int nr;
>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>           address = pvmw.address;
>>> +        nr = 1;
>>>             if (vma->vm_flags & VM_LOCKED) {
>>>               ptes++;
>>> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>>>               if (lru_gen_look_around(&pvmw))
>>>                   referenced++;
>>>           } else if (pvmw.pte) {
>>> +            if (folio_test_large(folio)) {
>>> +                unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
>>
>> I may be hallucinating here but I am just trying to recall things - is this a
>> bug in
>> folio_pte_batch_flags()? A folio may not be naturally aligned in virtual
>> space and hence
>> we may cross the PTE table while batching across it, which can be fixed by
>> taking into
>> account pmd_addr_end() while computing max_nr.
>
> IMHO, the comments for the folio_pte_batch_flags() function have already made
> clear requirements for the caller to avoid such situations:
>
> "
> * @ptep must map any page of the folio. max_nr must be at least one and
> * must be limited by the caller so scanning cannot exceed a single VMA and
> * a single page table.
> "
>
> Additionally, Lance recently fixed a similar issue, see commit ddd05742b45b
> ("mm/rmap: fix potential out-of-bounds page table access during batched unmap").
 

Ah I see, all other users of the folio_pte_batch API constrain start and end
because they are already operating on a single PTE table. But for rmap code

this may not be the case. 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios
  2025-12-17  3:11             ` Baolin Wang
@ 2025-12-17 14:28               ` Lorenzo Stoakes
  0 siblings, 0 replies; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-17 14:28 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Barry Song, akpm, david, catalin.marinas, will, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel

On Wed, Dec 17, 2025 at 11:11:35AM +0800, Baolin Wang wrote:
>
>
> On 2025/12/16 18:54, Lorenzo Stoakes wrote:
> > On Tue, Dec 16, 2025 at 02:22:11PM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 2025/12/16 14:13, Barry Song wrote:
> > > > > > >
> > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > > index ec232165c47d..4c9d5777c8da 100644
> > > > > > > --- a/mm/rmap.c
> > > > > > > +++ b/mm/rmap.c
> > > > > > > @@ -1855,9 +1855,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> > > > > > >        end_addr = pmd_addr_end(addr, vma->vm_end);
> > > > > > >        max_nr = (end_addr - addr) >> PAGE_SHIFT;
> > > > > > >
> > > > > > > -    /* We only support lazyfree batching for now ... */
> > > > > > > -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> > > > > > > +    /* We only support lazyfree or file folios batching for now ... */
> > > > > > > +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> > > > > >
> > > > > > Why is it now ok to support file-backed batched unmapping when it wasn't in
> > > > > > Barry's series (see [0])? You don't seem to be justifying this?
> > > > >
> > > > > Barry's series[0] is merely aimed at optimizing lazyfree anonymous large
> > > > > folios and does not continue to optimize anonymous large folios or
> > > > > file-backed large folios at that point.
> > > >
> > > > Yep. At that time, I didn’t have an Android machine with a filesystem
> > > > that supported large folios, so I focused on lazyfree. But I
> > > > agree that lazyfree anon folios and file folios are quite
> > > > similar.
> > > >
> > > > >
> > > > > Subsequently, Barry sent out a new patch (see [1]) to optimize anonymous
> > > > > large folios. As for file-backed large folios, the batched unmapping
> > > > > support is relatively simple, since we only need to clear the PTE
> > > > > entries for file-backed large folios.
> > > >
> > > > Yep. It is actually quite straightforward to go from lazyfree
> > > > anon folios to file folios. Swap-backed anon folios are much
> > > > more tricky, though.
> > >
> > > Agree. Thanks Barry for reviewing and confirming.
> >
> > OK that makes me less concerned, but you do need to put some more justification
> > in the commit message.
>
> Sure. Will do.

Thanks!


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-17  3:53         ` Baolin Wang
@ 2025-12-17 14:50           ` Lorenzo Stoakes
  2025-12-17 16:06             ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-12-17 14:50 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, linux-mm, linux-arm-kernel, linux-kernel

On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>
>
> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
> > On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 2025/12/15 19:36, Lorenzo Stoakes wrote:
> > > > On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
> > > > > Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> > > > > only clear the young flag and flush TLBs for PTEs within the contiguous range.
> > > > > To support batch PTE operations for other sized large folios in the following
> > > > > patches, adding a new parameter to specify the number of PTEs.
> > > > >
> > > > > While we are at it, rename the functions to maintain consistency with other
> > > > > contpte_*() functions.
> > > > >
> > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > ---
> > > > >    arch/arm64/include/asm/pgtable.h | 12 ++++-----
> > > > >    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
> > > > >    2 files changed, 37 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > > > index 0944e296dd4a..e03034683156 100644
> > > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > > @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> > > > >    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > > > >    				unsigned long addr, pte_t *ptep,
> > > > >    				unsigned int nr, int full);
> > > > > -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > > -				unsigned long addr, pte_t *ptep);
> > > > > -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> > > > > -				unsigned long addr, pte_t *ptep);
> > > > > +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> > > > > +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> > > > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> > > >
> > > > In core mm at least, as a convention, we strip 'extern' from header declarations
> > > > as we go as they're not necessary.
> > >
> > > Sure. I can remove the 'extern'.
> >
> > Thanks.
> >
> > >
> > > > I wonder about this naming convention also. The contpte_xxx_() prefix
> > > > obviously implies we are operating upon PTEs in the contiguous range, now
> > > > we are doing something... different and 'nr' is a bit vague.
> > > >
> > > > Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
> > >
> > > Yes, 'nr' here means nr_pages, which follows the convention of the
> > > parameters in contpte_xxx_() functions.
> >
> > Except you're violating the convention already by doing non-contpte stuff in
> > contpte functions?
>
> Nope. Sorry for my poor English, which might have caused some confusion. I
> followed the contpte's convention, and let me try to explain it again below.
>
> > The confusion is between nr of contpte ranges and nr of pages, so I think we
> > should be explicit.
> >
> > >
> > > > now we're not saying they're necessarily contiguous in the sense of contpte...
> > > >
> > > > I wonder whether we'd be better off introducing a new function that
> > > > explicitly has 'batch' or 'batched' in the name, and have the usual
> > > > contpte_***() stuff and callers that need batching using a new underlying helper?
> > >
> > > OK. I get your point. However, this is outside the scope of my patchset.
> >
> > Well, no, this is review feedback that needs to be addressed, I really don't
> > like this patch as-is.
> >
> > So for this to be upstreamable you really need to separate this out.
> >
> > >
> > > Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
> > > that this is confusing.
> >
> > I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
> > reject the patch unless we do this.
> >
> > The contpte logic is already very confusing, and this is only adding to it.
> >
> > >
> > > > Obviously I defer to Ryan on all this, just my thoughts...
> > > >
> > > > >    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > > > >    				pte_t *ptep, unsigned int nr);
> > > > >    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > > > > @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > >    	if (likely(!pte_valid_cont(orig_pte)))
> > > > >    		return __ptep_test_and_clear_young(vma, addr, ptep);
> > > > >
> > > > > -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> > > > > +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
> > > > >    }
> > > > >
> > > > >    #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> > > > > @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > > > >    	if (likely(!pte_valid_cont(orig_pte)))
> > > > >    		return __ptep_clear_flush_young(vma, addr, ptep);
> > > > >
> > > > > -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> > > > > +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
> > > > >    }
> > > > >
> > > > >    #define wrprotect_ptes wrprotect_ptes
> > > > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > > > > index c0557945939c..19b122441be3 100644
> > > > > --- a/arch/arm64/mm/contpte.c
> > > > > +++ b/arch/arm64/mm/contpte.c
> > > > > @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > > > >    }
> > > > >    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
> > > > >
> > > > > -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > > -					unsigned long addr, pte_t *ptep)
> > > > > +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > > > +					unsigned long addr, pte_t *ptep,
> > > > > +					unsigned int nr)
> > > > >    {
> > > > >    	/*
> > > > >    	 * ptep_clear_flush_young() technically requires us to clear the access
> > > > > @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > >    	 * having to unfold.
> > > > >    	 */
> > > >
> > > > Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
> > > >
> > > > "And since we only create a contig range when the range is covered by a single
> > > > folio, we can get away with clearing young for the whole contig range here, so
> > > > we avoid having to unfold."
> > >
> > > I think the original comments are still clear:
> > > "
> > > /*
> > >   * ptep_clear_flush_young() technically requires us to clear the access
> > >   * flag for a _single_ pte. However, the core-mm code actually tracks
> > >   * access/dirty per folio, not per page. And since we only create a
> > >   * contig range when the range is covered by a single folio, we can get
> > >   * away with clearing young for the whole contig range here, so we avoid
> > >   * having to unfold.
> > >   */
> > > "
> >
> > Except nr means you can span more than a single contig pte range? So this
> > comment is no longer applicable as-is?
>
> The nr means consecutive (present) PTEs that map consecutive pages of the
> same large folio in a single VMA and a single page table.
>
> I think this is a prerequisite. If you think it's necessary to explain it, I
> can add it to the comments.
>
> > You've changed the function, the comment needs to change too.
> >
> > >
> > > > However now you're allowing for large folios that are not contpte?
> > > >
> > > > Overall feeling pretty iffy about implementing this this way.
> > >
> > > Please see the above comments, which explain why we should do that.
> >
> > You haven't explained anything? Maybe I missed it?
> >
> > Can you explain clearly what nr might actually be in relation to contpte's? I'm
> > confused even as to the utility here.
> >
> > Is it batched ranges of contptes? But then why are you trying to see if end is a
> > contpte + expand the range if so?
>
> See below.
>
> > > > > +	unsigned long start = addr;
> > > > > +	unsigned long end = start + nr * PAGE_SIZE;
> > > >
> > > > So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
> > > > table?
> > >
> > > Caller has made sure that (see folio_referenced_one()).
> >
> > You should comment this somehow, I hate this nested assumptions stuff 'function
> > X which calls function Y which calls function Z makes sure that parameter A
> > fulfils requirements B but we don't mention it anywhere'.
>
> Sure.
>
> > > > >    	int young = 0;
> > > > >    	int i;
> > > > >
> > > > > -	ptep = contpte_align_down(ptep);
> > > > > -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> > > > > +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> > > > > +		end = ALIGN(end, CONT_PTE_SIZE);
> > > >
> > > > OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
> > > > with other PTE entries present there not having PTE_CONT set (Ryan - do
> > > > correct me if I'm wrong!)
> > > >
> > > > I guess then no worry about running off the end of the PTE table here.
> > > >
> > > > I wonder about using pte_cont_addr_end() here, but I guess you are not
> > > > intending to limit to a range here but rather capture the entire contpte
> > > > range of the end.
> > >
> > > Right.
> > >
> > > > I don't love that now a function prefixed with contpte_... can have a
> > > > condition where part of the input range is _not_ a contpte entry, or is
> > > > specified partially...
> > >
> > > Like I said above, this is outside the scope of my patchset. If Ryan also
> > > thinks we need to do some cleanups for all the contpte_xxx() functions in
> > > the contpte.c file, we can send a follow-up patchset to rename all the
> > > contpte_xxx() functions to make it clear.
> >
> > It's really not outside of the scope of the patch set, this is review feedback
> > you have to address :) trying not to be a grumpy kernel dev here :>)
> >
> > In general in the kernel we don't accept confusing patches then defer making
> > them not-confusing until later.
> >
> > We review until the series is at the correct standard to be accepted before we
> > merge.
> >
> > As always I'm happy to be convinced that I'm mistaken or something's not
> > necesary, however!
>
> Yes, let me try to explain it again. See below.
>
> > > > > -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> > > > > -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> > > > > +	if (pte_cont(__ptep_get(ptep))) {
> > > > > +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> > > > > +		ptep = contpte_align_down(ptep);
> > > > > +	}
> > > > > +
> > > > > +	nr = (end - start) / PAGE_SIZE;
> > > >
> > > > Hm don't love this reuse of input param.
> > >
> > > OK. Will use another local variable instead.
> >
> > Thanks.
> >
> > >
> > > >
> > > > > +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> > > > > +		young |= __ptep_test_and_clear_young(vma, start, ptep);
> > > >
> > > > Again, overall find it a bit iffy that we are essentially overloading the
> > > > code with non-contpte behaviour.
> > >
> > > Let me clarify further: this is the handling logic of the contpte_xxx()
> > > functions in the contpte.c file. For example, the function
> > > contpte_set_ptes() has a parameter 'nr', which may be greater than
> > > CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
> > > large folio.
> > >
> > > It might be a bit confusing, and I understand this is why you want to change
> > > the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
> > > can provide some input on this.
> > >
> > > Thanks for reviewing.
> >
> > OK so we have some reeeeal confusion here. And this speaks to the patch needing
> > work.
> >
> > This seems to answer what I asked above - basically - are we doing _batches_ of
> > _contpte_ entries, or are we just accepting arbitrary large folios here and
> > batching up operations on them, including non-contpte entries.
> >
> > Presumably from the above you're saying - this is dealing with batches of
> > contpte entries.
> >
> > In which case this has to be made _super_ clear. Not just adding an arbitrary
> > 'nr' parameter and letting people guess.
> >
> > The contpte code is _already_ confusing and somewhat surprising if you're not
> > aware of it, we need to be going in the other direction.
>
> Sure. Sorry for causing confusion. I try to explain it again to make it
> clear.
>
> First, it is necessary to explain how the contpte implementation in Arm64 is
> exposed, using set_ptes() as an example.
>
> Arm64 defines an arch-specific implementation of set_ptes():
> "
> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
> addr,
> 				pte_t *ptep, pte_t pte, unsigned int nr)
> {
> 	pte = pte_mknoncont(pte);
>
> 	if (likely(nr == 1)) {
> 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> 		__set_ptes(mm, addr, ptep, pte, 1);
> 		contpte_try_fold(mm, addr, ptep, pte);
> 	} else {
> 		contpte_set_ptes(mm, addr, ptep, pte, nr);
> 	}
> }
> "
>
> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
> of the same large folio within a single VMA and a single page table. Based
> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>
> Second, the underlying implementation of contpte_set_ptes() also calculates
> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
> For example:
>
> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
> be set since the 'next' is not aligned with CONT_PTE_MASK.
>
> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
> the PTE_CONT attribute to be set for the entire PTE entries of the large
> folio.
>
> For a large folio with order = 5 (nr = 32), this involves two ranges of
> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
> be set separately for each range.
>
> "
> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> 					pte_t *ptep, pte_t pte, unsigned int nr)
> {
> 	.......
>
> 	end = addr + (nr << PAGE_SHIFT);
> 	pfn = pte_pfn(pte);
> 	prot = pte_pgprot(pte);
>
> 	do {
> 		next = pte_cont_addr_end(addr, end);
> 		nr = (next - addr) >> PAGE_SHIFT;
> 		pte = pfn_pte(pfn, prot);
>
> 		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> 			pte = pte_mkcont(pte);
> 		else
> 			pte = pte_mknoncont(pte);
>
> 		__set_ptes(mm, addr, ptep, pte, nr);
>
> 		addr = next;
> 		ptep += nr;
> 		pfn += nr;
>
> 	} while (addr != end);
> }
> "
>
> Third, regarding the implementation of my patch, I have followed the contpte
> logic here by adding the 'nr' parameter to
> contpte_ptep_test_and_clear_young() for batch processing consecutive
> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
> original contpte functions, and I have not broken the existing contpte
> implementation logic.

OK that helps a lot thanks.

So the nr will be number of PTEs that are consecutive, belong to the same folio,
and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.

But since we are retrieving accessed bit state (young), isn't it a problem we're
ignoring this bit when considering what goes into the batch?

I mean now it's going to return true even if some do not have the accessed flag
set?

I am more convinced things are correct in style at least then this way as, as
you say, this is the approach taken in set_ptes().

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-11  8:16 ` [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag " Baolin Wang
  2025-12-15 11:36   ` Lorenzo Stoakes
@ 2025-12-17 15:43   ` Ryan Roberts
  2025-12-18  7:15     ` Baolin Wang
  1 sibling, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-17 15:43 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel

Sorry I'm a bit late to the party...


On 11/12/2025 08:16, Baolin Wang wrote:
> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> only clear the young flag and flush TLBs for PTEs within the contiguous range.
> To support batch PTE operations for other sized large folios in the following
> patches, adding a new parameter to specify the number of PTEs.
> 
> While we are at it, rename the functions to maintain consistency with other
> contpte_*() functions.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  2 files changed, 37 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0944e296dd4a..e03034683156 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep,
>  				unsigned int nr, int full);
> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);

The "contpte_" functions are intended to be private to the arm64 arch and should
be exposed via the generic APIs. But I don't see any generic batched API for
this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
in later patches?

>  extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_test_and_clear_young(vma, addr, ptep);
>  
> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_clear_flush_young(vma, addr, ptep);
>  
> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c0557945939c..19b122441be3 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>  
> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
>  {
>  	/*
>  	 * ptep_clear_flush_young() technically requires us to clear the access
> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	 * having to unfold.
>  	 */
>  
> +	unsigned long start = addr;

Personally I wouldn't bother defining start - just reuse addr. You're
incrementing start in the below loop, so it's more appropriate to call it addr
anyway.

> +	unsigned long end = start + nr * PAGE_SIZE;
>  	int young = 0;
>  	int i;
>  
> -	ptep = contpte_align_down(ptep);
> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> +		end = ALIGN(end, CONT_PTE_SIZE);
>  
> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +	if (pte_cont(__ptep_get(ptep))) {
> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +		ptep = contpte_align_down(ptep);
> +	}
> +
> +	nr = (end - start) / PAGE_SIZE;
> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)

Given you're now defining end, perhaps we don't need nr?

	for (; addr != end; ptep++, addr += PAGE_SIZE)
		young |= __ptep_test_and_clear_young(vma, addr, ptep);

> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>  
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>  
> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr)
>  {
>  	int young;
>  
> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>  
>  	if (young) {
> +		unsigned long start = addr;
> +		unsigned long end = start + nr * PAGE_SIZE;
> +
> +		if (pte_cont(__ptep_get(ptep + nr - 1)))
> +			end = ALIGN(end, CONT_PTE_SIZE);
> +
> +		if (pte_cont(__ptep_get(ptep)))
> +			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +

We now have this pattern of expanding contpte blocks up and down in 3 places.
Perhaps create a helper?

Thanks,
Ryan

>  		/*
>  		 * See comment in __ptep_clear_flush_young(); same rationale for
>  		 * eliding the trailing DSB applies here.
>  		 */
> -		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, start, end,
>  					 PAGE_SIZE, true, 3);
>  	}
>  
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
> +EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
>  
>  void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  					pte_t *ptep, unsigned int nr)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-17 14:50           ` Lorenzo Stoakes
@ 2025-12-17 16:06             ` Ryan Roberts
  2025-12-18  7:56               ` Baolin Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-17 16:06 UTC (permalink / raw)
  To: Lorenzo Stoakes, Baolin Wang
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel

On 17/12/2025 14:50, Lorenzo Stoakes wrote:
> On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
>>> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>>>> To support batch PTE operations for other sized large folios in the following
>>>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>>>
>>>>>> While we are at it, rename the functions to maintain consistency with other
>>>>>> contpte_*() functions.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>>    arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>>>    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>>>    2 files changed, 37 insertions(+), 19 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>> index 0944e296dd4a..e03034683156 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>    				unsigned long addr, pte_t *ptep,
>>>>>>    				unsigned int nr, int full);
>>>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>
>>>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>>>> as we go as they're not necessary.
>>>>
>>>> Sure. I can remove the 'extern'.
>>>
>>> Thanks.
>>>
>>>>
>>>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>>>> we are doing something... different and 'nr' is a bit vague.
>>>>>
>>>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>>>
>>>> Yes, 'nr' here means nr_pages, which follows the convention of the
>>>> parameters in contpte_xxx_() functions.
>>>
>>> Except you're violating the convention already by doing non-contpte stuff in
>>> contpte functions?
>>
>> Nope. Sorry for my poor English, which might have caused some confusion. I
>> followed the contpte's convention, and let me try to explain it again below.
>>
>>> The confusion is between nr of contpte ranges and nr of pages, so I think we
>>> should be explicit.
>>>
>>>>
>>>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>>>
>>>>> I wonder whether we'd be better off introducing a new function that
>>>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>>>
>>>> OK. I get your point. However, this is outside the scope of my patchset.
>>>
>>> Well, no, this is review feedback that needs to be addressed, I really don't
>>> like this patch as-is.
>>>
>>> So for this to be upstreamable you really need to separate this out.
>>>
>>>>
>>>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>>>> that this is confusing.
>>>
>>> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
>>> reject the patch unless we do this.
>>>
>>> The contpte logic is already very confusing, and this is only adding to it.
>>>
>>>>
>>>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>>>
>>>>>>    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>    				pte_t *ptep, unsigned int nr);
>>>>>>    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>    		return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>
>>>>>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>    }
>>>>>>
>>>>>>    #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>    		return __ptep_clear_flush_young(vma, addr, ptep);
>>>>>>
>>>>>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>>>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>    }
>>>>>>
>>>>>>    #define wrprotect_ptes wrprotect_ptes
>>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>>> index c0557945939c..19b122441be3 100644
>>>>>> --- a/arch/arm64/mm/contpte.c
>>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>    }
>>>>>>    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>>>
>>>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> -					unsigned long addr, pte_t *ptep)
>>>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>> +					unsigned long addr, pte_t *ptep,
>>>>>> +					unsigned int nr)
>>>>>>    {
>>>>>>    	/*
>>>>>>    	 * ptep_clear_flush_young() technically requires us to clear the access
>>>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>    	 * having to unfold.
>>>>>>    	 */
>>>>>
>>>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>>>
>>>>> "And since we only create a contig range when the range is covered by a single
>>>>> folio, we can get away with clearing young for the whole contig range here, so
>>>>> we avoid having to unfold."
>>>>
>>>> I think the original comments are still clear:
>>>> "
>>>> /*
>>>>   * ptep_clear_flush_young() technically requires us to clear the access
>>>>   * flag for a _single_ pte. However, the core-mm code actually tracks
>>>>   * access/dirty per folio, not per page. And since we only create a
>>>>   * contig range when the range is covered by a single folio, we can get
>>>>   * away with clearing young for the whole contig range here, so we avoid
>>>>   * having to unfold.
>>>>   */
>>>> "
>>>
>>> Except nr means you can span more than a single contig pte range? So this
>>> comment is no longer applicable as-is?
>>
>> The nr means consecutive (present) PTEs that map consecutive pages of the
>> same large folio in a single VMA and a single page table.
>>
>> I think this is a prerequisite. If you think it's necessary to explain it, I
>> can add it to the comments.
>>
>>> You've changed the function, the comment needs to change too.
>>>
>>>>
>>>>> However now you're allowing for large folios that are not contpte?
>>>>>
>>>>> Overall feeling pretty iffy about implementing this this way.
>>>>
>>>> Please see the above comments, which explain why we should do that.
>>>
>>> You haven't explained anything? Maybe I missed it?
>>>
>>> Can you explain clearly what nr might actually be in relation to contpte's? I'm
>>> confused even as to the utility here.
>>>
>>> Is it batched ranges of contptes? But then why are you trying to see if end is a
>>> contpte + expand the range if so?
>>
>> See below.
>>
>>>>>> +	unsigned long start = addr;
>>>>>> +	unsigned long end = start + nr * PAGE_SIZE;
>>>>>
>>>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>>>> table?
>>>>
>>>> Caller has made sure that (see folio_referenced_one()).
>>>
>>> You should comment this somehow, I hate this nested assumptions stuff 'function
>>> X which calls function Y which calls function Z makes sure that parameter A
>>> fulfils requirements B but we don't mention it anywhere'.
>>
>> Sure.
>>
>>>>>>    	int young = 0;
>>>>>>    	int i;
>>>>>>
>>>>>> -	ptep = contpte_align_down(ptep);
>>>>>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>>>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>>>>
>>>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>>>> correct me if I'm wrong!)
>>>>>
>>>>> I guess then no worry about running off the end of the PTE table here.
>>>>>
>>>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>>>> intending to limit to a range here but rather capture the entire contpte
>>>>> range of the end.
>>>>
>>>> Right.
>>>>
>>>>> I don't love that now a function prefixed with contpte_... can have a
>>>>> condition where part of the input range is _not_ a contpte entry, or is
>>>>> specified partially...
>>>>
>>>> Like I said above, this is outside the scope of my patchset. If Ryan also
>>>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>>>> the contpte.c file, we can send a follow-up patchset to rename all the
>>>> contpte_xxx() functions to make it clear.
>>>
>>> It's really not outside of the scope of the patch set, this is review feedback
>>> you have to address :) trying not to be a grumpy kernel dev here :>)
>>>
>>> In general in the kernel we don't accept confusing patches then defer making
>>> them not-confusing until later.
>>>
>>> We review until the series is at the correct standard to be accepted before we
>>> merge.
>>>
>>> As always I'm happy to be convinced that I'm mistaken or something's not
>>> necesary, however!
>>
>> Yes, let me try to explain it again. See below.
>>
>>>>>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>> +	if (pte_cont(__ptep_get(ptep))) {
>>>>>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>>>> +		ptep = contpte_align_down(ptep);
>>>>>> +	}
>>>>>> +
>>>>>> +	nr = (end - start) / PAGE_SIZE;
>>>>>
>>>>> Hm don't love this reuse of input param.
>>>>
>>>> OK. Will use another local variable instead.
>>>
>>> Thanks.
>>>
>>>>
>>>>>
>>>>>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>>>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>>
>>>>> Again, overall find it a bit iffy that we are essentially overloading the
>>>>> code with non-contpte behaviour.
>>>>
>>>> Let me clarify further: this is the handling logic of the contpte_xxx()
>>>> functions in the contpte.c file. For example, the function
>>>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>>>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>>>> large folio.
>>>>
>>>> It might be a bit confusing, and I understand this is why you want to change
>>>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>>>> can provide some input on this.
>>>>
>>>> Thanks for reviewing.
>>>
>>> OK so we have some reeeeal confusion here. And this speaks to the patch needing
>>> work.
>>>
>>> This seems to answer what I asked above - basically - are we doing _batches_ of
>>> _contpte_ entries, or are we just accepting arbitrary large folios here and
>>> batching up operations on them, including non-contpte entries.
>>>
>>> Presumably from the above you're saying - this is dealing with batches of
>>> contpte entries.
>>>
>>> In which case this has to be made _super_ clear. Not just adding an arbitrary
>>> 'nr' parameter and letting people guess.
>>>
>>> The contpte code is _already_ confusing and somewhat surprising if you're not
>>> aware of it, we need to be going in the other direction.
>>
>> Sure. Sorry for causing confusion. I try to explain it again to make it
>> clear.
>>
>> First, it is necessary to explain how the contpte implementation in Arm64 is
>> exposed, using set_ptes() as an example.
>>
>> Arm64 defines an arch-specific implementation of set_ptes():
>> "
>> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
>> addr,
>> 				pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> 	pte = pte_mknoncont(pte);
>>
>> 	if (likely(nr == 1)) {
>> 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> 		__set_ptes(mm, addr, ptep, pte, 1);
>> 		contpte_try_fold(mm, addr, ptep, pte);
>> 	} else {
>> 		contpte_set_ptes(mm, addr, ptep, pte, nr);
>> 	}
>> }
>> "
>>
>> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
>> of the same large folio within a single VMA and a single page table. Based
>> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>>
>> Second, the underlying implementation of contpte_set_ptes() also calculates
>> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
>> For example:
>>
>> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
>> be set since the 'next' is not aligned with CONT_PTE_MASK.
>>
>> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
>> the PTE_CONT attribute to be set for the entire PTE entries of the large
>> folio.
>>
>> For a large folio with order = 5 (nr = 32), this involves two ranges of
>> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
>> be set separately for each range.
>>
>> "
>> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> 					pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> 	.......
>>
>> 	end = addr + (nr << PAGE_SHIFT);
>> 	pfn = pte_pfn(pte);
>> 	prot = pte_pgprot(pte);
>>
>> 	do {
>> 		next = pte_cont_addr_end(addr, end);
>> 		nr = (next - addr) >> PAGE_SHIFT;
>> 		pte = pfn_pte(pfn, prot);
>>
>> 		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> 			pte = pte_mkcont(pte);
>> 		else
>> 			pte = pte_mknoncont(pte);
>>
>> 		__set_ptes(mm, addr, ptep, pte, nr);
>>
>> 		addr = next;
>> 		ptep += nr;
>> 		pfn += nr;
>>
>> 	} while (addr != end);
>> }
>> "
>>
>> Third, regarding the implementation of my patch, I have followed the contpte
>> logic here by adding the 'nr' parameter to
>> contpte_ptep_test_and_clear_young() for batch processing consecutive
>> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
>> original contpte functions, and I have not broken the existing contpte
>> implementation logic.
> 
> OK that helps a lot thanks.
> 
> So the nr will be number of PTEs that are consecutive, belong to the same folio,
> and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.

I'm not sure test_and_clear_young_ptes() would technically require all of these
constraints. I think it would be sufficient to just ensure that all the PTEs in
the batch are pte_present(). And I assume the return value is intended to
indicate if *any* of the PTEs in the batch were previously young?

The usual order for adding new batched PTE helpers is to define the generic
function along with it's spec in a comment block and provide a default
implementation that is implemented as a loop around the existing non-batched
version of the interface. Then you can convert the core-mm to use that API. And
finally you can opt arm64 into overriding the API with a more efficient
implementation.

If you were to do it in that order, it tells a better story and is easier to
follow, I think.

> 
> But since we are retrieving accessed bit state (young), isn't it a problem we're
> ignoring this bit when considering what goes into the batch?
> 
> I mean now it's going to return true even if some do not have the accessed flag
> set?

test_and_clear_young_ptes() is exploiting the fact that the kernel tracks young
(and dirty) per *folio*. So having a young and dirty bit per PTE is providing
more information than the kernel needs if a large folio is mapped across
multiple PTEs. Which is why returning the logical OR of the access bits is ok here.

> 
> I am more convinced things are correct in style at least then this way as, as
> you say, this is the approach taken in set_ptes().

I'm personally ok with the style and I believe the implementation is correct.
But it would help to reformulate the series in the order I suggest above since
then we can review against the spec of what the core-mm helper is supposed to do.

Thanks,
Ryan

> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
                     ` (2 preceding siblings ...)
  2025-12-17  6:49   ` Dev Jain
@ 2025-12-17 16:39   ` Ryan Roberts
  2025-12-18  7:47     ` Baolin Wang
  3 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-17 16:39 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel

On 11/12/2025 08:16, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
> 
> W/o patchset:
> real	0m1.518s
> user	0m0.000s
> sys	0m1.518s
> 
> W/ patchset:
> real	0m1.018s
> user	0m0.000s
> sys	0m1.018s
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>  include/linux/mmu_notifier.h     |  9 +++++----
>  include/linux/pgtable.h          | 19 +++++++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++++++++--
>  4 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e03034683156..a865bd8c46a3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
> +{
> +	if (likely(nr == 1))
> +		return __ptep_clear_flush_young(vma, addr, ptep);

Bug: This is broken if core-mm tries to call this for nr=1 on a pte that is part
of a contpte mapping.

The similar fastpaths are here to prevent regressing the common small folio case.

I guess here the best approach is (note no leading underscores):

	if (likely(nr == 1))
		return ptep_clear_flush_young(vma, addr, ptep);

> +
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}
> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, unsigned int nr)
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..be594b274729 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>  	range->owner = owner;
>  }
>  
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\

Shouldn't we rename this macro to clear_flush_young_ptes_notify()?

And potentially:

#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
	clear_flush_young_ptes_notify(__vma, __address, __ptep, 1)

if there are other non-batched users remaining.

>  ({									\
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	unsigned int ___nr = __nr;					\
> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>  	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>  						  ___address,		\
>  						  ___address +		\
> -							PAGE_SIZE);	\
> +						nr * PAGE_SIZE);	\
>  	__young;							\
>  })
>  > @@ -650,7 +651,7 @@ static inline void
mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>  
>  #define mmu_notifier_range_update_to_read_only(r) false
>  
> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>  #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>  #define ptep_clear_young_notify ptep_test_and_clear_young
>  #define pmdp_clear_young_notify pmdp_test_and_clear_young
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b13b6f42be3c..c7d0fd228cb7 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +#ifndef clear_flush_young_ptes

Let's have some function documentation here please.

> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	int young = 0;
> +
> +	for (;;) {

I know Lorenzo is pretty allergic to this style of looping :)

He's right of course, we should probably just do this the ideomatic way and not
worry about it looking a bit different to the others.

> +		young |= ptep_clear_flush_young(vma, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +
> +	return young;
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d6799afe1114..ec232165c47d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>  	struct folio_referenced_arg *pra = arg;
>  	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>  	int ptes = 0, referenced = 0;
> +	unsigned int nr;
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		address = pvmw.address;
> +		nr = 1;
>  
>  		if (vma->vm_flags & VM_LOCKED) {
>  			ptes++;
> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>  			if (lru_gen_look_around(&pvmw))
>  				referenced++;
>  		} else if (pvmw.pte) {
> +			if (folio_test_large(folio)) {
> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
> +				pte_t pteval = ptep_get(pvmw.pte);
> +
> +				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> +			}
> +
> +			ptes += nr;
>  			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte))
> +						pvmw.pte, nr))
>  				referenced++;
> +			/* Skip the batched PTEs */
> +			pvmw.pte += nr - 1;
> +			pvmw.address += (nr - 1) * PAGE_SIZE;

The -1 part is because the walker will increment by 1 I'm guessing?

>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>  			if (pmdp_clear_flush_young_notify(vma, address,
>  						pvmw.pmd))
> @@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
>  			WARN_ON_ONCE(1);
>  		}
>  
> -		pra->mapcount--;
> +		pra->mapcount -= nr;
> +		if (ptes == pvmw.nr_pages) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;

What's this needed for? I'm suspicious because there wasn't an equivalent here
before.

Thanks,
Ryan

> +		}
>  	}
>  
>  	if (referenced)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-17 15:43   ` Ryan Roberts
@ 2025-12-18  7:15     ` Baolin Wang
  2025-12-18 12:20       ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-18  7:15 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel



On 2025/12/17 23:43, Ryan Roberts wrote:
> Sorry I'm a bit late to the party...

Never mind. It's not late and comments are always welcome :)

> On 11/12/2025 08:16, Baolin Wang wrote:
>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>> To support batch PTE operations for other sized large folios in the following
>> patches, adding a new parameter to specify the number of PTEs.
>>
>> While we are at it, rename the functions to maintain consistency with other
>> contpte_*() functions.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>   2 files changed, 37 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0944e296dd4a..e03034683156 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep,
>>   				unsigned int nr, int full);
>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> 
> The "contpte_" functions are intended to be private to the arm64 arch and should
> be exposed via the generic APIs. But I don't see any generic batched API for
> this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
> planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
> in later patches?

Right. This is a preparation patch, and will be used in patch 2.

>>   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   				pte_t *ptep, unsigned int nr);
>>   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_test_and_clear_young(vma, addr, ptep);
>>   
>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>>   #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_clear_flush_young(vma, addr, ptep);
>>   
>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>>   #define wrprotect_ptes wrprotect_ptes
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..19b122441be3 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   }
>>   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>   
>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -					unsigned long addr, pte_t *ptep)
>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>>   {
>>   	/*
>>   	 * ptep_clear_flush_young() technically requires us to clear the access
>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	 * having to unfold.
>>   	 */
>>   
>> +	unsigned long start = addr;
> 
> Personally I wouldn't bother defining start - just reuse addr. You're
> incrementing start in the below loop, so it's more appropriate to call it addr
> anyway.

OK.

>> +	unsigned long end = start + nr * PAGE_SIZE;
>>   	int young = 0;
>>   	int i;
>>   
>> -	ptep = contpte_align_down(ptep);
>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>   
>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +	if (pte_cont(__ptep_get(ptep))) {
>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>> +		ptep = contpte_align_down(ptep);
>> +	}
>> +
>> +	nr = (end - start) / PAGE_SIZE;
>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> 
> Given you're now defining end, perhaps we don't need nr?
> 
> 	for (; addr != end; ptep++, addr += PAGE_SIZE)
> 		young |= __ptep_test_and_clear_young(vma, addr, ptep);

Yes, good point.

>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>   
>>   	return young;
>>   }
>> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
>> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>>   
>> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> -					unsigned long addr, pte_t *ptep)
>> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr)
>>   {
>>   	int young;
>>   
>> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>>   
>>   	if (young) {
>> +		unsigned long start = addr;
>> +		unsigned long end = start + nr * PAGE_SIZE;
>> +
>> +		if (pte_cont(__ptep_get(ptep + nr - 1)))
>> +			end = ALIGN(end, CONT_PTE_SIZE);
>> +
>> +		if (pte_cont(__ptep_get(ptep)))
>> +			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>> +
> 
> We now have this pattern of expanding contpte blocks up and down in 3 places.
> Perhaps create a helper?

Sounds reasonable. How about the following helper?

static pte_t *contpte_align_addr_ptep(unsigned long *start, unsigned 
long *end,
                                         pte_t *ptep, unsigned int nr)
{
         unsigned long end_addr = *start + nr * PAGE_SIZE;

         if (pte_cont(__ptep_get(ptep + nr - 1)))
                 *end = ALIGN(end_addr, CONT_PTE_SIZE);

         if (pte_cont(__ptep_get(ptep))) {
                 *start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
                 ptep = contpte_align_down(ptep);
         }

         return ptep;
}


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-17 16:39   ` Ryan Roberts
@ 2025-12-18  7:47     ` Baolin Wang
  2025-12-18 12:08       ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Baolin Wang @ 2025-12-18  7:47 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel



On 2025/12/18 00:39, Ryan Roberts wrote:
> On 11/12/2025 08:16, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>> of the young flags and flushing TLB entries, thereby improving performance
>> during large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real	0m1.518s
>> user	0m0.000s
>> sys	0m1.518s
>>
>> W/ patchset:
>> real	0m1.018s
>> user	0m0.000s
>> sys	0m1.018s
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   include/linux/mmu_notifier.h     |  9 +++++----
>>   include/linux/pgtable.h          | 19 +++++++++++++++++++
>>   mm/rmap.c                        | 22 ++++++++++++++++++++--
>>   4 files changed, 55 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index e03034683156..a865bd8c46a3 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>> +{
>> +	if (likely(nr == 1))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
> 
> Bug: This is broken if core-mm tries to call this for nr=1 on a pte that is part
> of a contpte mapping.
> 
> The similar fastpaths are here to prevent regressing the common small folio case.

Thanks for catching this. I had considered this before, but I still 
missed it.

> I guess here the best approach is (note no leading underscores):
> 
> 	if (likely(nr == 1))
> 		return ptep_clear_flush_young(vma, addr, ptep);

However, I prefer to use pte_cont() to check it. Later, I plan to clean 
up the ptep_clear_flush_young().

	if (nr == 1 && !pte_cont(__ptep_get(ptep))
		return __ptep_clear_flush_young(vma, addr, ptep);

>> +
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>> +}
>> +
>>   #define wrprotect_ptes wrprotect_ptes
>>   static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep, unsigned int nr)
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..be594b274729 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>   	range->owner = owner;
>>   }
>>   
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
>> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
> 
> Shouldn't we rename this macro to clear_flush_young_ptes_notify()?
> 
> And potentially:
> 
> #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
> 	clear_flush_young_ptes_notify(__vma, __address, __ptep, 1)
> 
> if there are other non-batched users remaining.

There are no other non-batched users now, so seems there is no need to 
add another redundant API.

>>   ({									\
>>   	int __young;							\
>>   	struct vm_area_struct *___vma = __vma;				\
>>   	unsigned long ___address = __address;				\
>> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
>> +	unsigned int ___nr = __nr;					\
>> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>>   	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>>   						  ___address,		\
>>   						  ___address +		\
>> -							PAGE_SIZE);	\
>> +						nr * PAGE_SIZE);	\
>>   	__young;							\
>>   })
>>   > @@ -650,7 +651,7 @@ static inline void
> mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>   
>>   #define mmu_notifier_range_update_to_read_only(r) false
>>   
>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>> +#define ptep_clear_flush_young_notify clear_flush_young_ptes
>>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>   #define ptep_clear_young_notify ptep_test_and_clear_young
>>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b13b6f42be3c..c7d0fd228cb7 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>   
>> +#ifndef clear_flush_young_ptes
> 
> Let's have some function documentation here please.

Sure. Will do.

>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					 unsigned long addr, pte_t *ptep,
>> +					 unsigned int nr)
>> +{
>> +	int young = 0;
>> +
>> +	for (;;) {
> 
> I know Lorenzo is pretty allergic to this style of looping :)
> 
> He's right of course, we should probably just do this the ideomatic way and not
> worry about it looking a bit different to the others.

Let me use the 'while (--nr) { }' instead.

> 
>> +		young |= ptep_clear_flush_young(vma, addr, ptep);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +
>> +	return young;
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index d6799afe1114..ec232165c47d 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
>>   	struct folio_referenced_arg *pra = arg;
>>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>>   	int ptes = 0, referenced = 0;
>> +	unsigned int nr;
>>   
>>   	while (page_vma_mapped_walk(&pvmw)) {
>>   		address = pvmw.address;
>> +		nr = 1;
>>   
>>   		if (vma->vm_flags & VM_LOCKED) {
>>   			ptes++;
>> @@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
>>   			if (lru_gen_look_around(&pvmw))
>>   				referenced++;
>>   		} else if (pvmw.pte) {
>> +			if (folio_test_large(folio)) {
>> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
>> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
>> +				pte_t pteval = ptep_get(pvmw.pte);
>> +
>> +				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
>> +			}
>> +
>> +			ptes += nr;
>>   			if (ptep_clear_flush_young_notify(vma, address,
>> -						pvmw.pte))
>> +						pvmw.pte, nr))
>>   				referenced++;
>> +			/* Skip the batched PTEs */
>> +			pvmw.pte += nr - 1;
>> +			pvmw.address += (nr - 1) * PAGE_SIZE;
> 
> The -1 part is because the walker will increment by 1 I'm guessing?

Right.

> 
>>   		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>   			if (pmdp_clear_flush_young_notify(vma, address,
>>   						pvmw.pmd))
>> @@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
>>   			WARN_ON_ONCE(1);
>>   		}
>>   
>> -		pra->mapcount--;
>> +		pra->mapcount -= nr;
>> +		if (ptes == pvmw.nr_pages) {
>> +			page_vma_mapped_walk_done(&pvmw);
>> +			break;
> 
> What's this needed for? I'm suspicious because there wasn't an equivalent here
> before.

If we are sure that we batched the entire folio, we can just optimize 
and stop right here.

Thanks for reviewing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-17 16:06             ` Ryan Roberts
@ 2025-12-18  7:56               ` Baolin Wang
  0 siblings, 0 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-18  7:56 UTC (permalink / raw)
  To: Ryan Roberts, Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel



On 2025/12/18 00:06, Ryan Roberts wrote:
> On 17/12/2025 14:50, Lorenzo Stoakes wrote:
>> On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
>>>> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>>>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>>>>> To support batch PTE operations for other sized large folios in the following
>>>>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>>>>
>>>>>>> While we are at it, rename the functions to maintain consistency with other
>>>>>>> contpte_*() functions.
>>>>>>>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> ---
>>>>>>>     arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>>>>     arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>>>>     2 files changed, 37 insertions(+), 19 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>>> index 0944e296dd4a..e03034683156 100644
>>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>>     extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>>     				unsigned long addr, pte_t *ptep,
>>>>>>>     				unsigned int nr, int full);
>>>>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>>
>>>>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>>>>> as we go as they're not necessary.
>>>>>
>>>>> Sure. I can remove the 'extern'.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>>>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>>>>> we are doing something... different and 'nr' is a bit vague.
>>>>>>
>>>>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>>>>
>>>>> Yes, 'nr' here means nr_pages, which follows the convention of the
>>>>> parameters in contpte_xxx_() functions.
>>>>
>>>> Except you're violating the convention already by doing non-contpte stuff in
>>>> contpte functions?
>>>
>>> Nope. Sorry for my poor English, which might have caused some confusion. I
>>> followed the contpte's convention, and let me try to explain it again below.
>>>
>>>> The confusion is between nr of contpte ranges and nr of pages, so I think we
>>>> should be explicit.
>>>>
>>>>>
>>>>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>>>>
>>>>>> I wonder whether we'd be better off introducing a new function that
>>>>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>>>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>>>>
>>>>> OK. I get your point. However, this is outside the scope of my patchset.
>>>>
>>>> Well, no, this is review feedback that needs to be addressed, I really don't
>>>> like this patch as-is.
>>>>
>>>> So for this to be upstreamable you really need to separate this out.
>>>>
>>>>>
>>>>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>>>>> that this is confusing.
>>>>
>>>> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
>>>> reject the patch unless we do this.
>>>>
>>>> The contpte logic is already very confusing, and this is only adding to it.
>>>>
>>>>>
>>>>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>>>>
>>>>>>>     extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>>     				pte_t *ptep, unsigned int nr);
>>>>>>>     extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>>     	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>>     		return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>>
>>>>>>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>>     }
>>>>>>>
>>>>>>>     #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>>     	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>>     		return __ptep_clear_flush_young(vma, addr, ptep);
>>>>>>>
>>>>>>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>>>>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>>     }
>>>>>>>
>>>>>>>     #define wrprotect_ptes wrprotect_ptes
>>>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>>>> index c0557945939c..19b122441be3 100644
>>>>>>> --- a/arch/arm64/mm/contpte.c
>>>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>>     }
>>>>>>>     EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>>>>
>>>>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>> -					unsigned long addr, pte_t *ptep)
>>>>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>>> +					unsigned long addr, pte_t *ptep,
>>>>>>> +					unsigned int nr)
>>>>>>>     {
>>>>>>>     	/*
>>>>>>>     	 * ptep_clear_flush_young() technically requires us to clear the access
>>>>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>>     	 * having to unfold.
>>>>>>>     	 */
>>>>>>
>>>>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>>>>
>>>>>> "And since we only create a contig range when the range is covered by a single
>>>>>> folio, we can get away with clearing young for the whole contig range here, so
>>>>>> we avoid having to unfold."
>>>>>
>>>>> I think the original comments are still clear:
>>>>> "
>>>>> /*
>>>>>    * ptep_clear_flush_young() technically requires us to clear the access
>>>>>    * flag for a _single_ pte. However, the core-mm code actually tracks
>>>>>    * access/dirty per folio, not per page. And since we only create a
>>>>>    * contig range when the range is covered by a single folio, we can get
>>>>>    * away with clearing young for the whole contig range here, so we avoid
>>>>>    * having to unfold.
>>>>>    */
>>>>> "
>>>>
>>>> Except nr means you can span more than a single contig pte range? So this
>>>> comment is no longer applicable as-is?
>>>
>>> The nr means consecutive (present) PTEs that map consecutive pages of the
>>> same large folio in a single VMA and a single page table.
>>>
>>> I think this is a prerequisite. If you think it's necessary to explain it, I
>>> can add it to the comments.
>>>
>>>> You've changed the function, the comment needs to change too.
>>>>
>>>>>
>>>>>> However now you're allowing for large folios that are not contpte?
>>>>>>
>>>>>> Overall feeling pretty iffy about implementing this this way.
>>>>>
>>>>> Please see the above comments, which explain why we should do that.
>>>>
>>>> You haven't explained anything? Maybe I missed it?
>>>>
>>>> Can you explain clearly what nr might actually be in relation to contpte's? I'm
>>>> confused even as to the utility here.
>>>>
>>>> Is it batched ranges of contptes? But then why are you trying to see if end is a
>>>> contpte + expand the range if so?
>>>
>>> See below.
>>>
>>>>>>> +	unsigned long start = addr;
>>>>>>> +	unsigned long end = start + nr * PAGE_SIZE;
>>>>>>
>>>>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>>>>> table?
>>>>>
>>>>> Caller has made sure that (see folio_referenced_one()).
>>>>
>>>> You should comment this somehow, I hate this nested assumptions stuff 'function
>>>> X which calls function Y which calls function Z makes sure that parameter A
>>>> fulfils requirements B but we don't mention it anywhere'.
>>>
>>> Sure.
>>>
>>>>>>>     	int young = 0;
>>>>>>>     	int i;
>>>>>>>
>>>>>>> -	ptep = contpte_align_down(ptep);
>>>>>>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>>>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>>>>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>>>>>
>>>>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>>>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>>>>> correct me if I'm wrong!)
>>>>>>
>>>>>> I guess then no worry about running off the end of the PTE table here.
>>>>>>
>>>>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>>>>> intending to limit to a range here but rather capture the entire contpte
>>>>>> range of the end.
>>>>>
>>>>> Right.
>>>>>
>>>>>> I don't love that now a function prefixed with contpte_... can have a
>>>>>> condition where part of the input range is _not_ a contpte entry, or is
>>>>>> specified partially...
>>>>>
>>>>> Like I said above, this is outside the scope of my patchset. If Ryan also
>>>>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>>>>> the contpte.c file, we can send a follow-up patchset to rename all the
>>>>> contpte_xxx() functions to make it clear.
>>>>
>>>> It's really not outside of the scope of the patch set, this is review feedback
>>>> you have to address :) trying not to be a grumpy kernel dev here :>)
>>>>
>>>> In general in the kernel we don't accept confusing patches then defer making
>>>> them not-confusing until later.
>>>>
>>>> We review until the series is at the correct standard to be accepted before we
>>>> merge.
>>>>
>>>> As always I'm happy to be convinced that I'm mistaken or something's not
>>>> necesary, however!
>>>
>>> Yes, let me try to explain it again. See below.
>>>
>>>>>>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>>>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>> +	if (pte_cont(__ptep_get(ptep))) {
>>>>>>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>>>>> +		ptep = contpte_align_down(ptep);
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	nr = (end - start) / PAGE_SIZE;
>>>>>>
>>>>>> Hm don't love this reuse of input param.
>>>>>
>>>>> OK. Will use another local variable instead.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>>>
>>>>>>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>>>>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>>>
>>>>>> Again, overall find it a bit iffy that we are essentially overloading the
>>>>>> code with non-contpte behaviour.
>>>>>
>>>>> Let me clarify further: this is the handling logic of the contpte_xxx()
>>>>> functions in the contpte.c file. For example, the function
>>>>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>>>>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>>>>> large folio.
>>>>>
>>>>> It might be a bit confusing, and I understand this is why you want to change
>>>>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>>>>> can provide some input on this.
>>>>>
>>>>> Thanks for reviewing.
>>>>
>>>> OK so we have some reeeeal confusion here. And this speaks to the patch needing
>>>> work.
>>>>
>>>> This seems to answer what I asked above - basically - are we doing _batches_ of
>>>> _contpte_ entries, or are we just accepting arbitrary large folios here and
>>>> batching up operations on them, including non-contpte entries.
>>>>
>>>> Presumably from the above you're saying - this is dealing with batches of
>>>> contpte entries.
>>>>
>>>> In which case this has to be made _super_ clear. Not just adding an arbitrary
>>>> 'nr' parameter and letting people guess.
>>>>
>>>> The contpte code is _already_ confusing and somewhat surprising if you're not
>>>> aware of it, we need to be going in the other direction.
>>>
>>> Sure. Sorry for causing confusion. I try to explain it again to make it
>>> clear.
>>>
>>> First, it is necessary to explain how the contpte implementation in Arm64 is
>>> exposed, using set_ptes() as an example.
>>>
>>> Arm64 defines an arch-specific implementation of set_ptes():
>>> "
>>> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
>>> addr,
>>> 				pte_t *ptep, pte_t pte, unsigned int nr)
>>> {
>>> 	pte = pte_mknoncont(pte);
>>>
>>> 	if (likely(nr == 1)) {
>>> 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> 		__set_ptes(mm, addr, ptep, pte, 1);
>>> 		contpte_try_fold(mm, addr, ptep, pte);
>>> 	} else {
>>> 		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> 	}
>>> }
>>> "
>>>
>>> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
>>> of the same large folio within a single VMA and a single page table. Based
>>> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>>>
>>> Second, the underlying implementation of contpte_set_ptes() also calculates
>>> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
>>> For example:
>>>
>>> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
>>> be set since the 'next' is not aligned with CONT_PTE_MASK.
>>>
>>> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
>>> the PTE_CONT attribute to be set for the entire PTE entries of the large
>>> folio.
>>>
>>> For a large folio with order = 5 (nr = 32), this involves two ranges of
>>> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
>>> be set separately for each range.
>>>
>>> "
>>> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> 					pte_t *ptep, pte_t pte, unsigned int nr)
>>> {
>>> 	.......
>>>
>>> 	end = addr + (nr << PAGE_SHIFT);
>>> 	pfn = pte_pfn(pte);
>>> 	prot = pte_pgprot(pte);
>>>
>>> 	do {
>>> 		next = pte_cont_addr_end(addr, end);
>>> 		nr = (next - addr) >> PAGE_SHIFT;
>>> 		pte = pfn_pte(pfn, prot);
>>>
>>> 		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> 			pte = pte_mkcont(pte);
>>> 		else
>>> 			pte = pte_mknoncont(pte);
>>>
>>> 		__set_ptes(mm, addr, ptep, pte, nr);
>>>
>>> 		addr = next;
>>> 		ptep += nr;
>>> 		pfn += nr;
>>>
>>> 	} while (addr != end);
>>> }
>>> "
>>>
>>> Third, regarding the implementation of my patch, I have followed the contpte
>>> logic here by adding the 'nr' parameter to
>>> contpte_ptep_test_and_clear_young() for batch processing consecutive
>>> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
>>> original contpte functions, and I have not broken the existing contpte
>>> implementation logic.
>>
>> OK that helps a lot thanks.
>>
>> So the nr will be number of PTEs that are consecutive, belong to the same folio,
>> and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.
> 
> I'm not sure test_and_clear_young_ptes() would technically require all of these
> constraints. I think it would be sufficient to just ensure that all the PTEs in
> the batch are pte_present(). And I assume the return value is intended to
> indicate if *any* of the PTEs in the batch were previously young?

Agree.

> The usual order for adding new batched PTE helpers is to define the generic
> function along with it's spec in a comment block and provide a default
> implementation that is implemented as a loop around the existing non-batched
> version of the interface. Then you can convert the core-mm to use that API. And
> finally you can opt arm64 into overriding the API with a more efficient
> implementation.
> 
> If you were to do it in that order, it tells a better story and is easier to
> follow, I think.

Good point. Let me rearrange the patch set.

> 
>>
>> But since we are retrieving accessed bit state (young), isn't it a problem we're
>> ignoring this bit when considering what goes into the batch?
>>
>> I mean now it's going to return true even if some do not have the accessed flag
>> set?
> 
> test_and_clear_young_ptes() is exploiting the fact that the kernel tracks young
> (and dirty) per *folio*. So having a young and dirty bit per PTE is providing
> more information than the kernel needs if a large folio is mapped across
> multiple PTEs. Which is why returning the logical OR of the access bits is ok here.

Right. Agree.

>> I am more convinced things are correct in style at least then this way as, as
>> you say, this is the approach taken in set_ptes().
> 
> I'm personally ok with the style and I believe the implementation is correct.
> But it would help to reformulate the series in the order I suggest above since
> then we can review against the spec of what the core-mm helper is supposed to do.

Sure. Thanks for reviewing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-18  7:47     ` Baolin Wang
@ 2025-12-18 12:08       ` Ryan Roberts
  2025-12-19  0:56         ` Baolin Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-18 12:08 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel

>>>   +#define clear_flush_young_ptes clear_flush_young_ptes
>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep,
>>> +                    unsigned int nr)
>>> +{
>>> +    if (likely(nr == 1))
>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>
>> Bug: This is broken if core-mm tries to call this for nr=1 on a pte that is part
>> of a contpte mapping.
>>
>> The similar fastpaths are here to prevent regressing the common small folio case.
> 
> Thanks for catching this. I had considered this before, but I still missed it.
> 
>> I guess here the best approach is (note no leading underscores):
>>
>>     if (likely(nr == 1))
>>         return ptep_clear_flush_young(vma, addr, ptep);
> 
> However, I prefer to use pte_cont() to check it. Later, I plan to clean up the
> ptep_clear_flush_young().
> 
>     if (nr == 1 && !pte_cont(__ptep_get(ptep))
>         return __ptep_clear_flush_young(vma, addr, ptep);

Sure. That would follow the pattern in clear_young_dirty_ptes(). Please use the
likely() hint as is done everywhere else:

	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))

I notice that ptep_test_and_clear_young() and ptep_clear_flush_young() are both
testing aginst pte_valid_cont(). These could probably be relaxed to pte_cont()
since it is implicit the the pte must be valid?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-18  7:15     ` Baolin Wang
@ 2025-12-18 12:20       ` Ryan Roberts
  2025-12-19  1:00         ` Baolin Wang
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-18 12:20 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel

On 18/12/2025 07:15, Baolin Wang wrote:
> 
> 
> On 2025/12/17 23:43, Ryan Roberts wrote:
>> Sorry I'm a bit late to the party...
> 
> Never mind. It's not late and comments are always welcome :)
> 
>> On 11/12/2025 08:16, Baolin Wang wrote:
>>> Currently, contpte_ptep_test_and_clear_young() and
>>> contpte_ptep_clear_flush_young()
>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>> To support batch PTE operations for other sized large folios in the following
>>> patches, adding a new parameter to specify the number of PTEs.
>>>
>>> While we are at it, rename the functions to maintain consistency with other
>>> contpte_*() functions.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>   2 files changed, 37 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 0944e296dd4a..e03034683156 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct
>>> *mm, unsigned long addr,
>>>   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>                   unsigned long addr, pte_t *ptep,
>>>                   unsigned int nr, int full);
>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> -                unsigned long addr, pte_t *ptep);
>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> -                unsigned long addr, pte_t *ptep);
>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>
>> The "contpte_" functions are intended to be private to the arm64 arch and should
>> be exposed via the generic APIs. But I don't see any generic batched API for
>> this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
>> planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
>> in later patches?
> 
> Right. This is a preparation patch, and will be used in patch 2.
> 
>>>   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>                   pte_t *ptep, unsigned int nr);
>>>   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct
>>> vm_area_struct *vma,
>>>       if (likely(!pte_valid_cont(orig_pte)))
>>>           return __ptep_test_and_clear_young(vma, addr, ptep);
>>>   -    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +    return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>   }
>>>     #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct
>>> vm_area_struct *vma,
>>>       if (likely(!pte_valid_cont(orig_pte)))
>>>           return __ptep_clear_flush_young(vma, addr, ptep);
>>>   -    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +    return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>   }
>>>     #define wrprotect_ptes wrprotect_ptes
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index c0557945939c..19b122441be3 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>   }
>>>   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>   -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> -                    unsigned long addr, pte_t *ptep)
>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep,
>>> +                    unsigned int nr)
>>>   {
>>>       /*
>>>        * ptep_clear_flush_young() technically requires us to clear the access
>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct
>>> vm_area_struct *vma,
>>>        * having to unfold.
>>>        */
>>>   +    unsigned long start = addr;
>>
>> Personally I wouldn't bother defining start - just reuse addr. You're
>> incrementing start in the below loop, so it's more appropriate to call it addr
>> anyway.
> 
> OK.
> 
>>> +    unsigned long end = start + nr * PAGE_SIZE;
>>>       int young = 0;
>>>       int i;
>>>   -    ptep = contpte_align_down(ptep);
>>> -    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +    if (pte_cont(__ptep_get(ptep + nr - 1)))
>>> +        end = ALIGN(end, CONT_PTE_SIZE);
>>>   -    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> -        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +    if (pte_cont(__ptep_get(ptep))) {
>>> +        start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>> +        ptep = contpte_align_down(ptep);
>>> +    }
>>> +
>>> +    nr = (end - start) / PAGE_SIZE;
>>> +    for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>
>> Given you're now defining end, perhaps we don't need nr?
>>
>>     for (; addr != end; ptep++, addr += PAGE_SIZE)
>>         young |= __ptep_test_and_clear_young(vma, addr, ptep);
> 
> Yes, good point.
> 
>>> +        young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>         return young;
>>>   }
>>> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
>>> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>>>   -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> -                    unsigned long addr, pte_t *ptep)
>>> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep,
>>> +                unsigned int nr)
>>>   {
>>>       int young;
>>>   -    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +    young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>>>         if (young) {
>>> +        unsigned long start = addr;
>>> +        unsigned long end = start + nr * PAGE_SIZE;
>>> +
>>> +        if (pte_cont(__ptep_get(ptep + nr - 1)))
>>> +            end = ALIGN(end, CONT_PTE_SIZE);
>>> +
>>> +        if (pte_cont(__ptep_get(ptep)))
>>> +            start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>> +
>>
>> We now have this pattern of expanding contpte blocks up and down in 3 places.
>> Perhaps create a helper?
> 
> Sounds reasonable. How about the following helper?
> 
> static pte_t *contpte_align_addr_ptep(unsigned long *start, unsigned long *end,
>                                         pte_t *ptep, unsigned int nr)
> {
>         unsigned long end_addr = *start + nr * PAGE_SIZE;
> 
>         if (pte_cont(__ptep_get(ptep + nr - 1)))

I think this is safe but calling it out to check; you're not checking that the
pte is valid, so theoretically you could have a swap-entry here with whatever
overlays the contiguous bit set. So then you would incorrectly extend.

But I think it is safe because the expectation is that core-mm has already
checked that the whole range is present?

>                 *end = ALIGN(end_addr, CONT_PTE_SIZE);
> 
>         if (pte_cont(__ptep_get(ptep))) {
>                 *start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
>                 ptep = contpte_align_down(ptep);
>         }
> 
>         return ptep;
> }

Looks good.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
  2025-12-18 12:08       ` Ryan Roberts
@ 2025-12-19  0:56         ` Baolin Wang
  0 siblings, 0 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-19  0:56 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel



On 2025/12/18 20:08, Ryan Roberts wrote:
>>>>    +#define clear_flush_young_ptes clear_flush_young_ptes
>>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep,
>>>> +                    unsigned int nr)
>>>> +{
>>>> +    if (likely(nr == 1))
>>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>>
>>> Bug: This is broken if core-mm tries to call this for nr=1 on a pte that is part
>>> of a contpte mapping.
>>>
>>> The similar fastpaths are here to prevent regressing the common small folio case.
>>
>> Thanks for catching this. I had considered this before, but I still missed it.
>>
>>> I guess here the best approach is (note no leading underscores):
>>>
>>>      if (likely(nr == 1))
>>>          return ptep_clear_flush_young(vma, addr, ptep);
>>
>> However, I prefer to use pte_cont() to check it. Later, I plan to clean up the
>> ptep_clear_flush_young().
>>
>>      if (nr == 1 && !pte_cont(__ptep_get(ptep))
>>          return __ptep_clear_flush_young(vma, addr, ptep);
> 
> Sure. That would follow the pattern in clear_young_dirty_ptes(). Please use the
> likely() hint as is done everywhere else:
> 
> 	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))

Sure.

> I notice that ptep_test_and_clear_young() and ptep_clear_flush_young() are both
> testing aginst pte_valid_cont(). These could probably be relaxed to pte_cont()
> since it is implicit the the pte must be valid?

Yes, I think so. I can do a cleanup later in a separate patch.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-18 12:20       ` Ryan Roberts
@ 2025-12-19  1:00         ` Baolin Wang
  0 siblings, 0 replies; 35+ messages in thread
From: Baolin Wang @ 2025-12-19  1:00 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, linux-mm, linux-arm-kernel,
	linux-kernel



On 2025/12/18 20:20, Ryan Roberts wrote:
> On 18/12/2025 07:15, Baolin Wang wrote:
>>
>>
>> On 2025/12/17 23:43, Ryan Roberts wrote:
>>> Sorry I'm a bit late to the party...
>>
>> Never mind. It's not late and comments are always welcome :)
>>
>>> On 11/12/2025 08:16, Baolin Wang wrote:
>>>> Currently, contpte_ptep_test_and_clear_young() and
>>>> contpte_ptep_clear_flush_young()
>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>> To support batch PTE operations for other sized large folios in the following
>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>
>>>> While we are at it, rename the functions to maintain consistency with other
>>>> contpte_*() functions.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>    2 files changed, 37 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0944e296dd4a..e03034683156 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct
>>>> *mm, unsigned long addr,
>>>>    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>                    unsigned long addr, pte_t *ptep,
>>>>                    unsigned int nr, int full);
>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -                unsigned long addr, pte_t *ptep);
>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> -                unsigned long addr, pte_t *ptep);
>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>>
>>> The "contpte_" functions are intended to be private to the arm64 arch and should
>>> be exposed via the generic APIs. But I don't see any generic batched API for
>>> this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
>>> planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
>>> in later patches?
>>
>> Right. This is a preparation patch, and will be used in patch 2.
>>
>>>>    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>                    pte_t *ptep, unsigned int nr);
>>>>    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct
>>>> vm_area_struct *vma,
>>>>        if (likely(!pte_valid_cont(orig_pte)))
>>>>            return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>    -    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +    return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>      #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct
>>>> vm_area_struct *vma,
>>>>        if (likely(!pte_valid_cont(orig_pte)))
>>>>            return __ptep_clear_flush_young(vma, addr, ptep);
>>>>    -    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>> +    return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>      #define wrprotect_ptes wrprotect_ptes
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index c0557945939c..19b122441be3 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>    -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -                    unsigned long addr, pte_t *ptep)
>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep,
>>>> +                    unsigned int nr)
>>>>    {
>>>>        /*
>>>>         * ptep_clear_flush_young() technically requires us to clear the access
>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct
>>>> vm_area_struct *vma,
>>>>         * having to unfold.
>>>>         */
>>>>    +    unsigned long start = addr;
>>>
>>> Personally I wouldn't bother defining start - just reuse addr. You're
>>> incrementing start in the below loop, so it's more appropriate to call it addr
>>> anyway.
>>
>> OK.
>>
>>>> +    unsigned long end = start + nr * PAGE_SIZE;
>>>>        int young = 0;
>>>>        int i;
>>>>    -    ptep = contpte_align_down(ptep);
>>>> -    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +    if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>> +        end = ALIGN(end, CONT_PTE_SIZE);
>>>>    -    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>> -        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>> +    if (pte_cont(__ptep_get(ptep))) {
>>>> +        start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>> +        ptep = contpte_align_down(ptep);
>>>> +    }
>>>> +
>>>> +    nr = (end - start) / PAGE_SIZE;
>>>> +    for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>
>>> Given you're now defining end, perhaps we don't need nr?
>>>
>>>      for (; addr != end; ptep++, addr += PAGE_SIZE)
>>>          young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>
>> Yes, good point.
>>
>>>> +        young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>          return young;
>>>>    }
>>>> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
>>>> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>>>>    -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> -                    unsigned long addr, pte_t *ptep)
>>>> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep,
>>>> +                unsigned int nr)
>>>>    {
>>>>        int young;
>>>>    -    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +    young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>>>>          if (young) {
>>>> +        unsigned long start = addr;
>>>> +        unsigned long end = start + nr * PAGE_SIZE;
>>>> +
>>>> +        if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>> +            end = ALIGN(end, CONT_PTE_SIZE);
>>>> +
>>>> +        if (pte_cont(__ptep_get(ptep)))
>>>> +            start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>> +
>>>
>>> We now have this pattern of expanding contpte blocks up and down in 3 places.
>>> Perhaps create a helper?
>>
>> Sounds reasonable. How about the following helper?
>>
>> static pte_t *contpte_align_addr_ptep(unsigned long *start, unsigned long *end,
>>                                          pte_t *ptep, unsigned int nr)
>> {
>>          unsigned long end_addr = *start + nr * PAGE_SIZE;
>>
>>          if (pte_cont(__ptep_get(ptep + nr - 1)))
> 
> I think this is safe but calling it out to check; you're not checking that the
> pte is valid, so theoretically you could have a swap-entry here with whatever
> overlays the contiguous bit set. So then you would incorrectly extend.
> 
> But I think it is safe because the expectation is that core-mm has already
> checked that the whole range is present?

Yes. They must be present PTEs that map consecutive pages of the same 
large folio within a single VMA and a single page table. I will add some 
comments to make this clear.

>>                  *end = ALIGN(end_addr, CONT_PTE_SIZE);
>>
>>          if (pte_cont(__ptep_get(ptep))) {
>>                  *start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
>>                  ptep = contpte_align_down(ptep);
>>          }
>>
>>          return ptep;
>> }
> 
> Looks good.

Thanks for reviewing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-12-19  1:00 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-11  8:16 [PATCH v2 0/3] support batch checking of references and unmapping for large folios Baolin Wang
2025-12-11  8:16 ` [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag " Baolin Wang
2025-12-15 11:36   ` Lorenzo Stoakes
2025-12-16  3:32     ` Baolin Wang
2025-12-16 11:11       ` Lorenzo Stoakes
2025-12-17  3:53         ` Baolin Wang
2025-12-17 14:50           ` Lorenzo Stoakes
2025-12-17 16:06             ` Ryan Roberts
2025-12-18  7:56               ` Baolin Wang
2025-12-17 15:43   ` Ryan Roberts
2025-12-18  7:15     ` Baolin Wang
2025-12-18 12:20       ` Ryan Roberts
2025-12-19  1:00         ` Baolin Wang
2025-12-11  8:16 ` [PATCH v2 2/3] mm: rmap: support batched checks of the references " Baolin Wang
2025-12-15 12:22   ` Lorenzo Stoakes
2025-12-16  3:47     ` Baolin Wang
2025-12-17  6:23   ` Dev Jain
2025-12-17  6:44     ` Baolin Wang
2025-12-17  6:49   ` Dev Jain
2025-12-17  7:09     ` Baolin Wang
2025-12-17  7:23       ` Dev Jain
2025-12-17 16:39   ` Ryan Roberts
2025-12-18  7:47     ` Baolin Wang
2025-12-18 12:08       ` Ryan Roberts
2025-12-19  0:56         ` Baolin Wang
2025-12-11  8:16 ` [PATCH v2 3/3] mm: rmap: support batched unmapping for file " Baolin Wang
2025-12-11 12:36   ` Barry Song
2025-12-15 12:38   ` Lorenzo Stoakes
2025-12-16  5:48     ` Baolin Wang
2025-12-16  6:13       ` Barry Song
2025-12-16  6:22         ` Baolin Wang
2025-12-16 10:54           ` Lorenzo Stoakes
2025-12-17  3:11             ` Baolin Wang
2025-12-17 14:28               ` Lorenzo Stoakes
2025-12-16 10:53       ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).