[PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping
@ 2026-03-10  7:30 Dev Jain
  2026-03-10  7:30 ` [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one Dev Jain
                   ` (10 more replies)
  0 siblings, 11 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Speed up unmapping of anonymous large folios by clearing the ptes, and
setting swap ptes, in one go.

The following benchmark (stolen from Barry at [1]) is used to measure the
time taken to swapout 256M worth of memory backed by 64K large folios:

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
 #include <errno.h>

 #define SIZE_MB 256
 #define SIZE_BYTES (SIZE_MB * 1024 * 1024)

 int main() {
     void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (addr == MAP_FAILED) {
         perror("mmap failed");
         return 1;
     }

     memset(addr, 0, SIZE_BYTES);

     struct timespec start, end;
     clock_gettime(CLOCK_MONOTONIC, &start);

     if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
         perror("madvise(MADV_PAGEOUT) failed");
         munmap(addr, SIZE_BYTES);
         return 1;
     }

     clock_gettime(CLOCK_MONOTONIC, &end);

     long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                        (end.tv_nsec - start.tv_nsec);
     printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
            duration_ns, duration_ns / 1e6);

     munmap(addr, SIZE_BYTES);
     return 0;
 }

On arm64, only showing one of the middle values in the distribution:

without patch:
madvise(MADV_PAGEOUT) took 52192959 ns (52.193 ms)

with patch:
madvise(MADV_PAGEOUT) took 26676625 ns (26.677 ms)


[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

---
Based on mm-unstable bb420884e9e0. mm-selftests pass.

Dev Jain (9):
  mm/rmap: make nr_pages signed in try_to_unmap_one
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  mm/rmap: refactor lazyfree unmap commit path to
    commit_ttu_lazyfree_folio()
  mm/memory: Batch set uffd-wp markers during zapping
  mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  mm/swapfile: Make folio_dup_swap batchable
  mm/swapfile: Make folio_put_swap batchable
  mm/rmap: introduce folio_try_share_anon_rmap_ptes
  mm/rmap: enable batch unmapping of anonymous folios

 include/linux/mm_inline.h  |  37 +++--
 include/linux/page-flags.h |  11 ++
 include/linux/rmap.h       |  38 ++++-
 mm/internal.h              |  26 ++++
 mm/memory.c                |  26 +---
 mm/mprotect.c              |  17 ---
 mm/rmap.c                  | 274 ++++++++++++++++++++++++-------------
 mm/shmem.c                 |   8 +-
 mm/swap.h                  |  10 +-
 mm/swapfile.c              |  25 ++--
 10 files changed, 298 insertions(+), 174 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  7:56   ` Lorenzo Stoakes (Oracle)
  2026-03-10  7:30 ` [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start " Dev Jain
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Currently, nr_pages is defined as unsigned long. We use nr_pages to
manipulate mm rss counters for lazyfree folios as follows:

	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);

Suppose nr_pages = 3. -nr_pages underflows and becomes ULONG_MAX - 2. Then,
since add_mm_counter() uses this -nr_pages as a long, ULONG_MAX - 2 does
not fit into the positive range of long, and is converted to -3. Eventually
all of this works out, but for keeping things simple, declare nr_pages as
a signed variable.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 6398d7eef393f..087c9f5b884fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1979,9 +1979,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	struct page *subpage;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
-	unsigned long nr_pages = 1, end_addr;
+	unsigned long end_addr;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	long nr_pages = 1;
 	int ptes = 0;
 
 	/*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
  2026-03-10  7:30 ` [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  8:10   ` Lorenzo Stoakes (Oracle)
  2026-03-10  7:30 ` [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio() Dev Jain
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Initialize nr_pages to 1 at the start of the loop, similar to what is being
done in folio_referenced_one(). It may happen that the nr_pages computed
from a previous call to folio_unmap_pte_batch gets reused without again
going through folio_unmap_pte_batch, messing up things. Although, I don't
think there is any bug right now; a bug would have been there, if in the
same instance of a call to try_to_unmap_one, we end up in the
pte_present(pteval) branch, then in the else branch doing pte_clear() for
device-exclusive ptes. This means that a lazyfree folio has some present
entries and some device entries mapping it. Since a pte being
device-exclusive means that a GUP reference on the underlying folio is
held, the lazyfree unmapping path upon witnessing this will abort
try_to_unmap_one.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 087c9f5b884fe..1fa020edd954a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1982,7 +1982,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long end_addr;
 	unsigned long pfn;
 	unsigned long hsz = 0;
-	long nr_pages = 1;
+	long nr_pages;
 	int ptes = 0;

 	/*
@@ -2019,6 +2019,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);

 	while (page_vma_mapped_walk(&pvmw)) {
+		nr_pages = 1;
+
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
 		 */
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio()
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
  2026-03-10  7:30 ` [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one Dev Jain
  2026-03-10  7:30 ` [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start " Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  8:19   ` Lorenzo Stoakes (Oracle)
  2026-03-10  7:30 ` [PATCH 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Clean up the code by refactoring the post-pte-clearing path of lazyfree
folio unmapping, into commit_ttu_lazyfree_folio().

No functional change is intended.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 93 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 54 insertions(+), 39 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1fa020edd954a..a61978141ee3f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1966,6 +1966,57 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
+static inline int commit_ttu_lazyfree_folio(struct vm_area_struct *vma,
+		struct folio *folio, unsigned long address, pte_t *ptep,
+		pte_t pteval, long nr_pages)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int ref_count, map_count;
+
+	/*
+	 * Synchronize with gup_pte_range():
+	 * - clear PTE; barrier; read refcount
+	 * - inc refcount; barrier; read PTE
+	 */
+	smp_mb();
+
+	ref_count = folio_ref_count(folio);
+	map_count = folio_mapcount(folio);
+
+	/*
+	 * Order reads for page refcount and dirty flag
+	 * (see comments in __remove_mapping()).
+	 */
+	smp_rmb();
+
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		/*
+		 * redirtied either using the page table or a previously
+		 * obtained GUP reference.
+		 */
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		folio_set_swapbacked(folio);
+		return 1;
+	}
+
+	if (ref_count != 1 + map_count) {
+		/*
+		 * Additional reference. Could be a GUP reference or any
+		 * speculative reference. GUP users must mark the folio
+		 * dirty if there was a modification. This folio cannot be
+		 * reclaimed right now either way, so act just like nothing
+		 * happened.
+		 * We'll come back here later and detect if the folio was
+		 * dirtied when the additional reference is gone.
+		 */
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		return 1;
+	}
+
+	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
+	return 0;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -2227,46 +2278,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 			/* MADV_FREE page check */
 			if (!folio_test_swapbacked(folio)) {
-				int ref_count, map_count;
-
-				/*
-				 * Synchronize with gup_pte_range():
-				 * - clear PTE; barrier; read refcount
-				 * - inc refcount; barrier; read PTE
-				 */
-				smp_mb();
-
-				ref_count = folio_ref_count(folio);
-				map_count = folio_mapcount(folio);
-
-				/*
-				 * Order reads for page refcount and dirty flag
-				 * (see comments in __remove_mapping()).
-				 */
-				smp_rmb();
-
-				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
-					/*
-					 * redirtied either using the page table or a previously
-					 * obtained GUP reference.
-					 */
-					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
-					folio_set_swapbacked(folio);
+				if (commit_ttu_lazyfree_folio(vma, folio, address,
+							      pvmw.pte, pteval,
+							      nr_pages))
 					goto walk_abort;
-				} else if (ref_count != 1 + map_count) {
-					/*
-					 * Additional reference. Could be a GUP reference or any
-					 * speculative reference. GUP users must mark the folio
-					 * dirty if there was a modification. This folio cannot be
-					 * reclaimed right now either way, so act just like nothing
-					 * happened.
-					 * We'll come back here later and detect if the folio was
-					 * dirtied when the additional reference is gone.
-					 */
-					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
-					goto walk_abort;
-				}
-				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
 				goto discard;
 			}
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/9] mm/memory: Batch set uffd-wp markers during zapping
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (2 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio() Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  7:30 ` [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

In preparation for the next patch, enable batch setting of uffd-wp ptes.

The code paths passing nr > 1 to zap_install_uffd_wp_if_needed() produce
that nr through either folio_pte_batch or swap_pte_batch, guaranteeing that
all ptes are the same w.r.t belonging to the same type of VMA (anonymous
or non-anonymous, wp-armed or non-wp-armed), and all being marked with
uffd-wp or all being not marked.

Note that we will have to use set_pte_at() in a loop instead of set_ptes()
since the latter cannot handle present->non-present conversion for
nr_pages > 1.

Convert documentation of install_uffd_wp_ptes_if_needed to kerneldoc
format.

No functional change is intended.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/mm_inline.h | 37 +++++++++++++++++++++++--------------
 mm/memory.c               | 20 +-------------------
 mm/rmap.c                 |  2 +-
 3 files changed, 25 insertions(+), 34 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ad50688d89dba..d69b9abbdf2a7 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -560,21 +560,30 @@ static inline pte_marker copy_pte_marker(
 	return dstm;
 }
 
-/*
- * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
- * replace a none pte.  NOTE!  This should only be called when *pte is already
+/**
+ * install_uffd_wp_ptes_if_needed - install uffd-wp marker on PTEs that map
+ *				    consecutive pages of the same large folio.
+ * @vma: The VMA the pages are mapped into.
+ * @addr: Address the first page of this batch is mapped at.
+ * @ptep: Page table pointer for the first entry of this batch.
+ * @pteval: old value of the entry pointed to by ptep.
+ * @nr: Number of entries to clear (batch size).
+ *
+ * If the ptes were wr-protected by uffd-wp in any form, arm special ptes to
+ * replace none ptes.  NOTE!  This should only be called when *pte is already
  * cleared so we will never accidentally replace something valuable.  Meanwhile
  * none pte also means we are not demoting the pte so tlb flushed is not needed.
  * E.g., when pte cleared the caller should have taken care of the tlb flush.
  *
- * Must be called with pgtable lock held so that no thread will see the none
- * pte, and if they see it, they'll fault and serialize at the pgtable lock.
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD
+ * and the same VMA.
  *
- * Returns true if an uffd-wp pte was installed, false otherwise.
+ * Returns true if uffd-wp ptes were installed, false otherwise.
  */
 static inline bool
-pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
-			      pte_t *pte, pte_t pteval)
+install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t *pte, pte_t pteval, unsigned int nr)
 {
 	bool arm_uffd_pte = false;
 
@@ -604,13 +613,13 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	if (unlikely(pte_swp_uffd_wp_any(pteval)))
 		arm_uffd_pte = true;
 
-	if (unlikely(arm_uffd_pte)) {
-		set_pte_at(vma->vm_mm, addr, pte,
-			   make_pte_marker(PTE_MARKER_UFFD_WP));
-		return true;
-	}
+	if (likely(!arm_uffd_pte))
+		return false;
 
-	return false;
+	for (int i = 0; i < nr; ++i, ++pte, addr += PAGE_SIZE)
+		set_pte_at(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP));
+
+	return true;
 }
 
 static inline bool vma_has_recency(const struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index 38062f8e11656..768646c0b3b6a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1594,29 +1594,11 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *pte, int nr,
 			      struct zap_details *details, pte_t pteval)
 {
-	bool was_installed = false;
-
-	if (!uffd_supports_wp_marker())
-		return false;
-
-	/* Zap on anonymous always means dropping everything */
-	if (vma_is_anonymous(vma))
-		return false;
-
 	if (zap_drop_markers(details))
 		return false;
 
-	for (;;) {
-		/* the PFN in the PTE is irrelevant. */
-		if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
-			was_installed = true;
-		if (--nr == 0)
-			break;
-		pte++;
-		addr += PAGE_SIZE;
-	}
+	return install_uffd_wp_ptes_if_needed(vma, addr, pte, pteval, nr);
 
-	return was_installed;
 }
 
 static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
diff --git a/mm/rmap.c b/mm/rmap.c
index a61978141ee3f..a7570cd037344 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2235,7 +2235,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * we may want to replace a none pte with a marker pte if
 		 * it's file-backed, so we don't lose the tracking info.
 		 */
-		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (3 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  8:34   ` Lorenzo Stoakes (Oracle)
  2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

The ptes are all the same w.r.t belonging to the same type of VMA, and
being marked with uffd-wp or all being not marked. Therefore we can batch
set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
batched unmapping of folios belonging to uffd-wp VMAs by dropping that
condition from folio_unmap_pte_batch.

It may happen that we don't batch over the entire folio in one go, in which
case, we must skip over the current batch. Add a helper to do that -
page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
by nr pages.

I think that we can get away with just incrementing pvmw->pte
and pvmw->address, since looking at the code in page_vma_mapped.c,
pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
cancelling out the increment and decrement in the respective fields. But
let us not rely on the pvmw implementation and keep this simple.

Export this function to rmap.h to enable future reuse.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/rmap.h | 10 ++++++++++
 mm/rmap.c            |  8 +++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8dc0871e5f001..1b7720c66ac87 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		spin_unlock(pvmw->ptl);
 }
 
+static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
+		unsigned int nr)
+{
+	pvmw->pfn += nr;
+	pvmw->nr_pages -= nr;
+	pvmw->pgoff += nr;
+	pvmw->pte += nr;
+	pvmw->address += nr * PAGE_SIZE;
+}
+
 /**
  * page_vma_mapped_walk_restart - Restart the page table walk.
  * @pvmw: Pointer to struct page_vma_mapped_walk.
diff --git a/mm/rmap.c b/mm/rmap.c
index a7570cd037344..dd638429c963e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1953,9 +1953,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;
 
-	if (userfaultfd_wp(vma))
-		return 1;
-
 	/*
 	 * If unmap fails, we need to restore the ptes. To avoid accidentally
 	 * upgrading write permissions for ptes that were not originally
@@ -2235,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * we may want to replace a none pte with a marker pte if
 		 * it's file-backed, so we don't lose the tracking info.
 		 */
-		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
+		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
@@ -2359,8 +2356,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * If we are sure that we batched the entire folio and cleared
 		 * all PTEs, we can just optimize and stop right here.
 		 */
-		if (nr_pages == folio_nr_pages(folio))
+		if (likely(nr_pages == folio_nr_pages(folio)))
 			goto walk_done;
+		page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);
 		continue;
 walk_abort:
 		ret = false;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (4 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  8:27   ` Kairui Song
                     ` (2 more replies)
  2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
                   ` (4 subsequent siblings)
  10 siblings, 3 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Teach folio_dup_swap to handle a batch of consecutive pages. Note that
folio_dup_swap already can handle a subset of this: nr_pages == 1 and
nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.

Currently we have a not-so-nice logic of passing in subpage == NULL if
we mean to exercise the logic on the entire folio, and subpage != NULL if
we want to exercise the logic on only that subpage. Remove this
indirection, and explicitly pass subpage != NULL, and the number of
pages required.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c     |  2 +-
 mm/shmem.c    |  2 +-
 mm/swap.h     |  5 +++--
 mm/swapfile.c | 12 +++++-------
 4 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index dd638429c963e..f6d5b187cf09b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2282,7 +2282,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto discard;
 			}
 
-			if (folio_dup_swap(folio, subpage) < 0) {
+			if (folio_dup_swap(folio, subpage, 1) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index 5e7dcf5bc5d3c..86ee34c9b40b3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1695,7 +1695,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 			spin_unlock(&shmem_swaplist_lock);
 		}
 
-		folio_dup_swap(folio, NULL);
+		folio_dup_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
diff --git a/mm/swap.h b/mm/swap.h
index a77016f2423b9..d9cb58ebbddd1 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -206,7 +206,7 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
  * folio_put_swap(): does the opposite thing of folio_dup_swap().
  */
 int folio_alloc_swap(struct folio *folio);
-int folio_dup_swap(struct folio *folio, struct page *subpage);
+int folio_dup_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
 void folio_put_swap(struct folio *folio, struct page *subpage);
 
 /* For internal use */
@@ -390,7 +390,8 @@ static inline int folio_alloc_swap(struct folio *folio)
 	return -EINVAL;
 }
 
-static inline int folio_dup_swap(struct folio *folio, struct page *page)
+static inline int folio_dup_swap(struct folio *folio, struct page *page,
+				 unsigned int nr_pages)
 {
 	return -EINVAL;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 915bc93964dbd..eaf61ae6c3817 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1738,7 +1738,8 @@ int folio_alloc_swap(struct folio *folio)
 /**
  * folio_dup_swap() - Increase swap count of swap entries of a folio.
  * @folio: folio with swap entries bounded.
- * @subpage: if not NULL, only increase the swap count of this subpage.
+ * @subpage: Increase the swap count of this subpage till nr number of
+ * pages forward.
  *
  * Typically called when the folio is unmapped and have its swap entry to
  * take its place: Swap entries allocated to a folio has count == 0 and pinned
@@ -1752,18 +1753,15 @@ int folio_alloc_swap(struct folio *folio)
  * swap_put_entries_direct on its swap entry before this helper returns, or
  * the swap count may underflow.
  */
-int folio_dup_swap(struct folio *folio, struct page *subpage)
+int folio_dup_swap(struct folio *folio, struct page *subpage,
+		   unsigned int nr_pages)
 {
 	swp_entry_t entry = folio->swap;
-	unsigned long nr_pages = folio_nr_pages(folio);
 
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
 
-	if (subpage) {
-		entry.val += folio_page_idx(folio, subpage);
-		nr_pages = 1;
-	}
+	entry.val += folio_page_idx(folio, subpage);
 
 	return swap_dup_entries_cluster(swap_entry_to_info(entry),
 					swp_offset(entry), nr_pages);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (5 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  8:29   ` Kairui Song
                     ` (2 more replies)
  2026-03-10  7:30 ` [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes Dev Jain
                   ` (3 subsequent siblings)
  10 siblings, 3 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Teach folio_put_swap to handle a batch of consecutive pages. Note that
folio_put_swap already can handle a subset of this: nr_pages == 1 and
nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.

Currently we have a not-so-nice logic of passing in subpage == NULL if
we mean to exercise the logic on the entire folio, and subpage != NULL if
we want to exercise the logic on only that subpage. Remove this
indirection, and explicitly pass subpage != NULL, and the number of
pages required.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/memory.c   |  6 +++---
 mm/rmap.c     |  4 ++--
 mm/shmem.c    |  6 +++---
 mm/swap.h     |  5 +++--
 mm/swapfile.c | 13 +++++--------
 5 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 768646c0b3b6a..8249a9b7083ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5002,7 +5002,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
-		folio_put_swap(swapcache, NULL);
+		folio_put_swap(swapcache, folio_page(swapcache, 0), folio_nr_pages(swapcache));
 	} else if (!folio_test_anon(folio)) {
 		/*
 		 * We currently only expect !anon folios that are fully
@@ -5011,12 +5011,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
-		folio_put_swap(folio, NULL);
+		folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
 	} else {
 		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
 					 rmap_flags);
-		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
+		folio_put_swap(folio, page, nr_pages);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
diff --git a/mm/rmap.c b/mm/rmap.c
index f6d5b187cf09b..42f6b00cced01 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2293,7 +2293,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * so we'll not check/care.
 			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				folio_put_swap(folio, subpage);
+				folio_put_swap(folio, subpage, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2301,7 +2301,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				folio_put_swap(folio, subpage);
+				folio_put_swap(folio, subpage, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index 86ee34c9b40b3..d9d216ea28ecb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1716,7 +1716,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		/* Swap entry might be erased by racing shmem_free_swap() */
 		if (!error) {
 			shmem_recalc_inode(inode, 0, -nr_pages);
-			folio_put_swap(folio, NULL);
+			folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
 		}
 
 		/*
@@ -2196,7 +2196,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
-	folio_put_swap(folio, NULL);
+	folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
 	swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2426,7 +2426,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	folio_put_swap(folio, NULL);
+	folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
 	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
 	put_swap_device(si);
diff --git a/mm/swap.h b/mm/swap.h
index d9cb58ebbddd1..73fd9faa67608 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -207,7 +207,7 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
  */
 int folio_alloc_swap(struct folio *folio);
 int folio_dup_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
-void folio_put_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
 
 /* For internal use */
 extern void __swap_cluster_free_entries(struct swap_info_struct *si,
@@ -396,7 +396,8 @@ static inline int folio_dup_swap(struct folio *folio, struct page *page,
 	return -EINVAL;
 }
 
-static inline void folio_put_swap(struct folio *folio, struct page *page)
+static inline void folio_put_swap(struct folio *folio, struct page *page,
+				  unsigned int nr_pages)
 {
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index eaf61ae6c3817..c66aa6d15d479 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1770,25 +1770,22 @@ int folio_dup_swap(struct folio *folio, struct page *subpage,
 /**
  * folio_put_swap() - Decrease swap count of swap entries of a folio.
  * @folio: folio with swap entries bounded, must be in swap cache and locked.
- * @subpage: if not NULL, only decrease the swap count of this subpage.
+ * @subpage: decrease the swap count of this subpage till nr_pages.
  *
  * This won't free the swap slots even if swap count drops to zero, they are
  * still pinned by the swap cache. User may call folio_free_swap to free them.
  * Context: Caller must ensure the folio is locked and in the swap cache.
  */
-void folio_put_swap(struct folio *folio, struct page *subpage)
+void folio_put_swap(struct folio *folio, struct page *subpage,
+		    unsigned int nr_pages)
 {
 	swp_entry_t entry = folio->swap;
-	unsigned long nr_pages = folio_nr_pages(folio);
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
 
-	if (subpage) {
-		entry.val += folio_page_idx(folio, subpage);
-		nr_pages = 1;
-	}
+	entry.val += folio_page_idx(folio, subpage);
 
 	swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
 }
@@ -2334,7 +2331,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
+	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)), 1);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (6 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  9:38   ` Lorenzo Stoakes (Oracle)
  2026-03-10  7:30 ` [PATCH 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

In the quest of enabling batched unmapping of anonymous folios, we need to
handle the sharing of exclusive pages. Hence, a batched version of
folio_try_share_anon_rmap_pte is required.

Currently, the sole purpose of nr_pages in __folio_try_share_anon_rmap is
to do some rmap sanity checks. Add helpers to set and clear the
PageAnonExclusive bit on a batch of nr_pages. Note that
__folio_try_share_anon_rmap can receive nr_pages == HPAGE_PMD_NR from the
PMD path, but currently we only clear the bit on the head page. Retain this
behaviour by setting nr_pages = 1 in case the caller is
folio_try_share_anon_rmap_pmd.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/page-flags.h | 11 +++++++++++
 include/linux/rmap.h       | 28 ++++++++++++++++++++++++++--
 mm/rmap.c                  |  2 +-
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0e03d816e8b9d..1d74ed9a28c41 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1178,6 +1178,17 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
 	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
 }
 
+static __always_inline void ClearPagesAnonExclusive(struct page *page,
+		unsigned int nr)
+{
+	for (;;) {
+		ClearPageAnonExclusive(page);
+		if (--nr == 0)
+			break;
+		++page;
+	}
+}
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1UL << PG_mlocked)
 #else
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 1b7720c66ac87..7a67776dca3fe 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -712,9 +712,13 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
 	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
 	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
 
+	/* We only clear anon-exclusive from head page of PMD folio */
+	if (level == PGTABLE_LEVEL_PMD)
+		nr_pages = 1;
+
 	/* device private folios cannot get pinned via GUP. */
 	if (unlikely(folio_is_device_private(folio))) {
-		ClearPageAnonExclusive(page);
+		ClearPagesAnonExclusive(page, nr_pages);
 		return 0;
 	}
 
@@ -766,7 +770,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
 
 	if (unlikely(folio_maybe_dma_pinned(folio)))
 		return -EBUSY;
-	ClearPageAnonExclusive(page);
+	ClearPagesAnonExclusive(page, nr_pages);
 
 	/*
 	 * This is conceptually a smp_wmb() paired with the smp_rmb() in
@@ -804,6 +808,26 @@ static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
 	return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
 }
 
+/**
+ * folio_try_share_anon_rmap_ptes - try marking exclusive anonymous pages
+ *				    mapped by PTEs possibly shared to prepare
+ *				   for KSM or temporary unmapping
+ * @folio:	The folio to share a mapping of
+ * @page:	The first mapped exclusive page of the batch
+ * @nr_pages:	The number of pages to share (batch size)
+ *
+ * See folio_try_share_anon_rmap_pte for full description.
+ *
+ * Context: The caller needs to hold the page table lock and has to have the
+ * page table entries cleared/invalidated. Those PTEs used to map consecutive
+ * pages of the folio passed here. The PTEs are all in the same PMD and VMA.
+ */
+static inline int folio_try_share_anon_rmap_ptes(struct folio *folio,
+		struct page *page, unsigned int nr)
+{
+	return __folio_try_share_anon_rmap(folio, page, nr, PGTABLE_LEVEL_PTE);
+}
+
 /**
  * folio_try_share_anon_rmap_pmd - try marking an exclusive anonymous page
  *				   range mapped by a PMD possibly shared to
diff --git a/mm/rmap.c b/mm/rmap.c
index 42f6b00cced01..bba5b571946d8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2300,7 +2300,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
-			    folio_try_share_anon_rmap_pte(folio, subpage)) {
+			    folio_try_share_anon_rmap_ptes(folio, subpage, 1)) {
 				folio_put_swap(folio, subpage, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 9/9] mm/rmap: enable batch unmapping of anonymous folios
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (7 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes Dev Jain
@ 2026-03-10  7:30 ` Dev Jain
  2026-03-10  8:02 ` [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Lorenzo Stoakes (Oracle)
  2026-03-10 12:59 ` Lance Yang
  10 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  7:30 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong
  Cc: weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain

Enable batch clearing of ptes, and batch swap setting of ptes for anon
folio unmapping. Processing all ptes of a large folio in one go helps
us batch across atomics (add_mm_counter etc), barriers (in the function
__folio_try_share_anon_rmap), repeated calls to page_vma_mapped_walk(),
to name a few. In general, batching helps us to execute similar code
together, making the execution of the program more memory and CPU friendly.

The handling of anon-exclusivity is very similar to commit cac1db8c3aad
("mm: optimize mprotect() by PTE batching"). Since folio_unmap_pte_batch()
won't look at the bits of the underlying page, we need to process
sub-batches of ptes pointing to pages which are same w.r.t exclusivity,
and batch set only those ptes to swap ptes in one go. Hence export
page_anon_exclusive_sub_batch() to internal.h and reuse it.

arch_unmap_one() is only defined for sparc64; I am not comfortable changing
that bit of code to enable batching, the nuances between retrieving the
pfn from pte_pfn() or from (paddr = pte_val(oldpte) & _PAGE_PADDR_4V)
(and, pte_next_pfn() can't even be called from arch_unmap_one() because
that file does not include pgtable.h), especially when I have no way to
test the code. So just disable the "sparc64-anon-swapbacked" case for now.

We need to take care of rmap accounting (folio_remove_rmap_ptes) and
reference accounting (folio_put_refs) when anon folio unmap succeeds.
In case we partially batch the large folio and fail, we need to correctly
do the accounting for pages which were successfully unmapped. So, put
this accounting code in __commit_ttu_anon_swapbacked_folio() itself,
instead of doing some horrible goto jumping at the callsite of
commit_ttu_anon_swapbacked_folio(). Similarly, do the jumping-over-batch
immediately after we succeed in the unmapping of the entire batch, and
continue to the next (unlikely) iteration.

Add a comment at relevant places to say that we are on a device-exclusive
entry and not a present entry.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/internal.h |  26 ++++++++
 mm/mprotect.c |  17 -----
 mm/rmap.c     | 170 +++++++++++++++++++++++++++++++++++---------------
 3 files changed, 144 insertions(+), 69 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 95b583e7e4f75..c29ecc334a06b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -393,6 +393,32 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio,
 unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 		unsigned int max_nr);
 
+/**
+ * page_anon_exclusive_sub_batch - Determine length of consecutive exclusive
+ * or maybe shared pages
+ * @start_idx: Starting index of the page array to scan from
+ * @max_len: Maximum length to look at
+ * @first_page: First page of the page array
+ * @expected_anon_exclusive: Whether to look for exclusive or !exclusive pages
+ *
+ * Determines length of consecutive ptes, pointing to pages being the same
+ * w.r.t the PageAnonExclusive bit.
+ *
+ * Context: The ptes point to consecutive pages of the same large folio. The
+ * ptes belong to the same PMD and VMA.
+ */
+static inline int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+		struct page *first_page, bool expected_anon_exclusive)
+{
+	int idx;
+
+	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
+		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
+			break;
+	}
+	return idx - start_idx;
+}
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9681f055b9fca..9403171d648b6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,23 +138,6 @@ static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long add
 		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
 }
 
-/*
- * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
- * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
- * that the ptes point to consecutive pages of the same anon large folio.
- */
-static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
-		struct page *first_page, bool expected_anon_exclusive)
-{
-	int idx;
-
-	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
-		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
-			break;
-	}
-	return idx - start_idx;
-}
-
 /*
  * This function is a result of trying our very best to retain the
  * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
diff --git a/mm/rmap.c b/mm/rmap.c
index bba5b571946d8..334350caf40b0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1946,11 +1946,11 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree or file folios batching for now ... */
-	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
+	if (pte_unused(pte))
 		return 1;
 
-	if (pte_unused(pte))
+	if (__is_defined(__HAVE_ARCH_UNMAP_ONE) && folio_test_anon(folio) &&
+	    folio_test_swapbacked(folio))
 		return 1;
 
 	/*
@@ -1963,6 +1963,112 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
+static inline void set_swp_ptes(struct mm_struct *mm, unsigned long address,
+		pte_t *ptep, swp_entry_t entry, pte_t pteval, bool anon_exclusive,
+		unsigned int nr_pages)
+{
+	pte_t swp_pte = swp_entry_to_pte(entry);
+
+	if (anon_exclusive)
+		swp_pte = pte_swp_mkexclusive(swp_pte);
+
+	if (likely(pte_present(pteval))) {
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+	} else {
+		/* Device-exclusive entry: nr_pages is 1. */
+		if (pte_swp_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_swp_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+	}
+
+	for (int i = 0; i < nr_pages; ++i, ++ptep, address += PAGE_SIZE) {
+		set_pte_at(mm, address, ptep, swp_pte);
+		swp_pte = pte_next_swp_offset(swp_pte);
+	}
+}
+
+static inline int __commit_ttu_anon_swapbacked_folio(struct vm_area_struct *vma,
+		struct folio *folio, struct page *subpage, unsigned long address,
+		pte_t *ptep, pte_t pteval, long nr_pages, bool anon_exclusive)
+{
+	swp_entry_t entry = page_swap_entry(subpage);
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (folio_dup_swap(folio, subpage, nr_pages) < 0) {
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		return 1;
+	}
+
+	/*
+	 * arch_unmap_one() is expected to be a NOP on
+	 * architectures where we could have PFN swap PTEs,
+	 * so we'll not check/care.
+	 */
+	if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+		VM_WARN_ON(nr_pages != 1);
+		folio_put_swap(folio, subpage, nr_pages);
+		set_pte_at(mm, address, ptep, pteval);
+		return 1;
+	}
+
+	/* See folio_try_share_anon_rmap(): clear PTE first. */
+	if (anon_exclusive && folio_try_share_anon_rmap_ptes(folio, subpage, nr_pages)) {
+		folio_put_swap(folio, subpage, nr_pages);
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		return 1;
+	}
+
+	if (list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		if (list_empty(&mm->mmlist))
+			list_add(&mm->mmlist, &init_mm.mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+
+	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
+	add_mm_counter(mm, MM_SWAPENTS, nr_pages);
+	set_swp_ptes(mm, address, ptep, entry, pteval, anon_exclusive, nr_pages);
+	folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_drain_local();
+	folio_put_refs(folio, nr_pages);
+	return 0;
+}
+
+static inline int commit_ttu_anon_swapbacked_folio(struct vm_area_struct *vma,
+		struct folio *folio, struct page *first_page, unsigned long address,
+		pte_t *ptep, pte_t pteval, long nr_pages)
+{
+	bool expected_anon_exclusive;
+	int sub_batch_idx = 0;
+	int len, err;
+
+	for (;;) {
+		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
+		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_pages,
+						    first_page, expected_anon_exclusive);
+		err = __commit_ttu_anon_swapbacked_folio(vma, folio, first_page + sub_batch_idx,
+				address, ptep, pteval, len, expected_anon_exclusive);
+		if (err)
+			return err;
+
+		nr_pages -= len;
+		if (!nr_pages)
+			break;
+
+		pteval = pte_advance_pfn(pteval, len);
+		address += len * PAGE_SIZE;
+		sub_batch_idx += len;
+		ptep += len;
+	}
+
+	return 0;
+}
+
 static inline int commit_ttu_lazyfree_folio(struct vm_area_struct *vma,
 		struct folio *folio, unsigned long address, pte_t *ptep,
 		pte_t pteval, long nr_pages)
@@ -2022,7 +2128,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	bool anon_exclusive, ret = true;
+	bool ret = true;
 	pte_t pteval;
 	struct page *subpage;
 	struct mmu_notifier_range range;
@@ -2148,8 +2254,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		subpage = folio_page(folio, pfn - folio_pfn(folio));
 		address = pvmw.address;
-		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
 
 		if (folio_test_hugetlb(folio)) {
 			bool anon = folio_test_anon(folio);
@@ -2224,6 +2328,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_dirty(pteval))
 				folio_mark_dirty(folio);
 		} else {
+			/* Device-exclusive entry */
 			pte_clear(mm, address, pvmw.pte);
 		}
 
@@ -2261,8 +2366,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter(folio));
 		} else if (folio_test_anon(folio)) {
-			swp_entry_t entry = page_swap_entry(subpage);
-			pte_t swp_pte;
 			/*
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
@@ -2282,52 +2385,15 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto discard;
 			}
 
-			if (folio_dup_swap(folio, subpage, 1) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
+			if (commit_ttu_anon_swapbacked_folio(vma, folio, subpage,
+							     address, pvmw.pte,
+							     pteval, nr_pages))
 				goto walk_abort;
-			}
 
-			/*
-			 * arch_unmap_one() is expected to be a NOP on
-			 * architectures where we could have PFN swap PTEs,
-			 * so we'll not check/care.
-			 */
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				folio_put_swap(folio, subpage, 1);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				goto walk_abort;
-			}
-
-			/* See folio_try_share_anon_rmap(): clear PTE first. */
-			if (anon_exclusive &&
-			    folio_try_share_anon_rmap_ptes(folio, subpage, 1)) {
-				folio_put_swap(folio, subpage, 1);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				goto walk_abort;
-			}
-			if (list_empty(&mm->mmlist)) {
-				spin_lock(&mmlist_lock);
-				if (list_empty(&mm->mmlist))
-					list_add(&mm->mmlist, &init_mm.mmlist);
-				spin_unlock(&mmlist_lock);
-			}
-			dec_mm_counter(mm, MM_ANONPAGES);
-			inc_mm_counter(mm, MM_SWAPENTS);
-			swp_pte = swp_entry_to_pte(entry);
-			if (anon_exclusive)
-				swp_pte = pte_swp_mkexclusive(swp_pte);
-			if (likely(pte_present(pteval))) {
-				if (pte_soft_dirty(pteval))
-					swp_pte = pte_swp_mksoft_dirty(swp_pte);
-				if (pte_uffd_wp(pteval))
-					swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			} else {
-				if (pte_swp_soft_dirty(pteval))
-					swp_pte = pte_swp_mksoft_dirty(swp_pte);
-				if (pte_swp_uffd_wp(pteval))
-					swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			}
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			if (likely(nr_pages == folio_nr_pages(folio)))
+				goto walk_done;
+			page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);
+			continue;
 		} else {
 			/*
 			 * This is a locked file-backed folio,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one
  2026-03-10  7:30 ` [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one Dev Jain
@ 2026-03-10  7:56   ` Lorenzo Stoakes (Oracle)
  2026-03-10  8:06     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  7:56 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:05PM +0530, Dev Jain wrote:
> Currently, nr_pages is defined as unsigned long. We use nr_pages to
> manipulate mm rss counters for lazyfree folios as follows:
>
> 	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
>
> Suppose nr_pages = 3. -nr_pages underflows and becomes ULONG_MAX - 2. Then,
> since add_mm_counter() uses this -nr_pages as a long, ULONG_MAX - 2 does
> not fit into the positive range of long, and is converted to -3. Eventually
> all of this works out, but for keeping things simple, declare nr_pages as
> a signed variable.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/rmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6398d7eef393f..087c9f5b884fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1979,9 +1979,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  	struct page *subpage;
>  	struct mmu_notifier_range range;
>  	enum ttu_flags flags = (enum ttu_flags)(long)arg;
> -	unsigned long nr_pages = 1, end_addr;
> +	unsigned long end_addr;
>  	unsigned long pfn;
>  	unsigned long hsz = 0;
> +	long nr_pages = 1;

This is a non-issue that makes the code confusing, so let's not?

The convention throughout the kernel is nr_pages generally is unsigned because
you can't have negative nr_pages.

>  	int ptes = 0;
>
>  	/*
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (8 preceding siblings ...)
  2026-03-10  7:30 ` [PATCH 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
@ 2026-03-10  8:02 ` Lorenzo Stoakes (Oracle)
  2026-03-10  9:28   ` Dev Jain
  2026-03-10 12:59 ` Lance Yang
  10 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:02 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:04PM +0530, Dev Jain wrote:
> Speed up unmapping of anonymous large folios by clearing the ptes, and
> setting swap ptes, in one go.
>
> The following benchmark (stolen from Barry at [1]) is used to measure the
> time taken to swapout 256M worth of memory backed by 64K large folios:
>
>  #define _GNU_SOURCE
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <sys/mman.h>
>  #include <string.h>
>  #include <time.h>
>  #include <unistd.h>
>  #include <errno.h>
>
>  #define SIZE_MB 256
>  #define SIZE_BYTES (SIZE_MB * 1024 * 1024)
>
>  int main() {
>      void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
>                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>      if (addr == MAP_FAILED) {
>          perror("mmap failed");
>          return 1;
>      }
>
>      memset(addr, 0, SIZE_BYTES);
>
>      struct timespec start, end;
>      clock_gettime(CLOCK_MONOTONIC, &start);
>
>      if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
>          perror("madvise(MADV_PAGEOUT) failed");
>          munmap(addr, SIZE_BYTES);
>          return 1;
>      }
>
>      clock_gettime(CLOCK_MONOTONIC, &end);
>
>      long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
>                         (end.tv_nsec - start.tv_nsec);
>      printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
>             duration_ns, duration_ns / 1e6);
>
>      munmap(addr, SIZE_BYTES);
>      return 0;
>  }
>
> On arm64, only showing one of the middle values in the distribution:
>

This doesn't seem very statistically valid.

How about you give median, stddev etc.? Variance matters too.

> without patch:
> madvise(MADV_PAGEOUT) took 52192959 ns (52.193 ms)
>
> with patch:
> madvise(MADV_PAGEOUT) took 26676625 ns (26.677 ms)

You have a habit of only giving data on arm64, and not mentioning whether you've
tested on any other arch/setup.

I've commented on this before so I'm a bit disappointed you've done the exact
same thing here again. Especially since you've previously introduced regressions
this way.

Please can you test this on (hardware!) x86-64 _at least_ as well and confirm
you aren't regressing anything for 4K pages?

>
>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/
>
> ---
> Based on mm-unstable bb420884e9e0. mm-selftests pass.
>
> Dev Jain (9):
>   mm/rmap: make nr_pages signed in try_to_unmap_one
>   mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
>   mm/rmap: refactor lazyfree unmap commit path to
>     commit_ttu_lazyfree_folio()
>   mm/memory: Batch set uffd-wp markers during zapping
>   mm/rmap: batch unmap folios belonging to uffd-wp VMAs
>   mm/swapfile: Make folio_dup_swap batchable
>   mm/swapfile: Make folio_put_swap batchable
>   mm/rmap: introduce folio_try_share_anon_rmap_ptes
>   mm/rmap: enable batch unmapping of anonymous folios
>
>  include/linux/mm_inline.h  |  37 +++--
>  include/linux/page-flags.h |  11 ++
>  include/linux/rmap.h       |  38 ++++-
>  mm/internal.h              |  26 ++++
>  mm/memory.c                |  26 +---
>  mm/mprotect.c              |  17 ---
>  mm/rmap.c                  | 274 ++++++++++++++++++++++++-------------
>  mm/shmem.c                 |   8 +-
>  mm/swap.h                  |  10 +-
>  mm/swapfile.c              |  25 ++--
>  10 files changed, 298 insertions(+), 174 deletions(-)
>
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one
  2026-03-10  7:56   ` Lorenzo Stoakes (Oracle)
@ 2026-03-10  8:06     ` David Hildenbrand (Arm)
  2026-03-10  8:23       ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-10  8:06 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, hughd, chrisl, kasong, weixugc,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On 3/10/26 08:56, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:05PM +0530, Dev Jain wrote:
>> Currently, nr_pages is defined as unsigned long. We use nr_pages to
>> manipulate mm rss counters for lazyfree folios as follows:
>>
>> 	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
>>
>> Suppose nr_pages = 3. -nr_pages underflows and becomes ULONG_MAX - 2. Then,
>> since add_mm_counter() uses this -nr_pages as a long, ULONG_MAX - 2 does
>> not fit into the positive range of long, and is converted to -3. Eventually
>> all of this works out, but for keeping things simple, declare nr_pages as
>> a signed variable.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/rmap.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 6398d7eef393f..087c9f5b884fe 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1979,9 +1979,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  	struct page *subpage;
>>  	struct mmu_notifier_range range;
>>  	enum ttu_flags flags = (enum ttu_flags)(long)arg;
>> -	unsigned long nr_pages = 1, end_addr;
>> +	unsigned long end_addr;
>>  	unsigned long pfn;
>>  	unsigned long hsz = 0;
>> +	long nr_pages = 1;
> 
> This is a non-issue that makes the code confusing, so let's not?
> 
> The convention throughout the kernel is nr_pages generally is unsigned because
> you can't have negative nr_pages.

Indeed. Documented in:

commit fa17bcd5f65ed702df001579cca8c885fa6bf3e7
Author: Aristeu Rozanski <aris@ruivo.org>
Date:   Tue Aug 26 11:37:21 2025 -0400

    mm: make folio page count functions return unsigned
    
    As raised by Andrew [1], a folio/compound page never spans a negative
    number of pages.  Consequently, let's use "unsigned long" instead of
    "long" consistently for folio_nr_pages(), folio_large_nr_pages() and
    compound_nr().
    
    Using "unsigned long" as return value is fine, because even
    "(long)-folio_nr_pages()" will keep on working as expected.  Using
    "unsigned int" instead would actually break these use cases.
    
    This patch takes the first step changing these to return unsigned long
    (and making drm_gem_get_pages() use the new types instead of replacing
    min()).
    
    In the future, we might want to make more callers of these functions to
    consistently use "unsigned long".
    


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-03-10  7:30 ` [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start " Dev Jain
@ 2026-03-10  8:10   ` Lorenzo Stoakes (Oracle)
  2026-03-10  8:31     ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:10 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:06PM +0530, Dev Jain wrote:
> Initialize nr_pages to 1 at the start of the loop, similar to what is being
> done in folio_referenced_one(). It may happen that the nr_pages computed
> from a previous call to folio_unmap_pte_batch gets reused without again
> going through folio_unmap_pte_batch, messing up things. Although, I don't
> think there is any bug right now; a bug would have been there, if in the
> same instance of a call to try_to_unmap_one, we end up in the
> pte_present(pteval) branch, then in the else branch doing pte_clear() for
> device-exclusive ptes. This means that a lazyfree folio has some present
> entries and some device entries mapping it. Since a pte being
> device-exclusive means that a GUP reference on the underlying folio is
> held, the lazyfree unmapping path upon witnessing this will abort
> try_to_unmap_one.

Dude, paragraphs. PARAGRAPHS :) this is one dense set of words.

It's also a very compressed 'stream of consciousness' hard-to-read block here.

I'm not sure it's really worth having this as a separate commit either, it's
pretty trivial.

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/rmap.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 087c9f5b884fe..1fa020edd954a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1982,7 +1982,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  	unsigned long end_addr;
>  	unsigned long pfn;
>  	unsigned long hsz = 0;
> -	long nr_pages = 1;
> +	long nr_pages;
>  	int ptes = 0;
>
>  	/*
> @@ -2019,6 +2019,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		nr_pages = 1;
> +

This seems valid but I really hate how we default-set it then in some branch set
it to something else.

But I think fixing that would be part of a bigger cleanup...

>  		/*
>  		 * If the folio is in an mlock()d vma, we must not swap it out.
>  		 */
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio()
  2026-03-10  7:30 ` [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio() Dev Jain
@ 2026-03-10  8:19   ` Lorenzo Stoakes (Oracle)
  2026-03-10  8:42     ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:19 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:07PM +0530, Dev Jain wrote:
> Clean up the code by refactoring the post-pte-clearing path of lazyfree
> folio unmapping, into commit_ttu_lazyfree_folio().
>
> No functional change is intended.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

This is a good idea, and we need more refactoring like this in the rmap code,
but comments/nits below.

> ---
>  mm/rmap.c | 93 ++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 54 insertions(+), 39 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1fa020edd954a..a61978141ee3f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1966,6 +1966,57 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>  }
>
> +static inline int commit_ttu_lazyfree_folio(struct vm_area_struct *vma,

Strange name, maybe lazyfree_range()? Not sure what ttu has to do with
anything...

> +		struct folio *folio, unsigned long address, pte_t *ptep,
> +		pte_t pteval, long nr_pages)

That long nr_pages is really grating now...

> +{

Come on Dev, it's 2026, why on earth are you returning an integer and not a
bool?

Also it would make sense for this to return false if something breaks, otherwise
true.

> +	struct mm_struct *mm = vma->vm_mm;
> +	int ref_count, map_count;
> +
> +	/*
> +	 * Synchronize with gup_pte_range():
> +	 * - clear PTE; barrier; read refcount
> +	 * - inc refcount; barrier; read PTE
> +	 */
> +	smp_mb();
> +
> +	ref_count = folio_ref_count(folio);
> +	map_count = folio_mapcount(folio);
> +
> +	/*
> +	 * Order reads for page refcount and dirty flag
> +	 * (see comments in __remove_mapping()).
> +	 */
> +	smp_rmb();
> +
> +	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +		/*
> +		 * redirtied either using the page table or a previously
> +		 * obtained GUP reference.
> +		 */
> +		set_ptes(mm, address, ptep, pteval, nr_pages);
> +		folio_set_swapbacked(folio);
> +		return 1;
> +	}
> +
> +	if (ref_count != 1 + map_count) {
> +		/*
> +		 * Additional reference. Could be a GUP reference or any
> +		 * speculative reference. GUP users must mark the folio
> +		 * dirty if there was a modification. This folio cannot be
> +		 * reclaimed right now either way, so act just like nothing
> +		 * happened.
> +		 * We'll come back here later and detect if the folio was
> +		 * dirtied when the additional reference is gone.
> +		 */
> +		set_ptes(mm, address, ptep, pteval, nr_pages);
> +		return 1;
> +	}
> +
> +	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
> +	return 0;
> +}
> +
>  /*
>   * @arg: enum ttu_flags will be passed to this argument
>   */
> @@ -2227,46 +2278,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>
>  			/* MADV_FREE page check */
>  			if (!folio_test_swapbacked(folio)) {
> -				int ref_count, map_count;
> -
> -				/*
> -				 * Synchronize with gup_pte_range():
> -				 * - clear PTE; barrier; read refcount
> -				 * - inc refcount; barrier; read PTE
> -				 */
> -				smp_mb();
> -
> -				ref_count = folio_ref_count(folio);
> -				map_count = folio_mapcount(folio);
> -
> -				/*
> -				 * Order reads for page refcount and dirty flag
> -				 * (see comments in __remove_mapping()).
> -				 */
> -				smp_rmb();
> -
> -				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> -					/*
> -					 * redirtied either using the page table or a previously
> -					 * obtained GUP reference.
> -					 */
> -					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
> -					folio_set_swapbacked(folio);
> +				if (commit_ttu_lazyfree_folio(vma, folio, address,
> +							      pvmw.pte, pteval,
> +							      nr_pages))

With above corrections this would be:

	if (!lazyfree_range(vma, folio, address, pvme.pte, pteval, nr_pages))
	   ...

>  					goto walk_abort;
> -				} else if (ref_count != 1 + map_count) {
> -					/*
> -					 * Additional reference. Could be a GUP reference or any
> -					 * speculative reference. GUP users must mark the folio
> -					 * dirty if there was a modification. This folio cannot be
> -					 * reclaimed right now either way, so act just like nothing
> -					 * happened.
> -					 * We'll come back here later and detect if the folio was
> -					 * dirtied when the additional reference is gone.
> -					 */
> -					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
> -					goto walk_abort;
> -				}
> -				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
>  				goto discard;
>  			}
>
> --
> 2.34.1
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one
  2026-03-10  8:06     ` David Hildenbrand (Arm)
@ 2026-03-10  8:23       ` Dev Jain
  2026-03-10 12:40         ` Matthew Wilcox
  0 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  8:23 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, hughd, chrisl, kasong, weixugc,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 10/03/26 1:36 pm, David Hildenbrand (Arm) wrote:
> On 3/10/26 08:56, Lorenzo Stoakes (Oracle) wrote:
>> On Tue, Mar 10, 2026 at 01:00:05PM +0530, Dev Jain wrote:
>>> Currently, nr_pages is defined as unsigned long. We use nr_pages to
>>> manipulate mm rss counters for lazyfree folios as follows:
>>>
>>> 	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
>>>
>>> Suppose nr_pages = 3. -nr_pages underflows and becomes ULONG_MAX - 2. Then,
>>> since add_mm_counter() uses this -nr_pages as a long, ULONG_MAX - 2 does
>>> not fit into the positive range of long, and is converted to -3. Eventually
>>> all of this works out, but for keeping things simple, declare nr_pages as
>>> a signed variable.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>  mm/rmap.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 6398d7eef393f..087c9f5b884fe 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1979,9 +1979,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>  	struct page *subpage;
>>>  	struct mmu_notifier_range range;
>>>  	enum ttu_flags flags = (enum ttu_flags)(long)arg;
>>> -	unsigned long nr_pages = 1, end_addr;
>>> +	unsigned long end_addr;
>>>  	unsigned long pfn;
>>>  	unsigned long hsz = 0;
>>> +	long nr_pages = 1;
>>
>> This is a non-issue that makes the code confusing, so let's not?
>>
>> The convention throughout the kernel is nr_pages generally is unsigned because
>> you can't have negative nr_pages.
> 
> Indeed. Documented in:
> 
> commit fa17bcd5f65ed702df001579cca8c885fa6bf3e7
> Author: Aristeu Rozanski <aris@ruivo.org>
> Date:   Tue Aug 26 11:37:21 2025 -0400
> 
>     mm: make folio page count functions return unsigned
>     
>     As raised by Andrew [1], a folio/compound page never spans a negative
>     number of pages.  Consequently, let's use "unsigned long" instead of
>     "long" consistently for folio_nr_pages(), folio_large_nr_pages() and
>     compound_nr().
>     
>     Using "unsigned long" as return value is fine, because even
>     "(long)-folio_nr_pages()" will keep on working as expected.  Using
>     "unsigned int" instead would actually break these use cases.
>     
>     This patch takes the first step changing these to return unsigned long
>     (and making drm_gem_get_pages() use the new types instead of replacing
>     min()).
>     
>     In the future, we might want to make more callers of these functions to
>     consistently use "unsigned long".

So when I was playing around with the code, I noticed that passing
unsigned int nr_pages to add_mm_counter(-nr_pages) messes up things. Then
I noticed we have an unsigned long here, which prevents that. This is quite
non-trivial information for me, especially when I searched around the
codebase and found this is the only place where we pass a negative unsigned
long.

But thanks for pointing out this commit. If it is a well-known fact that
(long)-folio_nr_pages() will work correctly, then we can drop this patch.

>     
> 
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
@ 2026-03-10  8:27   ` Kairui Song
  2026-03-10  8:46     ` Dev Jain
  2026-03-10  8:49   ` Lorenzo Stoakes (Oracle)
  2026-03-18  0:20   ` kernel test robot
  2 siblings, 1 reply; 46+ messages in thread
From: Kairui Song @ 2026-03-10  8:27 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, weixugc, ljs,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 3:36 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Teach folio_dup_swap to handle a batch of consecutive pages. Note that
> folio_dup_swap already can handle a subset of this: nr_pages == 1 and
> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.

Thanks a lot for doing this. I was thinking it's about time we respin
the batch unmapping of anon folios idea. Barry tried that before with
an RFC, and now batching from swap side is easier, so some parts can
be done cleaner.

> Currently we have a not-so-nice logic of passing in subpage == NULL if
> we mean to exercise the logic on the entire folio, and subpage != NULL if
> we want to exercise the logic on only that subpage. Remove this
> indirection, and explicitly pass subpage != NULL, and the number of
> pages required.

I was hoping most callers will just use the whole folio, but after
checking your code, yeah, using explicit subpage and nr does fit the
other parts better.

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 915bc93964dbd..eaf61ae6c3817 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1738,7 +1738,8 @@ int folio_alloc_swap(struct folio *folio)
>  /**
>   * folio_dup_swap() - Increase swap count of swap entries of a folio.
>   * @folio: folio with swap entries bounded.
> - * @subpage: if not NULL, only increase the swap count of this subpage.
> + * @subpage: Increase the swap count of this subpage till nr number of
> + * pages forward.

The new nr_pages parameter isn't documented in the kernel-doc comment?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable
  2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
@ 2026-03-10  8:29   ` Kairui Song
  2026-03-10  8:50     ` Dev Jain
  2026-03-10  8:55   ` Lorenzo Stoakes (Oracle)
  2026-03-18  1:04   ` kernel test robot
  2 siblings, 1 reply; 46+ messages in thread
From: Kairui Song @ 2026-03-10  8:29 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, weixugc, ljs,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 3:47 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Teach folio_put_swap to handle a batch of consecutive pages. Note that
> folio_put_swap already can handle a subset of this: nr_pages == 1 and
> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
>
> Currently we have a not-so-nice logic of passing in subpage == NULL if
> we mean to exercise the logic on the entire folio, and subpage != NULL if
> we want to exercise the logic on only that subpage. Remove this
> indirection, and explicitly pass subpage != NULL, and the number of
> pages required.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/memory.c   |  6 +++---
>  mm/rmap.c     |  4 ++--
>  mm/shmem.c    |  6 +++---
>  mm/swap.h     |  5 +++--
>  mm/swapfile.c | 13 +++++--------
>  5 files changed, 16 insertions(+), 18 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 768646c0b3b6a..8249a9b7083ab 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5002,7 +5002,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (unlikely(folio != swapcache)) {
>                 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>                 folio_add_lru_vma(folio, vma);
> -               folio_put_swap(swapcache, NULL);
> +               folio_put_swap(swapcache, folio_page(swapcache, 0), folio_nr_pages(swapcache));
>         } else if (!folio_test_anon(folio)) {
>                 /*
>                  * We currently only expect !anon folios that are fully
> @@ -5011,12 +5011,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
>                 VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
> -               folio_put_swap(folio, NULL);
> +               folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>         } else {
>                 VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
>                 folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
>                                          rmap_flags);
> -               folio_put_swap(folio, nr_pages == 1 ? page : NULL);
> +               folio_put_swap(folio, page, nr_pages);
>         }
>
>         VM_BUG_ON(!folio_test_anon(folio) ||
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f6d5b187cf09b..42f6b00cced01 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2293,7 +2293,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                          * so we'll not check/care.
>                          */
>                         if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> -                               folio_put_swap(folio, subpage);
> +                               folio_put_swap(folio, subpage, 1);
>                                 set_pte_at(mm, address, pvmw.pte, pteval);
>                                 goto walk_abort;
>                         }
> @@ -2301,7 +2301,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                         /* See folio_try_share_anon_rmap(): clear PTE first. */
>                         if (anon_exclusive &&
>                             folio_try_share_anon_rmap_pte(folio, subpage)) {
> -                               folio_put_swap(folio, subpage);
> +                               folio_put_swap(folio, subpage, 1);
>                                 set_pte_at(mm, address, pvmw.pte, pteval);
>                                 goto walk_abort;
>                         }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 86ee34c9b40b3..d9d216ea28ecb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1716,7 +1716,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>                 /* Swap entry might be erased by racing shmem_free_swap() */
>                 if (!error) {
>                         shmem_recalc_inode(inode, 0, -nr_pages);
> -                       folio_put_swap(folio, NULL);
> +                       folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));

I just realized that we already have a nr_pages variable available
here, maybe you can just use that?

Feel free to ignore this if it might touch more code.

>                 }
>
>                 /*
> @@ -2196,7 +2196,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>
>         nr_pages = folio_nr_pages(folio);
>         folio_wait_writeback(folio);
> -       folio_put_swap(folio, NULL);
> +       folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>         swap_cache_del_folio(folio);
>         /*
>          * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
> @@ -2426,7 +2426,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         if (sgp == SGP_WRITE)
>                 folio_mark_accessed(folio);
>
> -       folio_put_swap(folio, NULL);
> +       folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));

Same here, nr_pages seems good enough?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-03-10  8:10   ` Lorenzo Stoakes (Oracle)
@ 2026-03-10  8:31     ` Dev Jain
  2026-03-10  8:39       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  8:31 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 1:40 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:06PM +0530, Dev Jain wrote:
>> Initialize nr_pages to 1 at the start of the loop, similar to what is being
>> done in folio_referenced_one(). It may happen that the nr_pages computed
>> from a previous call to folio_unmap_pte_batch gets reused without again
>> going through folio_unmap_pte_batch, messing up things. Although, I don't
>> think there is any bug right now; a bug would have been there, if in the
>> same instance of a call to try_to_unmap_one, we end up in the
>> pte_present(pteval) branch, then in the else branch doing pte_clear() for
>> device-exclusive ptes. This means that a lazyfree folio has some present
>> entries and some device entries mapping it. Since a pte being
>> device-exclusive means that a GUP reference on the underlying folio is
>> held, the lazyfree unmapping path upon witnessing this will abort
>> try_to_unmap_one.
> 
> Dude, paragraphs. PARAGRAPHS :) this is one dense set of words.
> 
> It's also a very compressed 'stream of consciousness' hard-to-read block here.

Sure :) I'll try to break this down.

> 
> I'm not sure it's really worth having this as a separate commit either, it's
> pretty trivial.

Hmm...well, as I explain above, it's not trivial for me :) it is difficult
for me to reason here whether nr_pages can be reused without a reset in
a future iteration.

> 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/rmap.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 087c9f5b884fe..1fa020edd954a 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1982,7 +1982,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  	unsigned long end_addr;
>>  	unsigned long pfn;
>>  	unsigned long hsz = 0;
>> -	long nr_pages = 1;
>> +	long nr_pages;
>>  	int ptes = 0;
>>
>>  	/*
>> @@ -2019,6 +2019,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  	mmu_notifier_invalidate_range_start(&range);
>>
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> +		nr_pages = 1;
>> +
> 
> This seems valid but I really hate how we default-set it then in some branch set
> it to something else.
> 
> But I think fixing that would be part of a bigger cleanup...
> 
>>  		/*
>>  		 * If the folio is in an mlock()d vma, we must not swap it out.
>>  		 */
>> --
>> 2.34.1
>>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-03-10  7:30 ` [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
@ 2026-03-10  8:34   ` Lorenzo Stoakes (Oracle)
  2026-03-10 23:32     ` Barry Song
  2026-03-11  4:56     ` Dev Jain
  0 siblings, 2 replies; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:34 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
> The ptes are all the same w.r.t belonging to the same type of VMA, and
> being marked with uffd-wp or all being not marked. Therefore we can batch
> set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
> batched unmapping of folios belonging to uffd-wp VMAs by dropping that
> condition from folio_unmap_pte_batch.
>
> It may happen that we don't batch over the entire folio in one go, in which
> case, we must skip over the current batch. Add a helper to do that -
> page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
> by nr pages.
>
> I think that we can get away with just incrementing pvmw->pte
> and pvmw->address, since looking at the code in page_vma_mapped.c,
> pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
> and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
> cancelling out the increment and decrement in the respective fields. But
> let us not rely on the pvmw implementation and keep this simple.

This isn't simple...

>
> Export this function to rmap.h to enable future reuse.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/rmap.h | 10 ++++++++++
>  mm/rmap.c            |  8 +++-----
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8dc0871e5f001..1b7720c66ac87 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  		spin_unlock(pvmw->ptl);
>  }
>
> +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
> +		unsigned int nr)

unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
for no reason.

> +{
> +	pvmw->pfn += nr;
> +	pvmw->nr_pages -= nr;
> +	pvmw->pgoff += nr;
> +	pvmw->pte += nr;
> +	pvmw->address += nr * PAGE_SIZE;
> +}

I absolutely hate this. It's extremely confusing, especially since you're now
going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
here, you're losing sight of the batch size and exposing a silly detail to the
caller, and I really don't want to 'export' this at this time.

If we must have this, can you please make it static in rmap.c at least for the
time being.

Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
page_vma_mapped_walk_batch()?

I think that makes a lot more sense...

I mean I kind of hate the pvmw interface in general, this is a hack to handle
batching clamped on to the side of it, let's figure out how to do this sensibly
and do what's needed rather than adding yet more hacks-on-hacks please.

> +
>  /**
>   * page_vma_mapped_walk_restart - Restart the page table walk.
>   * @pvmw: Pointer to struct page_vma_mapped_walk.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7570cd037344..dd638429c963e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1953,9 +1953,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	if (pte_unused(pte))
>  		return 1;
>
> -	if (userfaultfd_wp(vma))
> -		return 1;
> -
>  	/*
>  	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>  	 * upgrading write permissions for ptes that were not originally
> @@ -2235,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 * we may want to replace a none pte with a marker pte if
>  		 * it's file-backed, so we don't lose the tracking info.
>  		 */
> -		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
> +		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);
>
>  		/* Update high watermark before we lower rss */
>  		update_hiwater_rss(mm);
> @@ -2359,8 +2356,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 * If we are sure that we batched the entire folio and cleared
>  		 * all PTEs, we can just optimize and stop right here.
>  		 */
> -		if (nr_pages == folio_nr_pages(folio))
> +		if (likely(nr_pages == folio_nr_pages(folio)))

Please don't add random likely()'s based on what you think is likely(). This
kind of thing should only be done based on profiling.

>  			goto walk_done;
> +		page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);

(You're now passing a signed long to an unsigned int...!)


>  		continue;
>  walk_abort:
>  		ret = false;
> --
> 2.34.1
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-03-10  8:31     ` Dev Jain
@ 2026-03-10  8:39       ` Lorenzo Stoakes (Oracle)
  2026-03-10  8:43         ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:39 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 02:01:44PM +0530, Dev Jain wrote:
>
>
> On 10/03/26 1:40 pm, Lorenzo Stoakes (Oracle) wrote:
> > On Tue, Mar 10, 2026 at 01:00:06PM +0530, Dev Jain wrote:
> >> Initialize nr_pages to 1 at the start of the loop, similar to what is being
> >> done in folio_referenced_one(). It may happen that the nr_pages computed
> >> from a previous call to folio_unmap_pte_batch gets reused without again
> >> going through folio_unmap_pte_batch, messing up things. Although, I don't
> >> think there is any bug right now; a bug would have been there, if in the
> >> same instance of a call to try_to_unmap_one, we end up in the
> >> pte_present(pteval) branch, then in the else branch doing pte_clear() for
> >> device-exclusive ptes. This means that a lazyfree folio has some present
> >> entries and some device entries mapping it. Since a pte being
> >> device-exclusive means that a GUP reference on the underlying folio is
> >> held, the lazyfree unmapping path upon witnessing this will abort
> >> try_to_unmap_one.
> >
> > Dude, paragraphs. PARAGRAPHS :) this is one dense set of words.
> >
> > It's also a very compressed 'stream of consciousness' hard-to-read block here.
>
> Sure :) I'll try to break this down.
>
> >
> > I'm not sure it's really worth having this as a separate commit either, it's
> > pretty trivial.
>
> Hmm...well, as I explain above, it's not trivial for me :) it is difficult
> for me to reason here whether nr_pages can be reused without a reset in
> a future iteration.

OK you can have it be separate, let's just really clean up the commit message
then please :)

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio()
  2026-03-10  8:19   ` Lorenzo Stoakes (Oracle)
@ 2026-03-10  8:42     ` Dev Jain
  2026-03-19 15:53       ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 46+ messages in thread
From: Dev Jain @ 2026-03-10  8:42 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 1:49 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:07PM +0530, Dev Jain wrote:
>> Clean up the code by refactoring the post-pte-clearing path of lazyfree
>> folio unmapping, into commit_ttu_lazyfree_folio().
>>
>> No functional change is intended.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> 
> This is a good idea, and we need more refactoring like this in the rmap code,
> but comments/nits below.
> 
>> ---
>>  mm/rmap.c | 93 ++++++++++++++++++++++++++++++++-----------------------
>>  1 file changed, 54 insertions(+), 39 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1fa020edd954a..a61978141ee3f 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1966,6 +1966,57 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>  				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>>  }
>>
>> +static inline int commit_ttu_lazyfree_folio(struct vm_area_struct *vma,
> 
> Strange name, maybe lazyfree_range()? Not sure what ttu has to do with

ttu means try_to_unmap, just like it is used in TTU_SYNC,
TTU_SPLIT_HUGE_PMD, etc. So personally I really like the name, it reads
"commit the try-to-unmap of a lazyfree folio". The "commit" comes because
the pte clearing has already happened, so now we are deciding if at all
to back-off and restore the ptes.

> anything...
> 
>> +		struct folio *folio, unsigned long address, pte_t *ptep,
>> +		pte_t pteval, long nr_pages)
> 
> That long nr_pages is really grating now...

Reading the discussion on patch 1, I'll convert this to unsigned long.

> 
>> +{
> 
> Come on Dev, it's 2026, why on earth are you returning an integer and not a
> bool?
> 
> Also it would make sense for this to return false if something breaks, otherwise
> true.

Yes I was confused on which one of the options to choose :). Since the
function does a lot more than just test some functionality (which is what
boolean functions usually do) I was feeling weird when returning bool.
But yeah alright, I'll convert this to bool.

> 
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	int ref_count, map_count;
>> +
>> +	/*
>> +	 * Synchronize with gup_pte_range():
>> +	 * - clear PTE; barrier; read refcount
>> +	 * - inc refcount; barrier; read PTE
>> +	 */
>> +	smp_mb();
>> +
>> +	ref_count = folio_ref_count(folio);
>> +	map_count = folio_mapcount(folio);
>> +
>> +	/*
>> +	 * Order reads for page refcount and dirty flag
>> +	 * (see comments in __remove_mapping()).
>> +	 */
>> +	smp_rmb();
>> +
>> +	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
>> +		/*
>> +		 * redirtied either using the page table or a previously
>> +		 * obtained GUP reference.
>> +		 */
>> +		set_ptes(mm, address, ptep, pteval, nr_pages);
>> +		folio_set_swapbacked(folio);
>> +		return 1;
>> +	}
>> +
>> +	if (ref_count != 1 + map_count) {
>> +		/*
>> +		 * Additional reference. Could be a GUP reference or any
>> +		 * speculative reference. GUP users must mark the folio
>> +		 * dirty if there was a modification. This folio cannot be
>> +		 * reclaimed right now either way, so act just like nothing
>> +		 * happened.
>> +		 * We'll come back here later and detect if the folio was
>> +		 * dirtied when the additional reference is gone.
>> +		 */
>> +		set_ptes(mm, address, ptep, pteval, nr_pages);
>> +		return 1;
>> +	}
>> +
>> +	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
>> +	return 0;
>> +}
>> +
>>  /*
>>   * @arg: enum ttu_flags will be passed to this argument
>>   */
>> @@ -2227,46 +2278,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>>  			/* MADV_FREE page check */
>>  			if (!folio_test_swapbacked(folio)) {
>> -				int ref_count, map_count;
>> -
>> -				/*
>> -				 * Synchronize with gup_pte_range():
>> -				 * - clear PTE; barrier; read refcount
>> -				 * - inc refcount; barrier; read PTE
>> -				 */
>> -				smp_mb();
>> -
>> -				ref_count = folio_ref_count(folio);
>> -				map_count = folio_mapcount(folio);
>> -
>> -				/*
>> -				 * Order reads for page refcount and dirty flag
>> -				 * (see comments in __remove_mapping()).
>> -				 */
>> -				smp_rmb();
>> -
>> -				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
>> -					/*
>> -					 * redirtied either using the page table or a previously
>> -					 * obtained GUP reference.
>> -					 */
>> -					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
>> -					folio_set_swapbacked(folio);
>> +				if (commit_ttu_lazyfree_folio(vma, folio, address,
>> +							      pvmw.pte, pteval,
>> +							      nr_pages))
> 
> With above corrections this would be:
> 
> 	if (!lazyfree_range(vma, folio, address, pvme.pte, pteval, nr_pages))
> 	   ...
> 
>>  					goto walk_abort;
>> -				} else if (ref_count != 1 + map_count) {
>> -					/*
>> -					 * Additional reference. Could be a GUP reference or any
>> -					 * speculative reference. GUP users must mark the folio
>> -					 * dirty if there was a modification. This folio cannot be
>> -					 * reclaimed right now either way, so act just like nothing
>> -					 * happened.
>> -					 * We'll come back here later and detect if the folio was
>> -					 * dirtied when the additional reference is gone.
>> -					 */
>> -					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
>> -					goto walk_abort;
>> -				}
>> -				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
>>  				goto discard;
>>  			}
>>
>> --
>> 2.34.1
>>
> 
> Thanks, Lorenzo
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-03-10  8:39       ` Lorenzo Stoakes (Oracle)
@ 2026-03-10  8:43         ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  8:43 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 2:09 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 02:01:44PM +0530, Dev Jain wrote:
>>
>>
>> On 10/03/26 1:40 pm, Lorenzo Stoakes (Oracle) wrote:
>>> On Tue, Mar 10, 2026 at 01:00:06PM +0530, Dev Jain wrote:
>>>> Initialize nr_pages to 1 at the start of the loop, similar to what is being
>>>> done in folio_referenced_one(). It may happen that the nr_pages computed
>>>> from a previous call to folio_unmap_pte_batch gets reused without again
>>>> going through folio_unmap_pte_batch, messing up things. Although, I don't
>>>> think there is any bug right now; a bug would have been there, if in the
>>>> same instance of a call to try_to_unmap_one, we end up in the
>>>> pte_present(pteval) branch, then in the else branch doing pte_clear() for
>>>> device-exclusive ptes. This means that a lazyfree folio has some present
>>>> entries and some device entries mapping it. Since a pte being
>>>> device-exclusive means that a GUP reference on the underlying folio is
>>>> held, the lazyfree unmapping path upon witnessing this will abort
>>>> try_to_unmap_one.
>>>
>>> Dude, paragraphs. PARAGRAPHS :) this is one dense set of words.
>>>
>>> It's also a very compressed 'stream of consciousness' hard-to-read block here.
>>
>> Sure :) I'll try to break this down.
>>
>>>
>>> I'm not sure it's really worth having this as a separate commit either, it's
>>> pretty trivial.
>>
>> Hmm...well, as I explain above, it's not trivial for me :) it is difficult
>> for me to reason here whether nr_pages can be reused without a reset in
>> a future iteration.
> 
> OK you can have it be separate, let's just really clean up the commit message
> then please :)

Okay, thanks.

> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-10  8:27   ` Kairui Song
@ 2026-03-10  8:46     ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  8:46 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, weixugc, ljs,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 10/03/26 1:57 pm, Kairui Song wrote:
> On Tue, Mar 10, 2026 at 3:36 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> Teach folio_dup_swap to handle a batch of consecutive pages. Note that
>> folio_dup_swap already can handle a subset of this: nr_pages == 1 and
>> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
> 
> Thanks a lot for doing this. I was thinking it's about time we respin
> the batch unmapping of anon folios idea. Barry tried that before with
> an RFC, and now batching from swap side is easier, so some parts can
> be done cleaner.
> 
>> Currently we have a not-so-nice logic of passing in subpage == NULL if
>> we mean to exercise the logic on the entire folio, and subpage != NULL if
>> we want to exercise the logic on only that subpage. Remove this
>> indirection, and explicitly pass subpage != NULL, and the number of
>> pages required.
> 
> I was hoping most callers will just use the whole folio, but after
> checking your code, yeah, using explicit subpage and nr does fit the
> other parts better.
> 
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 915bc93964dbd..eaf61ae6c3817 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1738,7 +1738,8 @@ int folio_alloc_swap(struct folio *folio)
>>  /**
>>   * folio_dup_swap() - Increase swap count of swap entries of a folio.
>>   * @folio: folio with swap entries bounded.
>> - * @subpage: if not NULL, only increase the swap count of this subpage.
>> + * @subpage: Increase the swap count of this subpage till nr number of
>> + * pages forward.
> 
> The new nr_pages parameter isn't documented in the kernel-doc comment?

Oops, will add this in v2, thanks.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
  2026-03-10  8:27   ` Kairui Song
@ 2026-03-10  8:49   ` Lorenzo Stoakes (Oracle)
  2026-03-11  5:42     ` Dev Jain
  2026-03-18  0:20   ` kernel test robot
  2 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:10PM +0530, Dev Jain wrote:
> Teach folio_dup_swap to handle a batch of consecutive pages. Note that
> folio_dup_swap already can handle a subset of this: nr_pages == 1 and
> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
>
> Currently we have a not-so-nice logic of passing in subpage == NULL if
> we mean to exercise the logic on the entire folio, and subpage != NULL if
> we want to exercise the logic on only that subpage. Remove this
> indirection, and explicitly pass subpage != NULL, and the number of
> pages required.

You've made the interface more confusing? Now we can update multiple subpages
but specify only one? :)

Let's try to actually refactor this into something sane... see below.

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/rmap.c     |  2 +-
>  mm/shmem.c    |  2 +-
>  mm/swap.h     |  5 +++--
>  mm/swapfile.c | 12 +++++-------
>  4 files changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index dd638429c963e..f6d5b187cf09b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2282,7 +2282,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				goto discard;
>  			}
>
> -			if (folio_dup_swap(folio, subpage) < 0) {
> +			if (folio_dup_swap(folio, subpage, 1) < 0) {
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_abort;
>  			}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5e7dcf5bc5d3c..86ee34c9b40b3 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1695,7 +1695,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>  			spin_unlock(&shmem_swaplist_lock);
>  		}
>
> -		folio_dup_swap(folio, NULL);
> +		folio_dup_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>  		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>
>  		BUG_ON(folio_mapped(folio));
> diff --git a/mm/swap.h b/mm/swap.h
> index a77016f2423b9..d9cb58ebbddd1 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -206,7 +206,7 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
>   * folio_put_swap(): does the opposite thing of folio_dup_swap().
>   */
>  int folio_alloc_swap(struct folio *folio);
> -int folio_dup_swap(struct folio *folio, struct page *subpage);
> +int folio_dup_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
>  void folio_put_swap(struct folio *folio, struct page *subpage);
>
>  /* For internal use */
> @@ -390,7 +390,8 @@ static inline int folio_alloc_swap(struct folio *folio)
>  	return -EINVAL;
>  }
>
> -static inline int folio_dup_swap(struct folio *folio, struct page *page)
> +static inline int folio_dup_swap(struct folio *folio, struct page *page,
> +				 unsigned int nr_pages)
>  {
>  	return -EINVAL;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 915bc93964dbd..eaf61ae6c3817 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1738,7 +1738,8 @@ int folio_alloc_swap(struct folio *folio)
>  /**
>   * folio_dup_swap() - Increase swap count of swap entries of a folio.
>   * @folio: folio with swap entries bounded.
> - * @subpage: if not NULL, only increase the swap count of this subpage.
> + * @subpage: Increase the swap count of this subpage till nr number of
> + * pages forward.

(Obviously also Kairui's point about missing entry in kdoc)

This is REALLY confusing sorry. And this interface is just a horror show.

Before we had subpage == only increase the swap count of the subpage.

Now subpage = the first subpage at which we do that? Please, no.

You just need to rework this interface in general, this is a hack.

Something like:

int __folio_dup_swap(struct folio *folio, unsigned int subpage_start_index,
		     unsigned int nr_subpages)
{
	...
}

...

int folio_dup_swap_subpage(struct folio *folio, struct page *subpage)
{
	return __folio_dup_swap(folio, folio_page_idx(folio, subpage), 1);
}

int folio_dup_swap(struct folio *folio)
{
	return __folio_dup_swap(folio, 0, folio_nr_pages(folio));
}

Or something like that.

We're definitely _not_ keeping the subpage parameter like that and hacking on
batching, PLEASE.

>   *
>   * Typically called when the folio is unmapped and have its swap entry to
>   * take its place: Swap entries allocated to a folio has count == 0 and pinned
> @@ -1752,18 +1753,15 @@ int folio_alloc_swap(struct folio *folio)
>   * swap_put_entries_direct on its swap entry before this helper returns, or
>   * the swap count may underflow.
>   */
> -int folio_dup_swap(struct folio *folio, struct page *subpage)
> +int folio_dup_swap(struct folio *folio, struct page *subpage,
> +		   unsigned int nr_pages)
>  {
>  	swp_entry_t entry = folio->swap;
> -	unsigned long nr_pages = folio_nr_pages(folio);
>
>  	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
>
> -	if (subpage) {
> -		entry.val += folio_page_idx(folio, subpage);
> -		nr_pages = 1;
> -	}
> +	entry.val += folio_page_idx(folio, subpage);
>
>  	return swap_dup_entries_cluster(swap_entry_to_info(entry),
>  					swp_offset(entry), nr_pages);
> --
> 2.34.1
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable
  2026-03-10  8:29   ` Kairui Song
@ 2026-03-10  8:50     ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  8:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, weixugc, ljs,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 10/03/26 1:59 pm, Kairui Song wrote:
> On Tue, Mar 10, 2026 at 3:47 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> Teach folio_put_swap to handle a batch of consecutive pages. Note that
>> folio_put_swap already can handle a subset of this: nr_pages == 1 and
>> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
>>
>> Currently we have a not-so-nice logic of passing in subpage == NULL if
>> we mean to exercise the logic on the entire folio, and subpage != NULL if
>> we want to exercise the logic on only that subpage. Remove this
>> indirection, and explicitly pass subpage != NULL, and the number of
>> pages required.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/memory.c   |  6 +++---
>>  mm/rmap.c     |  4 ++--
>>  mm/shmem.c    |  6 +++---
>>  mm/swap.h     |  5 +++--
>>  mm/swapfile.c | 13 +++++--------
>>  5 files changed, 16 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 768646c0b3b6a..8249a9b7083ab 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5002,7 +5002,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         if (unlikely(folio != swapcache)) {
>>                 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>                 folio_add_lru_vma(folio, vma);
>> -               folio_put_swap(swapcache, NULL);
>> +               folio_put_swap(swapcache, folio_page(swapcache, 0), folio_nr_pages(swapcache));
>>         } else if (!folio_test_anon(folio)) {
>>                 /*
>>                  * We currently only expect !anon folios that are fully
>> @@ -5011,12 +5011,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                 VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
>>                 VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>> -               folio_put_swap(folio, NULL);
>> +               folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>>         } else {
>>                 VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
>>                 folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
>>                                          rmap_flags);
>> -               folio_put_swap(folio, nr_pages == 1 ? page : NULL);
>> +               folio_put_swap(folio, page, nr_pages);
>>         }
>>
>>         VM_BUG_ON(!folio_test_anon(folio) ||
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index f6d5b187cf09b..42f6b00cced01 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2293,7 +2293,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                          * so we'll not check/care.
>>                          */
>>                         if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>> -                               folio_put_swap(folio, subpage);
>> +                               folio_put_swap(folio, subpage, 1);
>>                                 set_pte_at(mm, address, pvmw.pte, pteval);
>>                                 goto walk_abort;
>>                         }
>> @@ -2301,7 +2301,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                         /* See folio_try_share_anon_rmap(): clear PTE first. */
>>                         if (anon_exclusive &&
>>                             folio_try_share_anon_rmap_pte(folio, subpage)) {
>> -                               folio_put_swap(folio, subpage);
>> +                               folio_put_swap(folio, subpage, 1);
>>                                 set_pte_at(mm, address, pvmw.pte, pteval);
>>                                 goto walk_abort;
>>                         }
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 86ee34c9b40b3..d9d216ea28ecb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1716,7 +1716,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>                 /* Swap entry might be erased by racing shmem_free_swap() */
>>                 if (!error) {
>>                         shmem_recalc_inode(inode, 0, -nr_pages);
>> -                       folio_put_swap(folio, NULL);
>> +                       folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
> 
> I just realized that we already have a nr_pages variable available
> here, maybe you can just use that?
> 
> Feel free to ignore this if it might touch more code.
> 
>>                 }
>>
>>                 /*
>> @@ -2196,7 +2196,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>
>>         nr_pages = folio_nr_pages(folio);
>>         folio_wait_writeback(folio);
>> -       folio_put_swap(folio, NULL);
>> +       folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>>         swap_cache_del_folio(folio);
>>         /*
>>          * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
>> @@ -2426,7 +2426,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>         if (sgp == SGP_WRITE)
>>                 folio_mark_accessed(folio);
>>
>> -       folio_put_swap(folio, NULL);
>> +       folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
> 
> Same here, nr_pages seems good enough?

Yup, thanks for your observation.




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable
  2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
  2026-03-10  8:29   ` Kairui Song
@ 2026-03-10  8:55   ` Lorenzo Stoakes (Oracle)
  2026-03-18  1:04   ` kernel test robot
  2 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  8:55 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:11PM +0530, Dev Jain wrote:
> Teach folio_put_swap to handle a batch of consecutive pages. Note that
> folio_put_swap already can handle a subset of this: nr_pages == 1 and
> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
>
> Currently we have a not-so-nice logic of passing in subpage == NULL if
> we mean to exercise the logic on the entire folio, and subpage != NULL if
> we want to exercise the logic on only that subpage. Remove this
> indirection, and explicitly pass subpage != NULL, and the number of
> pages required.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/memory.c   |  6 +++---
>  mm/rmap.c     |  4 ++--
>  mm/shmem.c    |  6 +++---
>  mm/swap.h     |  5 +++--
>  mm/swapfile.c | 13 +++++--------
>  5 files changed, 16 insertions(+), 18 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 768646c0b3b6a..8249a9b7083ab 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5002,7 +5002,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (unlikely(folio != swapcache)) {
>  		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>  		folio_add_lru_vma(folio, vma);
> -		folio_put_swap(swapcache, NULL);
> +		folio_put_swap(swapcache, folio_page(swapcache, 0), folio_nr_pages(swapcache));

Lord above HELPER. HELPER :) please.

I think in general, let's have the same refactoring in folio_put_swap() as I
suggested for folio_dup_swap().

I'm not sure where this hellish subpage interface came from, I mean there must
be a good reason but it seems kinda horrible.

>  	} else if (!folio_test_anon(folio)) {
>  		/*
>  		 * We currently only expect !anon folios that are fully
> @@ -5011,12 +5011,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
>  		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>  		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
> -		folio_put_swap(folio, NULL);
> +		folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>  	} else {
>  		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
>  		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
>  					 rmap_flags);
> -		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
> +		folio_put_swap(folio, page, nr_pages);

I'm confused as to why some callers use folio_nr_pages() and others nr_pages...

>  	}
>
>  	VM_BUG_ON(!folio_test_anon(folio) ||
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f6d5b187cf09b..42f6b00cced01 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2293,7 +2293,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * so we'll not check/care.
>  			 */
>  			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
> -				folio_put_swap(folio, subpage);
> +				folio_put_swap(folio, subpage, 1);

Again, as with the folio_dup_swap() refactoring I suggested in previous patch,
something like folio_dup_swap_subpage() would be good here right?

Like folio_put_swap_subpage(folio, subpage)...


>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_abort;
>  			}
> @@ -2301,7 +2301,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
>  			if (anon_exclusive &&
>  			    folio_try_share_anon_rmap_pte(folio, subpage)) {
> -				folio_put_swap(folio, subpage);
> +				folio_put_swap(folio, subpage, 1);
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_abort;
>  			}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 86ee34c9b40b3..d9d216ea28ecb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1716,7 +1716,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>  		/* Swap entry might be erased by racing shmem_free_swap() */
>  		if (!error) {
>  			shmem_recalc_inode(inode, 0, -nr_pages);
> -			folio_put_swap(folio, NULL);
> +			folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>  		}
>
>  		/*
> @@ -2196,7 +2196,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>
>  	nr_pages = folio_nr_pages(folio);
>  	folio_wait_writeback(folio);
> -	folio_put_swap(folio, NULL);
> +	folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>  	swap_cache_del_folio(folio);
>  	/*
>  	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
> @@ -2426,7 +2426,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	if (sgp == SGP_WRITE)
>  		folio_mark_accessed(folio);
>
> -	folio_put_swap(folio, NULL);
> +	folio_put_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));

Again, for emphasis, please do not repeatedly open code the exact same thing all
over the place, 'don't repeat yourself' is a really good principle in
programming in general. Now if one of these callers gets updated and the others
not we're in a mess.

Abstract this please :)

>  	swap_cache_del_folio(folio);
>  	folio_mark_dirty(folio);
>  	put_swap_device(si);
> diff --git a/mm/swap.h b/mm/swap.h
> index d9cb58ebbddd1..73fd9faa67608 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -207,7 +207,7 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
>   */
>  int folio_alloc_swap(struct folio *folio);
>  int folio_dup_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
> -void folio_put_swap(struct folio *folio, struct page *subpage);
> +void folio_put_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
>
>  /* For internal use */
>  extern void __swap_cluster_free_entries(struct swap_info_struct *si,
> @@ -396,7 +396,8 @@ static inline int folio_dup_swap(struct folio *folio, struct page *page,
>  	return -EINVAL;
>  }
>
> -static inline void folio_put_swap(struct folio *folio, struct page *page)
> +static inline void folio_put_swap(struct folio *folio, struct page *page,
> +				  unsigned int nr_pages)
>  {
>  }
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index eaf61ae6c3817..c66aa6d15d479 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1770,25 +1770,22 @@ int folio_dup_swap(struct folio *folio, struct page *subpage,
>  /**
>   * folio_put_swap() - Decrease swap count of swap entries of a folio.
>   * @folio: folio with swap entries bounded, must be in swap cache and locked.
> - * @subpage: if not NULL, only decrease the swap count of this subpage.
> + * @subpage: decrease the swap count of this subpage till nr_pages.

Again no nr_pages entry.

>   *
>   * This won't free the swap slots even if swap count drops to zero, they are
>   * still pinned by the swap cache. User may call folio_free_swap to free them.
>   * Context: Caller must ensure the folio is locked and in the swap cache.
>   */
> -void folio_put_swap(struct folio *folio, struct page *subpage)
> +void folio_put_swap(struct folio *folio, struct page *subpage,
> +		    unsigned int nr_pages)
>  {
>  	swp_entry_t entry = folio->swap;
> -	unsigned long nr_pages = folio_nr_pages(folio);
>  	struct swap_info_struct *si = __swap_entry_to_info(entry);
>
>  	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
>
> -	if (subpage) {
> -		entry.val += folio_page_idx(folio, subpage);
> -		nr_pages = 1;
> -	}
> +	entry.val += folio_page_idx(folio, subpage);
>
>  	swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);

And yeah exact same comments re: refactoring as per folio_dup_swap().

>  }
> @@ -2334,7 +2331,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		new_pte = pte_mkuffd_wp(new_pte);
>  setpte:
>  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
> -	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
> +	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)), 1);
>  out:
>  	if (pte)
>  		pte_unmap_unlock(pte, ptl);
> --
> 2.34.1
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping
  2026-03-10  8:02 ` [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Lorenzo Stoakes (Oracle)
@ 2026-03-10  9:28   ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-10  9:28 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 1:32 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:04PM +0530, Dev Jain wrote:
>> Speed up unmapping of anonymous large folios by clearing the ptes, and
>> setting swap ptes, in one go.
>>
>> The following benchmark (stolen from Barry at [1]) is used to measure the
>> time taken to swapout 256M worth of memory backed by 64K large folios:
>>
>>  #define _GNU_SOURCE
>>  #include <stdio.h>
>>  #include <stdlib.h>
>>  #include <sys/mman.h>
>>  #include <string.h>
>>  #include <time.h>
>>  #include <unistd.h>
>>  #include <errno.h>
>>
>>  #define SIZE_MB 256
>>  #define SIZE_BYTES (SIZE_MB * 1024 * 1024)
>>
>>  int main() {
>>      void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
>>                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>      if (addr == MAP_FAILED) {
>>          perror("mmap failed");
>>          return 1;
>>      }
>>
>>      memset(addr, 0, SIZE_BYTES);
>>
>>      struct timespec start, end;
>>      clock_gettime(CLOCK_MONOTONIC, &start);
>>
>>      if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
>>          perror("madvise(MADV_PAGEOUT) failed");
>>          munmap(addr, SIZE_BYTES);
>>          return 1;
>>      }
>>
>>      clock_gettime(CLOCK_MONOTONIC, &end);
>>
>>      long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
>>                         (end.tv_nsec - start.tv_nsec);
>>      printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
>>             duration_ns, duration_ns / 1e6);
>>
>>      munmap(addr, SIZE_BYTES);
>>      return 0;
>>  }
>>
>> On arm64, only showing one of the middle values in the distribution:
>>
> 
> This doesn't seem very statistically valid.
> 
> How about you give median, stddev etc.? Variance matters too.

Okay.

> 
>> without patch:
>> madvise(MADV_PAGEOUT) took 52192959 ns (52.193 ms)
>>
>> with patch:
>> madvise(MADV_PAGEOUT) took 26676625 ns (26.677 ms)
> 
> You have a habit of only giving data on arm64, and not mentioning whether you've
> tested on any other arch/setup.

I did do an x86 build, forgot to mention that.
I didn't do the numbers thinking this patchset is quite generic and has got
nothing to do with the arm64 cont bit - but arguably I should have.

> 
> I've commented on this before so I'm a bit disappointed you've done the exact
> same thing here again. Especially since you've previously introduced regressions
> this way.
> 
> Please can you test this on (hardware!) x86-64 _at least_ as well and confirm
> you aren't regressing anything for 4K pages?

Lemme go and manage that :)

> 
>>
>>
>> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/
>>
>> ---
>> Based on mm-unstable bb420884e9e0. mm-selftests pass.
>>
>> Dev Jain (9):
>>   mm/rmap: make nr_pages signed in try_to_unmap_one
>>   mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
>>   mm/rmap: refactor lazyfree unmap commit path to
>>     commit_ttu_lazyfree_folio()
>>   mm/memory: Batch set uffd-wp markers during zapping
>>   mm/rmap: batch unmap folios belonging to uffd-wp VMAs
>>   mm/swapfile: Make folio_dup_swap batchable
>>   mm/swapfile: Make folio_put_swap batchable
>>   mm/rmap: introduce folio_try_share_anon_rmap_ptes
>>   mm/rmap: enable batch unmapping of anonymous folios
>>
>>  include/linux/mm_inline.h  |  37 +++--
>>  include/linux/page-flags.h |  11 ++
>>  include/linux/rmap.h       |  38 ++++-
>>  mm/internal.h              |  26 ++++
>>  mm/memory.c                |  26 +---
>>  mm/mprotect.c              |  17 ---
>>  mm/rmap.c                  | 274 ++++++++++++++++++++++++-------------
>>  mm/shmem.c                 |   8 +-
>>  mm/swap.h                  |  10 +-
>>  mm/swapfile.c              |  25 ++--
>>  10 files changed, 298 insertions(+), 174 deletions(-)
>>
>> --
>> 2.34.1
>>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes
  2026-03-10  7:30 ` [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes Dev Jain
@ 2026-03-10  9:38   ` Lorenzo Stoakes (Oracle)
  2026-03-11  8:09     ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-10  9:38 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:00:12PM +0530, Dev Jain wrote:
> In the quest of enabling batched unmapping of anonymous folios, we need to
> handle the sharing of exclusive pages. Hence, a batched version of
> folio_try_share_anon_rmap_pte is required.
>
> Currently, the sole purpose of nr_pages in __folio_try_share_anon_rmap is
> to do some rmap sanity checks. Add helpers to set and clear the
> PageAnonExclusive bit on a batch of nr_pages. Note that
> __folio_try_share_anon_rmap can receive nr_pages == HPAGE_PMD_NR from the
> PMD path, but currently we only clear the bit on the head page. Retain this
> behaviour by setting nr_pages = 1 in case the caller is
> folio_try_share_anon_rmap_pmd.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/page-flags.h | 11 +++++++++++
>  include/linux/rmap.h       | 28 ++++++++++++++++++++++++++--
>  mm/rmap.c                  |  2 +-
>  3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0e03d816e8b9d..1d74ed9a28c41 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1178,6 +1178,17 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
>  	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>  }
>
> +static __always_inline void ClearPagesAnonExclusive(struct page *page,
> +		unsigned int nr)

You're randomly moving between nr and nr_pages, can we just consistently use
nr_pages please.

> +{
> +	for (;;) {
> +		ClearPageAnonExclusive(page);
> +		if (--nr == 0)

You really require nr to != 0 here or otherwise you're going to be clearing 4
billion pages :)

> +			break;
> +		++page;
> +	}
> +}

Can we put this in mm.h or somewhere else please, and can we do away with this
HorribleNamingConvention, this is new, we can 'get away' with making it
something sensible :)

I wonder if we shouldn't also add a folio pointer here, and some
VM_WARN_ON_ONCE()'s. Like:

static inline void folio_clear_page_batch(struct folio *folio,
					  struct page *first_subpage,
					  unsigned int nr_pages)
{
	struct page *subpage = first_subpage;

	VM_WARN_ON_ONCE(!nr_pages);
	VM_WARN_ON_ONCE(... check first_subpage in folio ...);
	VM_WARN_ON_ONCE(... check first_subpage -> first_subpage + nr_pages in folio ...);

	while (nr_pages--)
		ClearPageAnonExclusive(subpage++);
}

> +
>  #ifdef CONFIG_MMU
>  #define __PG_MLOCKED		(1UL << PG_mlocked)
>  #else
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 1b7720c66ac87..7a67776dca3fe 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -712,9 +712,13 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>  	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
>  	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>
> +	/* We only clear anon-exclusive from head page of PMD folio */

Is this accurate? David? I thought anon exclusive was per-subpage for any large
folio...?

If we're explicitly doing this for some reason here, then why introduce it?

> +	if (level == PGTABLE_LEVEL_PMD)
> +		nr_pages = 1;
> +
>  	/* device private folios cannot get pinned via GUP. */
>  	if (unlikely(folio_is_device_private(folio))) {
> -		ClearPageAnonExclusive(page);
> +		ClearPagesAnonExclusive(page, nr_pages);

I really kind of hate this 'we are looking at subpage X with variable page in
folio Y, but we don't mention Y' thing. It's super confusing that we have a
pointer to a thing which sometimes we deref and treat as a value we care about
and sometimes treat as an array.

This pattern exists throughout all the batched stuff and I kind of hate it
everywhere.

I guess the batching means that we are looking at a sub-folio range.

If C had a better type system we could somehow have a type that encoded this,
but it doesn't :>)

But I wonder if we shouldn't just go ahead and rename page -> pages and be
consistent about this?

>  		return 0;
>  	}
>
> @@ -766,7 +770,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>
>  	if (unlikely(folio_maybe_dma_pinned(folio)))
>  		return -EBUSY;
> -	ClearPageAnonExclusive(page);
> +	ClearPagesAnonExclusive(page, nr_pages);
>
>  	/*
>  	 * This is conceptually a smp_wmb() paired with the smp_rmb() in
> @@ -804,6 +808,26 @@ static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
>  	return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
>  }
>
> +/**
> + * folio_try_share_anon_rmap_ptes - try marking exclusive anonymous pages
> + *				    mapped by PTEs possibly shared to prepare
> + *				   for KSM or temporary unmapping

This description is very confusing. 'Try marking exclusive anonymous pages
[... marking them as what?] mapped by PTEs[, or (]possibly shared[, or )] to
prepare for KSM[under what circumstances? Why mention KSM here?] or temporary
unmapping [why temporary?]

OK I think you mean to say 'marking' them as 'possibly' shared.

But really by 'shared' you mean clearing anon exclusive right? So maybe the
function name and description should reference that instead.

But this needs clarifying. This isn't an exercise in minimum number of words to
describe the function.

Ohhh now I see this is what the comment is in folio_try_share_anon_rmap_pte() :P

Well, I wish we could update the original too ;) but OK this is fine as-is to
matc that then.

> + * @folio:	The folio to share a mapping of
> + * @page:	The first mapped exclusive page of the batch
> + * @nr_pages:	The number of pages to share (batch size)
> + *
> + * See folio_try_share_anon_rmap_pte for full description.
> + *
> + * Context: The caller needs to hold the page table lock and has to have the
> + * page table entries cleared/invalidated. Those PTEs used to map consecutive
> + * pages of the folio passed here. The PTEs are all in the same PMD and VMA.

Can we VM_WARN_ON_ONCE() any of this? Not completely a necessity.

> + */
> +static inline int folio_try_share_anon_rmap_ptes(struct folio *folio,
> +		struct page *page, unsigned int nr)
> +{
> +	return __folio_try_share_anon_rmap(folio, page, nr, PGTABLE_LEVEL_PTE);
> +}
> +
>  /**
>   * folio_try_share_anon_rmap_pmd - try marking an exclusive anonymous page
>   *				   range mapped by a PMD possibly shared to
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 42f6b00cced01..bba5b571946d8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2300,7 +2300,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>
>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
>  			if (anon_exclusive &&
> -			    folio_try_share_anon_rmap_pte(folio, subpage)) {
> +			    folio_try_share_anon_rmap_ptes(folio, subpage, 1)) {

I guess this is because you intend to make this batched later with >1, but I
don't see the point of adding this since folio_try_share_anon_rmap_pte() already
does what you're doing here.

So why not just change this when you actually batch?

Buuuut.... haven't you not already changed this whole function to now 'jump'
ahead if batched, so why are we only specifying nr_pages = 1 here?

Honestly I think this function needs to be fully refactored away from the
appalling giant-ball-of-string mess it is now before we try to add in batching
to be honest.

Let's not accumulate more technical debt.


>  				folio_put_swap(folio, subpage, 1);
>  				set_pte_at(mm, address, pvmw.pte, pteval);
>  				goto walk_abort;
> --
> 2.34.1
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one
  2026-03-10  8:23       ` Dev Jain
@ 2026-03-10 12:40         ` Matthew Wilcox
  2026-03-11  4:54           ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: Matthew Wilcox @ 2026-03-10 12:40 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes (Oracle), akpm,
	axelrasmussen, yuanchu, hughd, chrisl, kasong, weixugc,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 01:53:21PM +0530, Dev Jain wrote:
> So when I was playing around with the code, I noticed that passing
> unsigned int nr_pages to add_mm_counter(-nr_pages) messes up things. Then

Using int (whether signed or unsigned) to store nr_pages is a bad idea.
Look how many people are asking about supporting PUD-sized folios.
On an ARM 64k PAGE_SIZE machine, that's 2^26 pages which is uncomfortably
close to 2^32.  It'll only take one more level to exceed that, so, what,
five to ten more years?

Just use unsigned long everywhere now and save ourselves the grief later.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping
  2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
                   ` (9 preceding siblings ...)
  2026-03-10  8:02 ` [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Lorenzo Stoakes (Oracle)
@ 2026-03-10 12:59 ` Lance Yang
  2026-03-11  8:11   ` Dev Jain
  10 siblings, 1 reply; 46+ messages in thread
From: Lance Yang @ 2026-03-10 12:59 UTC (permalink / raw)
  To: dev.jain
  Cc: Liam.Howlett, akpm, anshuman.khandual, axelrasmussen, baohua,
	baolin.wang, bhe, chrisl, david, harry.yoo, hughd, jannh, kas,
	kasong, linux-kernel, linux-mm, ljs, mhocko, nphamcs, pfalcato,
	riel, rppt, ryan.roberts, shikemeng, surenb, vbabka, weixugc,
	willy, youngjun.park, yuanchu, yuzhao, ziy, Lance Yang


On Tue, Mar 10, 2026 at 01:00:04PM +0530, Dev Jain wrote:
>Speed up unmapping of anonymous large folios by clearing the ptes, and
>setting swap ptes, in one go.
>
>The following benchmark (stolen from Barry at [1]) is used to measure the
>time taken to swapout 256M worth of memory backed by 64K large folios:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <time.h>
> #include <unistd.h>
> #include <errno.h>
>
> #define SIZE_MB 256
> #define SIZE_BYTES (SIZE_MB * 1024 * 1024)
>
> int main() {
>     void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
>                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     if (addr == MAP_FAILED) {
>         perror("mmap failed");
>         return 1;
>     }
>
>     memset(addr, 0, SIZE_BYTES);
>
>     struct timespec start, end;
>     clock_gettime(CLOCK_MONOTONIC, &start);
>
>     if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
>         perror("madvise(MADV_PAGEOUT) failed");
>         munmap(addr, SIZE_BYTES);
>         return 1;
>     }
>
>     clock_gettime(CLOCK_MONOTONIC, &end);
>
>     long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
>                        (end.tv_nsec - start.tv_nsec);
>     printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
>            duration_ns, duration_ns / 1e6);
>
>     munmap(addr, SIZE_BYTES);
>     return 0;
> }
>
>On arm64, only showing one of the middle values in the distribution:
>
>without patch:
>madvise(MADV_PAGEOUT) took 52192959 ns (52.193 ms)
>
>with patch:
>madvise(MADV_PAGEOUT) took 26676625 ns (26.677 ms)

Good numbers! Just tested on x86 KVM with THP=never, no performance
regression observed.

Cheers,
Lance


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-03-10  8:34   ` Lorenzo Stoakes (Oracle)
@ 2026-03-10 23:32     ` Barry Song
  2026-03-11  4:14       ` Barry Song
  2026-03-11  4:56     ` Dev Jain
  1 sibling, 1 reply; 46+ messages in thread
From: Barry Song @ 2026-03-10 23:32 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Dev Jain, akpm, axelrasmussen, yuanchu, david, hughd, chrisl,
	kasong, weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 4:34 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
> > The ptes are all the same w.r.t belonging to the same type of VMA, and
> > being marked with uffd-wp or all being not marked. Therefore we can batch
> > set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
> > batched unmapping of folios belonging to uffd-wp VMAs by dropping that
> > condition from folio_unmap_pte_batch.
> >
> > It may happen that we don't batch over the entire folio in one go, in which
> > case, we must skip over the current batch. Add a helper to do that -
> > page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
> > by nr pages.
> >
> > I think that we can get away with just incrementing pvmw->pte
> > and pvmw->address, since looking at the code in page_vma_mapped.c,
> > pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
> > and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
> > cancelling out the increment and decrement in the respective fields. But
> > let us not rely on the pvmw implementation and keep this simple.
>
> This isn't simple...
>
> >
> > Export this function to rmap.h to enable future reuse.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >  include/linux/rmap.h | 10 ++++++++++
> >  mm/rmap.c            |  8 +++-----
> >  2 files changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 8dc0871e5f001..1b7720c66ac87 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >               spin_unlock(pvmw->ptl);
> >  }
> >
> > +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
> > +             unsigned int nr)
>
> unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
> for no reason.
>
> > +{
> > +     pvmw->pfn += nr;
> > +     pvmw->nr_pages -= nr;
> > +     pvmw->pgoff += nr;
> > +     pvmw->pte += nr;
> > +     pvmw->address += nr * PAGE_SIZE;
> > +}
>
> I absolutely hate this. It's extremely confusing, especially since you're now
> going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
> here, you're losing sight of the batch size and exposing a silly detail to the
> caller, and I really don't want to 'export' this at this time.

I’m fairly sure I raised the same concern when Dev first suggested this,
but somehow it seems my comment was completely overlooked. :-)

>
> If we must have this, can you please make it static in rmap.c at least for the
> time being.
>
> Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
> page_vma_mapped_walk_batch()?

Right now, for non-anon pages we face the same issues, but
page_vma_mapped_walk() can skip those PTEs once it finds that
nr - 1 PTEs are none.

next_pte:
                do {
                        pvmw->address += PAGE_SIZE;
                        if (pvmw->address >= end)
                                return not_found(pvmw);
                        /* Did we cross page table boundary? */
                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
                                if (pvmw->ptl) {
                                        spin_unlock(pvmw->ptl);
                                        pvmw->ptl = NULL;
                                }
                                pte_unmap(pvmw->pte);
                                pvmw->pte = NULL;
                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
                                goto restart;
                        }
                        pvmw->pte++;
                } while (pte_none(ptep_get(pvmw->pte)));

The difference now is that swap entries cannot be skipped.

If we're trying to find `page_vma_mapped_walk_batch()`, I suppose
it could be like this?

bool page_vma_mapped_walk_batch(struct page_vma_mapped_walk *pvmw,
unsigned long nr)
{
...
}

static inline bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
{
        return page_vma_mapped_walk_batch(pvmw, 1);
}

Thanks
barry


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-03-10 23:32     ` Barry Song
@ 2026-03-11  4:14       ` Barry Song
  2026-03-11  4:52         ` Dev Jain
  0 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2026-03-11  4:14 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Dev Jain, akpm, axelrasmussen, yuanchu, david, hughd, chrisl,
	kasong, weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Wed, Mar 11, 2026 at 7:32 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 10, 2026 at 4:34 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
> > > The ptes are all the same w.r.t belonging to the same type of VMA, and
> > > being marked with uffd-wp or all being not marked. Therefore we can batch
> > > set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
> > > batched unmapping of folios belonging to uffd-wp VMAs by dropping that
> > > condition from folio_unmap_pte_batch.
> > >
> > > It may happen that we don't batch over the entire folio in one go, in which
> > > case, we must skip over the current batch. Add a helper to do that -
> > > page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
> > > by nr pages.
> > >
> > > I think that we can get away with just incrementing pvmw->pte
> > > and pvmw->address, since looking at the code in page_vma_mapped.c,
> > > pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
> > > and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
> > > cancelling out the increment and decrement in the respective fields. But
> > > let us not rely on the pvmw implementation and keep this simple.
> >
> > This isn't simple...
> >
> > >
> > > Export this function to rmap.h to enable future reuse.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >  include/linux/rmap.h | 10 ++++++++++
> > >  mm/rmap.c            |  8 +++-----
> > >  2 files changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index 8dc0871e5f001..1b7720c66ac87 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > >               spin_unlock(pvmw->ptl);
> > >  }
> > >
> > > +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
> > > +             unsigned int nr)
> >
> > unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
> > for no reason.
> >
> > > +{
> > > +     pvmw->pfn += nr;
> > > +     pvmw->nr_pages -= nr;
> > > +     pvmw->pgoff += nr;
> > > +     pvmw->pte += nr;
> > > +     pvmw->address += nr * PAGE_SIZE;
> > > +}
> >
> > I absolutely hate this. It's extremely confusing, especially since you're now
> > going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
> > here, you're losing sight of the batch size and exposing a silly detail to the
> > caller, and I really don't want to 'export' this at this time.
>
> I’m fairly sure I raised the same concern when Dev first suggested this,
> but somehow it seems my comment was completely overlooked. :-)
>
> >
> > If we must have this, can you please make it static in rmap.c at least for the
> > time being.
> >
> > Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
> > page_vma_mapped_walk_batch()?
>
> Right now, for non-anon pages we face the same issues, but
> page_vma_mapped_walk() can skip those PTEs once it finds that
> nr - 1 PTEs are none.
>
> next_pte:
>                 do {
>                         pvmw->address += PAGE_SIZE;
>                         if (pvmw->address >= end)
>                                 return not_found(pvmw);
>                         /* Did we cross page table boundary? */
>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>                                 if (pvmw->ptl) {
>                                         spin_unlock(pvmw->ptl);
>                                         pvmw->ptl = NULL;
>                                 }
>                                 pte_unmap(pvmw->pte);
>                                 pvmw->pte = NULL;
>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>                                 goto restart;
>                         }
>                         pvmw->pte++;
>                 } while (pte_none(ptep_get(pvmw->pte)));
>
> The difference now is that swap entries cannot be skipped.
>
> If we're trying to find `page_vma_mapped_walk_batch()`, I suppose
> it could be like this?
>
> bool page_vma_mapped_walk_batch(struct page_vma_mapped_walk *pvmw,
> unsigned long nr)
> {
> ...
> }
>
> static inline bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> {
>         return page_vma_mapped_walk_batch(pvmw, 1);
> }

Another approach might be to introduce a flag so that
page_vma_mapped_walk() knows we are doing batched unmaps
and can skip nr - 1 swap entries.

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8dc0871e5f00..bf03ae006366 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -856,6 +856,9 @@ struct page *make_device_exclusive(struct
mm_struct *mm, unsigned long addr,
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION         (1 << 1)

+/* Batched unmap: skip swap entries. */
+#define PVMW_BATCH_UNMAP       (1 << 2)
+
 /* Result flags */

 /* The page is mapped across page table boundary */


Thanks
Barry


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-03-11  4:14       ` Barry Song
@ 2026-03-11  4:52         ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-11  4:52 UTC (permalink / raw)
  To: Barry Song, Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, ziy, kas, willy, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 11/03/26 9:44 am, Barry Song wrote:
> On Wed, Mar 11, 2026 at 7:32 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Mar 10, 2026 at 4:34 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>>>
>>> On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
>>>> The ptes are all the same w.r.t belonging to the same type of VMA, and
>>>> being marked with uffd-wp or all being not marked. Therefore we can batch
>>>> set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
>>>> batched unmapping of folios belonging to uffd-wp VMAs by dropping that
>>>> condition from folio_unmap_pte_batch.
>>>>
>>>> It may happen that we don't batch over the entire folio in one go, in which
>>>> case, we must skip over the current batch. Add a helper to do that -
>>>> page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
>>>> by nr pages.
>>>>
>>>> I think that we can get away with just incrementing pvmw->pte
>>>> and pvmw->address, since looking at the code in page_vma_mapped.c,
>>>> pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
>>>> and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
>>>> cancelling out the increment and decrement in the respective fields. But
>>>> let us not rely on the pvmw implementation and keep this simple.
>>>
>>> This isn't simple...
>>>
>>>>
>>>> Export this function to rmap.h to enable future reuse.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>  include/linux/rmap.h | 10 ++++++++++
>>>>  mm/rmap.c            |  8 +++-----
>>>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 8dc0871e5f001..1b7720c66ac87 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>>>               spin_unlock(pvmw->ptl);
>>>>  }
>>>>
>>>> +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
>>>> +             unsigned int nr)
>>>
>>> unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
>>> for no reason.
>>>
>>>> +{
>>>> +     pvmw->pfn += nr;
>>>> +     pvmw->nr_pages -= nr;
>>>> +     pvmw->pgoff += nr;
>>>> +     pvmw->pte += nr;
>>>> +     pvmw->address += nr * PAGE_SIZE;
>>>> +}
>>>
>>> I absolutely hate this. It's extremely confusing, especially since you're now
>>> going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
>>> here, you're losing sight of the batch size and exposing a silly detail to the
>>> caller, and I really don't want to 'export' this at this time.
>>
>> I’m fairly sure I raised the same concern when Dev first suggested this,
>> but somehow it seems my comment was completely overlooked. :-)

I do recall, perhaps I was lazy to look at the pvmw code :) But I should
have just looked at this earlier, it's fairly simple. See below.

>>
>>>
>>> If we must have this, can you please make it static in rmap.c at least for the
>>> time being.
>>>
>>> Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
>>> page_vma_mapped_walk_batch()?
>>
>> Right now, for non-anon pages we face the same issues, but
>> page_vma_mapped_walk() can skip those PTEs once it finds that
>> nr - 1 PTEs are none.
>>
>> next_pte:
>>                 do {
>>                         pvmw->address += PAGE_SIZE;
>>                         if (pvmw->address >= end)
>>                                 return not_found(pvmw);
>>                         /* Did we cross page table boundary? */
>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>                                 if (pvmw->ptl) {
>>                                         spin_unlock(pvmw->ptl);
>>                                         pvmw->ptl = NULL;
>>                                 }
>>                                 pte_unmap(pvmw->pte);
>>                                 pvmw->pte = NULL;
>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>                                 goto restart;
>>                         }
>>                         pvmw->pte++;
>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>
>> The difference now is that swap entries cannot be skipped.
>>
>> If we're trying to find `page_vma_mapped_walk_batch()`, I suppose
>> it could be like this?
>>
>> bool page_vma_mapped_walk_batch(struct page_vma_mapped_walk *pvmw,
>> unsigned long nr)
>> {
>> ...
>> }
>>
>> static inline bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>> {
>>         return page_vma_mapped_walk_batch(pvmw, 1);
>> }
> 
> Another approach might be to introduce a flag so that
> page_vma_mapped_walk() knows we are doing batched unmaps
> and can skip nr - 1 swap entries.
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8dc0871e5f00..bf03ae006366 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -856,6 +856,9 @@ struct page *make_device_exclusive(struct
> mm_struct *mm, unsigned long addr,
>  /* Look for migration entries rather than present PTEs */
>  #define PVMW_MIGRATION         (1 << 1)
> 
> +/* Batched unmap: skip swap entries. */
> +#define PVMW_BATCH_UNMAP       (1 << 2)

Exactly, I just also came up with the same solution and saw your reply :)
We can just name this PVMW_BATCH_PRESENT, the comment saying
"Look for present entries", and fix the comment above PVMW_MIGRATION to
drop the "rather than present PTEs" because that is wrong, we are currently
also looking for softleaf entries by default.

> +
>  /* Result flags */
> 
>  /* The page is mapped across page table boundary */
> 
> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one
  2026-03-10 12:40         ` Matthew Wilcox
@ 2026-03-11  4:54           ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-11  4:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes (Oracle), akpm,
	axelrasmussen, yuanchu, hughd, chrisl, kasong, weixugc,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, ziy, kas, yuzhao, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 10/03/26 6:10 pm, Matthew Wilcox wrote:
> On Tue, Mar 10, 2026 at 01:53:21PM +0530, Dev Jain wrote:
>> So when I was playing around with the code, I noticed that passing
>> unsigned int nr_pages to add_mm_counter(-nr_pages) messes up things. Then
> 
> Using int (whether signed or unsigned) to store nr_pages is a bad idea.
> Look how many people are asking about supporting PUD-sized folios.
> On an ARM 64k PAGE_SIZE machine, that's 2^26 pages which is uncomfortably
> close to 2^32.  It'll only take one more level to exceed that, so, what,
> five to ten more years?
> 
> Just use unsigned long everywhere now and save ourselves the grief later.

Thanks, let us do this.





^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-03-10  8:34   ` Lorenzo Stoakes (Oracle)
  2026-03-10 23:32     ` Barry Song
@ 2026-03-11  4:56     ` Dev Jain
  1 sibling, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-11  4:56 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 2:04 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
>> The ptes are all the same w.r.t belonging to the same type of VMA, and
>> being marked with uffd-wp or all being not marked. Therefore we can batch
>> set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
>> batched unmapping of folios belonging to uffd-wp VMAs by dropping that
>> condition from folio_unmap_pte_batch.
>>
>> It may happen that we don't batch over the entire folio in one go, in which
>> case, we must skip over the current batch. Add a helper to do that -
>> page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
>> by nr pages.
>>
>> I think that we can get away with just incrementing pvmw->pte
>> and pvmw->address, since looking at the code in page_vma_mapped.c,
>> pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
>> and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
>> cancelling out the increment and decrement in the respective fields. But
>> let us not rely on the pvmw implementation and keep this simple.
> 
> This isn't simple...
> 
>>
>> Export this function to rmap.h to enable future reuse.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/rmap.h | 10 ++++++++++
>>  mm/rmap.c            |  8 +++-----
>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 8dc0871e5f001..1b7720c66ac87 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>  		spin_unlock(pvmw->ptl);
>>  }
>>
>> +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
>> +		unsigned int nr)
> 
> unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
> for no reason.
> 
>> +{
>> +	pvmw->pfn += nr;
>> +	pvmw->nr_pages -= nr;
>> +	pvmw->pgoff += nr;
>> +	pvmw->pte += nr;
>> +	pvmw->address += nr * PAGE_SIZE;
>> +}
> 
> I absolutely hate this. It's extremely confusing, especially since you're now
> going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
> here, you're losing sight of the batch size and exposing a silly detail to the
> caller, and I really don't want to 'export' this at this time.
> 
> If we must have this, can you please make it static in rmap.c at least for the
> time being.
> 
> Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
> page_vma_mapped_walk_batch()?
> 
> I think that makes a lot more sense...
> 
> I mean I kind of hate the pvmw interface in general, this is a hack to handle
> batching clamped on to the side of it, let's figure out how to do this sensibly
> and do what's needed rather than adding yet more hacks-on-hacks please.
> 
>> +
>>  /**
>>   * page_vma_mapped_walk_restart - Restart the page table walk.
>>   * @pvmw: Pointer to struct page_vma_mapped_walk.
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index a7570cd037344..dd638429c963e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1953,9 +1953,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>  	if (pte_unused(pte))
>>  		return 1;
>>
>> -	if (userfaultfd_wp(vma))
>> -		return 1;
>> -
>>  	/*
>>  	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>>  	 * upgrading write permissions for ptes that were not originally
>> @@ -2235,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  		 * we may want to replace a none pte with a marker pte if
>>  		 * it's file-backed, so we don't lose the tracking info.
>>  		 */
>> -		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
>> +		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);
>>
>>  		/* Update high watermark before we lower rss */
>>  		update_hiwater_rss(mm);
>> @@ -2359,8 +2356,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  		 * If we are sure that we batched the entire folio and cleared
>>  		 * all PTEs, we can just optimize and stop right here.
>>  		 */
>> -		if (nr_pages == folio_nr_pages(folio))
>> +		if (likely(nr_pages == folio_nr_pages(folio)))
> 
> Please don't add random likely()'s based on what you think is likely(). This
> kind of thing should only be done based on profiling.

Okay.

> 
>>  			goto walk_done;
>> +		page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);
> 
> (You're now passing a signed long to an unsigned int...!)

Will fix all instances of nr_pages to unsigned long.

> 
> 
>>  		continue;
>>  walk_abort:
>>  		ret = false;
>> --
>> 2.34.1
>>
> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-10  8:49   ` Lorenzo Stoakes (Oracle)
@ 2026-03-11  5:42     ` Dev Jain
  2026-03-19 15:26       ` Lorenzo Stoakes (Oracle)
  2026-03-19 16:47       ` Matthew Wilcox
  0 siblings, 2 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-11  5:42 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 2:19 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:10PM +0530, Dev Jain wrote:
>> Teach folio_dup_swap to handle a batch of consecutive pages. Note that
>> folio_dup_swap already can handle a subset of this: nr_pages == 1 and
>> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
>>
>> Currently we have a not-so-nice logic of passing in subpage == NULL if
>> we mean to exercise the logic on the entire folio, and subpage != NULL if
>> we want to exercise the logic on only that subpage. Remove this
>> indirection, and explicitly pass subpage != NULL, and the number of
>> pages required.
> 
> You've made the interface more confusing? Now we can update multiple subpages
> but specify only one? :)
> 
> Let's try to actually refactor this into something sane... see below.
> 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/rmap.c     |  2 +-
>>  mm/shmem.c    |  2 +-
>>  mm/swap.h     |  5 +++--
>>  mm/swapfile.c | 12 +++++-------
>>  4 files changed, 10 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index dd638429c963e..f6d5b187cf09b 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2282,7 +2282,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  				goto discard;
>>  			}
>>
>> -			if (folio_dup_swap(folio, subpage) < 0) {
>> +			if (folio_dup_swap(folio, subpage, 1) < 0) {
>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>  				goto walk_abort;
>>  			}
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 5e7dcf5bc5d3c..86ee34c9b40b3 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1695,7 +1695,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>  			spin_unlock(&shmem_swaplist_lock);
>>  		}
>>
>> -		folio_dup_swap(folio, NULL);
>> +		folio_dup_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
>>  		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>>
>>  		BUG_ON(folio_mapped(folio));
>> diff --git a/mm/swap.h b/mm/swap.h
>> index a77016f2423b9..d9cb58ebbddd1 100644
>> --- a/mm/swap.h
>> +++ b/mm/swap.h
>> @@ -206,7 +206,7 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
>>   * folio_put_swap(): does the opposite thing of folio_dup_swap().
>>   */
>>  int folio_alloc_swap(struct folio *folio);
>> -int folio_dup_swap(struct folio *folio, struct page *subpage);
>> +int folio_dup_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
>>  void folio_put_swap(struct folio *folio, struct page *subpage);
>>
>>  /* For internal use */
>> @@ -390,7 +390,8 @@ static inline int folio_alloc_swap(struct folio *folio)
>>  	return -EINVAL;
>>  }
>>
>> -static inline int folio_dup_swap(struct folio *folio, struct page *page)
>> +static inline int folio_dup_swap(struct folio *folio, struct page *page,
>> +				 unsigned int nr_pages)
>>  {
>>  	return -EINVAL;
>>  }
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 915bc93964dbd..eaf61ae6c3817 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1738,7 +1738,8 @@ int folio_alloc_swap(struct folio *folio)
>>  /**
>>   * folio_dup_swap() - Increase swap count of swap entries of a folio.
>>   * @folio: folio with swap entries bounded.
>> - * @subpage: if not NULL, only increase the swap count of this subpage.
>> + * @subpage: Increase the swap count of this subpage till nr number of
>> + * pages forward.
> 
> (Obviously also Kairui's point about missing entry in kdoc)
> 
> This is REALLY confusing sorry. And this interface is just a horror show.
> 
> Before we had subpage == only increase the swap count of the subpage.
> 
> Now subpage = the first subpage at which we do that? Please, no.
> 
> You just need to rework this interface in general, this is a hack.
> 
> Something like:
> 
> int __folio_dup_swap(struct folio *folio, unsigned int subpage_start_index,
> 		     unsigned int nr_subpages)
> {
> 	...
> }
> 
> ...
> 
> int folio_dup_swap_subpage(struct folio *folio, struct page *subpage)
> {
> 	return __folio_dup_swap(folio, folio_page_idx(folio, subpage), 1);
> }
> 
> int folio_dup_swap(struct folio *folio)
> {
> 	return __folio_dup_swap(folio, 0, folio_nr_pages(folio));
> }
> 
> Or something like that.


I get the essence of the point you are making.

Since most callers of folio_put_swap mean it for entire folio, perhaps
we can have folio_put_swap for these callers, and the ones which are
not sure can call folio_put_swap_subpages? Same for folio_dup_swap.
And since we are calling it folio_put_swap_subpages, we can retain
the subpage parameter?

> 
> We're definitely _not_ keeping the subpage parameter like that and hacking on
> batching, PLEASE.
> 
>>   *
>>   * Typically called when the folio is unmapped and have its swap entry to
>>   * take its place: Swap entries allocated to a folio has count == 0 and pinned
>> @@ -1752,18 +1753,15 @@ int folio_alloc_swap(struct folio *folio)
>>   * swap_put_entries_direct on its swap entry before this helper returns, or
>>   * the swap count may underflow.
>>   */
>> -int folio_dup_swap(struct folio *folio, struct page *subpage)
>> +int folio_dup_swap(struct folio *folio, struct page *subpage,
>> +		   unsigned int nr_pages)
>>  {
>>  	swp_entry_t entry = folio->swap;
>> -	unsigned long nr_pages = folio_nr_pages(folio);
>>
>>  	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>>  	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
>>
>> -	if (subpage) {
>> -		entry.val += folio_page_idx(folio, subpage);
>> -		nr_pages = 1;
>> -	}
>> +	entry.val += folio_page_idx(folio, subpage);
>>
>>  	return swap_dup_entries_cluster(swap_entry_to_info(entry),
>>  					swp_offset(entry), nr_pages);
>> --
>> 2.34.1
>>
> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes
  2026-03-10  9:38   ` Lorenzo Stoakes (Oracle)
@ 2026-03-11  8:09     ` Dev Jain
  2026-03-12  8:19       ` Wei Yang
  2026-03-19 15:47       ` Lorenzo Stoakes (Oracle)
  0 siblings, 2 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-11  8:09 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual



On 10/03/26 3:08 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:12PM +0530, Dev Jain wrote:
>> In the quest of enabling batched unmapping of anonymous folios, we need to
>> handle the sharing of exclusive pages. Hence, a batched version of
>> folio_try_share_anon_rmap_pte is required.
>>
>> Currently, the sole purpose of nr_pages in __folio_try_share_anon_rmap is
>> to do some rmap sanity checks. Add helpers to set and clear the
>> PageAnonExclusive bit on a batch of nr_pages. Note that
>> __folio_try_share_anon_rmap can receive nr_pages == HPAGE_PMD_NR from the
>> PMD path, but currently we only clear the bit on the head page. Retain this
>> behaviour by setting nr_pages = 1 in case the caller is
>> folio_try_share_anon_rmap_pmd.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/page-flags.h | 11 +++++++++++
>>  include/linux/rmap.h       | 28 ++++++++++++++++++++++++++--
>>  mm/rmap.c                  |  2 +-
>>  3 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 0e03d816e8b9d..1d74ed9a28c41 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -1178,6 +1178,17 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
>>  	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>>  }
>>
>> +static __always_inline void ClearPagesAnonExclusive(struct page *page,
>> +		unsigned int nr)
> 
> You're randomly moving between nr and nr_pages, can we just consistently use
> nr_pages please.

Okay.

> 
>> +{
>> +	for (;;) {
>> +		ClearPageAnonExclusive(page);
>> +		if (--nr == 0)
> 
> You really require nr to != 0 here or otherwise you're going to be clearing 4
> billion pages :)

I'm following the pattern in pgtable.h, see set_ptes,
clear_young_dirty_ptes, etc.

> 
>> +			break;
>> +		++page;
>> +	}
>> +}
> 
> Can we put this in mm.h or somewhere else please, and can we do away with this

What is the problem with page-flags.h? I am not very aware on which
function to put in which header file semantically, so please educate me on
this.

> HorribleNamingConvention, this is new, we can 'get away' with making it
> something sensible :)

I'll name it folio_clear_pages_anon_exclusive.


> 
> I wonder if we shouldn't also add a folio pointer here, and some
> VM_WARN_ON_ONCE()'s. Like:
> 
> static inline void folio_clear_page_batch(struct folio *folio,
> 					  struct page *first_subpage,
> 					  unsigned int nr_pages)
> {
> 	struct page *subpage = first_subpage;
> 
> 	VM_WARN_ON_ONCE(!nr_pages);
> 	VM_WARN_ON_ONCE(... check first_subpage in folio ...);
> 	VM_WARN_ON_ONCE(... check first_subpage -> first_subpage + nr_pages in folio ...);

I like what you are saying, but __folio_rmap_sanity_checks in the caller
checks exactly this :)

> 
> 	while (nr_pages--)
> 		ClearPageAnonExclusive(subpage++);
> }
> 
>> +
>>  #ifdef CONFIG_MMU
>>  #define __PG_MLOCKED		(1UL << PG_mlocked)
>>  #else
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 1b7720c66ac87..7a67776dca3fe 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -712,9 +712,13 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>>  	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
>>  	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>
>> +	/* We only clear anon-exclusive from head page of PMD folio */
> 
> Is this accurate? David? I thought anon exclusive was per-subpage for any large
> folio...?

The current behaviour is to do this only. I was also surprised with this,
so I had dug in and found out:

https://lore.kernel.org/all/20220428083441.37290-13-david@redhat.com/

where David says:

"Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP."


> 
> If we're explicitly doing this for some reason here, then why introduce it?
> 
>> +	if (level == PGTABLE_LEVEL_PMD)
>> +		nr_pages = 1;
>> +
>>  	/* device private folios cannot get pinned via GUP. */
>>  	if (unlikely(folio_is_device_private(folio))) {
>> -		ClearPageAnonExclusive(page);
>> +		ClearPagesAnonExclusive(page, nr_pages);
> 
> I really kind of hate this 'we are looking at subpage X with variable page in
> folio Y, but we don't mention Y' thing. It's super confusing that we have a
> pointer to a thing which sometimes we deref and treat as a value we care about
> and sometimes treat as an array.
> 
> This pattern exists throughout all the batched stuff and I kind of hate it
> everywhere.
> 
> I guess the batching means that we are looking at a sub-folio range.
> 
> If C had a better type system we could somehow have a type that encoded this,
> but it doesn't :>)
> 
> But I wonder if we shouldn't just go ahead and rename page -> pages and be
> consistent about this?

Agreed. You are correct in saying that this function should receive struct
folio to assert that we are essentially in a page array, and some sanity
checking should happen. But the callers are already doing the checking
in folio_rmap_sanity_checks. Let me think on this.


> 
>>  		return 0;
>>  	}
>>
>> @@ -766,7 +770,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>>
>>  	if (unlikely(folio_maybe_dma_pinned(folio)))
>>  		return -EBUSY;
>> -	ClearPageAnonExclusive(page);
>> +	ClearPagesAnonExclusive(page, nr_pages);
>>
>>  	/*
>>  	 * This is conceptually a smp_wmb() paired with the smp_rmb() in
>> @@ -804,6 +808,26 @@ static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
>>  	return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
>>  }
>>
>> +/**
>> + * folio_try_share_anon_rmap_ptes - try marking exclusive anonymous pages
>> + *				    mapped by PTEs possibly shared to prepare
>> + *				   for KSM or temporary unmapping
> 
> This description is very confusing. 'Try marking exclusive anonymous pages
> [... marking them as what?] mapped by PTEs[, or (]possibly shared[, or )] to
> prepare for KSM[under what circumstances? Why mention KSM here?] or temporary
> unmapping [why temporary?]
> 
> OK I think you mean to say 'marking' them as 'possibly' shared.
> 
> But really by 'shared' you mean clearing anon exclusive right? So maybe the
> function name and description should reference that instead.
> 
> But this needs clarifying. This isn't an exercise in minimum number of words to
> describe the function.
> 
> Ohhh now I see this is what the comment is in folio_try_share_anon_rmap_pte() :P
> 
> Well, I wish we could update the original too ;) but OK this is fine as-is to
> matc that then.
> 
>> + * @folio:	The folio to share a mapping of
>> + * @page:	The first mapped exclusive page of the batch
>> + * @nr_pages:	The number of pages to share (batch size)
>> + *
>> + * See folio_try_share_anon_rmap_pte for full description.
>> + *
>> + * Context: The caller needs to hold the page table lock and has to have the
>> + * page table entries cleared/invalidated. Those PTEs used to map consecutive
>> + * pages of the folio passed here. The PTEs are all in the same PMD and VMA.
> 
> Can we VM_WARN_ON_ONCE() any of this? Not completely a necessity.

Again, we have WARN checks in folio_rmap_sanity_checks, and even in
folio_pte_batch. I am afraid of duplication.

> 
>> + */
>> +static inline int folio_try_share_anon_rmap_ptes(struct folio *folio,
>> +		struct page *page, unsigned int nr)
>> +{
>> +	return __folio_try_share_anon_rmap(folio, page, nr, PGTABLE_LEVEL_PTE);
>> +}
>> +
>>  /**
>>   * folio_try_share_anon_rmap_pmd - try marking an exclusive anonymous page
>>   *				   range mapped by a PMD possibly shared to
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 42f6b00cced01..bba5b571946d8 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2300,7 +2300,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
>>  			if (anon_exclusive &&
>> -			    folio_try_share_anon_rmap_pte(folio, subpage)) {
>> +			    folio_try_share_anon_rmap_ptes(folio, subpage, 1)) {
> 
> I guess this is because you intend to make this batched later with >1, but I
> don't see the point of adding this since folio_try_share_anon_rmap_pte() already
> does what you're doing here.
> 
> So why not just change this when you actually batch?

It is generally the consensus to introduce a new function along with its
caller. Although one may argue that the caller introduced here is not
a functional change (still passing nr_pages = 1). So I am fine doing
what you suggest.

> 
> Buuuut.... haven't you not already changed this whole function to now 'jump'
> ahead if batched, so why are we only specifying nr_pages = 1 here?

Because...please bear with the insanity :) currently we are in a ridiculous
situation where

nr_pages can be > 1 for file folios, and lazyfree folios, *and* it is
required that the VMA is non-uffd.

So, the "jump" thingy I was doing in the previous patch was adding support
for file folios, belonging to uffd VMAs (see pte_install_uffd_wp_if_needed,
we need to handle uffd-wp marker for file folios only, and also I should
have mentioned "file folio" in the subject line of that patch, of course
I missed that because reasoning through this code is very difficult)

To answer your question, currently for anon folios nr_pages == 1, so
the jump is a no-op.

When I had discovered the uffd-wp bug some weeks back, I was pushing
back against the idea of hacking around it by disabling batching
for uffd-VMAs in folio_unmap_pte_batch, but solve it then and there
properly. Now we have too many cases - first we added lazyfree support,
then file-non-uffd support, my patch 5 adds file-uffd support, and
last patch finally completes this with anon support.


> 
> Honestly I think this function needs to be fully refactored away from the
> appalling giant-ball-of-string mess it is now before we try to add in batching
> to be honest.
> 
> Let's not accumulate more technical debt.

I agree, I am happy to help in cleaning up this function.

> 
> 
>>  				folio_put_swap(folio, subpage, 1);
>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>  				goto walk_abort;
>> --
>> 2.34.1
>>
> 
> Thanks, Lorenzo



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping
  2026-03-10 12:59 ` Lance Yang
@ 2026-03-11  8:11   ` Dev Jain
  0 siblings, 0 replies; 46+ messages in thread
From: Dev Jain @ 2026-03-11  8:11 UTC (permalink / raw)
  To: Lance Yang
  Cc: Liam.Howlett, akpm, anshuman.khandual, axelrasmussen, baohua,
	baolin.wang, bhe, chrisl, david, harry.yoo, hughd, jannh, kas,
	kasong, linux-kernel, linux-mm, ljs, mhocko, nphamcs, pfalcato,
	riel, rppt, ryan.roberts, shikemeng, surenb, vbabka, weixugc,
	willy, youngjun.park, yuanchu, yuzhao, ziy



On 10/03/26 6:29 pm, Lance Yang wrote:
> 
> On Tue, Mar 10, 2026 at 01:00:04PM +0530, Dev Jain wrote:
>> Speed up unmapping of anonymous large folios by clearing the ptes, and
>> setting swap ptes, in one go.
>>
>> The following benchmark (stolen from Barry at [1]) is used to measure the
>> time taken to swapout 256M worth of memory backed by 64K large folios:
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <time.h>
>> #include <unistd.h>
>> #include <errno.h>
>>
>> #define SIZE_MB 256
>> #define SIZE_BYTES (SIZE_MB * 1024 * 1024)
>>
>> int main() {
>>     void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
>>                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>     if (addr == MAP_FAILED) {
>>         perror("mmap failed");
>>         return 1;
>>     }
>>
>>     memset(addr, 0, SIZE_BYTES);
>>
>>     struct timespec start, end;
>>     clock_gettime(CLOCK_MONOTONIC, &start);
>>
>>     if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
>>         perror("madvise(MADV_PAGEOUT) failed");
>>         munmap(addr, SIZE_BYTES);
>>         return 1;
>>     }
>>
>>     clock_gettime(CLOCK_MONOTONIC, &end);
>>
>>     long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
>>                        (end.tv_nsec - start.tv_nsec);
>>     printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
>>            duration_ns, duration_ns / 1e6);
>>
>>     munmap(addr, SIZE_BYTES);
>>     return 0;
>> }
>>
>> On arm64, only showing one of the middle values in the distribution:
>>
>> without patch:
>> madvise(MADV_PAGEOUT) took 52192959 ns (52.193 ms)
>>
>> with patch:
>> madvise(MADV_PAGEOUT) took 26676625 ns (26.677 ms)
> 
> Good numbers! Just tested on x86 KVM with THP=never, no performance
> regression observed.

Thanks Lance!

Although still I'll try to get no-regression numbers and perf-boost
numbers on x86 myself and post it in next version.

> 
> Cheers,
> Lance
> 



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes
  2026-03-11  8:09     ` Dev Jain
@ 2026-03-12  8:19       ` Wei Yang
  2026-03-19 15:47       ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 46+ messages in thread
From: Wei Yang @ 2026-03-12  8:19 UTC (permalink / raw)
  To: Dev Jain
  Cc: Lorenzo Stoakes (Oracle), akpm, axelrasmussen, yuanchu, david,
	hughd, chrisl, kasong, weixugc, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, baohua, youngjun.park, ziy, kas, willy,
	yuzhao, linux-mm, linux-kernel, ryan.roberts, anshuman.khandual

On Wed, Mar 11, 2026 at 01:39:25PM +0530, Dev Jain wrote:
>
[...]
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index 1b7720c66ac87..7a67776dca3fe 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -712,9 +712,13 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
>>>  	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
>>>  	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
>>>
>>> +	/* We only clear anon-exclusive from head page of PMD folio */
>> 
>> Is this accurate? David? I thought anon exclusive was per-subpage for any large
>> folio...?
>
>The current behaviour is to do this only. I was also surprised with this,
>so I had dug in and found out:
>
>https://lore.kernel.org/all/20220428083441.37290-13-david@redhat.com/
>
>where David says:
>
>"Long story short: once
>PTE-mapped, we have to track information about exclusivity per sub-page,
>but until then, we can just track it for the compound page in the head
>page and not having to update a whole bunch of subpages all of the time
>for a simple PMD mapping of a THP."
>

Thanks for digging.

One tiny thing:

Now we have a comment in PageAnonExclusive(), which says:

	/*
	 * HugeTLB stores this information on the head page; THP keeps it per
	 * page
	 */

This may confuse readers? Not your fault, just want to point it out.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
  2026-03-10  8:27   ` Kairui Song
  2026-03-10  8:49   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18  0:20   ` kernel test robot
  2 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2026-03-18  0:20 UTC (permalink / raw)
  To: Dev Jain, akpm, axelrasmussen, yuanchu, david, hughd, chrisl,
	kasong
  Cc: oe-kbuild-all, weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, pfalcato, baolin.wang, shikemeng,
	nphamcs, bhe, baohua, youngjun.park, ziy, kas, willy, yuzhao,
	linux-mm, linux-kernel

Hi Dev,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-make-nr_pages-signed-in-try_to_unmap_one/20260310-153604
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260310073013.4069309-7-dev.jain%40arm.com
patch subject: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20260318/202603180851.hRAaHR6o-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260318/202603180851.hRAaHR6o-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603180851.hRAaHR6o-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: mm/swapfile.c:1757 function parameter 'nr_pages' not described in 'folio_dup_swap'
>> Warning: mm/swapfile.c:1757 function parameter 'nr_pages' not described in 'folio_dup_swap'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable
  2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
  2026-03-10  8:29   ` Kairui Song
  2026-03-10  8:55   ` Lorenzo Stoakes (Oracle)
@ 2026-03-18  1:04   ` kernel test robot
  2 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2026-03-18  1:04 UTC (permalink / raw)
  To: Dev Jain, akpm, axelrasmussen, yuanchu, david, hughd, chrisl,
	kasong
  Cc: oe-kbuild-all, weixugc, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, riel, harry.yoo, jannh, pfalcato, baolin.wang, shikemeng,
	nphamcs, bhe, baohua, youngjun.park, ziy, kas, willy, yuzhao,
	linux-mm, linux-kernel

Hi Dev,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-make-nr_pages-signed-in-try_to_unmap_one/20260310-153604
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20260310073013.4069309-8-dev.jain%40arm.com
patch subject: [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20260318/202603180953.FNgOslIG-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260318/202603180953.FNgOslIG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603180953.FNgOslIG-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Warning: mm/swapfile.c:1757 function parameter 'nr_pages' not described in 'folio_dup_swap'
>> Warning: mm/swapfile.c:1780 function parameter 'nr_pages' not described in 'folio_put_swap'
   Warning: mm/swapfile.c:1757 function parameter 'nr_pages' not described in 'folio_dup_swap'
>> Warning: mm/swapfile.c:1780 function parameter 'nr_pages' not described in 'folio_put_swap'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-11  5:42     ` Dev Jain
@ 2026-03-19 15:26       ` Lorenzo Stoakes (Oracle)
  2026-03-19 16:47       ` Matthew Wilcox
  1 sibling, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:26 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Wed, Mar 11, 2026 at 11:12:22AM +0530, Dev Jain wrote:
>
>
> On 10/03/26 2:19 pm, Lorenzo Stoakes (Oracle) wrote:
> > On Tue, Mar 10, 2026 at 01:00:10PM +0530, Dev Jain wrote:
> >> Teach folio_dup_swap to handle a batch of consecutive pages. Note that
> >> folio_dup_swap already can handle a subset of this: nr_pages == 1 and
> >> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
> >>
> >> Currently we have a not-so-nice logic of passing in subpage == NULL if
> >> we mean to exercise the logic on the entire folio, and subpage != NULL if
> >> we want to exercise the logic on only that subpage. Remove this
> >> indirection, and explicitly pass subpage != NULL, and the number of
> >> pages required.
> >
> > You've made the interface more confusing? Now we can update multiple subpages
> > but specify only one? :)
> >
> > Let's try to actually refactor this into something sane... see below.
> >
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >>  mm/rmap.c     |  2 +-
> >>  mm/shmem.c    |  2 +-
> >>  mm/swap.h     |  5 +++--
> >>  mm/swapfile.c | 12 +++++-------
> >>  4 files changed, 10 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index dd638429c963e..f6d5b187cf09b 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -2282,7 +2282,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>  				goto discard;
> >>  			}
> >>
> >> -			if (folio_dup_swap(folio, subpage) < 0) {
> >> +			if (folio_dup_swap(folio, subpage, 1) < 0) {
> >>  				set_pte_at(mm, address, pvmw.pte, pteval);
> >>  				goto walk_abort;
> >>  			}
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 5e7dcf5bc5d3c..86ee34c9b40b3 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1695,7 +1695,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >>  			spin_unlock(&shmem_swaplist_lock);
> >>  		}
> >>
> >> -		folio_dup_swap(folio, NULL);
> >> +		folio_dup_swap(folio, folio_page(folio, 0), folio_nr_pages(folio));
> >>  		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
> >>
> >>  		BUG_ON(folio_mapped(folio));
> >> diff --git a/mm/swap.h b/mm/swap.h
> >> index a77016f2423b9..d9cb58ebbddd1 100644
> >> --- a/mm/swap.h
> >> +++ b/mm/swap.h
> >> @@ -206,7 +206,7 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
> >>   * folio_put_swap(): does the opposite thing of folio_dup_swap().
> >>   */
> >>  int folio_alloc_swap(struct folio *folio);
> >> -int folio_dup_swap(struct folio *folio, struct page *subpage);
> >> +int folio_dup_swap(struct folio *folio, struct page *subpage, unsigned int nr_pages);
> >>  void folio_put_swap(struct folio *folio, struct page *subpage);
> >>
> >>  /* For internal use */
> >> @@ -390,7 +390,8 @@ static inline int folio_alloc_swap(struct folio *folio)
> >>  	return -EINVAL;
> >>  }
> >>
> >> -static inline int folio_dup_swap(struct folio *folio, struct page *page)
> >> +static inline int folio_dup_swap(struct folio *folio, struct page *page,
> >> +				 unsigned int nr_pages)
> >>  {
> >>  	return -EINVAL;
> >>  }
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 915bc93964dbd..eaf61ae6c3817 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -1738,7 +1738,8 @@ int folio_alloc_swap(struct folio *folio)
> >>  /**
> >>   * folio_dup_swap() - Increase swap count of swap entries of a folio.
> >>   * @folio: folio with swap entries bounded.
> >> - * @subpage: if not NULL, only increase the swap count of this subpage.
> >> + * @subpage: Increase the swap count of this subpage till nr number of
> >> + * pages forward.
> >
> > (Obviously also Kairui's point about missing entry in kdoc)
> >
> > This is REALLY confusing sorry. And this interface is just a horror show.
> >
> > Before we had subpage == only increase the swap count of the subpage.
> >
> > Now subpage = the first subpage at which we do that? Please, no.
> >
> > You just need to rework this interface in general, this is a hack.
> >
> > Something like:
> >
> > int __folio_dup_swap(struct folio *folio, unsigned int subpage_start_index,
> > 		     unsigned int nr_subpages)
> > {
> > 	...
> > }
> >
> > ...
> >
> > int folio_dup_swap_subpage(struct folio *folio, struct page *subpage)
> > {
> > 	return __folio_dup_swap(folio, folio_page_idx(folio, subpage), 1);
> > }
> >
> > int folio_dup_swap(struct folio *folio)
> > {
> > 	return __folio_dup_swap(folio, 0, folio_nr_pages(folio));
> > }
> >
> > Or something like that.
>
>
> I get the essence of the point you are making.
>
> Since most callers of folio_put_swap mean it for entire folio, perhaps
> we can have folio_put_swap for these callers, and the ones which are
> not sure can call folio_put_swap_subpages? Same for folio_dup_swap.
> And since we are calling it folio_put_swap_subpages, we can retain
> the subpage parameter?

Right but I'm talking about dup variants here but the principle is the same.

>
> >
> > We're definitely _not_ keeping the subpage parameter like that and hacking on
> > batching, PLEASE.
> >
> >>   *
> >>   * Typically called when the folio is unmapped and have its swap entry to
> >>   * take its place: Swap entries allocated to a folio has count == 0 and pinned
> >> @@ -1752,18 +1753,15 @@ int folio_alloc_swap(struct folio *folio)
> >>   * swap_put_entries_direct on its swap entry before this helper returns, or
> >>   * the swap count may underflow.
> >>   */
> >> -int folio_dup_swap(struct folio *folio, struct page *subpage)
> >> +int folio_dup_swap(struct folio *folio, struct page *subpage,
> >> +		   unsigned int nr_pages)
> >>  {
> >>  	swp_entry_t entry = folio->swap;
> >> -	unsigned long nr_pages = folio_nr_pages(folio);
> >>
> >>  	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> >>  	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
> >>
> >> -	if (subpage) {
> >> -		entry.val += folio_page_idx(folio, subpage);
> >> -		nr_pages = 1;
> >> -	}
> >> +	entry.val += folio_page_idx(folio, subpage);
> >>
> >>  	return swap_dup_entries_cluster(swap_entry_to_info(entry),
> >>  					swp_offset(entry), nr_pages);
> >> --
> >> 2.34.1
> >>
> >
> > Thanks, Lorenzo
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes
  2026-03-11  8:09     ` Dev Jain
  2026-03-12  8:19       ` Wei Yang
@ 2026-03-19 15:47       ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:47 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Wed, Mar 11, 2026 at 01:39:25PM +0530, Dev Jain wrote:
>
>
> On 10/03/26 3:08 pm, Lorenzo Stoakes (Oracle) wrote:
> > On Tue, Mar 10, 2026 at 01:00:12PM +0530, Dev Jain wrote:
> >> In the quest of enabling batched unmapping of anonymous folios, we need to
> >> handle the sharing of exclusive pages. Hence, a batched version of
> >> folio_try_share_anon_rmap_pte is required.
> >>
> >> Currently, the sole purpose of nr_pages in __folio_try_share_anon_rmap is
> >> to do some rmap sanity checks. Add helpers to set and clear the
> >> PageAnonExclusive bit on a batch of nr_pages. Note that
> >> __folio_try_share_anon_rmap can receive nr_pages == HPAGE_PMD_NR from the
> >> PMD path, but currently we only clear the bit on the head page. Retain this
> >> behaviour by setting nr_pages = 1 in case the caller is
> >> folio_try_share_anon_rmap_pmd.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >>  include/linux/page-flags.h | 11 +++++++++++
> >>  include/linux/rmap.h       | 28 ++++++++++++++++++++++++++--
> >>  mm/rmap.c                  |  2 +-
> >>  3 files changed, 38 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index 0e03d816e8b9d..1d74ed9a28c41 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -1178,6 +1178,17 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
> >>  	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> >>  }
> >>
> >> +static __always_inline void ClearPagesAnonExclusive(struct page *page,
> >> +		unsigned int nr)
> >
> > You're randomly moving between nr and nr_pages, can we just consistently use
> > nr_pages please.
>
> Okay.
>
> >
> >> +{
> >> +	for (;;) {
> >> +		ClearPageAnonExclusive(page);
> >> +		if (--nr == 0)
> >
> > You really require nr to != 0 here or otherwise you're going to be clearing 4
> > billion pages :)
>
> I'm following the pattern in pgtable.h, see set_ptes,
> clear_young_dirty_ptes, etc.
>
> >
> >> +			break;
> >> +		++page;
> >> +	}
> >> +}
> >
> > Can we put this in mm.h or somewhere else please, and can we do away with this
>
> What is the problem with page-flags.h? I am not very aware on which
> function to put in which header file semantically, so please educate me on
> this.

It's a mess in there and this this doesn't really belong.

>
> > HorribleNamingConvention, this is new, we can 'get away' with making it
> > something sensible :)
>
> I'll name it folio_clear_pages_anon_exclusive.

OK

>
>
> >
> > I wonder if we shouldn't also add a folio pointer here, and some
> > VM_WARN_ON_ONCE()'s. Like:
> >
> > static inline void folio_clear_page_batch(struct folio *folio,
> > 					  struct page *first_subpage,
> > 					  unsigned int nr_pages)
> > {
> > 	struct page *subpage = first_subpage;
> >
> > 	VM_WARN_ON_ONCE(!nr_pages);
> > 	VM_WARN_ON_ONCE(... check first_subpage in folio ...);
> > 	VM_WARN_ON_ONCE(... check first_subpage -> first_subpage + nr_pages in folio ...);
>
> I like what you are saying, but __folio_rmap_sanity_checks in the caller
> checks exactly this :)

This is a shared function that can't be assumed to always be called from a
context where that has run. VM_WARN_ON_ONCE()'s are free in release kernels so I
don't think it should be such an issue.

>
> >
> > 	while (nr_pages--)
> > 		ClearPageAnonExclusive(subpage++);
> > }
> >
> >> +
> >>  #ifdef CONFIG_MMU
> >>  #define __PG_MLOCKED		(1UL << PG_mlocked)
> >>  #else
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index 1b7720c66ac87..7a67776dca3fe 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -712,9 +712,13 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
> >>  	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
> >>  	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>
> >> +	/* We only clear anon-exclusive from head page of PMD folio */
> >
> > Is this accurate? David? I thought anon exclusive was per-subpage for any large
> > folio...?
>
> The current behaviour is to do this only. I was also surprised with this,
> so I had dug in and found out:
>
> https://lore.kernel.org/all/20220428083441.37290-13-david@redhat.com/
>
> where David says:
>
> "Long story short: once
> PTE-mapped, we have to track information about exclusivity per sub-page,
> but until then, we can just track it for the compound page in the head
> page and not having to update a whole bunch of subpages all of the time
> for a simple PMD mapping of a THP."

OK

>
>
> >
> > If we're explicitly doing this for some reason here, then why introduce it?
> >
> >> +	if (level == PGTABLE_LEVEL_PMD)
> >> +		nr_pages = 1;
> >> +
> >>  	/* device private folios cannot get pinned via GUP. */
> >>  	if (unlikely(folio_is_device_private(folio))) {
> >> -		ClearPageAnonExclusive(page);
> >> +		ClearPagesAnonExclusive(page, nr_pages);
> >
> > I really kind of hate this 'we are looking at subpage X with variable page in
> > folio Y, but we don't mention Y' thing. It's super confusing that we have a
> > pointer to a thing which sometimes we deref and treat as a value we care about
> > and sometimes treat as an array.
> >
> > This pattern exists throughout all the batched stuff and I kind of hate it
> > everywhere.
> >
> > I guess the batching means that we are looking at a sub-folio range.
> >
> > If C had a better type system we could somehow have a type that encoded this,
> > but it doesn't :>)
> >
> > But I wonder if we shouldn't just go ahead and rename page -> pages and be
> > consistent about this?
>
> Agreed. You are correct in saying that this function should receive struct
> folio to assert that we are essentially in a page array, and some sanity
> checking should happen. But the callers are already doing the checking
> in folio_rmap_sanity_checks. Let me think on this.

You treat that like a license to never put an assert anywhere else in mm... In
any case I'm not sure I mentioend an assert here, more so the pattern around
folio v.s subpages.

>
>
> >
> >>  		return 0;
> >>  	}
> >>
> >> @@ -766,7 +770,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
> >>
> >>  	if (unlikely(folio_maybe_dma_pinned(folio)))
> >>  		return -EBUSY;
> >> -	ClearPageAnonExclusive(page);
> >> +	ClearPagesAnonExclusive(page, nr_pages);
> >>
> >>  	/*
> >>  	 * This is conceptually a smp_wmb() paired with the smp_rmb() in
> >> @@ -804,6 +808,26 @@ static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
> >>  	return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
> >>  }
> >>
> >> +/**
> >> + * folio_try_share_anon_rmap_ptes - try marking exclusive anonymous pages
> >> + *				    mapped by PTEs possibly shared to prepare
> >> + *				   for KSM or temporary unmapping
> >
> > This description is very confusing. 'Try marking exclusive anonymous pages
> > [... marking them as what?] mapped by PTEs[, or (]possibly shared[, or )] to
> > prepare for KSM[under what circumstances? Why mention KSM here?] or temporary
> > unmapping [why temporary?]
> >
> > OK I think you mean to say 'marking' them as 'possibly' shared.
> >
> > But really by 'shared' you mean clearing anon exclusive right? So maybe the
> > function name and description should reference that instead.
> >
> > But this needs clarifying. This isn't an exercise in minimum number of words to
> > describe the function.
> >
> > Ohhh now I see this is what the comment is in folio_try_share_anon_rmap_pte() :P
> >
> > Well, I wish we could update the original too ;) but OK this is fine as-is to
> > matc that then.
> >
> >> + * @folio:	The folio to share a mapping of
> >> + * @page:	The first mapped exclusive page of the batch
> >> + * @nr_pages:	The number of pages to share (batch size)
> >> + *
> >> + * See folio_try_share_anon_rmap_pte for full description.
> >> + *
> >> + * Context: The caller needs to hold the page table lock and has to have the
> >> + * page table entries cleared/invalidated. Those PTEs used to map consecutive
> >> + * pages of the folio passed here. The PTEs are all in the same PMD and VMA.
> >
> > Can we VM_WARN_ON_ONCE() any of this? Not completely a necessity.
>
> Again, we have WARN checks in folio_rmap_sanity_checks, and even in
> folio_pte_batch. I am afraid of duplication.

Why on earth are you bothering with 'context' then?

If having run folio_rmap_sanity_checks() means we never have to assert or think
about state or context again then why bother?

I looked in __folio_rma_sanity_checks() and I see no:

Breaking it down:

Context: The caller needs to hold the page table lock and has to have the
         page table entries cleared/invalidated. Those PTEs used to map consecutive
         pages of the folio passed here. The PTEs are all in the same PMD and VMA.

- Page table lock <-- NOT checked
- Page table entries cleared/invalidated <-- NOT checked
- PTEs passed in must map all pages in the folio? <-- you aren't passing in PTEs?
- PTEs are all in the same PMD <- You aren't passing in PTEs?
- PTES are all in the same VMA <- You aren't passing in PTEs?

So that's hardly a coherent picture no?

>
> >
> >> + */
> >> +static inline int folio_try_share_anon_rmap_ptes(struct folio *folio,
> >> +		struct page *page, unsigned int nr)
> >> +{
> >> +	return __folio_try_share_anon_rmap(folio, page, nr, PGTABLE_LEVEL_PTE);
> >> +}
> >> +
> >>  /**
> >>   * folio_try_share_anon_rmap_pmd - try marking an exclusive anonymous page
> >>   *				   range mapped by a PMD possibly shared to
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 42f6b00cced01..bba5b571946d8 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -2300,7 +2300,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>
> >>  			/* See folio_try_share_anon_rmap(): clear PTE first. */
> >>  			if (anon_exclusive &&
> >> -			    folio_try_share_anon_rmap_pte(folio, subpage)) {
> >> +			    folio_try_share_anon_rmap_ptes(folio, subpage, 1)) {
> >
> > I guess this is because you intend to make this batched later with >1, but I
> > don't see the point of adding this since folio_try_share_anon_rmap_pte() already
> > does what you're doing here.
> >
> > So why not just change this when you actually batch?
>
> It is generally the consensus to introduce a new function along with its
> caller. Although one may argue that the caller introduced here is not

That's not always the case, sometimes it makes more sense and is cleaner to
introduce it first.

All rules of thumb are open to sensible interpretation.

I'm not sure arbitrarily using the function in a way that makes no sense in the
code is a good approach but this isn't the end of the world.

> a functional change (still passing nr_pages = 1). So I am fine doing
> what you suggest.
>
> >
> > Buuuut.... haven't you not already changed this whole function to now 'jump'
> > ahead if batched, so why are we only specifying nr_pages = 1 here?
>
> Because...please bear with the insanity :) currently we are in a ridiculous
> situation where
>
> nr_pages can be > 1 for file folios, and lazyfree folios, *and* it is
> required that the VMA is non-uffd.
>
> So, the "jump" thingy I was doing in the previous patch was adding support
> for file folios, belonging to uffd VMAs (see pte_install_uffd_wp_if_needed,
> we need to handle uffd-wp marker for file folios only, and also I should
> have mentioned "file folio" in the subject line of that patch, of course
> I missed that because reasoning through this code is very difficult)
>
> To answer your question, currently for anon folios nr_pages == 1, so
> the jump is a no-op.
>
> When I had discovered the uffd-wp bug some weeks back, I was pushing
> back against the idea of hacking around it by disabling batching
> for uffd-VMAs in folio_unmap_pte_batch, but solve it then and there
> properly. Now we have too many cases - first we added lazyfree support,
> then file-non-uffd support, my patch 5 adds file-uffd support, and
> last patch finally completes this with anon support.

I am not a fan of any of that, this just speaks to the need to clean this up
before endlessly adding more functionality and piles of complexity and
confusion.

Let's just add some patches to do things sanely please.

>
>
> >
> > Honestly I think this function needs to be fully refactored away from the
> > appalling giant-ball-of-string mess it is now before we try to add in batching
> > to be honest.
> >
> > Let's not accumulate more technical debt.
>
> I agree, I am happy to help in cleaning up this function.

Right, well then this series needs to have some clean up patches in it first.

>
> >
> >
> >>  				folio_put_swap(folio, subpage, 1);
> >>  				set_pte_at(mm, address, pvmw.pte, pteval);
> >>  				goto walk_abort;
> >> --
> >> 2.34.1
> >>
> >
> > Thanks, Lorenzo
>

Thanks< Lorenzo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio()
  2026-03-10  8:42     ` Dev Jain
@ 2026-03-19 15:53       ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-19 15:53 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, axelrasmussen, yuanchu, david, hughd, chrisl, kasong,
	weixugc, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	baohua, youngjun.park, ziy, kas, willy, yuzhao, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual

On Tue, Mar 10, 2026 at 02:12:45PM +0530, Dev Jain wrote:
>
>
> On 10/03/26 1:49 pm, Lorenzo Stoakes (Oracle) wrote:
> > On Tue, Mar 10, 2026 at 01:00:07PM +0530, Dev Jain wrote:
> >> Clean up the code by refactoring the post-pte-clearing path of lazyfree
> >> folio unmapping, into commit_ttu_lazyfree_folio().
> >>
> >> No functional change is intended.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >
> > This is a good idea, and we need more refactoring like this in the rmap code,
> > but comments/nits below.
> >
> >> ---
> >>  mm/rmap.c | 93 ++++++++++++++++++++++++++++++++-----------------------
> >>  1 file changed, 54 insertions(+), 39 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 1fa020edd954a..a61978141ee3f 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1966,6 +1966,57 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >>  				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
> >>  }
> >>
> >> +static inline int commit_ttu_lazyfree_folio(struct vm_area_struct *vma,
> >
> > Strange name, maybe lazyfree_range()? Not sure what ttu has to do with
>
> ttu means try_to_unmap, just like it is used in TTU_SYNC,
> TTU_SPLIT_HUGE_PMD, etc. So personally I really like the name, it reads
> "commit the try-to-unmap of a lazyfree folio". The "commit" comes because
> the pte clearing has already happened, so now we are deciding if at all
> to back-off and restore the ptes.

I absolutely hate the name, and nobody sane is going to read it like that
sorry :) I think this is a case of being too close to the work.

I also hate the overloading of 'lazyfree' which is actually MADV_FREE,
users don't know what lazyfree is and it's a bit of an overloaded term
(we've had discussions of a new 'lazy free' implementation at conferences
before).

Commit is overloaded everywhere I'm not sure it's even particularly
pertinent here.

Also it's not necessarily a folio is it? It could be a contpte range within
a folio so that's just misleading...

lazyfree_pte_range() or something like that seems better to me.

>
> > anything...
> >
> >> +		struct folio *folio, unsigned long address, pte_t *ptep,
> >> +		pte_t pteval, long nr_pages)
> >
> > That long nr_pages is really grating now...
>
> Reading the discussion on patch 1, I'll convert this to unsigned long.

Thanks!

>
> >
> >> +{
> >
> > Come on Dev, it's 2026, why on earth are you returning an integer and not a
> > bool?
> >
> > Also it would make sense for this to return false if something breaks, otherwise
> > true.
>
> Yes I was confused on which one of the options to choose :). Since the
> function does a lot more than just test some functionality (which is what
> boolean functions usually do) I was feeling weird when returning bool.
> But yeah alright, I'll convert this to bool.

Thanks!

Cheers, Lorenzo

>
> >
> >> +	struct mm_struct *mm = vma->vm_mm;
> >> +	int ref_count, map_count;
> >> +
> >> +	/*
> >> +	 * Synchronize with gup_pte_range():
> >> +	 * - clear PTE; barrier; read refcount
> >> +	 * - inc refcount; barrier; read PTE
> >> +	 */
> >> +	smp_mb();
> >> +
> >> +	ref_count = folio_ref_count(folio);
> >> +	map_count = folio_mapcount(folio);
> >> +
> >> +	/*
> >> +	 * Order reads for page refcount and dirty flag
> >> +	 * (see comments in __remove_mapping()).
> >> +	 */
> >> +	smp_rmb();
> >> +
> >> +	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> >> +		/*
> >> +		 * redirtied either using the page table or a previously
> >> +		 * obtained GUP reference.
> >> +		 */
> >> +		set_ptes(mm, address, ptep, pteval, nr_pages);
> >> +		folio_set_swapbacked(folio);
> >> +		return 1;
> >> +	}
> >> +
> >> +	if (ref_count != 1 + map_count) {
> >> +		/*
> >> +		 * Additional reference. Could be a GUP reference or any
> >> +		 * speculative reference. GUP users must mark the folio
> >> +		 * dirty if there was a modification. This folio cannot be
> >> +		 * reclaimed right now either way, so act just like nothing
> >> +		 * happened.
> >> +		 * We'll come back here later and detect if the folio was
> >> +		 * dirtied when the additional reference is gone.
> >> +		 */
> >> +		set_ptes(mm, address, ptep, pteval, nr_pages);
> >> +		return 1;
> >> +	}
> >> +
> >> +	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
> >> +	return 0;
> >> +}
> >> +
> >>  /*
> >>   * @arg: enum ttu_flags will be passed to this argument
> >>   */
> >> @@ -2227,46 +2278,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>
> >>  			/* MADV_FREE page check */
> >>  			if (!folio_test_swapbacked(folio)) {
> >> -				int ref_count, map_count;
> >> -
> >> -				/*
> >> -				 * Synchronize with gup_pte_range():
> >> -				 * - clear PTE; barrier; read refcount
> >> -				 * - inc refcount; barrier; read PTE
> >> -				 */
> >> -				smp_mb();
> >> -
> >> -				ref_count = folio_ref_count(folio);
> >> -				map_count = folio_mapcount(folio);
> >> -
> >> -				/*
> >> -				 * Order reads for page refcount and dirty flag
> >> -				 * (see comments in __remove_mapping()).
> >> -				 */
> >> -				smp_rmb();
> >> -
> >> -				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> >> -					/*
> >> -					 * redirtied either using the page table or a previously
> >> -					 * obtained GUP reference.
> >> -					 */
> >> -					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
> >> -					folio_set_swapbacked(folio);
> >> +				if (commit_ttu_lazyfree_folio(vma, folio, address,
> >> +							      pvmw.pte, pteval,
> >> +							      nr_pages))
> >
> > With above corrections this would be:
> >
> > 	if (!lazyfree_range(vma, folio, address, pvme.pte, pteval, nr_pages))
> > 	   ...
> >
> >>  					goto walk_abort;
> >> -				} else if (ref_count != 1 + map_count) {
> >> -					/*
> >> -					 * Additional reference. Could be a GUP reference or any
> >> -					 * speculative reference. GUP users must mark the folio
> >> -					 * dirty if there was a modification. This folio cannot be
> >> -					 * reclaimed right now either way, so act just like nothing
> >> -					 * happened.
> >> -					 * We'll come back here later and detect if the folio was
> >> -					 * dirtied when the additional reference is gone.
> >> -					 */
> >> -					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
> >> -					goto walk_abort;
> >> -				}
> >> -				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
> >>  				goto discard;
> >>  			}
> >>
> >> --
> >> 2.34.1
> >>
> >
> > Thanks, Lorenzo
> >
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable
  2026-03-11  5:42     ` Dev Jain
  2026-03-19 15:26       ` Lorenzo Stoakes (Oracle)
@ 2026-03-19 16:47       ` Matthew Wilcox
  1 sibling, 0 replies; 46+ messages in thread
From: Matthew Wilcox @ 2026-03-19 16:47 UTC (permalink / raw)
  To: Dev Jain
  Cc: Lorenzo Stoakes (Oracle), akpm, axelrasmussen, yuanchu, david,
	hughd, chrisl, kasong, weixugc, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, baohua, youngjun.park, ziy, kas, yuzhao,
	linux-mm, linux-kernel, ryan.roberts, anshuman.khandual

On Wed, Mar 11, 2026 at 11:12:22AM +0530, Dev Jain wrote:
> On 10/03/26 2:19 pm, Lorenzo Stoakes (Oracle) wrote:
> > On Tue, Mar 10, 2026 at 01:00:10PM +0530, Dev Jain wrote:
> >> Teach folio_dup_swap to handle a batch of consecutive pages. Note that
> >> folio_dup_swap already can handle a subset of this: nr_pages == 1 and
> >> nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.
> >>
> >> Currently we have a not-so-nice logic of passing in subpage == NULL if
> >> we mean to exercise the logic on the entire folio, and subpage != NULL if
> >> we want to exercise the logic on only that subpage. Remove this
> >> indirection, and explicitly pass subpage != NULL, and the number of
> >> pages required.
> > 
> > You've made the interface more confusing? Now we can update multiple subpages
> > but specify only one? :)

For the next version, please lose the "subpage" name.  It's just a page.
folios have pages, not subpages.


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2026-03-19 16:48 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10  7:30 [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Dev Jain
2026-03-10  7:30 ` [PATCH 1/9] mm/rmap: make nr_pages signed in try_to_unmap_one Dev Jain
2026-03-10  7:56   ` Lorenzo Stoakes (Oracle)
2026-03-10  8:06     ` David Hildenbrand (Arm)
2026-03-10  8:23       ` Dev Jain
2026-03-10 12:40         ` Matthew Wilcox
2026-03-11  4:54           ` Dev Jain
2026-03-10  7:30 ` [PATCH 2/9] mm/rmap: initialize nr_pages to 1 at loop start " Dev Jain
2026-03-10  8:10   ` Lorenzo Stoakes (Oracle)
2026-03-10  8:31     ` Dev Jain
2026-03-10  8:39       ` Lorenzo Stoakes (Oracle)
2026-03-10  8:43         ` Dev Jain
2026-03-10  7:30 ` [PATCH 3/9] mm/rmap: refactor lazyfree unmap commit path to commit_ttu_lazyfree_folio() Dev Jain
2026-03-10  8:19   ` Lorenzo Stoakes (Oracle)
2026-03-10  8:42     ` Dev Jain
2026-03-19 15:53       ` Lorenzo Stoakes (Oracle)
2026-03-10  7:30 ` [PATCH 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
2026-03-10  7:30 ` [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
2026-03-10  8:34   ` Lorenzo Stoakes (Oracle)
2026-03-10 23:32     ` Barry Song
2026-03-11  4:14       ` Barry Song
2026-03-11  4:52         ` Dev Jain
2026-03-11  4:56     ` Dev Jain
2026-03-10  7:30 ` [PATCH 6/9] mm/swapfile: Make folio_dup_swap batchable Dev Jain
2026-03-10  8:27   ` Kairui Song
2026-03-10  8:46     ` Dev Jain
2026-03-10  8:49   ` Lorenzo Stoakes (Oracle)
2026-03-11  5:42     ` Dev Jain
2026-03-19 15:26       ` Lorenzo Stoakes (Oracle)
2026-03-19 16:47       ` Matthew Wilcox
2026-03-18  0:20   ` kernel test robot
2026-03-10  7:30 ` [PATCH 7/9] mm/swapfile: Make folio_put_swap batchable Dev Jain
2026-03-10  8:29   ` Kairui Song
2026-03-10  8:50     ` Dev Jain
2026-03-10  8:55   ` Lorenzo Stoakes (Oracle)
2026-03-18  1:04   ` kernel test robot
2026-03-10  7:30 ` [PATCH 8/9] mm/rmap: introduce folio_try_share_anon_rmap_ptes Dev Jain
2026-03-10  9:38   ` Lorenzo Stoakes (Oracle)
2026-03-11  8:09     ` Dev Jain
2026-03-12  8:19       ` Wei Yang
2026-03-19 15:47       ` Lorenzo Stoakes (Oracle)
2026-03-10  7:30 ` [PATCH 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
2026-03-10  8:02 ` [PATCH 0/9] mm/rmap: Optimize anonymous large folio unmapping Lorenzo Stoakes (Oracle)
2026-03-10  9:28   ` Dev Jain
2026-03-10 12:59 ` Lance Yang
2026-03-11  8:11   ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox