[PATCH v2 0/9] Optimize anonymous large folio unmapping

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/9] Optimize anonymous large folio unmapping
@ 2026-04-10 10:31 Dev Jain
  2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:31 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Speed up unmapping of anonymous large folios by clearing the ptes, and
setting swap ptes, in one go.

The following benchmark (stolen from Barry at [1]) is used to measure the
time taken to swapout 256M worth of memory backed by 64K large folios:

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
 #include <errno.h>

 #define SIZE_MB 256
 #define SIZE_BYTES (SIZE_MB * 1024 * 1024)

 int main() {
     void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (addr == MAP_FAILED) {
         perror("mmap failed");
         return 1;
     }

     memset(addr, 0, SIZE_BYTES);

     struct timespec start, end;
     clock_gettime(CLOCK_MONOTONIC, &start);

     if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
         perror("madvise(MADV_PAGEOUT) failed");
         munmap(addr, SIZE_BYTES);
         return 1;
     }

     clock_gettime(CLOCK_MONOTONIC, &end);

     long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                        (end.tv_nsec - start.tv_nsec);
     printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
            duration_ns, duration_ns / 1e6);

     munmap(addr, SIZE_BYTES);
     return 0;
 }

Performance as measured on a Linux VM on Apple M3 (arm64):

Vanilla - Mean: 37401913 ns, std dev: 12%
Patched - Mean: 17420282 ns, std dev: 11%

No regression observed on 4K folios.

Performance as measured on bare metal x86:

Vanilla - mean: 54986286 ns, std dev: 1.5%
Patched - mean: 51930795 ns, std dev: 3%

Interestingly, no obvious improvement is observed on x86, hinting that the
benefit lies mainly in the reduction of ptep_get() calls and the reduction
of TLB flushes during contpte-unfolding, on arm64.

No regression is observed on 4K folios on x86 too.

---
Based on mm-unstable 3fa44141e0bb ("ksm: optimize rmap_walk_ksm by passing
a suitable address range"). mm-selftests pass.

v1->v2:
 - Keep nr_pages as unsigned long
 - Add patch 2
 - Rename some functions, make return type bool for functions returning 0/1
 - Drop page_vma_mapped_walk_jump - this is implicitly handled
 - Drop likely()
 - Add folio_dup/put_swap_pages, do subpage -> page
 - Shorten the kerneldoc to remove unnecessary information - keep it
   aligned with analogous functions
 - Put clear_pages_anon_exclusive to mm.h
 - Some more refactoring in last patch with finish_folio_unmap

Dev Jain (9):
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  mm/rmap: refactor some code around lazyfree folio unmapping
  mm/memory: Batch set uffd-wp markers during zapping
  mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  mm/swapfile: Add batched version of folio_dup_swap
  mm/swapfile: Add batched version of folio_put_swap
  mm/rmap: Add batched version of folio_try_share_anon_rmap_pte
  mm/rmap: enable batch unmapping of anonymous folios

 include/linux/mm.h        |  11 ++
 include/linux/mm_inline.h |  32 +--
 include/linux/rmap.h      |  27 ++-
 mm/internal.h             |  26 +++
 mm/memory.c               |  26 +--
 mm/mprotect.c             |  17 --
 mm/rmap.c                 | 404 +++++++++++++++++++++++---------------
 mm/shmem.c                |   8 +-
 mm/swap.h                 |  23 ++-
 mm/swapfile.c             |  42 ++--
 10 files changed, 380 insertions(+), 236 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
@ 2026-04-10 10:31 ` Dev Jain
  2026-04-11  1:02   ` Barry Song
  2026-04-10 10:31 ` [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing " Dev Jain
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:31 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Initialize nr_pages to 1 at the start of each loop iteration, like
folio_referenced_one() does.

Without this, nr_pages computed by a previous folio_unmap_pte_batch() call
can be reused on a later iteration that does not run
folio_unmap_pte_batch() again.

I don’t think this is causing a bug today, but it is fragile.

A real bug would require this sequence within the same try_to_unmap_one()
call:

1. Hit the pte_present(pteval) branch and set nr_pages > 1.
2. Later hit the else branch and do pte_clear() for device-exclusive PTE,
   and execute rest of the code with nr_pages > 1.

Executing the above would imply a lazyfree folio is mapped by a mix of
present PTEs and device-exclusive PTEs.

In practice, device-exclusive PTEs imply a GUP pin on the folio, and
lazyfree unmapping aborts try_to_unmap_one() when it detects that
condition. So today this likely does not manifest, but initializing
nr_pages per-iteration is still the correct and safer behavior.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 78b7fb5f367ce..62a8c912fd788 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1991,7 +1991,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	struct page *subpage;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
-	unsigned long nr_pages = 1, end_addr;
+	unsigned long nr_pages;
+	unsigned long end_addr;
 	unsigned long pfn;
 	unsigned long hsz = 0;
 	int ptes = 0;
@@ -2030,6 +2031,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);

 	while (page_vma_mapped_walk(&pvmw)) {
+		nr_pages = 1;
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
 		 */
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
@ 2026-04-11  1:02   ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2026-04-11  1:02 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, hughd, chrisl, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, riel, harry, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, youngjun.park, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Fri, Apr 10, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Initialize nr_pages to 1 at the start of each loop iteration, like
> folio_referenced_one() does.
>
> Without this, nr_pages computed by a previous folio_unmap_pte_batch() call
> can be reused on a later iteration that does not run
> folio_unmap_pte_batch() again.
>
> I don’t think this is causing a bug today, but it is fragile.
>
> A real bug would require this sequence within the same try_to_unmap_one()
> call:
>
> 1. Hit the pte_present(pteval) branch and set nr_pages > 1.
> 2. Later hit the else branch and do pte_clear() for device-exclusive PTE,
>    and execute rest of the code with nr_pages > 1.
>
> Executing the above would imply a lazyfree folio is mapped by a mix of
> present PTEs and device-exclusive PTEs.
>
> In practice, device-exclusive PTEs imply a GUP pin on the folio, and
> lazyfree unmapping aborts try_to_unmap_one() when it detects that
> condition. So today this likely does not manifest, but initializing
> nr_pages per-iteration is still the correct and safer behavior.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Acked-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
  2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
@ 2026-04-10 10:31 ` Dev Jain
  2026-04-11  8:55   ` Barry Song
  2026-04-11 11:45   ` Jie Gan
  2026-04-10 10:31 ` [PATCH v2 3/9] mm/rmap: refactor some code around lazyfree folio unmapping Dev Jain
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:31 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Simplify the code by refactoring the folio_test_hugetlb() branch into
a new function.

No functional change is intended.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 116 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 67 insertions(+), 49 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 62a8c912fd788..a9c43e2f6e695 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1978,6 +1978,67 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
+static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
+		struct folio *folio, struct page_vma_mapped_walk *pvmw,
+		struct page *page, enum ttu_flags flags, pte_t *pteval,
+		struct mmu_notifier_range *range, bool *walk_done)
+{
+	/*
+	 * The try_to_unmap() is only passed a hugetlb page
+	 * in the case where the hugetlb page is poisoned.
+	 */
+	VM_WARN_ON_PAGE(!PageHWPoison(page), page);
+	/*
+	 * huge_pmd_unshare may unmap an entire PMD page.
+	 * There is no way of knowing exactly which PMDs may
+	 * be cached for this mm, so we must flush them all.
+	 * start/end were already adjusted above to cover this
+	 * range.
+	 */
+	flush_cache_range(vma, range->start, range->end);
+
+	/*
+	 * To call huge_pmd_unshare, i_mmap_rwsem must be
+	 * held in write mode.  Caller needs to explicitly
+	 * do this outside rmap routines.
+	 *
+	 * We also must hold hugetlb vma_lock in write mode.
+	 * Lock order dictates acquiring vma_lock BEFORE
+	 * i_mmap_rwsem.  We can only try lock here and fail
+	 * if unsuccessful.
+	 */
+	if (!folio_test_anon(folio)) {
+		struct mmu_gather tlb;
+
+		VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
+		if (!hugetlb_vma_trylock_write(vma)) {
+			*walk_done = true;
+			return false;
+		}
+
+		tlb_gather_mmu_vma(&tlb, vma);
+		if (huge_pmd_unshare(&tlb, vma, pvmw->address, pvmw->pte)) {
+			hugetlb_vma_unlock_write(vma);
+			huge_pmd_unshare_flush(&tlb, vma);
+			tlb_finish_mmu(&tlb);
+			/*
+			 * The PMD table was unmapped,
+			 * consequently unmapping the folio.
+			 */
+			*walk_done = true;
+			return true;
+		}
+		hugetlb_vma_unlock_write(vma);
+		tlb_finish_mmu(&tlb);
+	}
+	*pteval = huge_ptep_clear_flush(vma, pvmw->address, pvmw->pte);
+	if (pte_dirty(*pteval))
+		folio_mark_dirty(folio);
+
+	*walk_done = false;
+	return true;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -2115,56 +2176,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 PageAnonExclusive(subpage);
 
 		if (folio_test_hugetlb(folio)) {
-			bool anon = folio_test_anon(folio);
-
-			/*
-			 * The try_to_unmap() is only passed a hugetlb page
-			 * in the case where the hugetlb page is poisoned.
-			 */
-			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
-			/*
-			 * huge_pmd_unshare may unmap an entire PMD page.
-			 * There is no way of knowing exactly which PMDs may
-			 * be cached for this mm, so we must flush them all.
-			 * start/end were already adjusted above to cover this
-			 * range.
-			 */
-			flush_cache_range(vma, range.start, range.end);
+			bool walk_done;
 
-			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and fail
-			 * if unsuccessful.
-			 */
-			if (!anon) {
-				struct mmu_gather tlb;
-
-				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
-				if (!hugetlb_vma_trylock_write(vma))
-					goto walk_abort;
-
-				tlb_gather_mmu_vma(&tlb, vma);
-				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
-					hugetlb_vma_unlock_write(vma);
-					huge_pmd_unshare_flush(&tlb, vma);
-					tlb_finish_mmu(&tlb);
-					/*
-					 * The PMD table was unmapped,
-					 * consequently unmapping the folio.
-					 */
-					goto walk_done;
-				}
-				hugetlb_vma_unlock_write(vma);
-				tlb_finish_mmu(&tlb);
-			}
-			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-			if (pte_dirty(pteval))
-				folio_mark_dirty(folio);
+			ret = unmap_hugetlb_folio(vma, folio, &pvmw, subpage,
+						  flags, &pteval, &range,
+						  &walk_done);
+			if (walk_done)
+				goto walk_done;
 		} else if (likely(pte_present(pteval))) {
 			nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
 			end_addr = address + nr_pages * PAGE_SIZE;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  2026-04-10 10:31 ` [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing " Dev Jain
@ 2026-04-11  8:55   ` Barry Song
  2026-04-11 16:05     ` Dev Jain
  2026-04-11 11:45   ` Jie Gan
  1 sibling, 1 reply; 18+ messages in thread
From: Barry Song @ 2026-04-11  8:55 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, hughd, chrisl, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, riel, harry, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, youngjun.park, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

On Fri, Apr 10, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>
> Simplify the code by refactoring the folio_test_hugetlb() branch into
> a new function.
>
> No functional change is intended.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/rmap.c | 116 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 67 insertions(+), 49 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 62a8c912fd788..a9c43e2f6e695 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1978,6 +1978,67 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>                                      FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>  }
>
> +static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
> +               struct folio *folio, struct page_vma_mapped_walk *pvmw,
> +               struct page *page, enum ttu_flags flags, pte_t *pteval,
> +               struct mmu_notifier_range *range, bool *walk_done)
> +{

Can we add a comment before the function explaining what
the return value means?

> +       /*
> +        * The try_to_unmap() is only passed a hugetlb page
> +        * in the case where the hugetlb page is poisoned.
> +        */
> +       VM_WARN_ON_PAGE(!PageHWPoison(page), page);
> +       /*
> +        * huge_pmd_unshare may unmap an entire PMD page.
> +        * There is no way of knowing exactly which PMDs may
> +        * be cached for this mm, so we must flush them all.
> +        * start/end were already adjusted above to cover this
> +        * range.
> +        */
> +       flush_cache_range(vma, range->start, range->end);
> +
> +       /*
> +        * To call huge_pmd_unshare, i_mmap_rwsem must be
> +        * held in write mode.  Caller needs to explicitly
> +        * do this outside rmap routines.
> +        *
> +        * We also must hold hugetlb vma_lock in write mode.
> +        * Lock order dictates acquiring vma_lock BEFORE
> +        * i_mmap_rwsem.  We can only try lock here and fail
> +        * if unsuccessful.
> +        */
> +       if (!folio_test_anon(folio)) {
> +               struct mmu_gather tlb;
> +
> +               VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
> +               if (!hugetlb_vma_trylock_write(vma)) {
> +                       *walk_done = true;
> +                       return false;
> +               }
> +

Sometimes I feel walk_done is misleading, since walk_done with
ret = false actually means walk_abort.

So another option is to make this function return a tristate:
WALK_DONE, WALK_ABORT, WALK_CONT. Then we could drop the
walk_done argument entirely.

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  2026-04-11  8:55   ` Barry Song
@ 2026-04-11 16:05     ` Dev Jain
  2026-04-11 16:24       ` Dev Jain
  0 siblings, 1 reply; 18+ messages in thread
From: Dev Jain @ 2026-04-11 16:05 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, hughd, chrisl, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, riel, harry, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, youngjun.park, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 11/04/26 2:25 pm, Barry Song wrote:
> On Fri, Apr 10, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> Simplify the code by refactoring the folio_test_hugetlb() branch into
>> a new function.
>>
>> No functional change is intended.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  mm/rmap.c | 116 +++++++++++++++++++++++++++++++-----------------------
>>  1 file changed, 67 insertions(+), 49 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 62a8c912fd788..a9c43e2f6e695 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1978,6 +1978,67 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>                                      FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>>  }
>>
>> +static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
>> +               struct folio *folio, struct page_vma_mapped_walk *pvmw,
>> +               struct page *page, enum ttu_flags flags, pte_t *pteval,
>> +               struct mmu_notifier_range *range, bool *walk_done)
>> +{
> 
> Can we add a comment before the function explaining what
> the return value means?

Yes I can add that.


> 
>> +       /*
>> +        * The try_to_unmap() is only passed a hugetlb page
>> +        * in the case where the hugetlb page is poisoned.
>> +        */
>> +       VM_WARN_ON_PAGE(!PageHWPoison(page), page);
>> +       /*
>> +        * huge_pmd_unshare may unmap an entire PMD page.
>> +        * There is no way of knowing exactly which PMDs may
>> +        * be cached for this mm, so we must flush them all.
>> +        * start/end were already adjusted above to cover this
>> +        * range.
>> +        */
>> +       flush_cache_range(vma, range->start, range->end);
>> +
>> +       /*
>> +        * To call huge_pmd_unshare, i_mmap_rwsem must be
>> +        * held in write mode.  Caller needs to explicitly
>> +        * do this outside rmap routines.
>> +        *
>> +        * We also must hold hugetlb vma_lock in write mode.
>> +        * Lock order dictates acquiring vma_lock BEFORE
>> +        * i_mmap_rwsem.  We can only try lock here and fail
>> +        * if unsuccessful.
>> +        */
>> +       if (!folio_test_anon(folio)) {
>> +               struct mmu_gather tlb;
>> +
>> +               VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
>> +               if (!hugetlb_vma_trylock_write(vma)) {
>> +                       *walk_done = true;
>> +                       return false;
>> +               }
>> +
> 
> Sometimes I feel walk_done is misleading, since walk_done with
> ret = false actually means walk_abort.

I'll rename this to exit_walk, so it doesn't collide with
the label names.

> 
> So another option is to make this function return a tristate:
> WALK_DONE, WALK_ABORT, WALK_CONT. Then we could drop the
> walk_done argument entirely.

I thought a lot about how to refactor try_to_unmap_one() as
a whole, and couldn't come up with a good solution.

There are these conditions:

1. ret = false => page_vma_mapped_walk_done(), break
2. ret not decided, "continue"
3. ret = true
   a) exit the while loop naturally
   b) exit prematurely -> page_vma_mapped_walk_done(), break

I had thought about the refactoring method to have an enum for
all conditions. So we can refactor bits of code, return an
enum, but we will still retain ugliness like

if (ret == WALK_DONE)
	goto walk_done;
if (ret == WALK_ABORT)
	goto walk_abort;
if (ret == WALK_CONTINUE)
	continue;

This seemed more of a forced-refactoring to me, IMHO doesn't
reduce the complexity of the function at all.

I don't have a clever solution to get rid of all the label
jumping, so I refactored what I could.

> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  2026-04-11 16:05     ` Dev Jain
@ 2026-04-11 16:24       ` Dev Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-11 16:24 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, hughd, chrisl, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, kasong, qi.zheng, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, riel, harry, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, youngjun.park, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual



On 11/04/26 9:35 pm, Dev Jain wrote:
> 
> 
> On 11/04/26 2:25 pm, Barry Song wrote:
>> On Fri, Apr 10, 2026 at 6:32 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>> Simplify the code by refactoring the folio_test_hugetlb() branch into
>>> a new function.
>>>
>>> No functional change is intended.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>  mm/rmap.c | 116 +++++++++++++++++++++++++++++++-----------------------
>>>  1 file changed, 67 insertions(+), 49 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 62a8c912fd788..a9c43e2f6e695 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1978,6 +1978,67 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>                                      FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>>>  }
>>>
>>> +static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
>>> +               struct folio *folio, struct page_vma_mapped_walk *pvmw,
>>> +               struct page *page, enum ttu_flags flags, pte_t *pteval,
>>> +               struct mmu_notifier_range *range, bool *walk_done)
>>> +{
>>
>> Can we add a comment before the function explaining what
>> the return value means?
> 
> Yes I can add that.
> 
> 
>>
>>> +       /*
>>> +        * The try_to_unmap() is only passed a hugetlb page
>>> +        * in the case where the hugetlb page is poisoned.
>>> +        */
>>> +       VM_WARN_ON_PAGE(!PageHWPoison(page), page);
>>> +       /*
>>> +        * huge_pmd_unshare may unmap an entire PMD page.
>>> +        * There is no way of knowing exactly which PMDs may
>>> +        * be cached for this mm, so we must flush them all.
>>> +        * start/end were already adjusted above to cover this
>>> +        * range.
>>> +        */
>>> +       flush_cache_range(vma, range->start, range->end);
>>> +
>>> +       /*
>>> +        * To call huge_pmd_unshare, i_mmap_rwsem must be
>>> +        * held in write mode.  Caller needs to explicitly
>>> +        * do this outside rmap routines.
>>> +        *
>>> +        * We also must hold hugetlb vma_lock in write mode.
>>> +        * Lock order dictates acquiring vma_lock BEFORE
>>> +        * i_mmap_rwsem.  We can only try lock here and fail
>>> +        * if unsuccessful.
>>> +        */
>>> +       if (!folio_test_anon(folio)) {
>>> +               struct mmu_gather tlb;
>>> +
>>> +               VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
>>> +               if (!hugetlb_vma_trylock_write(vma)) {
>>> +                       *walk_done = true;
>>> +                       return false;
>>> +               }
>>> +
>>
>> Sometimes I feel walk_done is misleading, since walk_done with
>> ret = false actually means walk_abort.
> 
> I'll rename this to exit_walk, so it doesn't collide with
> the label names.

But then

if (exit_walk)
	goto walk_done;

also looks weird.

I think I should do

if (exit_walk) {
	page_vma_mapped_walk_done();
	break;
}

The mess here is that we have a label walk_abort which
is literally ret = false + walk_done.

Perhaps we can remove one of the labels, have a single label
exit_walk and inline the "set ret = false/true and goto exit_label"
for the discarded label. I hesitated in doing this because both
labels are being currently used at a good amount of places.

> 
>>
>> So another option is to make this function return a tristate:
>> WALK_DONE, WALK_ABORT, WALK_CONT. Then we could drop the
>> walk_done argument entirely.
> 
> I thought a lot about how to refactor try_to_unmap_one() as
> a whole, and couldn't come up with a good solution.
> 
> There are these conditions:
> 
> 1. ret = false => page_vma_mapped_walk_done(), break
> 2. ret not decided, "continue"
> 3. ret = true
>    a) exit the while loop naturally
>    b) exit prematurely -> page_vma_mapped_walk_done(), break
> 
> I had thought about the refactoring method to have an enum for
> all conditions. So we can refactor bits of code, return an
> enum, but we will still retain ugliness like
> 
> if (ret == WALK_DONE)
> 	goto walk_done;
> if (ret == WALK_ABORT)
> 	goto walk_abort;
> if (ret == WALK_CONTINUE)
> 	continue;
> 
> This seemed more of a forced-refactoring to me, IMHO doesn't
> reduce the complexity of the function at all.
> 
> I don't have a clever solution to get rid of all the label
> jumping, so I refactored what I could.
> 
>>
>> Thanks
>> Barry
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  2026-04-10 10:31 ` [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing " Dev Jain
  2026-04-11  8:55   ` Barry Song
@ 2026-04-11 11:45   ` Jie Gan
  2026-04-11 16:08     ` Dev Jain
  1 sibling, 1 reply; 18+ messages in thread
From: Jie Gan @ 2026-04-11 11:45 UTC (permalink / raw)
  To: Dev Jain, akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual



On 4/10/2026 6:31 PM, Dev Jain wrote:
> Simplify the code by refactoring the folio_test_hugetlb() branch into
> a new function.
> 
> No functional change is intended.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   mm/rmap.c | 116 +++++++++++++++++++++++++++++++-----------------------
>   1 file changed, 67 insertions(+), 49 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 62a8c912fd788..a9c43e2f6e695 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1978,6 +1978,67 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>   				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>   }
>   
> +static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
> +		struct folio *folio, struct page_vma_mapped_walk *pvmw,
> +		struct page *page, enum ttu_flags flags, pte_t *pteval,
> +		struct mmu_notifier_range *range, bool *walk_done)
> +{
> +	/*
> +	 * The try_to_unmap() is only passed a hugetlb page
> +	 * in the case where the hugetlb page is poisoned.
> +	 */
> +	VM_WARN_ON_PAGE(!PageHWPoison(page), page);

As you mentioned "No functional change is intended." in commit message, 
but you have changed VM_BUG_ON_PAGE to VM_WARN_ON_PAGE here?

> +	/*
> +	 * huge_pmd_unshare may unmap an entire PMD page.
> +	 * There is no way of knowing exactly which PMDs may
> +	 * be cached for this mm, so we must flush them all.
> +	 * start/end were already adjusted above to cover this
> +	 * range.
> +	 */
> +	flush_cache_range(vma, range->start, range->end);
> +
> +	/*
> +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
> +	 * held in write mode.  Caller needs to explicitly
> +	 * do this outside rmap routines.
> +	 *
> +	 * We also must hold hugetlb vma_lock in write mode.
> +	 * Lock order dictates acquiring vma_lock BEFORE
> +	 * i_mmap_rwsem.  We can only try lock here and fail
> +	 * if unsuccessful.
> +	 */
> +	if (!folio_test_anon(folio)) {
> +		struct mmu_gather tlb;
> +
> +		VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));

ditto

Thanks,
Jie

> +		if (!hugetlb_vma_trylock_write(vma)) {
> +			*walk_done = true;
> +			return false;
> +		}
> +
> +		tlb_gather_mmu_vma(&tlb, vma);
> +		if (huge_pmd_unshare(&tlb, vma, pvmw->address, pvmw->pte)) {
> +			hugetlb_vma_unlock_write(vma);
> +			huge_pmd_unshare_flush(&tlb, vma);
> +			tlb_finish_mmu(&tlb);
> +			/*
> +			 * The PMD table was unmapped,
> +			 * consequently unmapping the folio.
> +			 */
> +			*walk_done = true;
> +			return true;
> +		}
> +		hugetlb_vma_unlock_write(vma);
> +		tlb_finish_mmu(&tlb);
> +	}
> +	*pteval = huge_ptep_clear_flush(vma, pvmw->address, pvmw->pte);
> +	if (pte_dirty(*pteval))
> +		folio_mark_dirty(folio);
> +
> +	*walk_done = false;
> +	return true;
> +}
> +
>   /*
>    * @arg: enum ttu_flags will be passed to this argument
>    */
> @@ -2115,56 +2176,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 PageAnonExclusive(subpage);
>   
>   		if (folio_test_hugetlb(folio)) {
> -			bool anon = folio_test_anon(folio);
> -
> -			/*
> -			 * The try_to_unmap() is only passed a hugetlb page
> -			 * in the case where the hugetlb page is poisoned.
> -			 */
> -			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
> -			/*
> -			 * huge_pmd_unshare may unmap an entire PMD page.
> -			 * There is no way of knowing exactly which PMDs may
> -			 * be cached for this mm, so we must flush them all.
> -			 * start/end were already adjusted above to cover this
> -			 * range.
> -			 */
> -			flush_cache_range(vma, range.start, range.end);
> +			bool walk_done;
>   
> -			/*
> -			 * To call huge_pmd_unshare, i_mmap_rwsem must be
> -			 * held in write mode.  Caller needs to explicitly
> -			 * do this outside rmap routines.
> -			 *
> -			 * We also must hold hugetlb vma_lock in write mode.
> -			 * Lock order dictates acquiring vma_lock BEFORE
> -			 * i_mmap_rwsem.  We can only try lock here and fail
> -			 * if unsuccessful.
> -			 */
> -			if (!anon) {
> -				struct mmu_gather tlb;
> -
> -				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> -				if (!hugetlb_vma_trylock_write(vma))
> -					goto walk_abort;
> -
> -				tlb_gather_mmu_vma(&tlb, vma);
> -				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
> -					hugetlb_vma_unlock_write(vma);
> -					huge_pmd_unshare_flush(&tlb, vma);
> -					tlb_finish_mmu(&tlb);
> -					/*
> -					 * The PMD table was unmapped,
> -					 * consequently unmapping the folio.
> -					 */
> -					goto walk_done;
> -				}
> -				hugetlb_vma_unlock_write(vma);
> -				tlb_finish_mmu(&tlb);
> -			}
> -			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> -			if (pte_dirty(pteval))
> -				folio_mark_dirty(folio);
> +			ret = unmap_hugetlb_folio(vma, folio, &pvmw, subpage,
> +						  flags, &pteval, &range,
> +						  &walk_done);
> +			if (walk_done)
> +				goto walk_done;
>   		} else if (likely(pte_present(pteval))) {
>   			nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
>   			end_addr = address + nr_pages * PAGE_SIZE;



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  2026-04-11 11:45   ` Jie Gan
@ 2026-04-11 16:08     ` Dev Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-11 16:08 UTC (permalink / raw)
  To: Jie Gan, akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual



On 11/04/26 5:15 pm, Jie Gan wrote:
> 
> 
> On 4/10/2026 6:31 PM, Dev Jain wrote:
>> Simplify the code by refactoring the folio_test_hugetlb() branch into
>> a new function.
>>
>> No functional change is intended.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/rmap.c | 116 +++++++++++++++++++++++++++++++-----------------------
>>   1 file changed, 67 insertions(+), 49 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 62a8c912fd788..a9c43e2f6e695 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1978,6 +1978,67 @@ static inline unsigned int
>> folio_unmap_pte_batch(struct folio *folio,
>>                        FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>>   }
>>   +static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
>> +        struct folio *folio, struct page_vma_mapped_walk *pvmw,
>> +        struct page *page, enum ttu_flags flags, pte_t *pteval,
>> +        struct mmu_notifier_range *range, bool *walk_done)
>> +{
>> +    /*
>> +     * The try_to_unmap() is only passed a hugetlb page
>> +     * in the case where the hugetlb page is poisoned.
>> +     */
>> +    VM_WARN_ON_PAGE(!PageHWPoison(page), page);
> 
> As you mentioned "No functional change is intended." in commit message, but
> you have changed VM_BUG_ON_PAGE to VM_WARN_ON_PAGE here?

Forgot to mention this in the description : )

... "While at it, as BUG_ONs are discouraged, convert them
to WARN_ON."


> 
>> +    /*
>> +     * huge_pmd_unshare may unmap an entire PMD page.
>> +     * There is no way of knowing exactly which PMDs may
>> +     * be cached for this mm, so we must flush them all.
>> +     * start/end were already adjusted above to cover this
>> +     * range.
>> +     */
>> +    flush_cache_range(vma, range->start, range->end);
>> +
>> +    /*
>> +     * To call huge_pmd_unshare, i_mmap_rwsem must be
>> +     * held in write mode.  Caller needs to explicitly
>> +     * do this outside rmap routines.
>> +     *
>> +     * We also must hold hugetlb vma_lock in write mode.
>> +     * Lock order dictates acquiring vma_lock BEFORE
>> +     * i_mmap_rwsem.  We can only try lock here and fail
>> +     * if unsuccessful.
>> +     */
>> +    if (!folio_test_anon(folio)) {
>> +        struct mmu_gather tlb;
>> +
>> +        VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
> 
> ditto
> 
> Thanks,
> Jie
> 
>> +        if (!hugetlb_vma_trylock_write(vma)) {
>> +            *walk_done = true;
>> +            return false;
>> +        }
>> +
>> +        tlb_gather_mmu_vma(&tlb, vma);
>> +        if (huge_pmd_unshare(&tlb, vma, pvmw->address, pvmw->pte)) {
>> +            hugetlb_vma_unlock_write(vma);
>> +            huge_pmd_unshare_flush(&tlb, vma);
>> +            tlb_finish_mmu(&tlb);
>> +            /*
>> +             * The PMD table was unmapped,
>> +             * consequently unmapping the folio.
>> +             */
>> +            *walk_done = true;
>> +            return true;
>> +        }
>> +        hugetlb_vma_unlock_write(vma);
>> +        tlb_finish_mmu(&tlb);
>> +    }
>> +    *pteval = huge_ptep_clear_flush(vma, pvmw->address, pvmw->pte);
>> +    if (pte_dirty(*pteval))
>> +        folio_mark_dirty(folio);
>> +
>> +    *walk_done = false;
>> +    return true;
>> +}
>> +
>>   /*
>>    * @arg: enum ttu_flags will be passed to this argument
>>    */
>> @@ -2115,56 +2176,13 @@ static bool try_to_unmap_one(struct folio *folio,
>> struct vm_area_struct *vma,
>>                    PageAnonExclusive(subpage);
>>             if (folio_test_hugetlb(folio)) {
>> -            bool anon = folio_test_anon(folio);
>> -
>> -            /*
>> -             * The try_to_unmap() is only passed a hugetlb page
>> -             * in the case where the hugetlb page is poisoned.
>> -             */
>> -            VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
>> -            /*
>> -             * huge_pmd_unshare may unmap an entire PMD page.
>> -             * There is no way of knowing exactly which PMDs may
>> -             * be cached for this mm, so we must flush them all.
>> -             * start/end were already adjusted above to cover this
>> -             * range.
>> -             */
>> -            flush_cache_range(vma, range.start, range.end);
>> +            bool walk_done;
>>   -            /*
>> -             * To call huge_pmd_unshare, i_mmap_rwsem must be
>> -             * held in write mode.  Caller needs to explicitly
>> -             * do this outside rmap routines.
>> -             *
>> -             * We also must hold hugetlb vma_lock in write mode.
>> -             * Lock order dictates acquiring vma_lock BEFORE
>> -             * i_mmap_rwsem.  We can only try lock here and fail
>> -             * if unsuccessful.
>> -             */
>> -            if (!anon) {
>> -                struct mmu_gather tlb;
>> -
>> -                VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> -                if (!hugetlb_vma_trylock_write(vma))
>> -                    goto walk_abort;
>> -
>> -                tlb_gather_mmu_vma(&tlb, vma);
>> -                if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
>> -                    hugetlb_vma_unlock_write(vma);
>> -                    huge_pmd_unshare_flush(&tlb, vma);
>> -                    tlb_finish_mmu(&tlb);
>> -                    /*
>> -                     * The PMD table was unmapped,
>> -                     * consequently unmapping the folio.
>> -                     */
>> -                    goto walk_done;
>> -                }
>> -                hugetlb_vma_unlock_write(vma);
>> -                tlb_finish_mmu(&tlb);
>> -            }
>> -            pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>> -            if (pte_dirty(pteval))
>> -                folio_mark_dirty(folio);
>> +            ret = unmap_hugetlb_folio(vma, folio, &pvmw, subpage,
>> +                          flags, &pteval, &range,
>> +                          &walk_done);
>> +            if (walk_done)
>> +                goto walk_done;
>>           } else if (likely(pte_present(pteval))) {
>>               nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
>>               end_addr = address + nr_pages * PAGE_SIZE;
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 3/9] mm/rmap: refactor some code around lazyfree folio unmapping
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
  2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
  2026-04-10 10:31 ` [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing " Dev Jain
@ 2026-04-10 10:31 ` Dev Jain
  2026-04-10 10:31 ` [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:31 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

For lazyfree folio unmapping, after clearing the ptes we must abort the
operation if the folio got dirtied or it has unexpected references.

Refactor this logic into a function which will return whether we need
to abort or not.

If we abort, we restore the ptes and bail out of try_to_unmap_one.
Otherwise adjust the rss stats of the mm and jump to a label.

Also rename that label from "discard" to "finish_unmap"; the former
is appropriate in the lazyfree context, but the code following the label
is executed for other successful unmap code paths too, so 'discard' does
not sound correct for them.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 95 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 55 insertions(+), 40 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index a9c43e2f6e695..fa5d6599dedf0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1978,6 +1978,56 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
+static inline bool can_unmap_lazyfree_folio_range(struct vm_area_struct *vma,
+		struct folio *folio, unsigned long address, pte_t *ptep,
+		pte_t pteval, unsigned long nr_pages)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int ref_count, map_count;
+
+	/*
+	 * Synchronize with gup_pte_range():
+	 * - clear PTE; barrier; read refcount
+	 * - inc refcount; barrier; read PTE
+	 */
+	smp_mb();
+
+	ref_count = folio_ref_count(folio);
+	map_count = folio_mapcount(folio);
+
+	/*
+	 * Order reads for page refcount and dirty flag
+	 * (see comments in __remove_mapping()).
+	 */
+	smp_rmb();
+
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		/*
+		 * redirtied either using the page table or a previously
+		 * obtained GUP reference.
+		 */
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		folio_set_swapbacked(folio);
+		return false;
+	}
+
+	if (ref_count != 1 + map_count) {
+		/*
+		 * Additional reference. Could be a GUP reference or any
+		 * speculative reference. GUP users must mark the folio
+		 * dirty if there was a modification. This folio cannot be
+		 * reclaimed right now either way, so act just like nothing
+		 * happened.
+		 * We'll come back here later and detect if the folio was
+		 * dirtied when the additional reference is gone.
+		 */
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		return false;
+	}
+
+	return true;
+}
+
 static inline bool unmap_hugetlb_folio(struct vm_area_struct *vma,
 		struct folio *folio, struct page_vma_mapped_walk *pvmw,
 		struct page *page, enum ttu_flags flags, pte_t *pteval,
@@ -2256,47 +2306,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 			/* MADV_FREE page check */
 			if (!folio_test_swapbacked(folio)) {
-				int ref_count, map_count;
-
-				/*
-				 * Synchronize with gup_pte_range():
-				 * - clear PTE; barrier; read refcount
-				 * - inc refcount; barrier; read PTE
-				 */
-				smp_mb();
-
-				ref_count = folio_ref_count(folio);
-				map_count = folio_mapcount(folio);
-
-				/*
-				 * Order reads for page refcount and dirty flag
-				 * (see comments in __remove_mapping()).
-				 */
-				smp_rmb();
-
-				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
-					/*
-					 * redirtied either using the page table or a previously
-					 * obtained GUP reference.
-					 */
-					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
-					folio_set_swapbacked(folio);
+				if (!can_unmap_lazyfree_folio_range(vma, folio, address,
+				    pvmw.pte, pteval, nr_pages))
 					goto walk_abort;
-				} else if (ref_count != 1 + map_count) {
-					/*
-					 * Additional reference. Could be a GUP reference or any
-					 * speculative reference. GUP users must mark the folio
-					 * dirty if there was a modification. This folio cannot be
-					 * reclaimed right now either way, so act just like nothing
-					 * happened.
-					 * We'll come back here later and detect if the folio was
-					 * dirtied when the additional reference is gone.
-					 */
-					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
-					goto walk_abort;
-				}
+
 				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
-				goto discard;
+				goto finish_unmap;
 			}
 
 			if (folio_dup_swap(folio, subpage) < 0) {
@@ -2359,7 +2374,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 */
 			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
-discard:
+finish_unmap:
 		if (unlikely(folio_test_hugetlb(folio))) {
 			hugetlb_remove_rmap(folio);
 		} else {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (2 preceding siblings ...)
  2026-04-10 10:31 ` [PATCH v2 3/9] mm/rmap: refactor some code around lazyfree folio unmapping Dev Jain
@ 2026-04-10 10:31 ` Dev Jain
  2026-04-14  5:46   ` Dev Jain
  2026-04-10 10:32 ` [PATCH v2 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:31 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

In preparation for the next patch, enable batch setting of uffd-wp ptes.

The code paths passing nr > 1 to zap_install_uffd_wp_if_needed() produce
that nr through either folio_pte_batch or swap_pte_batch, guaranteeing that
all ptes are the same w.r.t belonging to the same type of VMA (anonymous
or non-anonymous, wp-armed or non-wp-armed), and all being marked with
uffd-wp or all being not marked.

Note that we will have to use set_pte_at() in a loop instead of set_ptes()
since the latter cannot handle present->non-present conversion for
nr_pages > 1.

Convert documentation of install_uffd_wp_ptes_if_needed to kerneldoc
format.

No functional change is intended.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/mm_inline.h | 32 ++++++++++++++++++++------------
 mm/memory.c               | 20 +-------------------
 mm/rmap.c                 |  2 +-
 3 files changed, 22 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index a171070e15f05..20c34d14ad539 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -566,9 +566,17 @@ static inline pte_marker copy_pte_marker(
 	return dstm;
 }
 
-/*
- * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
- * replace a none pte.  NOTE!  This should only be called when *pte is already
+/**
+ * install_uffd_wp_ptes_if_needed - install uffd-wp marker on PTEs that map
+ *				    consecutive pages of the same large folio.
+ * @vma: The VMA the pages are mapped into.
+ * @addr: Address the first page of this batch is mapped at.
+ * @ptep: Page table pointer for the first entry of this batch.
+ * @pteval: old value of the entry pointed to by ptep.
+ * @nr_ptes: Number of entries to clear (batch size).
+ *
+ * If the ptes were wr-protected by uffd-wp in any form, arm special ptes to
+ * replace none ptes.  NOTE!  This should only be called when *pte is already
  * cleared so we will never accidentally replace something valuable.  Meanwhile
  * none pte also means we are not demoting the pte so tlb flushed is not needed.
  * E.g., when pte cleared the caller should have taken care of the tlb flush.
@@ -576,11 +584,11 @@ static inline pte_marker copy_pte_marker(
  * Must be called with pgtable lock held so that no thread will see the none
  * pte, and if they see it, they'll fault and serialize at the pgtable lock.
  *
- * Returns true if an uffd-wp pte was installed, false otherwise.
+ * Returns true if uffd-wp ptes were installed, false otherwise.
  */
 static inline bool
-pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
-			      pte_t *pte, pte_t pteval)
+install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t *pte, pte_t pteval, unsigned long nr_ptes)
 {
 	bool arm_uffd_pte = false;
 
@@ -610,13 +618,13 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	if (unlikely(pte_swp_uffd_wp_any(pteval)))
 		arm_uffd_pte = true;
 
-	if (unlikely(arm_uffd_pte)) {
-		set_pte_at(vma->vm_mm, addr, pte,
-			   make_pte_marker(PTE_MARKER_UFFD_WP));
-		return true;
-	}
+	if (likely(!arm_uffd_pte))
+		return false;
 
-	return false;
+	for (int i = 0; i < nr_ptes; ++i, ++pte, addr += PAGE_SIZE)
+		set_pte_at(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP));
+
+	return true;
 }
 
 static inline bool vma_has_recency(const struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index ea65685711311..eef144fa293d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1594,29 +1594,11 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *pte, int nr,
 			      struct zap_details *details, pte_t pteval)
 {
-	bool was_installed = false;
-
-	if (!uffd_supports_wp_marker())
-		return false;
-
-	/* Zap on anonymous always means dropping everything */
-	if (vma_is_anonymous(vma))
-		return false;
-
 	if (zap_drop_markers(details))
 		return false;
 
-	for (;;) {
-		/* the PFN in the PTE is irrelevant. */
-		if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
-			was_installed = true;
-		if (--nr == 0)
-			break;
-		pte++;
-		addr += PAGE_SIZE;
-	}
+	return install_uffd_wp_ptes_if_needed(vma, addr, pte, pteval, nr);
 
-	return was_installed;
 }
 
 static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
diff --git a/mm/rmap.c b/mm/rmap.c
index fa5d6599dedf0..20e1fb81c33fc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2263,7 +2263,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * we may want to replace a none pte with a marker pte if
 		 * it's file-backed, so we don't lose the tracking info.
 		 */
-		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping
  2026-04-10 10:31 ` [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
@ 2026-04-14  5:46   ` Dev Jain
  0 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-14  5:46 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual



On 10/04/26 4:01 pm, Dev Jain wrote:
> In preparation for the next patch, enable batch setting of uffd-wp ptes.
> 
> The code paths passing nr > 1 to zap_install_uffd_wp_if_needed() produce
> that nr through either folio_pte_batch or swap_pte_batch, guaranteeing that
> all ptes are the same w.r.t belonging to the same type of VMA (anonymous
> or non-anonymous, wp-armed or non-wp-armed), and all being marked with
> uffd-wp or all being not marked.
> 
> Note that we will have to use set_pte_at() in a loop instead of set_ptes()
> since the latter cannot handle present->non-present conversion for
> nr_pages > 1.
> 
> Convert documentation of install_uffd_wp_ptes_if_needed to kerneldoc
> format.
> 
> No functional change is intended.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/mm_inline.h | 32 ++++++++++++++++++++------------
>  mm/memory.c               | 20 +-------------------
>  mm/rmap.c                 |  2 +-
>  3 files changed, 22 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index a171070e15f05..20c34d14ad539 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -566,9 +566,17 @@ static inline pte_marker copy_pte_marker(
>  	return dstm;
>  }
>  
> -/*
> - * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> - * replace a none pte.  NOTE!  This should only be called when *pte is already
> +/**
> + * install_uffd_wp_ptes_if_needed - install uffd-wp marker on PTEs that map
> + *				    consecutive pages of the same large folio.
> + * @vma: The VMA the pages are mapped into.
> + * @addr: Address the first page of this batch is mapped at.
> + * @ptep: Page table pointer for the first entry of this batch.
> + * @pteval: old value of the entry pointed to by ptep.
> + * @nr_ptes: Number of entries to clear (batch size).
> + *
> + * If the ptes were wr-protected by uffd-wp in any form, arm special ptes to
> + * replace none ptes.  NOTE!  This should only be called when *pte is already
>   * cleared so we will never accidentally replace something valuable.  Meanwhile
>   * none pte also means we are not demoting the pte so tlb flushed is not needed.
>   * E.g., when pte cleared the caller should have taken care of the tlb flush.
> @@ -576,11 +584,11 @@ static inline pte_marker copy_pte_marker(
>   * Must be called with pgtable lock held so that no thread will see the none
>   * pte, and if they see it, they'll fault and serialize at the pgtable lock.
>   *
> - * Returns true if an uffd-wp pte was installed, false otherwise.
> + * Returns true if uffd-wp ptes were installed, false otherwise.
>   */
>  static inline bool
> -pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> -			      pte_t *pte, pte_t pteval)
> +install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma, unsigned long addr,
> +			      pte_t *pte, pte_t pteval, unsigned long nr_ptes)

From Sashiko - kernel doc mismatch. I'll rename pte to ptep.

>  {
>  	bool arm_uffd_pte = false;
>  
> @@ -610,13 +618,13 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>  	if (unlikely(pte_swp_uffd_wp_any(pteval)))
>  		arm_uffd_pte = true;
>  
> -	if (unlikely(arm_uffd_pte)) {
> -		set_pte_at(vma->vm_mm, addr, pte,
> -			   make_pte_marker(PTE_MARKER_UFFD_WP));
> -		return true;
> -	}
> +	if (likely(!arm_uffd_pte))
> +		return false;
>  
> -	return false;
> +	for (int i = 0; i < nr_ptes; ++i, ++pte, addr += PAGE_SIZE)
> +		set_pte_at(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP));
> +
> +	return true;
>  }
>  
>  static inline bool vma_has_recency(const struct vm_area_struct *vma)
> diff --git a/mm/memory.c b/mm/memory.c
> index ea65685711311..eef144fa293d4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1594,29 +1594,11 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>  			      unsigned long addr, pte_t *pte, int nr,
>  			      struct zap_details *details, pte_t pteval)
>  {
> -	bool was_installed = false;
> -
> -	if (!uffd_supports_wp_marker())
> -		return false;
> -
> -	/* Zap on anonymous always means dropping everything */
> -	if (vma_is_anonymous(vma))
> -		return false;
> -
>  	if (zap_drop_markers(details))
>  		return false;
>  
> -	for (;;) {
> -		/* the PFN in the PTE is irrelevant. */
> -		if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
> -			was_installed = true;
> -		if (--nr == 0)
> -			break;
> -		pte++;
> -		addr += PAGE_SIZE;
> -	}
> +	return install_uffd_wp_ptes_if_needed(vma, addr, pte, pteval, nr);
>  
> -	return was_installed;
>  }
>  
>  static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fa5d6599dedf0..20e1fb81c33fc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2263,7 +2263,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 * we may want to replace a none pte with a marker pte if
>  		 * it's file-backed, so we don't lose the tracking info.
>  		 */
> -		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> +		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
>  
>  		/* Update high watermark before we lower rss */
>  		update_hiwater_rss(mm);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (3 preceding siblings ...)
  2026-04-10 10:31 ` [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
@ 2026-04-10 10:32 ` Dev Jain
  2026-04-10 10:32 ` [PATCH v2 6/9] mm/swapfile: Add batched version of folio_dup_swap Dev Jain
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:32 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Commit a67fe41e214f ("mm: rmap: support batched unmapping for file large folios")
extended batched unmapping for file folios. That also required making
install_uffd_wp_ptes_if_needed() support batching, but that was left
out for the time being, and correctness was maintained by stopping
batching in case the VMA the folio belongs to is marked uffd-wp.

Now that we have a batched version called install_uffd_wp_ptes_if_needed,
simply call that. folio_unmap_pte_batch() ensures that the original state
of the ptes is either all uffd or all non-uffd, so we maintain
correctness.

If uffd-wp bit is there, we have the following transitions of ptes
after unmapping:

1) anon folio: present -> uffd-wp swap
2) file folio: present -> uffd-wp marker

We must ensure that these ptes are not reprocessed by the while loop -
if the batch length is less than the number of pages in the folio, then
we must skip over this batch.

The page_vma_mapped_walk API ensures this - check_pte() will return true
only if any of [pvmw->pfn, pvmw->pfn + nr_pages) is mapped by the pte.
There is no pfn underlying either a uffd-wp swap pte or a uffd-wp marker
pte, so check_pte returns false and we keep skipping until we hit a
present entry, which is where we want to batch from next.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 20e1fb81c33fc..7a150edd96819 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1965,9 +1965,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;

-	if (userfaultfd_wp(vma))
-		return 1;
-
 	/*
 	 * If unmap fails, we need to restore the ptes. To avoid accidentally
 	 * upgrading write permissions for ptes that were not originally
@@ -2263,7 +2260,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * we may want to replace a none pte with a marker pte if
 		 * it's file-backed, so we don't lose the tracking info.
 		 */
-		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
+		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);

 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 6/9] mm/swapfile: Add batched version of folio_dup_swap
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (4 preceding siblings ...)
  2026-04-10 10:32 ` [PATCH v2 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
@ 2026-04-10 10:32 ` Dev Jain
  2026-04-10 10:32 ` [PATCH v2 7/9] mm/swapfile: Add batched version of folio_put_swap Dev Jain
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:32 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Add folio_dup_swap_pages to handle a batch of consecutive pages. Note
that folio_dup_swap already can handle a subset of this: nr_pages == 1 and
nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.

Currently we have a not-so-nice logic of passing in subpage == NULL if
we mean to exercise the logic on the entire folio, and subpage != NULL if
we want to exercise the logic on only that subpage. Remove this
indirection: the caller invokes folio_dup_swap_pages() if it wants to
operate on a range of pages in the folio (i.e nr_pages may be anything
between 1 till folio_nr_pages()), and invokes folio_dup_swap() if it
wants to operate on the entire folio.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c     |  2 +-
 mm/shmem.c    |  2 +-
 mm/swap.h     | 12 ++++++++++--
 mm/swapfile.c | 20 ++++++++++++--------
 4 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7a150edd96819..6412103fcd6cb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2311,7 +2311,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto finish_unmap;
 			}
 
-			if (folio_dup_swap(folio, subpage) < 0) {
+			if (folio_dup_swap_pages(folio, subpage, 1) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index 5aa43657886c3..3f9523c97b9ed 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1695,7 +1695,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 			spin_unlock(&shmem_swaplist_lock);
 		}
 
-		folio_dup_swap(folio, NULL);
+		folio_dup_swap(folio);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
diff --git a/mm/swap.h b/mm/swap.h
index a77016f2423b9..3c25f914e908b 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -206,7 +206,9 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
  * folio_put_swap(): does the opposite thing of folio_dup_swap().
  */
 int folio_alloc_swap(struct folio *folio);
-int folio_dup_swap(struct folio *folio, struct page *subpage);
+int folio_dup_swap(struct folio *folio);
+int folio_dup_swap_pages(struct folio *folio, struct page *page,
+			 unsigned long nr_pages);
 void folio_put_swap(struct folio *folio, struct page *subpage);
 
 /* For internal use */
@@ -390,7 +392,13 @@ static inline int folio_alloc_swap(struct folio *folio)
 	return -EINVAL;
 }
 
-static inline int folio_dup_swap(struct folio *folio, struct page *page)
+static inline int folio_dup_swap(struct folio *folio)
+{
+	return -EINVAL;
+}
+
+static inline int folio_dup_swap_pages(struct folio *folio, struct page *page,
+		unsigned long nr_pages)
 {
 	return -EINVAL;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ff315b752afd3..22be05a0bb200 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1740,9 +1740,10 @@ int folio_alloc_swap(struct folio *folio)
 }
 
 /**
- * folio_dup_swap() - Increase swap count of swap entries of a folio.
+ * folio_dup_swap_pages() - Increase swap count of swap entries of a folio.
  * @folio: folio with swap entries bounded.
- * @subpage: if not NULL, only increase the swap count of this subpage.
+ * @page: the first page in the folio to increase the swap count for.
+ * @nr_pages: the number of pages in the folio to increase the swap count for.
  *
  * Typically called when the folio is unmapped and have its swap entry to
  * take its place: Swap entries allocated to a folio has count == 0 and pinned
@@ -1756,23 +1757,26 @@ int folio_alloc_swap(struct folio *folio)
  * swap_put_entries_direct on its swap entry before this helper returns, or
  * the swap count may underflow.
  */
-int folio_dup_swap(struct folio *folio, struct page *subpage)
+int folio_dup_swap_pages(struct folio *folio, struct page *page,
+		unsigned long nr_pages)
 {
 	swp_entry_t entry = folio->swap;
-	unsigned long nr_pages = folio_nr_pages(folio);
 
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
 
-	if (subpage) {
-		entry.val += folio_page_idx(folio, subpage);
-		nr_pages = 1;
-	}
+	entry.val += folio_page_idx(folio, page);
 
 	return swap_dup_entries_cluster(swap_entry_to_info(entry),
 					swp_offset(entry), nr_pages);
 }
 
+int folio_dup_swap(struct folio *folio)
+{
+	return folio_dup_swap_pages(folio, folio_page(folio, 0),
+				    folio_nr_pages(folio));
+}
+
 /**
  * folio_put_swap() - Decrease swap count of swap entries of a folio.
  * @folio: folio with swap entries bounded, must be in swap cache and locked.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 7/9] mm/swapfile: Add batched version of folio_put_swap
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (5 preceding siblings ...)
  2026-04-10 10:32 ` [PATCH v2 6/9] mm/swapfile: Add batched version of folio_dup_swap Dev Jain
@ 2026-04-10 10:32 ` Dev Jain
  2026-04-10 10:32 ` [PATCH v2 8/9] mm/rmap: Add batched version of folio_try_share_anon_rmap_pte Dev Jain
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:32 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Add folio_put_swap_pages to handle a batch of consecutive pages. Note
that folio_put_swap already can handle a subset of this: nr_pages == 1 and
nr_pages == folio_nr_pages(folio). Generalize this to any nr_pages.

Currently we have a not-so-nice logic of passing in subpage == NULL if
we mean to exercise the logic on the entire folio, and subpage != NULL if
we want to exercise the logic on only that subpage. Remove this
indirection: the caller invokes folio_put_swap_pages() if it wants to
operate on a range of pages in the folio (i.e nr_pages may be anything
between 1 till folio_nr_pages()), and invokes folio_put_swap() if it
wants to operate on the entire folio.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/memory.c   |  6 +++---
 mm/rmap.c     |  4 ++--
 mm/shmem.c    |  6 +++---
 mm/swap.h     | 11 +++++++++--
 mm/swapfile.c | 22 +++++++++++++---------
 5 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eef144fa293d4..d6da01867baf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5088,7 +5088,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
-		folio_put_swap(swapcache, NULL);
+		folio_put_swap(swapcache);
 	} else if (!folio_test_anon(folio)) {
 		/*
 		 * We currently only expect !anon folios that are fully
@@ -5097,12 +5097,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
-		folio_put_swap(folio, NULL);
+		folio_put_swap(folio);
 	} else {
 		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
 					 rmap_flags);
-		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
+		folio_put_swap_pages(folio, page, nr_pages);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
diff --git a/mm/rmap.c b/mm/rmap.c
index 6412103fcd6cb..9b20ef7f211e1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2322,7 +2322,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * so we'll not check/care.
 			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				folio_put_swap(folio, subpage);
+				folio_put_swap_pages(folio, subpage, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2330,7 +2330,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				folio_put_swap(folio, subpage);
+				folio_put_swap_pages(folio, subpage, 1);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index 3f9523c97b9ed..f49bf07e806a7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1716,7 +1716,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		/* Swap entry might be erased by racing shmem_free_swap() */
 		if (!error) {
 			shmem_recalc_inode(inode, 0, -nr_pages);
-			folio_put_swap(folio, NULL);
+			folio_put_swap(folio);
 		}
 
 		/*
@@ -2196,7 +2196,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
-	folio_put_swap(folio, NULL);
+	folio_put_swap(folio);
 	swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2426,7 +2426,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	folio_put_swap(folio, NULL);
+	folio_put_swap(folio);
 	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
 	put_swap_device(si);
diff --git a/mm/swap.h b/mm/swap.h
index 3c25f914e908b..343547469927a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -209,7 +209,9 @@ int folio_alloc_swap(struct folio *folio);
 int folio_dup_swap(struct folio *folio);
 int folio_dup_swap_pages(struct folio *folio, struct page *page,
 			 unsigned long nr_pages);
-void folio_put_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio);
+void folio_put_swap_pages(struct folio *folio, struct page *page,
+			  unsigned long nr_pages);
 
 /* For internal use */
 extern void __swap_cluster_free_entries(struct swap_info_struct *si,
@@ -403,7 +405,12 @@ static inline int folio_dup_swap_pages(struct folio *folio, struct page *page,
 	return -EINVAL;
 }
 
-static inline void folio_put_swap(struct folio *folio, struct page *page)
+static inline void folio_put_swap(struct folio *folio)
+{
+}
+
+static inline void folio_put_swap_pages(struct folio *folio, struct page *page,
+				  unsigned long nr_pages)
 {
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 22be05a0bb200..d8fae3925e171 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1778,31 +1778,34 @@ int folio_dup_swap(struct folio *folio)
 }
 
 /**
- * folio_put_swap() - Decrease swap count of swap entries of a folio.
+ * folio_put_swap_pages() - Decrease swap count of swap entries of a folio.
  * @folio: folio with swap entries bounded, must be in swap cache and locked.
- * @subpage: if not NULL, only decrease the swap count of this subpage.
+ * @page: the first page in the folio to decrease the swap count for.
+ * @nr_pages: the number of pages in the folio to decrease the swap count for.
  *
  * This won't free the swap slots even if swap count drops to zero, they are
  * still pinned by the swap cache. User may call folio_free_swap to free them.
  * Context: Caller must ensure the folio is locked and in the swap cache.
  */
-void folio_put_swap(struct folio *folio, struct page *subpage)
+void folio_put_swap_pages(struct folio *folio, struct page *page,
+			  unsigned long nr_pages)
 {
 	swp_entry_t entry = folio->swap;
-	unsigned long nr_pages = folio_nr_pages(folio);
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
 
-	if (subpage) {
-		entry.val += folio_page_idx(folio, subpage);
-		nr_pages = 1;
-	}
+	entry.val += folio_page_idx(folio, page);
 
 	swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
 }
 
+void folio_put_swap(struct folio *folio)
+{
+	folio_put_swap_pages(folio, folio_page(folio, 0), folio_nr_pages(folio));
+}
+
 /*
  * When we get a swap entry, if there aren't some other ways to
  * prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -2443,7 +2446,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
+	folio_put_swap_pages(swapcache,
+			     folio_file_page(swapcache, swp_offset(entry)), 1);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 8/9] mm/rmap: Add batched version of folio_try_share_anon_rmap_pte
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (6 preceding siblings ...)
  2026-04-10 10:32 ` [PATCH v2 7/9] mm/swapfile: Add batched version of folio_put_swap Dev Jain
@ 2026-04-10 10:32 ` Dev Jain
  2026-04-10 10:32 ` [PATCH v2 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
  2026-04-10 13:53 ` [PATCH v2 0/9] Optimize anonymous large folio unmapping Lorenzo Stoakes
  9 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:32 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

To enabe batched unmapping of anonymous folios, we need to handle the
sharing of exclusive pages. Hence, a batched version of
folio_try_share_anon_rmap_pte is required.

Currently, the sole purpose of nr_pages in __folio_try_share_anon_rmap is
to do some rmap sanity checks. Add helpers to set and clear the
PageAnonExclusive bit on a batch of nr_pages. Note that
__folio_try_share_anon_rmap can receive nr_pages == HPAGE_PMD_NR from the
PMD path, but currently we only clear the bit on the head page. Retain
this behaviour by setting nr_pages = 1 in case the caller is
folio_try_share_anon_rmap_pmd.

While at it, convert nr_pages to unsigned long to future-proof from
overflow in case P4D-huge mappings etc get supported down the road.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/mm.h   | 11 +++++++++++
 include/linux/rmap.h | 27 ++++++++++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 633bbf9a184a6..2d20954da652a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -243,6 +243,17 @@ static inline unsigned long folio_page_idx(const struct folio *folio,
 	return page - &folio->page;
 }
 
+static __always_inline void folio_clear_pages_anon_exclusive(struct page *page,
+		unsigned long nr_pages)
+{
+	for (;;) {
+		ClearPageAnonExclusive(page);
+		if (--nr_pages == 0)
+			break;
+		++page;
+	}
+}
+
 static inline struct folio *lru_to_folio(struct list_head *head)
 {
 	return list_entry((head)->prev, struct folio, lru);
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8dc0871e5f001..f3b3ee3955afc 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -706,15 +706,19 @@ static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio,
 }
 
 static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
-		struct page *page, int nr_pages, enum pgtable_level level)
+		struct page *page, unsigned long nr_pages, enum pgtable_level level)
 {
 	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
 	VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
 	__folio_rmap_sanity_checks(folio, page, nr_pages, level);
 
+	/* We only clear anon-exclusive from head page of PMD folio */
+	if (level == PGTABLE_LEVEL_PMD)
+		nr_pages = 1;
+
 	/* device private folios cannot get pinned via GUP. */
 	if (unlikely(folio_is_device_private(folio))) {
-		ClearPageAnonExclusive(page);
+		folio_clear_pages_anon_exclusive(page, nr_pages);
 		return 0;
 	}
 
@@ -766,7 +770,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
 
 	if (unlikely(folio_maybe_dma_pinned(folio)))
 		return -EBUSY;
-	ClearPageAnonExclusive(page);
+	folio_clear_pages_anon_exclusive(page, nr_pages);
 
 	/*
 	 * This is conceptually a smp_wmb() paired with the smp_rmb() in
@@ -778,11 +782,12 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
 }
 
 /**
- * folio_try_share_anon_rmap_pte - try marking an exclusive anonymous page
- *				   mapped by a PTE possibly shared to prepare
+ * folio_try_share_anon_rmap_ptes - try marking exclusive anonymous pages
+ *				   mapped by PTEs possibly shared to prepare
  *				   for KSM or temporary unmapping
  * @folio:	The folio to share a mapping of
- * @page:	The mapped exclusive page
+ * @page:	The first mapped exclusive page of the batch in the folio
+ * @nr_pages:	The number of pages to share in the folio (batch size)
  *
  * The caller needs to hold the page table lock and has to have the page table
  * entries cleared/invalidated.
@@ -797,11 +802,19 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
  *
  * Returns 0 if marking the mapped page possibly shared succeeded. Returns
  * -EBUSY otherwise.
+ *
+ * The caller needs to hold the page table lock.
  */
+static inline int folio_try_share_anon_rmap_ptes(struct folio *folio,
+		struct page *page, unsigned long nr_pages)
+{
+	return __folio_try_share_anon_rmap(folio, page, nr_pages, PGTABLE_LEVEL_PTE);
+}
+
 static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
 		struct page *page)
 {
-	return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
+	return folio_try_share_anon_rmap_ptes(folio, page, 1);
 }
 
 /**
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 9/9] mm/rmap: enable batch unmapping of anonymous folios
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (7 preceding siblings ...)
  2026-04-10 10:32 ` [PATCH v2 8/9] mm/rmap: Add batched version of folio_try_share_anon_rmap_pte Dev Jain
@ 2026-04-10 10:32 ` Dev Jain
  2026-04-10 13:53 ` [PATCH v2 0/9] Optimize anonymous large folio unmapping Lorenzo Stoakes
  9 siblings, 0 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:32 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Enable batch clearing of ptes, and batch swap setting of ptes for anon
folio unmapping.

Processing all ptes of a large folio in one go helps us batch across
atomics (add_mm_counter etc), barriers (in the function
__folio_try_share_anon_rmap), repeated calls to page_vma_mapped_walk(),
to name a few. In general, batching helps us to execute similar code
together, making the execution of the program more memory and
CPU friendly.

On arm64-contpte, batching also helps us avoid redundant ptep_get() calls
and TLB flushes while breaking the contpte mapping.

The handling of anon-exclusivity is very similar to commit cac1db8c3aad
("mm: optimize mprotect() by PTE batching"). Since folio_unmap_pte_batch()
won't look at the bits of the underlying page, we need to process
sub-batches of ptes pointing to pages which are same w.r.t exclusivity,
and batch set only those ptes to swap ptes in one go. Hence export
page_anon_exclusive_sub_batch() to internal.h and reuse it.

arch_unmap_one() is only defined for sparc64; I am not comfortable
regarding the nuances between retrieving the pfn from pte_pfn() or from
(paddr = pte_val(oldpte) & _PAGE_PADDR_4V).

(And, pte_next_pfn() can't even be called from arch_unmap_one() because
that file does not include pgtable.h) So just disable the
"sparc64-anon-swapbacked" case for now.

We need to take care of rmap accounting (folio_remove_rmap_ptes) and
reference accounting (folio_put_refs) when anon folio unmap succeeds.
In case we partially batch the large folio and fail, we need to correctly
do the accounting for pages which were successfully unmapped. So, put
this accounting code in __unmap_anon_folio() itself, instead of doing
some horrible goto jumping at the callsite of unmap_anon_folio().

Add a comment at relevant places to say that we are on a device-exclusive
entry and not a present entry.

If the batch length is less than the number of pages in the folio, then
we must skip over this batch.

The page_vma_mapped_walk API ensures this - check_pte() will return true
only if any of [pvmw->pfn, pvmw->pfn + nr_pages) is mapped by the pte.
There is no pfn underlying a swap pte, so check_pte returns false and we
keep skipping until we hit a present pte, which is where we want to start
unmapping from next.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/internal.h |  26 +++++++
 mm/mprotect.c |  17 -----
 mm/rmap.c     | 188 ++++++++++++++++++++++++++++++++++----------------
 3 files changed, 153 insertions(+), 78 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c693646e5b3f0..f7033c9626767 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -393,6 +393,32 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio,
 unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 		unsigned int max_nr);
 
+/**
+ * page_anon_exclusive_sub_batch - Determine length of consecutive exclusive
+ * or maybe shared pages
+ * @start_idx: Starting index of the page array to scan from
+ * @max_len: Maximum length to look at
+ * @first_page: First page of the page array
+ * @expected_anon_exclusive: Whether to look for exclusive or !exclusive pages
+ *
+ * Determines length of consecutive ptes, pointing to pages being the same
+ * w.r.t the PageAnonExclusive bit.
+ *
+ * Context: The ptes point to consecutive pages of the same large folio. The
+ * ptes belong to the same PMD and VMA.
+ */
+static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+		struct page *first_page, bool expected_anon_exclusive)
+{
+	int idx;
+
+	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
+		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
+			break;
+	}
+	return idx - start_idx;
+}
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9cbf932b028cf..949fd7022b5cf 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,23 +138,6 @@ static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
 		tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
 }
 
-/*
- * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
- * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
- * that the ptes point to consecutive pages of the same anon large folio.
- */
-static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len,
-		struct page *first_page, bool expected_anon_exclusive)
-{
-	int idx;
-
-	for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
-		if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
-			break;
-	}
-	return idx - start_idx;
-}
-
 /*
  * This function is a result of trying our very best to retain the
  * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
diff --git a/mm/rmap.c b/mm/rmap.c
index 9b20ef7f211e1..ca071641965bd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1958,11 +1958,11 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree or file folios batching for now ... */
-	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
+	if (pte_unused(pte))
 		return 1;
 
-	if (pte_unused(pte))
+	if (__is_defined(__HAVE_ARCH_UNMAP_ONE) && folio_test_anon(folio) &&
+	    folio_test_swapbacked(folio))
 		return 1;
 
 	/*
@@ -1975,6 +1975,122 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
+static inline void set_swp_ptes(struct mm_struct *mm, unsigned long address,
+		pte_t *ptep, swp_entry_t entry, pte_t pteval, bool anon_exclusive,
+		unsigned long nr_pages)
+{
+	pte_t swp_pte = swp_entry_to_pte(entry);
+
+	if (anon_exclusive)
+		swp_pte = pte_swp_mkexclusive(swp_pte);
+
+	if (likely(pte_present(pteval))) {
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+	} else {
+		/* Device-exclusive entry */
+		VM_WARN_ON(nr_pages != 1);
+		if (pte_swp_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_swp_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+	}
+
+	for (int i = 0; i < nr_pages; ++i, ++ptep, address += PAGE_SIZE) {
+		set_pte_at(mm, address, ptep, swp_pte);
+		swp_pte = pte_next_swp_offset(swp_pte);
+	}
+}
+
+static inline void finish_folio_unmap(struct vm_area_struct *vma,
+		struct folio *folio, struct page *subpage, unsigned long nr_pages)
+{
+	if (unlikely(folio_test_hugetlb(folio)))
+		hugetlb_remove_rmap(folio);
+	else
+		folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_drain_local();
+	folio_put_refs(folio, nr_pages);
+}
+
+static inline bool __unmap_anon_folio_range(struct vm_area_struct *vma, struct folio *folio,
+		struct page *subpage, unsigned long address, pte_t *ptep,
+		pte_t pteval, unsigned long nr_pages, bool anon_exclusive)
+{
+	swp_entry_t entry = page_swap_entry(subpage);
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (folio_dup_swap_pages(folio, subpage, nr_pages) < 0) {
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		return false;
+	}
+
+	/*
+	 * arch_unmap_one() is expected to be a NOP on
+	 * architectures where we could have PFN swap PTEs,
+	 * so we'll not check/care.
+	 */
+	if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+		VM_WARN_ON(nr_pages != 1);
+		folio_put_swap_pages(folio, subpage, nr_pages);
+		set_pte_at(mm, address, ptep, pteval);
+		return false;
+	}
+
+	/* See folio_try_share_anon_rmap(): clear PTE first. */
+	if (anon_exclusive && folio_try_share_anon_rmap_ptes(folio, subpage, nr_pages)) {
+		folio_put_swap_pages(folio, subpage, nr_pages);
+		set_ptes(mm, address, ptep, pteval, nr_pages);
+		return false;
+	}
+
+	if (list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		if (list_empty(&mm->mmlist))
+			list_add(&mm->mmlist, &init_mm.mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+
+	add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
+	add_mm_counter(mm, MM_SWAPENTS, nr_pages);
+	set_swp_ptes(mm, address, ptep, entry, pteval, anon_exclusive, nr_pages);
+	finish_folio_unmap(vma, folio, subpage, nr_pages);
+	return true;
+}
+
+static inline bool unmap_anon_folio_range(struct vm_area_struct *vma, struct folio *folio,
+		struct page *first_page, unsigned long address, pte_t *ptep,
+		pte_t pteval, unsigned long nr_pages)
+{
+	bool expected_anon_exclusive;
+	int sub_batch_idx = 0;
+	int len, ret;
+
+	for (;;) {
+		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
+		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_pages,
+						    first_page, expected_anon_exclusive);
+		ret = __unmap_anon_folio_range(vma, folio, first_page + sub_batch_idx,
+					       address, ptep, pteval, len, expected_anon_exclusive);
+		if (!ret)
+			return ret;
+
+		nr_pages -= len;
+		if (!nr_pages)
+			break;
+
+		pteval = pte_advance_pfn(pteval, len);
+		address += len * PAGE_SIZE;
+		sub_batch_idx += len;
+		ptep += len;
+	}
+
+	return true;
+}
+
 static inline bool can_unmap_lazyfree_folio_range(struct vm_area_struct *vma,
 		struct folio *folio, unsigned long address, pte_t *ptep,
 		pte_t pteval, unsigned long nr_pages)
@@ -2094,7 +2210,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	bool anon_exclusive, ret = true;
+	bool ret = true;
 	pte_t pteval;
 	struct page *subpage;
 	struct mmu_notifier_range range;
@@ -2219,8 +2335,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		subpage = folio_page(folio, pfn - folio_pfn(folio));
 		address = pvmw.address;
-		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
 
 		if (folio_test_hugetlb(folio)) {
 			bool walk_done;
@@ -2252,6 +2366,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_dirty(pteval))
 				folio_mark_dirty(folio);
 		} else {
+			/* Device-exclusive entry */
 			pte_clear(mm, address, pvmw.pte);
 		}
 
@@ -2289,8 +2404,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter(folio));
 		} else if (folio_test_anon(folio)) {
-			swp_entry_t entry = page_swap_entry(subpage);
-			pte_t swp_pte;
 			/*
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
@@ -2306,57 +2419,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				if (!can_unmap_lazyfree_folio_range(vma, folio, address,
 				    pvmw.pte, pteval, nr_pages))
 					goto walk_abort;
-
 				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
 				goto finish_unmap;
 			}
 
-			if (folio_dup_swap_pages(folio, subpage, 1) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
+			if (!unmap_anon_folio_range(vma, folio, subpage, address,
+						    pvmw.pte, pteval, nr_pages))
 				goto walk_abort;
-			}
 
-			/*
-			 * arch_unmap_one() is expected to be a NOP on
-			 * architectures where we could have PFN swap PTEs,
-			 * so we'll not check/care.
-			 */
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				folio_put_swap_pages(folio, subpage, 1);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				goto walk_abort;
-			}
-
-			/* See folio_try_share_anon_rmap(): clear PTE first. */
-			if (anon_exclusive &&
-			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				folio_put_swap_pages(folio, subpage, 1);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				goto walk_abort;
-			}
-			if (list_empty(&mm->mmlist)) {
-				spin_lock(&mmlist_lock);
-				if (list_empty(&mm->mmlist))
-					list_add(&mm->mmlist, &init_mm.mmlist);
-				spin_unlock(&mmlist_lock);
-			}
-			dec_mm_counter(mm, MM_ANONPAGES);
-			inc_mm_counter(mm, MM_SWAPENTS);
-			swp_pte = swp_entry_to_pte(entry);
-			if (anon_exclusive)
-				swp_pte = pte_swp_mkexclusive(swp_pte);
-			if (likely(pte_present(pteval))) {
-				if (pte_soft_dirty(pteval))
-					swp_pte = pte_swp_mksoft_dirty(swp_pte);
-				if (pte_uffd_wp(pteval))
-					swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			} else {
-				if (pte_swp_soft_dirty(pteval))
-					swp_pte = pte_swp_mksoft_dirty(swp_pte);
-				if (pte_swp_uffd_wp(pteval))
-					swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			}
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			if (nr_pages == folio_nr_pages(folio))
+				goto walk_done;
+			continue;
 		} else {
 			/*
 			 * This is a locked file-backed folio,
@@ -2372,14 +2445,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 finish_unmap:
-		if (unlikely(folio_test_hugetlb(folio))) {
-			hugetlb_remove_rmap(folio);
-		} else {
-			folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
-		}
-		if (vma->vm_flags & VM_LOCKED)
-			mlock_drain_local();
-		folio_put_refs(folio, nr_pages);
+		finish_folio_unmap(vma, folio, subpage, nr_pages);
 
 		/*
 		 * If we are sure that we batched the entire folio and cleared
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/9] Optimize anonymous large folio unmapping
  2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
                   ` (8 preceding siblings ...)
  2026-04-10 10:32 ` [PATCH v2 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
@ 2026-04-10 13:53 ` Lorenzo Stoakes
  9 siblings, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2026-04-10 13:53 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, hughd, chrisl, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, kasong, qi.zheng, shakeel.butt, baohua, axelrasmussen,
	yuanchu, weixugc, riel, harry, jannh, pfalcato, baolin.wang,
	shikemeng, nphamcs, bhe, youngjun.park, linux-mm, linux-kernel,
	ryan.roberts, anshuman.khandual

Obviously this one is for v7.2, we're in a quiet period now so this may or may
not get review attention before v7.1-rc1, at which point please resend the
series rebased appropriately :)

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-04-14  5:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
2026-04-11  1:02   ` Barry Song
2026-04-10 10:31 ` [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing " Dev Jain
2026-04-11  8:55   ` Barry Song
2026-04-11 16:05     ` Dev Jain
2026-04-11 16:24       ` Dev Jain
2026-04-11 11:45   ` Jie Gan
2026-04-11 16:08     ` Dev Jain
2026-04-10 10:31 ` [PATCH v2 3/9] mm/rmap: refactor some code around lazyfree folio unmapping Dev Jain
2026-04-10 10:31 ` [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
2026-04-14  5:46   ` Dev Jain
2026-04-10 10:32 ` [PATCH v2 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
2026-04-10 10:32 ` [PATCH v2 6/9] mm/swapfile: Add batched version of folio_dup_swap Dev Jain
2026-04-10 10:32 ` [PATCH v2 7/9] mm/swapfile: Add batched version of folio_put_swap Dev Jain
2026-04-10 10:32 ` [PATCH v2 8/9] mm/rmap: Add batched version of folio_try_share_anon_rmap_pte Dev Jain
2026-04-10 10:32 ` [PATCH v2 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
2026-04-10 13:53 ` [PATCH v2 0/9] Optimize anonymous large folio unmapping Lorenzo Stoakes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.