All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Optimize anonymous large folio unmapping
@ 2026-04-10 10:31 Dev Jain
  2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Dev Jain @ 2026-04-10 10:31 UTC (permalink / raw)
  To: akpm, david, hughd, chrisl
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, kasong, qi.zheng,
	shakeel.butt, baohua, axelrasmussen, yuanchu, weixugc, riel,
	harry, jannh, pfalcato, baolin.wang, shikemeng, nphamcs, bhe,
	youngjun.park, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, Dev Jain

Speed up unmapping of anonymous large folios by clearing the ptes, and
setting swap ptes, in one go.

The following benchmark (stolen from Barry at [1]) is used to measure the
time taken to swapout 256M worth of memory backed by 64K large folios:

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
 #include <errno.h>

 #define SIZE_MB 256
 #define SIZE_BYTES (SIZE_MB * 1024 * 1024)

 int main() {
     void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (addr == MAP_FAILED) {
         perror("mmap failed");
         return 1;
     }

     memset(addr, 0, SIZE_BYTES);

     struct timespec start, end;
     clock_gettime(CLOCK_MONOTONIC, &start);

     if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
         perror("madvise(MADV_PAGEOUT) failed");
         munmap(addr, SIZE_BYTES);
         return 1;
     }

     clock_gettime(CLOCK_MONOTONIC, &end);

     long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                        (end.tv_nsec - start.tv_nsec);
     printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
            duration_ns, duration_ns / 1e6);

     munmap(addr, SIZE_BYTES);
     return 0;
 }

Performance as measured on a Linux VM on Apple M3 (arm64):

Vanilla - Mean: 37401913 ns, std dev: 12%
Patched - Mean: 17420282 ns, std dev: 11%

No regression observed on 4K folios.

Performance as measured on bare metal x86:

Vanilla - mean: 54986286 ns, std dev: 1.5%
Patched - mean: 51930795 ns, std dev: 3%

Interestingly, no obvious improvement is observed on x86, hinting that the
benefit lies mainly in the reduction of ptep_get() calls and the reduction
of TLB flushes during contpte-unfolding, on arm64.

No regression is observed on 4K folios on x86 too.

---
Based on mm-unstable 3fa44141e0bb ("ksm: optimize rmap_walk_ksm by passing
a suitable address range"). mm-selftests pass.

v1->v2:
 - Keep nr_pages as unsigned long
 - Add patch 2
 - Rename some functions, make return type bool for functions returning 0/1
 - Drop page_vma_mapped_walk_jump - this is implicitly handled
 - Drop likely()
 - Add folio_dup/put_swap_pages, do subpage -> page
 - Shorten the kerneldoc to remove unnecessary information - keep it
   aligned with analogous functions
 - Put clear_pages_anon_exclusive to mm.h
 - Some more refactoring in last patch with finish_folio_unmap

Dev Jain (9):
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  mm/rmap: refactor some code around lazyfree folio unmapping
  mm/memory: Batch set uffd-wp markers during zapping
  mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  mm/swapfile: Add batched version of folio_dup_swap
  mm/swapfile: Add batched version of folio_put_swap
  mm/rmap: Add batched version of folio_try_share_anon_rmap_pte
  mm/rmap: enable batch unmapping of anonymous folios

 include/linux/mm.h        |  11 ++
 include/linux/mm_inline.h |  32 +--
 include/linux/rmap.h      |  27 ++-
 mm/internal.h             |  26 +++
 mm/memory.c               |  26 +--
 mm/mprotect.c             |  17 --
 mm/rmap.c                 | 404 +++++++++++++++++++++++---------------
 mm/shmem.c                |   8 +-
 mm/swap.h                 |  23 ++-
 mm/swapfile.c             |  42 ++--
 10 files changed, 380 insertions(+), 236 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-04-14  5:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-10 10:31 [PATCH v2 0/9] Optimize anonymous large folio unmapping Dev Jain
2026-04-10 10:31 ` [PATCH v2 1/9] mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one Dev Jain
2026-04-11  1:02   ` Barry Song
2026-04-10 10:31 ` [PATCH v2 2/9] mm/rmap: refactor hugetlb pte clearing " Dev Jain
2026-04-11  8:55   ` Barry Song
2026-04-11 16:05     ` Dev Jain
2026-04-11 16:24       ` Dev Jain
2026-04-11 11:45   ` Jie Gan
2026-04-11 16:08     ` Dev Jain
2026-04-10 10:31 ` [PATCH v2 3/9] mm/rmap: refactor some code around lazyfree folio unmapping Dev Jain
2026-04-10 10:31 ` [PATCH v2 4/9] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
2026-04-14  5:46   ` Dev Jain
2026-04-10 10:32 ` [PATCH v2 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
2026-04-10 10:32 ` [PATCH v2 6/9] mm/swapfile: Add batched version of folio_dup_swap Dev Jain
2026-04-10 10:32 ` [PATCH v2 7/9] mm/swapfile: Add batched version of folio_put_swap Dev Jain
2026-04-10 10:32 ` [PATCH v2 8/9] mm/rmap: Add batched version of folio_try_share_anon_rmap_pte Dev Jain
2026-04-10 10:32 ` [PATCH v2 9/9] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
2026-04-10 13:53 ` [PATCH v2 0/9] Optimize anonymous large folio unmapping Lorenzo Stoakes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.