public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Usama Arif <usama.arif@linux.dev>
To: Zi Yan <ziy@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
	ljs@kernel.org, bhe@redhat.com, willy@infradead.org,
	youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com,
	shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org,
	baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	Liam.Howlett@oracle.com, ryan.roberts@arm.com,
	Vlastimil Babka <vbabka@kernel.org>,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	nphamcs@gmail.com, shikemeng@huaweicloud.com,
	kernel-team@meta.com, Ying Huang <ying.huang@linux.alibaba.com>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
Date: Mon, 27 Apr 2026 21:12:24 +0100	[thread overview]
Message-ID: <30bd8e73-d718-4a44-ac46-ce4579edfb10@linux.dev> (raw)
In-Reply-To: <D491228B-A6D0-4B55-BC56-8709C107CB30@nvidia.com>



On 27/04/2026 19:26, Zi Yan wrote:
> +Ying, who did the original THP swap work[1].
> 
> [1] https://lkml.org/lkml/2016/8/9/588
> 

Thanks Zi!

Sorry Ying for not CCing you! checkpatch on the whole series produced
a really long list and I wasnt sure if people would start thinking of
it as spam. I added reviewers and maintainers of swap and THP + a few
folks that commented on previous related work from which this kicked off.
I should have just CC'ed everyone.

> On 27 Apr 2026, at 6:01, Usama Arif wrote:
> 
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
>> no khugepaged involvement is needed. swap_map metadata is identical
>> either way (512 single-slot counts), so the PTE split buys nothing
>> on the swap side, it is purely a page-table representation change.
>>
>> This work was brought about after Hugh reported that one of the
>> major blockers for having lazy page table deposit is the lack of
>> PMD swap entries [1]. However, this series has benefits of its
>> own:
>> - The huge mapping is restored on swap-in.  Today even when the
>>   folio is still in swap cache as a single 2 MB folio, the swap-in
>>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>>   freshly-materialised PTE table sticks around, and only
>>   khugepaged can later collapse the range back into a THP.
>>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>>   one fault, no khugepaged involvement.
>> - Memory saved per swapped-out THP *once lazy page table deposit is
>>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>>   The new path leaves the pgtable hierarchy at PMD level and avoids
>>   that allocation entirely.
>>   This will save memory when swapping, which is likely when there is
>>   memory pressure and exactly when allocations are most likely to
>>   fail.
>> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>>   visit one PMD entry instead of 512 PTEs, reducing traversal
>>   time and lock-hold windows.
>>
>> The swap entry value is identical to 512 PTE swap entries (same
>> type, same starting offset), so swap_map refcounting is unchanged.
>> Only the page-table representation differs; the swap slot allocator,
>> swap I/O, and swap cache are untouched.  The new path falls back to
>> the existing PTE-split path whenever a PMD-order resource is
>> unavailable: zswap enabled, non-contiguous swap allocation
>> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
>> or fork, racing folio split, or rmap-driven split on a swapcache
>> folio.  Walkers that previously assumed every non-present PMD encodes
>> a PFN (migration / device_private) are taught to recognise PMD swap
>> entries.
>>
>> Patch breakdown:
>>
>> The series is ordered to preserve git bisectability: every consumer
>> of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
>> swap-in fault) lands before the producer.  The swap-out path that
>> actually installs PMD swap entries is the very last functional patch
>> (12), so no intermediate commit can leave the kernel handling a
>> PMD swap entry it does not yet understand.
>>
>> The first 4 patches are preparatory patches. Some of them (like
>> softleaf_to_pmd() change in patch 1) are not exactly needed but its
>> done to hopefully improve code quality and so that the PMD swap
>> entry changes look well integrated with the rest of mm.
>>
>> Prep patches:
>>   1. mm: add softleaf_to_pmd() and convert existing callers
>>      PMD counterpart to softleaf_to_pte(); needed to construct a
>>      PMD from a swap entry in later patches.
>>   2. mm: extract ensure_on_mmlist() helper
>>      Hoists the "register mm with swapoff" double-checked-locking
>>      pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
>>      the PMD swap-out and PMD fork paths can reuse it without a
>>      third open-coded copy.
>>   3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>      pagemap_pmd_range_thp() today calls softleaf_to_page()
>>      unconditionally; a PMD swap entry has no PFN and would crash
>>      it.
>>   4. mm/huge_memory: move softleaf_to_folio() inside migration branch
>>      change_non_present_huge_pmd() today calls softleaf_to_folio()
>>      before branching on entry type, so a PMD swap entry would
>>      produce a bogus folio pointer that the migration-only code
>>      below would then dereference.
>>
>> Core patches:
>>   5. PMD swap entry detection (pmd_is_swap_entry,
>>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>>      helpers (x86/arm64/s390/riscv/loongarch).
>>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>>      into 512 PTE swap entries, used as the fallback when a
>>      PMD-order resource is unavailable.
> 
> I was wondering how to handle insufficient memory during swap-in.
> Here it is. I have not read the code, but the split should be
> straightforward, since we already have a contiguous swap space at
m> swap-out time and the split is just to enable PTE-level swap in, right?
> 

Yes that is correct. Actually patch 6 was one of the easier patches.
If the kernel can't allocate 2M, memcg charge fails and a few other reasons,
we split THP.


>>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>>      copy_pte_range().
>>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
>>   9. Walker updates: zap_huge_pmd, change_huge_pmd,
>>      change_non_present_huge_pmd, move_soft_dirty_pmd,
>>      clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
>>      queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
>>      and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
>>      VM_BUG_ON extensions.
>>  10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
>>      entry whole via a new move_swap_pmd() helper modeled on
>>      move_swap_pte().
>>  11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
>>      one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
>>      mapping, immediate COW for write faults; falls back to PTE-split
>>      on any PMD-order resource shortfall.
>>  12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
>>      PMD-mappable swapcache folios (when zswap is disabled), and
>>      try_to_unmap_one() installs one PMD swap entry via
>>      set_pmd_swap_entry() instead of splitting.
>>
>> Testing:
>>  13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
>>      repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
>>      MADV_FREE, UFFDIO_MOVE, swapoff.
>>
>> Making PMD swap entries work with zswap is another project on its own and
>> should be in a separate follow up series.
>>
>> The patches are on top of mm-unstable from 23 April
>> (2bcc13c29c711381d815c1ba5d5b25737400c71a).
>>
>> [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
>> [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
>>
>> Usama Arif (13):
>>   mm: add softleaf_to_pmd() and convert existing callers
>>   mm: extract ensure_on_mmlist() helper
>>   fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>   mm/huge_memory: move softleaf_to_folio() inside migration branch
>>   mm: add PMD swap entry detection support
>>   mm: add PMD swap entry splitting support
>>   mm: handle PMD swap entries in fork path
>>   mm: swap in PMD swap entries as whole THPs during swapoff
>>   mm: handle PMD swap entries in non-present PMD walkers
>>   mm: handle PMD swap entries in UFFDIO_MOVE
>>   mm: handle PMD swap entry faults on swap-in
>>   mm: install PMD swap entries on swap-out
>>   selftests/mm: add PMD swap entry tests
>>
>>  arch/arm64/include/asm/pgtable.h      |   4 +
>>  arch/loongarch/include/asm/pgtable.h  |  17 +
>>  arch/riscv/include/asm/pgtable.h      |  15 +
>>  arch/s390/include/asm/pgtable.h       |  15 +
>>  arch/x86/include/asm/pgtable.h        |  15 +
>>  fs/proc/task_mmu.c                    |  47 +-
>>  include/linux/huge_mm.h               |  11 +
>>  include/linux/leafops.h               |  44 +-
>>  include/linux/swap.h                  |   4 +-
>>  include/linux/vm_event_item.h         |   1 +
>>  mm/hmm.c                              |   3 +-
>>  mm/huge_memory.c                      | 540 +++++++++++++++++++++--
>>  mm/internal.h                         |  49 +++
>>  mm/khugepaged.c                       |   6 +
>>  mm/madvise.c                          |   5 +-
>>  mm/memory.c                           |  51 +--
>>  mm/mempolicy.c                        |   2 +
>>  mm/rmap.c                             |  27 +-
>>  mm/swap.h                             |   7 +
>>  mm/swap_state.c                       |  35 ++
>>  mm/swapfile.c                         | 144 +++++-
>>  mm/vmscan.c                           |  14 +-
>>  mm/vmstat.c                           |   1 +
>>  tools/testing/selftests/mm/Makefile   |   1 +
>>  tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
>>  25 files changed, 1554 insertions(+), 111 deletions(-)
>>  create mode 100644 tools/testing/selftests/mm/pmd_swap.c
>>
>> -- 
>> 2.52.0
> 
> 
> Best Regards,
> Yan, Zi


      reply	other threads:[~2026-04-27 20:12 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-27 10:01 [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 10:01 ` [PATCH 01/13] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-04-27 10:01 ` [PATCH 02/13] mm: extract ensure_on_mmlist() helper Usama Arif
2026-04-27 10:01 ` [PATCH 03/13] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-04-27 10:01 ` [PATCH 04/13] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-04-27 10:01 ` [PATCH 05/13] mm: add PMD swap entry detection support Usama Arif
2026-04-27 10:01 ` [PATCH 06/13] mm: add PMD swap entry splitting support Usama Arif
2026-04-27 10:01 ` [PATCH 07/13] mm: handle PMD swap entries in fork path Usama Arif
2026-04-27 10:01 ` [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-04-27 10:01 ` [PATCH 09/13] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-04-27 10:01 ` [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-04-27 10:02 ` [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-04-27 10:02 ` [PATCH 12/13] mm: install PMD swap entries on swap-out Usama Arif
2026-04-27 10:02 ` [PATCH 13/13] selftests/mm: add PMD swap entry tests Usama Arif
2026-04-27 13:38 ` [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-04-27 18:26 ` Zi Yan
2026-04-27 20:12   ` Usama Arif [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30bd8e73-d718-4a44-ac46-ce4579edfb10@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex@ghiti.fr \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=youngjun.park@lge.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox