The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
	david@kernel.org, chrisl@kernel.org, kasong@tencent.com,
	ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org
Cc: ying.huang@linux.alibaba.com, Baoquan He <baoquan.he@linux.dev>,
	willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org,
	riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr,
	kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	Liam R. Howlett <liam@infradead.org>,
	ryan.roberts@arm.com, Vlastimil Babka <vbabka@kernel.org>,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	nphamcs@gmail.com, shikemeng@huaweicloud.com,
	kernel-team@meta.com, Usama Arif <usama.arif@linux.dev>
Subject: [PATCH v3 07/11] mm: handle PMD swap entries in MADV_WILLNEED
Date: Fri,  3 Jul 2026 10:38:24 -0700	[thread overview]
Message-ID: <20260703173903.3789516-8-usama.arif@linux.dev> (raw)
In-Reply-To: <20260703173903.3789516-1-usama.arif@linux.dev>

swapin_walk_pmd_entry() walks PTEs and skips non-present PMDs, so
MADV_WILLNEED is a no-op on a PMD swap entry.

Handle PMD swap entries under pmd_trans_huge_lock(). If the covered
swap-cache range already has a PMD-sized folio, there is nothing left
to prefetch. If the range has split cache state, or any covered slot
currently has a zswap entry, split the PMD swap entry and ask the
walker to retry so the PTE path can handle the individual slots.

Otherwise pin the swap device and read the folio in at PMD order via
swapin_sync(BIT(HPAGE_PMD_ORDER)). This keeps the subsequent fault on
the do_huge_pmd_swap_page() path and avoids order-0 readahead
needlessly splitting the PMD swap entry. If PMD-order swapin races
with per-slot swap-cache population after dropping the PMD lock, split
and retry through the PTE path instead.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/madvise.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 0d6aa0608f70..78a08039e173 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -32,6 +32,7 @@
 #include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/zswap.h>
 
 #include <asm/tlb.h>
 
@@ -193,6 +194,79 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	spinlock_t *ptl;
 	unsigned long addr;
 
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		pmd_t pmdval = *pmd;
+
+		if (pmd_is_swap_entry(pmdval)) {
+			softleaf_t entry = softleaf_from_pmd(pmdval);
+			struct vm_fault vmf = {
+				.vma = vma,
+				.address = start,
+				.real_address = start,
+				.pmd = pmd,
+			};
+			struct swap_info_struct *si;
+			struct folio *folio;
+			enum swap_pmd_cache cache_state;
+			bool split = false;
+
+			cache_state = swap_pmd_cache_lookup(entry, &folio);
+			if (cache_state == SWAP_PMD_CACHE_HUGE) {
+				folio_put(folio);
+				spin_unlock(ptl);
+				goto ret;
+			}
+			if (cache_state == SWAP_PMD_CACHE_SPLIT ||
+			    zswap_range_has_entry(entry, HPAGE_PMD_NR)) {
+				spin_unlock(ptl);
+				__split_huge_pmd(vma, pmd, start, false);
+				walk->action = ACTION_AGAIN;
+				goto ret;
+			}
+
+			/*
+			 * Pin the swap device under the PMD lock so the
+			 * PMD-swap-entry observation keeps the entry valid for
+			 * swapin_sync().
+			 */
+			si = get_swap_device(entry);
+			spin_unlock(ptl);
+			if (!si)
+				goto ret;
+
+			folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
+					    BIT(HPAGE_PMD_ORDER), &vmf,
+					    NULL, 0);
+			/*
+			 * The empty-cache observation was made under the PMD
+			 * lock, but swap cache can change after dropping it. If
+			 * PMD-order swapin lost a race to per-slot cache state,
+			 * retry through the PTE path.
+			 */
+			if (IS_ERR(folio)) {
+				if (PTR_ERR(folio) == -EBUSY)
+					split = true;
+			} else if (folio) {
+				if (folio_nr_pages(folio) != HPAGE_PMD_NR)
+					split = true;
+				else if (!folio_test_locked(folio) &&
+					 !folio_test_uptodate(folio) &&
+					 zswap_range_has_entry(entry,
+							       HPAGE_PMD_NR))
+					split = true;
+				folio_put(folio);
+			}
+			put_swap_device(si);
+			if (split) {
+				__split_huge_pmd(vma, pmd, start, false);
+				walk->action = ACTION_AGAIN;
+			}
+			goto ret;
+		}
+		spin_unlock(ptl);
+	}
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		pte_t pte;
 		softleaf_t entry;
@@ -221,6 +295,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (ptep)
 		pte_unmap_unlock(ptep, ptl);
 	swap_read_unplug(splug);
+ret:
 	cond_resched();
 
 	return 0;
-- 
2.53.0-Meta


  parent reply	other threads:[~2026-07-03 17:40 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-03 17:38 [PATCH v3 00/11] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-07-03 17:38 ` [PATCH v3 01/11] mm: add PMD swap entry detection support Usama Arif
2026-07-03 17:38 ` [PATCH v3 02/11] mm: add PMD swap entry splitting support Usama Arif
2026-07-03 17:38 ` [PATCH v3 03/11] mm: handle PMD swap entries in fork path Usama Arif
2026-07-03 17:38 ` [PATCH v3 04/11] mm: zswap: add range lookup for large-folio swapin Usama Arif
2026-07-03 17:38 ` [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-07-03 17:38 ` [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-07-03 17:38 ` Usama Arif [this message]
2026-07-03 17:38 ` [PATCH v3 08/11] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-07-03 17:38 ` [PATCH v3 09/11] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-07-03 17:38 ` [PATCH v3 10/11] mm: install PMD swap entries on swap-out Usama Arif
2026-07-03 17:38 ` [PATCH v3 11/11] selftests/mm: add PMD swap entry tests Usama Arif
2026-07-04  6:27   ` kernel test robot
2026-07-04  8:30   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260703173903.3789516-8-usama.arif@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=alex@ghiti.fr \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baoquan.he@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=youngjun.park@lge.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox