Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: usama.arif@linux.dev
Cc: akpm@linux-foundation.org, david@kernel.org, chrisl@kernel.org,
	kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com,
	ying.huang@linux.alibaba.com, baoquan.he@linux.dev,
	willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org,
	riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr,
	kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	liam@infradead.org, ryan.roberts@arm.com, vbabka@kernel.org,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	nphamcs@gmail.com, shikemeng@huaweicloud.com,
	kernel-team@meta.com, linux-mm@kvack.org
Subject: Re: [v2 15/16] mm: install PMD swap entries on swap-out
Date: Fri, 12 Jun 2026 22:21:24 +0800	[thread overview]
Message-ID: <20260612142124.73367-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260602142537.198755-16-usama.arif@linux.dev>

+Cc linux-mm

On Tue, Jun 02, 2026 at 07:24:23AM -0700, Usama Arif wrote:
[...]
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index e8a90911bf88..0f376fbf9bb3 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -64,6 +64,7 @@
> 
> #include <linux/swapops.h>
> #include <linux/sched/sysctl.h>
>+#include <linux/zswap.h>
> 
> #include "internal.h"
> #include "swap.h"
>@@ -1332,7 +1333,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> 			enum ttu_flags flags = TTU_BATCH_FLUSH;
> 			bool was_swapbacked = folio_test_swapbacked(folio);
> 
>-			if (folio_test_pmd_mappable(folio))
>+			/*
>+			 * With THP_SWAP, PMD-mappable folios already in the
>+			 * swap cache can be unmapped with a PMD-level swap
>+			 * entry, avoiding the cost of splitting the PMD.
>+			 * Skip this when zswap has been enabled because
>+			 * zswap stores pages individually and cannot
>+			 * reconstruct a large folio on swap-in.
>+			 */
>+			if (folio_test_pmd_mappable(folio) &&
>+			    !(IS_ENABLED(CONFIG_THP_SWAP) &&
>+			      folio_test_swapcache(folio) &&
>+			      zswap_never_enabled()))

There may be a race here ...

1) zswap_never_enabled() passes, 2) try_to_unmap() installs the PMD swap
entry, and 3) zswap can still be enabled before the later pageout() ->
swap_writeout() -> zswap_store().

zswap_store() loops over each page of the folio:

	for (index = 0; index < nr_pages; ++index) {
		struct page *page = folio_page(folio, index);

		if (!zswap_store_page(page, objcg, pool))
			goto put_pool;
	}

So still one PMD swap entry, while zswap has 512 entries, one for each
page of the folio ...

If the swapcache is reclaimed later, a PMD fault will try PMD-order
swapin again:

do_huge_pmd_swap_page()
	swap_cache_get_folio()
	swapin_sync(..., BIT(HPAGE_PMD_ORDER))
		swap_read_folio()
			zswap_load()

zswap_load() rejects large folios with -EINVAL and leaves the folio not
uptodate:

	/*
	 * Large folios should not be swapped in while zswap is being used, as
	 * they are not properly handled. Zswap does not properly load large
	 * folios, and a large folio may only be partially in zswap.
	 */
	if (WARN_ON_ONCE(folio_test_large(folio))) {
		folio_unlock(folio);
		return -EINVAL;
	}

swap_read_folio() jumps to finish and does not try a normal swap read:

	if (zswap_load(folio) != -ENOENT)
		goto finish;

And the awkward part is that no error really gets propagated ...
swap_read_folio() is void, and swapin_sync() just hands the same folio
back to do_huge_pmd_swap_page().

At that point the folio is still !uptodate, so the fault would just end
up:

	if (unlikely(!folio_test_uptodate(folio))) {
		ret = VM_FAULT_SIGBUS;
		goto out_page;
	}

Looks race, but possible?

Cheers, Lance

> 				flags |= TTU_SPLIT_HUGE_PMD;
> 			/*
> 			 * Without TTU_SYNC, try_to_unmap will only begin to
>diff --git a/mm/vmstat.c b/mm/vmstat.c
>index f534972f517d..9b4963a7eb04 100644
>--- a/mm/vmstat.c
>+++ b/mm/vmstat.c
>@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
> 	[I(THP_ZERO_PAGE_ALLOC_FAILED)]		= "thp_zero_page_alloc_failed",
> 	[I(THP_SWPOUT)]				= "thp_swpout",
> 	[I(THP_SWPOUT_FALLBACK)]		= "thp_swpout_fallback",
>+	[I(THP_SWPOUT_PMD)]			= "thp_swpout_pmd",
> #endif
> #ifdef CONFIG_BALLOON
> 	[I(BALLOON_INFLATE)]			= "balloon_inflate",
>-- 
>2.52.0
>
>


           reply	other threads:[~2026-06-12 14:22 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <20260602142537.198755-16-usama.arif@linux.dev>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612142124.73367-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=alex@ghiti.fr \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baoquan.he@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=youngjun.park@lge.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox