All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: usama.arif@linux.dev
Cc: akpm@linux-foundation.org, david@kernel.org, chrisl@kernel.org,
	kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com,
	ying.huang@linux.alibaba.com, baoquan.he@linux.dev,
	willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org,
	riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr,
	kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	liam@infradead.org, ryan.roberts@arm.com, vbabka@kernel.org,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	nphamcs@gmail.com, shikemeng@huaweicloud.com,
	kernel-team@meta.com, linux-mm@kvack.org
Subject: Re: [v2 15/16] mm: install PMD swap entries on swap-out
Date: Fri, 12 Jun 2026 22:21:24 +0800	[thread overview]
Message-ID: <20260612142124.73367-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260602142537.198755-16-usama.arif@linux.dev>

+Cc linux-mm

On Tue, Jun 02, 2026 at 07:24:23AM -0700, Usama Arif wrote:
[...]
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index e8a90911bf88..0f376fbf9bb3 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -64,6 +64,7 @@
> 
> #include <linux/swapops.h>
> #include <linux/sched/sysctl.h>
>+#include <linux/zswap.h>
> 
> #include "internal.h"
> #include "swap.h"
>@@ -1332,7 +1333,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> 			enum ttu_flags flags = TTU_BATCH_FLUSH;
> 			bool was_swapbacked = folio_test_swapbacked(folio);
> 
>-			if (folio_test_pmd_mappable(folio))
>+			/*
>+			 * With THP_SWAP, PMD-mappable folios already in the
>+			 * swap cache can be unmapped with a PMD-level swap
>+			 * entry, avoiding the cost of splitting the PMD.
>+			 * Skip this when zswap has been enabled because
>+			 * zswap stores pages individually and cannot
>+			 * reconstruct a large folio on swap-in.
>+			 */
>+			if (folio_test_pmd_mappable(folio) &&
>+			    !(IS_ENABLED(CONFIG_THP_SWAP) &&
>+			      folio_test_swapcache(folio) &&
>+			      zswap_never_enabled()))

There may be a race here ...

1) zswap_never_enabled() passes, 2) try_to_unmap() installs the PMD swap
entry, and 3) zswap can still be enabled before the later pageout() ->
swap_writeout() -> zswap_store().

zswap_store() loops over each page of the folio:

	for (index = 0; index < nr_pages; ++index) {
		struct page *page = folio_page(folio, index);

		if (!zswap_store_page(page, objcg, pool))
			goto put_pool;
	}

So still one PMD swap entry, while zswap has 512 entries, one for each
page of the folio ...

If the swapcache is reclaimed later, a PMD fault will try PMD-order
swapin again:

do_huge_pmd_swap_page()
	swap_cache_get_folio()
	swapin_sync(..., BIT(HPAGE_PMD_ORDER))
		swap_read_folio()
			zswap_load()

zswap_load() rejects large folios with -EINVAL and leaves the folio not
uptodate:

	/*
	 * Large folios should not be swapped in while zswap is being used, as
	 * they are not properly handled. Zswap does not properly load large
	 * folios, and a large folio may only be partially in zswap.
	 */
	if (WARN_ON_ONCE(folio_test_large(folio))) {
		folio_unlock(folio);
		return -EINVAL;
	}

swap_read_folio() jumps to finish and does not try a normal swap read:

	if (zswap_load(folio) != -ENOENT)
		goto finish;

And the awkward part is that no error really gets propagated ...
swap_read_folio() is void, and swapin_sync() just hands the same folio
back to do_huge_pmd_swap_page().

At that point the folio is still !uptodate, so the fault would just end
up:

	if (unlikely(!folio_test_uptodate(folio))) {
		ret = VM_FAULT_SIGBUS;
		goto out_page;
	}

Looks race, but possible?

Cheers, Lance

> 				flags |= TTU_SPLIT_HUGE_PMD;
> 			/*
> 			 * Without TTU_SYNC, try_to_unmap will only begin to
>diff --git a/mm/vmstat.c b/mm/vmstat.c
>index f534972f517d..9b4963a7eb04 100644
>--- a/mm/vmstat.c
>+++ b/mm/vmstat.c
>@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = {
> 	[I(THP_ZERO_PAGE_ALLOC_FAILED)]		= "thp_zero_page_alloc_failed",
> 	[I(THP_SWPOUT)]				= "thp_swpout",
> 	[I(THP_SWPOUT_FALLBACK)]		= "thp_swpout_fallback",
>+	[I(THP_SWPOUT_PMD)]			= "thp_swpout_pmd",
> #endif
> #ifdef CONFIG_BALLOON
> 	[I(BALLOON_INFLATE)]			= "balloon_inflate",
>-- 
>2.52.0
>
>


  reply	other threads:[~2026-06-12 14:22 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-02 14:24 [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-06-02 14:24 ` [v2 01/16] mm: add softleaf_to_pmd() and convert existing callers Usama Arif
2026-06-02 14:24 ` [v2 02/16] mm: extract mm_prepare_for_swap_entries() helper Usama Arif
2026-06-02 14:24 ` [v2 03/16] fs/proc: use softleaf_has_pfn() in pagemap PMD walker Usama Arif
2026-06-02 14:24 ` [v2 04/16] mm/huge_memory: move softleaf_to_folio() inside migration branch Usama Arif
2026-06-02 14:24 ` [v2 05/16] mm/migrate_device: move softleaf_to_folio() inside device-private branch Usama Arif
2026-06-02 14:24 ` [v2 06/16] mm: rename ARCH_ENABLE_THP_MIGRATION to ARCH_SUPPORTS_PMD_SOFTLEAF Usama Arif
2026-06-02 14:24 ` [v2 07/16] mm: add PMD swap entry detection support Usama Arif
2026-06-02 14:24 ` [v2 08/16] mm: add PMD swap entry splitting support Usama Arif
2026-06-02 14:24 ` [v2 09/16] mm: handle PMD swap entries in fork path Usama Arif
2026-06-02 14:24 ` [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff Usama Arif
2026-06-02 14:24 ` [v2 11/16] mm: handle PMD swap entries in non-present PMD walkers Usama Arif
2026-06-12  6:45   ` Lance Yang
2026-06-12 15:05     ` Usama Arif
2026-06-12 15:21       ` Lance Yang
2026-06-02 14:24 ` [v2 12/16] mm: handle PMD swap entries in MADV_WILLNEED Usama Arif
2026-06-02 14:24 ` [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE Usama Arif
2026-06-12  8:50   ` Lance Yang
2026-06-02 14:24 ` [v2 14/16] mm: handle PMD swap entry faults on swap-in Usama Arif
2026-06-02 14:24 ` [v2 15/16] mm: install PMD swap entries on swap-out Usama Arif
2026-06-12 14:21   ` Lance Yang [this message]
2026-06-02 14:24 ` [v2 16/16] selftests/mm: add PMD swap entry tests Usama Arif
2026-06-09 14:29 ` [v2 00/16] mm: PMD-level swap entries for anonymous THPs Usama Arif
2026-06-10 12:24   ` David Hildenbrand (Arm)
2026-06-10 13:01     ` Lance Yang
2026-06-10 13:48       ` David Hildenbrand (Arm)
2026-06-10 14:44         ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612142124.73367-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=alex@ghiti.fr \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baoquan.he@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=youngjun.park@lge.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.