* [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
@ 2026-05-08 20:18 fujunjie
2026-05-08 20:20 ` [RFC PATCH 1/5] mm: zswap: decompress into a folio subpage fujunjie
` (6 more replies)
0 siblings, 7 replies; 17+ messages in thread
From: fujunjie @ 2026-05-08 20:18 UTC (permalink / raw)
To: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
Hi,
This RFC explores anonymous large folio swapin when a contiguous swap
range is backed consistently by zswap.
Large folio swapout to zswap is already supported by storing each base
page in the folio as a separate zswap entry. The anonymous synchronous
swapin path has remained order-0 once zswap has ever been enabled:
zswap_load() rejected large folios, and alloc_swap_folio() avoided large
folio allocation to protect against mixed backend ranges.
This RFC keeps the scope intentionally conservative. It does not try to
read one large folio from mixed zswap and disk backends, and it does not
change shmem swapin. Shmem still has its existing zswap fallback and is
left for later discussion. For anonymous swapin, the backend rule is made
explicit:
- a range fully absent from zswap can keep using the disk backend
- a range fully present in zswap can be decompressed into a large folio
- a mixed zswap/non-zswap range falls back to order-0 swapin
The series adds a zswap range query helper, teaches zswap_load() to
decompress all-zswap large folios one base page at a time, accounts mTHP
swpin for zswap-loaded large folios, retries synchronous large-folio
insertion races with order-0 swapin, and removes the anonymous
zswap-never-enabled restriction once mixed ranges are filtered.
I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.
The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
order-4 THP groups in the mapping.
In the mixed-backend run, the workload used a 64MiB anonymous mapping
split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
one zswap base-page entry, refault left 1023 order-4 THP groups and one
order-0 mixed group. The kernel stats matched that shape:
mthp64_swpin=1023, zswpin=16383 and zswpwb=1.
CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
writeback deterministic; it is not required by the implementation.
Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
swap cache and zswap entry state into a virtual swap descriptor, and lists
mixed backing THP swapin as a future use case. This RFC is independent and
works with the current swap/zswap infrastructure, but may need rebasing if
VSS lands first.
Feedback would be especially helpful on:
1. whether it makes sense to support all-zswap large folio swapin first,
while keeping mixed zswap/disk ranges on the order-0 fallback path
2. whether a follow-up for mixed zswap/disk large folio swapin would be
useful after this RFC
Thanks.
---
fujunjie (5):
mm: zswap: decompress into a folio subpage
mm: zswap: add a zswap entry batch helper
mm: zswap: load fully stored large folios
mm: swap: fall back to order-0 after large swapin races
mm: swap: allow zswap-backed large folio swapin
Documentation/admin-guide/mm/transhuge.rst | 4 +-
include/linux/zswap.h | 9 ++
mm/memory.c | 67 ++++++++-----
mm/swap_state.c | 23 +++--
mm/zswap.c | 111 ++++++++++++++++-----
5 files changed, 154 insertions(+), 60 deletions(-)
base-commit: 917719c412c48687d4a176965d1fa35320ec457c
--
2.34.1
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 1/5] mm: zswap: decompress into a folio subpage
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
@ 2026-05-08 20:20 ` fujunjie
2026-05-08 20:20 ` [RFC PATCH 2/5] mm: zswap: add a zswap entry batch helper fujunjie
` (5 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: fujunjie @ 2026-05-08 20:20 UTC (permalink / raw)
To: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
zswap_decompress() always writes to offset 0 of the target folio. That
is sufficient while zswap only loads order-0 folios, but large folio
swapin needs to fill each base page from its own zswap entry.
Pass the base-page index to zswap_decompress() and use it for the kmap
and scatterlist output offsets. Existing callers pass index 0, so this
is a preparatory change with no intended behavior change.
Signed-off-by: fujunjie <fujunjie1@qq.com>
---
mm/zswap.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 4b5149173b0e..afe38dfc5a29 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -921,12 +921,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
return comp_ret == 0 && alloc_ret == 0;
}
-static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
+static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio,
+ unsigned long index)
{
struct zswap_pool *pool = entry->pool;
struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */
struct scatterlist output;
struct crypto_acomp_ctx *acomp_ctx;
+ size_t offset = index * PAGE_SIZE;
int ret = 0, dlen;
acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
@@ -939,14 +941,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
WARN_ON_ONCE(input->length != PAGE_SIZE);
- dst = kmap_local_folio(folio, 0);
+ dst = kmap_local_folio(folio, offset);
memcpy_from_sglist(dst, input, 0, PAGE_SIZE);
dlen = PAGE_SIZE;
kunmap_local(dst);
flush_dcache_folio(folio);
} else {
sg_init_table(&output, 1);
- sg_set_folio(&output, folio, PAGE_SIZE, 0);
+ sg_set_folio(&output, folio, PAGE_SIZE, offset);
acomp_request_set_params(acomp_ctx->req, input, &output,
entry->length, PAGE_SIZE);
ret = crypto_acomp_decompress(acomp_ctx->req);
@@ -1034,7 +1036,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
goto out;
}
- if (!zswap_decompress(entry, folio)) {
+ if (!zswap_decompress(entry, folio, 0)) {
ret = -EIO;
goto out;
}
@@ -1611,7 +1613,7 @@ int zswap_load(struct folio *folio)
if (!entry)
return -ENOENT;
- if (!zswap_decompress(entry, folio)) {
+ if (!zswap_decompress(entry, folio, 0)) {
folio_unlock(folio);
return -EIO;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 2/5] mm: zswap: add a zswap entry batch helper
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
2026-05-08 20:20 ` [RFC PATCH 1/5] mm: zswap: decompress into a folio subpage fujunjie
@ 2026-05-08 20:20 ` fujunjie
2026-05-08 20:20 ` [RFC PATCH 3/5] mm: zswap: load fully stored large folios fujunjie
` (4 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: fujunjie @ 2026-05-08 20:20 UTC (permalink / raw)
To: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
Large folio swapin has to know whether a contiguous swap range is backed
consistently by zswap. A range that is partly in zswap and partly on
disk cannot be read by the existing whole-folio swap_read_folio()
backend selection.
Add zswap_entry_batch(), mirroring the existing zeromap batch query: it
returns how many entries starting at a swap entry share the same zswap
presence, and optionally reports the first entry's state. The
CONFIG_ZSWAP=n stub reports that the whole range is not in zswap.
Signed-off-by: fujunjie <fujunjie1@qq.com>
---
include/linux/zswap.h | 9 +++++++++
mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..b9d71f027200 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -27,6 +27,7 @@ struct zswap_lruvec_state {
unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
int zswap_load(struct folio *folio);
+int zswap_entry_batch(swp_entry_t swp, int max_nr, bool *is_zswap);
void zswap_invalidate(swp_entry_t swp);
int zswap_swapon(int type, unsigned long nr_pages);
void zswap_swapoff(int type);
@@ -49,6 +50,14 @@ static inline int zswap_load(struct folio *folio)
return -ENOENT;
}
+static inline int zswap_entry_batch(swp_entry_t swp, int max_nr,
+ bool *is_zswap)
+{
+ if (is_zswap)
+ *is_zswap = false;
+ return max_nr;
+}
+
static inline void zswap_invalidate(swp_entry_t swp) {}
static inline int zswap_swapon(int type, unsigned long nr_pages)
{
diff --git a/mm/zswap.c b/mm/zswap.c
index afe38dfc5a29..27c14b8edd15 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -234,6 +234,42 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>> ZSWAP_ADDRESS_SPACE_SHIFT];
}
+/*
+ * Return the number of contiguous swap entries that share the same zswap
+ * presence as @swp. If @is_zswap is not NULL, return @swp's zswap status.
+ *
+ * Context: callers must keep the swap type alive. The result is a snapshot
+ * of zswap xarray presence; callers must tolerate races by rechecking under
+ * the lock that matters for their operation or by falling back safely.
+ */
+int zswap_entry_batch(swp_entry_t swp, int max_nr, bool *is_zswap)
+{
+ pgoff_t offset = swp_offset(swp);
+ bool first;
+ int i;
+
+ if (zswap_never_enabled()) {
+ if (is_zswap)
+ *is_zswap = false;
+ return max_nr;
+ }
+
+ first = !!xa_load(swap_zswap_tree(swp), offset);
+ if (is_zswap)
+ *is_zswap = first;
+
+ for (i = 1; i < max_nr; i++) {
+ swp_entry_t entry = swp_entry(swp_type(swp), offset + i);
+ bool present;
+
+ present = !!xa_load(swap_zswap_tree(entry), offset + i);
+ if (present != first)
+ return i;
+ }
+
+ return max_nr;
+}
+
#define zswap_pool_debug(msg, p) \
pr_debug("%s pool %s\n", msg, (p)->tfm_name)
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 3/5] mm: zswap: load fully stored large folios
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
2026-05-08 20:20 ` [RFC PATCH 1/5] mm: zswap: decompress into a folio subpage fujunjie
2026-05-08 20:20 ` [RFC PATCH 2/5] mm: zswap: add a zswap entry batch helper fujunjie
@ 2026-05-08 20:20 ` fujunjie
2026-05-11 22:38 ` Yosry Ahmed
2026-05-08 20:20 ` [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races fujunjie
` (3 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: fujunjie @ 2026-05-08 20:20 UTC (permalink / raw)
To: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
zswap_store() already stores every base page of a large folio as a
separate zswap entry and tears the whole folio back down on store
failure. The load side still rejects any large folio, which forces the
swapin path to avoid mTHP swapin once zswap has ever been enabled.
Use zswap_entry_batch() to distinguish three cases: the whole range is
absent from zswap and should fall through to the disk backend, the whole
range is present and can be decompressed one base page at a time, or the
range is mixed and must be treated as an invalid large-folio backend
selection.
After all entries decompress successfully, mark the folio uptodate and
dirty, account the mTHP swpin stat once for the folio, account one ZSWPIN
event per base page, and invalidate each zswap entry because the
swapcache folio becomes authoritative.
Signed-off-by: fujunjie <fujunjie1@qq.com>
---
Documentation/admin-guide/mm/transhuge.rst | 4 +-
mm/zswap.c | 65 ++++++++++++++--------
2 files changed, 45 insertions(+), 24 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5fbc3d89bb07..05456906aff6 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -644,8 +644,8 @@ zswpout
piece without splitting.
swpin
- is incremented every time a huge page is swapped in from a non-zswap
- swap device in one piece.
+ is incremented every time a huge page is swapped in from swap I/O or
+ zswap in one piece.
swpin_fallback
is incremented if swapin fails to allocate or charge a huge page
diff --git a/mm/zswap.c b/mm/zswap.c
index 27c14b8edd15..863ca1e896ed 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -28,6 +28,7 @@
#include <crypto/acompress.h>
#include <crypto/scatterwalk.h>
#include <linux/zswap.h>
+#include <linux/huge_mm.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
#include <linux/swapops.h>
@@ -1614,20 +1615,23 @@ bool zswap_store(struct folio *folio)
* NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
* will SIGBUS).
*
- * -EINVAL: if the swapped out content was in zswap, but the page belongs
- * to a large folio, which is not supported by zswap. The folio is unlocked,
- * but NOT marked up-to-date, so that an IO error is emitted (e.g.
- * do_swap_page() will SIGBUS).
+ * -EINVAL: if the folio spans a mix of zswap and non-zswap entries. The
+ * folio is unlocked, but NOT marked up-to-date, so that an IO error is
+ * emitted (e.g. do_swap_page() will SIGBUS). Large folio swapin should
+ * reject such ranges before calling zswap_load().
*
- * -ENOENT: if the swapped out content was not in zswap. The folio remains
+ * -ENOENT: if the swapped out content was not in zswap. For a large folio,
+ * this means the whole folio range was not in zswap. The folio remains
* locked on return.
*/
int zswap_load(struct folio *folio)
{
swp_entry_t swp = folio->swap;
pgoff_t offset = swp_offset(swp);
- struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
+ int nr_pages = folio_nr_pages(folio);
+ bool is_zswap;
+ int index;
VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1635,30 +1639,36 @@ int zswap_load(struct folio *folio)
if (zswap_never_enabled())
return -ENOENT;
- /*
- * Large folios should not be swapped in while zswap is being used, as
- * they are not properly handled. Zswap does not properly load large
- * folios, and a large folio may only be partially in zswap.
- */
- if (WARN_ON_ONCE(folio_test_large(folio))) {
+ if (zswap_entry_batch(swp, nr_pages, &is_zswap) != nr_pages) {
+ WARN_ON_ONCE(folio_test_large(folio));
folio_unlock(folio);
return -EINVAL;
}
- entry = xa_load(tree, offset);
- if (!entry)
+ if (!is_zswap)
return -ENOENT;
- if (!zswap_decompress(entry, folio, 0)) {
- folio_unlock(folio);
- return -EIO;
+ for (index = 0; index < nr_pages; index++) {
+ swp_entry_t entry_swp = swp_entry(swp_type(swp),
+ offset + index);
+ struct xarray *tree = swap_zswap_tree(entry_swp);
+
+ entry = xa_load(tree, offset + index);
+ if (WARN_ON_ONCE(!entry)) {
+ folio_unlock(folio);
+ return -EINVAL;
+ }
+
+ if (!zswap_decompress(entry, folio, index)) {
+ folio_unlock(folio);
+ return -EIO;
+ }
}
folio_mark_uptodate(folio);
- count_vm_event(ZSWPIN);
- if (entry->objcg)
- count_objcg_events(entry->objcg, ZSWPIN, 1);
+ count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
+ count_vm_events(ZSWPIN, nr_pages);
/*
* We are reading into the swapcache, invalidate zswap entry.
@@ -1668,8 +1678,19 @@ int zswap_load(struct folio *folio)
* compression work.
*/
folio_mark_dirty(folio);
- xa_erase(tree, offset);
- zswap_entry_free(entry);
+
+ for (index = 0; index < nr_pages; index++) {
+ swp_entry_t entry_swp = swp_entry(swp_type(swp),
+ offset + index);
+ struct xarray *tree = swap_zswap_tree(entry_swp);
+
+ entry = xa_erase(tree, offset + index);
+ if (WARN_ON_ONCE(!entry))
+ continue;
+ if (entry->objcg)
+ count_objcg_events(entry->objcg, ZSWPIN, 1);
+ zswap_entry_free(entry);
+ }
folio_unlock(folio);
return 0;
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
` (2 preceding siblings ...)
2026-05-08 20:20 ` [RFC PATCH 3/5] mm: zswap: load fully stored large folios fujunjie
@ 2026-05-08 20:20 ` fujunjie
2026-05-11 13:03 ` David Hildenbrand (Arm)
2026-05-08 20:20 ` [RFC PATCH 5/5] mm: swap: allow zswap-backed large folio swapin fujunjie
` (2 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: fujunjie @ 2026-05-08 20:20 UTC (permalink / raw)
To: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
swapin_folio() documents that a large folio insertion race returns NULL
so the caller can fall back to order-0 swapin. do_swap_page() currently
turns that NULL into VM_FAULT_OOM if the PTE is unchanged, which is
harsher than necessary and gets in the way of rejecting large folio
ranges for backend reasons.
Move the synchronous swapin sequence into a helper and retry with an
order-0 folio when a large folio cannot be inserted into the swap cache.
Count the event as an mTHP swapin fallback before dropping the failed
large allocation.
Signed-off-by: fujunjie <fujunjie1@qq.com>
---
mm/memory.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..84e3b77b8293 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4757,6 +4757,44 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+static struct folio *swapin_synchronous_folio(swp_entry_t entry,
+ struct vm_fault *vmf)
+{
+ struct folio *swapcache, *folio;
+ bool large;
+ int order;
+
+ folio = alloc_swap_folio(vmf);
+ if (!folio)
+ return NULL;
+
+ large = folio_test_large(folio);
+ order = folio_order(folio);
+
+ /*
+ * folio is charged, so swapin can only fail due to raced swapin and
+ * return NULL.
+ */
+ swapcache = swapin_folio(entry, folio);
+ if (swapcache == folio)
+ return folio;
+
+ if (!swapcache && large)
+ count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
+ folio_put(folio);
+ if (swapcache || !large)
+ return swapcache;
+
+ folio = __alloc_swap_folio(vmf);
+ if (!folio)
+ return NULL;
+
+ swapcache = swapin_folio(entry, folio);
+ if (swapcache != folio)
+ folio_put(folio);
+ return swapcache;
+}
+
/* Sanity check that a folio is fully exclusive */
static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
unsigned int nr_pages)
@@ -4860,17 +4898,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swap_update_readahead(folio, vma, vmf->address);
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
- folio = alloc_swap_folio(vmf);
- if (folio) {
- /*
- * folio is charged, so swapin can only fail due
- * to raced swapin and return NULL.
- */
- swapcache = swapin_folio(entry, folio);
- if (swapcache != folio)
- folio_put(folio);
- folio = swapcache;
- }
+ folio = swapin_synchronous_folio(entry, vmf);
} else {
folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 5/5] mm: swap: allow zswap-backed large folio swapin
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
` (3 preceding siblings ...)
2026-05-08 20:20 ` [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races fujunjie
@ 2026-05-08 20:20 ` fujunjie
2026-05-11 22:13 ` [RFC PATCH 0/5] mm: support zswap-backed anonymous " Yosry Ahmed
2026-05-12 4:20 ` Alexandre Ghiti
6 siblings, 0 replies; 17+ messages in thread
From: fujunjie @ 2026-05-08 20:20 UTC (permalink / raw)
To: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
alloc_swap_folio() has been falling back to order-0 in the anonymous
synchronous swapin path whenever zswap was ever enabled, because a large
folio range could contain a mixture of zswap and non-zswap entries and
zswap_load() could not handle large folios.
zswap_load() can now load a range that is fully present in zswap, and
zswap_entry_batch() can identify mixed zswap ranges. Use that check
alongside the existing zeromap and swapcache checks when selecting a large
folio for anonymous swapin, and recheck before inserting a large folio into
the swap cache while holding the swap cluster lock.
With mixed zswap ranges rejected and the insertion-race fallback in place,
remove the blanket zswap_never_enabled() fallback from the anonymous swapin
path so all-zswap and all-disk anonymous ranges can use mTHP swapin. Shmem
keeps its existing zswap fallback and is outside this RFC.
Signed-off-by: fujunjie <fujunjie1@qq.com>
---
mm/memory.c | 21 ++++++---------------
mm/swap_state.c | 23 +++++++++++++++--------
2 files changed, 21 insertions(+), 23 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 84e3b77b8293..0be249108de1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -78,6 +78,7 @@
#include <linux/sched/sysctl.h>
#include <linux/pgalloc.h>
#include <linux/uaccess.h>
+#include <linux/zswap.h>
#include <trace/events/kmem.h>
@@ -4635,13 +4636,11 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
return false;
- /*
- * swap_read_folio() can't handle the case a large folio is hybridly
- * from different backends. And they are likely corner cases. Similar
- * things might be added once zswap support large folios.
- */
+ /* swap_read_folio() can't handle hybrid backend large folios. */
if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
return false;
+ if (unlikely(zswap_entry_batch(entry, nr_pages, NULL) != nr_pages))
+ return false;
if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
return false;
@@ -4690,14 +4689,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
if (unlikely(userfaultfd_armed(vma)))
goto fallback;
- /*
- * A large swapped out folio could be partially or fully in zswap. We
- * lack handling for such cases, so fallback to swapping in order-0
- * folio.
- */
- if (!zswap_never_enabled())
- goto fallback;
-
entry = softleaf_from_pte(vmf->orig_pte);
/*
* Get a list of all the (large) orders below PMD_ORDER that are enabled
@@ -4772,8 +4763,8 @@ static struct folio *swapin_synchronous_folio(swp_entry_t entry,
order = folio_order(folio);
/*
- * folio is charged, so swapin can only fail due to raced swapin and
- * return NULL.
+ * folio is charged, so NULL means the large folio could not be
+ * inserted and needs order-0 fallback.
*/
swapcache = swapin_folio(entry, folio);
if (swapcache == folio)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1415a5c54a43..4e58fad5e5f0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -22,6 +22,7 @@
#include <linux/vmalloc.h>
#include <linux/huge_mm.h>
#include <linux/shmem_fs.h>
+#include <linux/zswap.h>
#include "internal.h"
#include "swap_table.h"
#include "swap.h"
@@ -207,6 +208,11 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
if (swp_tb_is_shadow(old_tb))
shadow = swp_tb_to_shadow(old_tb);
} while (++ci_off < ci_end);
+ if (unlikely(folio_test_large(folio) &&
+ zswap_entry_batch(entry, nr_pages, NULL) != nr_pages)) {
+ err = -EAGAIN;
+ goto failed;
+ }
__swap_cache_add_folio(ci, folio, entry);
swap_cluster_unlock(ci);
if (shadowp)
@@ -460,7 +466,8 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
*
* Context: Caller must protect the swap device with reference count or locks.
* Return: Returns the folio being added on success. Returns the existing folio
- * if @entry is already cached. Returns NULL if raced with swapin or swapoff.
+ * if @entry is already cached. Returns NULL if raced with swapin or swapoff,
+ * or if a large folio fails a backend recheck before insertion.
*/
static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
struct folio *folio,
@@ -483,10 +490,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
/*
* Large order allocation needs special handling on
- * race: if a smaller folio exists in cache, swapin needs
- * to fallback to order 0, and doing a swap cache lookup
- * might return a folio that is irrelevant to the faulting
- * entry because @entry is aligned down. Just return NULL.
+ * race or backend recheck failure: swapin needs to fall back
+ * to order 0, and doing a swap cache lookup might return a
+ * folio that is irrelevant to the faulting entry because
+ * @entry is aligned down. Just return NULL.
*/
if (ret != -EEXIST || folio_test_large(folio))
goto failed;
@@ -567,9 +574,9 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
* with the folio size.
*
* Return: returns pointer to @folio on success. If folio is a large folio
- * and this raced with another swapin, NULL will be returned to allow fallback
- * to order 0. Else, if another folio was already added to the swap cache,
- * return that swap cache folio instead.
+ * and it raced with another swapin or failed a backend recheck, NULL will be
+ * returned to allow fallback to order 0. Else, if another folio was already
+ * added to the swap cache, return that swap cache folio instead.
*/
struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races
2026-05-08 20:20 ` [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races fujunjie
@ 2026-05-11 13:03 ` David Hildenbrand (Arm)
2026-05-11 14:59 ` Kairui Song
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 13:03 UTC (permalink / raw)
To: fujunjie, Andrew Morton, Chris Li, Kairui Song, Johannes Weiner,
Nhat Pham, Yosry Ahmed
Cc: linux-mm, linux-kernel, linux-doc, Jonathan Corbet, Ryan Roberts,
Barry Song, Baolin Wang, Chengming Zhou, Baoquan He,
Lorenzo Stoakes
On 5/8/26 22:20, fujunjie wrote:
> swapin_folio() documents that a large folio insertion race returns NULL
> so the caller can fall back to order-0 swapin. do_swap_page() currently
> turns that NULL into VM_FAULT_OOM if the PTE is unchanged, which is
> harsher than necessary and gets in the way of rejecting large folio
> ranges for backend reasons.
>
> Move the synchronous swapin sequence into a helper and retry with an
> order-0 folio when a large folio cannot be inserted into the swap cache.
> Count the event as an mTHP swapin fallback before dropping the failed
> large allocation.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
> mm/memory.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 39 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ea6568571131..84e3b77b8293 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4757,6 +4757,44 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static struct folio *swapin_synchronous_folio(swp_entry_t entry,
> + struct vm_fault *vmf)
> +{
> + struct folio *swapcache, *folio;
> + bool large;
> + int order;
> +
> + folio = alloc_swap_folio(vmf);
> + if (!folio)
> + return NULL;
> +
> + large = folio_test_large(folio);
> + order = folio_order(folio);
> +
> + /*
> + * folio is charged, so swapin can only fail due to raced swapin and
> + * return NULL.
> + */
> + swapcache = swapin_folio(entry, folio);
> + if (swapcache == folio)
> + return folio;
> +
> + if (!swapcache && large)
> + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
> + folio_put(folio);
> + if (swapcache || !large)
> + return swapcache;
> +
> + folio = __alloc_swap_folio(vmf);
> + if (!folio)
> + return NULL;
> +
> + swapcache = swapin_folio(entry, folio);
> + if (swapcache != folio)
> + folio_put(folio);
> + return swapcache;
> +}
> +
> /* Sanity check that a folio is fully exclusive */
> static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> unsigned int nr_pages)
> @@ -4860,17 +4898,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> swap_update_readahead(folio, vma, vmf->address);
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> - folio = alloc_swap_folio(vmf);
> - if (folio) {
> - /*
> - * folio is charged, so swapin can only fail due
> - * to raced swapin and return NULL.
> - */
> - swapcache = swapin_folio(entry, folio);
> - if (swapcache != folio)
> - folio_put(folio);
> - folio = swapcache;
> - }
> + folio = swapin_synchronous_folio(entry, vmf);
> } else {
> folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> }
There are some upcoming changes with:
https://lore.kernel.org/r/20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com
All the of that logic you have in swapin_synchronous_folio() should ideally not
go into memory.c, but into some swap specific code.
But
https://lore.kernel.org/r/20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com
Already changes a lot of that.
--
Cheers,
David
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races
2026-05-11 13:03 ` David Hildenbrand (Arm)
@ 2026-05-11 14:59 ` Kairui Song
2026-05-12 7:57 ` Fujunjie
0 siblings, 1 reply; 17+ messages in thread
From: Kairui Song @ 2026-05-11 14:59 UTC (permalink / raw)
To: fujunjie, David Hildenbrand (Arm)
Cc: Andrew Morton, Chris Li, Johannes Weiner, Nhat Pham, Yosry Ahmed,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet, Ryan Roberts,
Barry Song, Baolin Wang, Chengming Zhou, Baoquan He,
Lorenzo Stoakes
On Mon, May 11, 2026 at 9:14 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 5/8/26 22:20, fujunjie wrote:
> > swapin_folio() documents that a large folio insertion race returns NULL
> > so the caller can fall back to order-0 swapin. do_swap_page() currently
> > turns that NULL into VM_FAULT_OOM if the PTE is unchanged, which is
> > harsher than necessary and gets in the way of rejecting large folio
> > ranges for backend reasons.
> >
> > Move the synchronous swapin sequence into a helper and retry with an
> > order-0 folio when a large folio cannot be inserted into the swap cache.
> > Count the event as an mTHP swapin fallback before dropping the failed
> > large allocation.
> >
> > Signed-off-by: fujunjie <fujunjie1@qq.com>
> > ---
> > mm/memory.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 39 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index ea6568571131..84e3b77b8293 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4757,6 +4757,44 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > }
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static struct folio *swapin_synchronous_folio(swp_entry_t entry,
> > + struct vm_fault *vmf)
> > +{
> > + struct folio *swapcache, *folio;
> > + bool large;
> > + int order;
> > +
> > + folio = alloc_swap_folio(vmf);
> > + if (!folio)
> > + return NULL;
> > +
> > + large = folio_test_large(folio);
> > + order = folio_order(folio);
> > +
> > + /*
> > + * folio is charged, so swapin can only fail due to raced swapin and
> > + * return NULL.
> > + */
> > + swapcache = swapin_folio(entry, folio);
> > + if (swapcache == folio)
> > + return folio;
> > +
> > + if (!swapcache && large)
> > + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
> > + folio_put(folio);
> > + if (swapcache || !large)
> > + return swapcache;
> > +
> > + folio = __alloc_swap_folio(vmf);
> > + if (!folio)
> > + return NULL;
> > +
> > + swapcache = swapin_folio(entry, folio);
> > + if (swapcache != folio)
> > + folio_put(folio);
> > + return swapcache;
> > +}
> > +
> > /* Sanity check that a folio is fully exclusive */
> > static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> > unsigned int nr_pages)
> > @@ -4860,17 +4898,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > swap_update_readahead(folio, vma, vmf->address);
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > - folio = alloc_swap_folio(vmf);
> > - if (folio) {
> > - /*
> > - * folio is charged, so swapin can only fail due
> > - * to raced swapin and return NULL.
> > - */
> > - swapcache = swapin_folio(entry, folio);
> > - if (swapcache != folio)
> > - folio_put(folio);
> > - folio = swapcache;
> > - }
> > + folio = swapin_synchronous_folio(entry, vmf);
> > } else {
> > folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > }
>
> There are some upcoming changes with:
>
> https://lore.kernel.org/r/20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com
>
>
> All the of that logic you have in swapin_synchronous_folio() should ideally not
> go into memory.c, but into some swap specific code.
>
> But
>
> https://lore.kernel.org/r/20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com
Thanks for mentioning this!
I think Junjie's change fits better after that change indeed. And I
checked the code, it should fits easily too.
It's already strange enough that THP swapin is bundled with
synchronous swapin, we better not make it more divergent here, and add
more bits into memory.c.
And this commit will limit it to anon, no shmem, which is another
strange detail. Or we'll have to repeat everything and copy these code
to shmem.c...
Once all swap-ins uses basically the same path as in that series, all
swap-ins will be able to have similar THP and zswap THP support too.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
` (4 preceding siblings ...)
2026-05-08 20:20 ` [RFC PATCH 5/5] mm: swap: allow zswap-backed large folio swapin fujunjie
@ 2026-05-11 22:13 ` Yosry Ahmed
2026-05-12 6:14 ` David Hildenbrand (Arm)
2026-05-12 8:02 ` Fujunjie
2026-05-12 4:20 ` Alexandre Ghiti
6 siblings, 2 replies; 17+ messages in thread
From: Yosry Ahmed @ 2026-05-11 22:13 UTC (permalink / raw)
To: fujunjie
Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
On Fri, May 08, 2026 at 08:18:29PM +0000, fujunjie wrote:
> Hi,
>
> This RFC explores anonymous large folio swapin when a contiguous swap
> range is backed consistently by zswap.
>
> Large folio swapout to zswap is already supported by storing each base
> page in the folio as a separate zswap entry. The anonymous synchronous
> swapin path has remained order-0 once zswap has ever been enabled:
> zswap_load() rejected large folios, and alloc_swap_folio() avoided large
> folio allocation to protect against mixed backend ranges.
>
> This RFC keeps the scope intentionally conservative. It does not try to
> read one large folio from mixed zswap and disk backends, and it does not
> change shmem swapin. Shmem still has its existing zswap fallback and is
> left for later discussion. For anonymous swapin, the backend rule is made
> explicit:
>
> - a range fully absent from zswap can keep using the disk backend
> - a range fully present in zswap can be decompressed into a large folio
> - a mixed zswap/non-zswap range falls back to order-0 swapin
>
> The series adds a zswap range query helper, teaches zswap_load() to
> decompress all-zswap large folios one base page at a time, accounts mTHP
> swpin for zswap-loaded large folios, retries synchronous large-folio
> insertion races with order-0 swapin, and removes the anonymous
> zswap-never-enabled restriction once mixed ranges are filtered.
>
> I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
> CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.
>
> The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
> fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
> as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
> reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
> mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
> order-4 THP groups in the mapping.
>
> In the mixed-backend run, the workload used a 64MiB anonymous mapping
> split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
> one zswap base-page entry, refault left 1023 order-4 THP groups and one
> order-0 mixed group. The kernel stats matched that shape:
> mthp64_swpin=1023, zswpin=16383 and zswpwb=1.
>
> CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
> writeback deterministic; it is not required by the implementation.
>
> Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
> swap cache and zswap entry state into a virtual swap descriptor, and lists
> mixed backing THP swapin as a future use case. This RFC is independent and
> works with the current swap/zswap infrastructure, but may need rebasing if
> VSS lands first.
>
> Feedback would be especially helpful on:
>
> 1. whether it makes sense to support all-zswap large folio swapin first,
> while keeping mixed zswap/disk ranges on the order-0 fallback path
I think so, yes, but based on my read of the code this RFC only affects
synchornous swapin, which is more-or-less zram+zswap. This is an
uncommon setup outside of testing.
> 2. whether a follow-up for mixed zswap/disk large folio swapin would be
> useful after this RFC
That's a heavier lift and I think we should consider this in the
longer-term, once the virtual swap work settles down. This is
conceptually not a zswap thing, you can have parts of a folio on disk,
in zswap, in the zeromap, etc. So it needs to be handled at a higher
layer (virtual swap for example).
>
> Thanks.
>
> ---
>
> fujunjie (5):
> mm: zswap: decompress into a folio subpage
> mm: zswap: add a zswap entry batch helper
> mm: zswap: load fully stored large folios
> mm: swap: fall back to order-0 after large swapin races
> mm: swap: allow zswap-backed large folio swapin
>
> Documentation/admin-guide/mm/transhuge.rst | 4 +-
> include/linux/zswap.h | 9 ++
> mm/memory.c | 67 ++++++++-----
> mm/swap_state.c | 23 +++--
> mm/zswap.c | 111 ++++++++++++++++-----
> 5 files changed, 154 insertions(+), 60 deletions(-)
>
>
> base-commit: 917719c412c48687d4a176965d1fa35320ec457c
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 3/5] mm: zswap: load fully stored large folios
2026-05-08 20:20 ` [RFC PATCH 3/5] mm: zswap: load fully stored large folios fujunjie
@ 2026-05-11 22:38 ` Yosry Ahmed
2026-05-12 8:05 ` Fujunjie
0 siblings, 1 reply; 17+ messages in thread
From: Yosry Ahmed @ 2026-05-11 22:38 UTC (permalink / raw)
To: fujunjie
Cc: Andrew Morton, Johannes Weiner, Chris Li, Kairui Song, Nhat Pham,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes, Michal Hocko,
Roman Gushchin, Shakeel Butt
On Fri, May 08, 2026 at 08:20:31PM +0000, fujunjie wrote:
> zswap_store() already stores every base page of a large folio as a
> separate zswap entry and tears the whole folio back down on store
> failure. The load side still rejects any large folio, which forces the
> swapin path to avoid mTHP swapin once zswap has ever been enabled.
>
> Use zswap_entry_batch() to distinguish three cases: the whole range is
> absent from zswap and should fall through to the disk backend, the whole
> range is present and can be decompressed one base page at a time, or the
> range is mixed and must be treated as an invalid large-folio backend
> selection.
>
> After all entries decompress successfully, mark the folio uptodate and
> dirty, account the mTHP swpin stat once for the folio, account one ZSWPIN
> event per base page, and invalidate each zswap entry because the
> swapcache folio becomes authoritative.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 4 +-
> mm/zswap.c | 65 ++++++++++++++--------
> 2 files changed, 45 insertions(+), 24 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 5fbc3d89bb07..05456906aff6 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -644,8 +644,8 @@ zswpout
> piece without splitting.
>
> swpin
> - is incremented every time a huge page is swapped in from a non-zswap
> - swap device in one piece.
> + is incremented every time a huge page is swapped in from swap I/O or
> + zswap in one piece.
>
> swpin_fallback
> is incremented if swapin fails to allocate or charge a huge page
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 27c14b8edd15..863ca1e896ed 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -28,6 +28,7 @@
> #include <crypto/acompress.h>
> #include <crypto/scatterwalk.h>
> #include <linux/zswap.h>
> +#include <linux/huge_mm.h>
> #include <linux/mm_types.h>
> #include <linux/page-flags.h>
> #include <linux/swapops.h>
> @@ -1614,20 +1615,23 @@ bool zswap_store(struct folio *folio)
> * NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
> * will SIGBUS).
> *
> - * -EINVAL: if the swapped out content was in zswap, but the page belongs
> - * to a large folio, which is not supported by zswap. The folio is unlocked,
> - * but NOT marked up-to-date, so that an IO error is emitted (e.g.
> - * do_swap_page() will SIGBUS).
> + * -EINVAL: if the folio spans a mix of zswap and non-zswap entries. The
> + * folio is unlocked, but NOT marked up-to-date, so that an IO error is
> + * emitted (e.g. do_swap_page() will SIGBUS). Large folio swapin should
> + * reject such ranges before calling zswap_load().
> *
> - * -ENOENT: if the swapped out content was not in zswap. The folio remains
> + * -ENOENT: if the swapped out content was not in zswap. For a large folio,
> + * this means the whole folio range was not in zswap. The folio remains
> * locked on return.
> */
> int zswap_load(struct folio *folio)
> {
> swp_entry_t swp = folio->swap;
> pgoff_t offset = swp_offset(swp);
> - struct xarray *tree = swap_zswap_tree(swp);
> struct zswap_entry *entry;
> + int nr_pages = folio_nr_pages(folio);
> + bool is_zswap;
> + int index;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1635,30 +1639,36 @@ int zswap_load(struct folio *folio)
> if (zswap_never_enabled())
> return -ENOENT;
>
> - /*
> - * Large folios should not be swapped in while zswap is being used, as
> - * they are not properly handled. Zswap does not properly load large
> - * folios, and a large folio may only be partially in zswap.
> - */
> - if (WARN_ON_ONCE(folio_test_large(folio))) {
> + if (zswap_entry_batch(swp, nr_pages, &is_zswap) != nr_pages) {
> + WARN_ON_ONCE(folio_test_large(folio));
IIUC the condition can only be true for large folios, so I think just
WARN_ON_ONCE() in the if statement itself?
Taking a step back, this is doing xa_load() lookups that are repeated
again below. Maybe we should drop this check here and integrate the
check into the loop below (either all entries exist or don't)?
We might end up with a case where we decompress some parts of the folio
then return a failure, but this possibility already exists (e.g.
decompression failures).
> folio_unlock(folio);
> return -EINVAL;
> }
>
> - entry = xa_load(tree, offset);
> - if (!entry)
> + if (!is_zswap)
> return -ENOENT;
>
> - if (!zswap_decompress(entry, folio, 0)) {
> - folio_unlock(folio);
> - return -EIO;
> + for (index = 0; index < nr_pages; index++) {
> + swp_entry_t entry_swp = swp_entry(swp_type(swp),
> + offset + index);
> + struct xarray *tree = swap_zswap_tree(entry_swp);
> +
> + entry = xa_load(tree, offset + index);
> + if (WARN_ON_ONCE(!entry)) {
> + folio_unlock(folio);
> + return -EINVAL;
> + }
> +
> + if (!zswap_decompress(entry, folio, index)) {
> + folio_unlock(folio);
> + return -EIO;
> + }
> }
>
> folio_mark_uptodate(folio);
>
> - count_vm_event(ZSWPIN);
> - if (entry->objcg)
> - count_objcg_events(entry->objcg, ZSWPIN, 1);
> + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
Not a problem with this patch, but I wonder why different swapin
implementations are making this call instead of putting it in a
higher-level (e.g. alloc_swap_folio()).
> + count_vm_events(ZSWPIN, nr_pages);
>
> /*
> * We are reading into the swapcache, invalidate zswap entry.
> @@ -1668,8 +1678,19 @@ int zswap_load(struct folio *folio)
> * compression work.
> */
> folio_mark_dirty(folio);
> - xa_erase(tree, offset);
> - zswap_entry_free(entry);
> +
> + for (index = 0; index < nr_pages; index++) {
> + swp_entry_t entry_swp = swp_entry(swp_type(swp),
> + offset + index);
> + struct xarray *tree = swap_zswap_tree(entry_swp);
> +
> + entry = xa_erase(tree, offset + index);
> + if (WARN_ON_ONCE(!entry))
> + continue;
> + if (entry->objcg)
> + count_objcg_events(entry->objcg, ZSWPIN, 1);
Hmm I was wondering how the stat update should be handled here. It is
possible that a folio was swapped out from one memcg and is being
swapped in by a process in a different memcg. For order-0 folios, the
folio gets charged to the swapout memcg.
However, looking at alloc_swap_folio() ->
mem_cgroup_swapin_charge_folio(), it seems like we charge mTHP folios to
the swapout memcg of the first swap entry. This seems a bit arbitrary.
Focusing at the code here, seems like we'll count the swapin against the
swapout memcg, which is good in terms of keeping ZSWPIN and ZSWPOUT
balanced. However, it seems like it's possible that the folio is being
charged to a different memcg here than the swapout memcg, so we may end
up counting ZSWPIN in one memcg but charging the folio in another.
I am not sure if this is practically a problem, but something doesn't
sound right. Johannes (and other memcg folks), were these the intended
charging semantics for mTHP swapin? Is it reasonable to count ZSWPIN in
one memcg and charge the folio in another?
> + zswap_entry_free(entry);
> + }
>
> folio_unlock(folio);
> return 0;
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
` (5 preceding siblings ...)
2026-05-11 22:13 ` [RFC PATCH 0/5] mm: support zswap-backed anonymous " Yosry Ahmed
@ 2026-05-12 4:20 ` Alexandre Ghiti
2026-05-12 7:46 ` Fujunjie
6 siblings, 1 reply; 17+ messages in thread
From: Alexandre Ghiti @ 2026-05-12 4:20 UTC (permalink / raw)
To: fujunjie
Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
[-- Attachment #1: Type: text/plain, Size: 4632 bytes --]
Hi,
On Fri, May 8, 2026 at 10:18 PM fujunjie <fujunjie1@qq.com> wrote:
> >
> Hi,
>
> This RFC explores anonymous large folio swapin when a contiguous swap
> range is backed consistently by zswap.
>
> Large folio swapout to zswap is already supported by storing each base
> page in the folio as a separate zswap entry. The anonymous synchronous
> swapin path has remained order-0 once zswap has ever been enabled:
> zswap_load() rejected large folios, and alloc_swap_folio() avoided large
> folio allocation to protect against mixed backend ranges.
>
> This RFC keeps the scope intentionally conservative. It does not try to
> read one large folio from mixed zswap and disk backends, and it does not
> change shmem swapin. Shmem still has its existing zswap fallback and is
> left for later discussion. For anonymous swapin, the backend rule is made
> explicit:
>
> - a range fully absent from zswap can keep using the disk backend
> - a range fully present in zswap can be decompressed into a large folio
> - a mixed zswap/non-zswap range falls back to order-0 swapin
>
> The series adds a zswap range query helper, teaches zswap_load() to
> decompress all-zswap large folios one base page at a time, accounts mTHP
> swpin for zswap-loaded large folios, retries synchronous large-folio
> insertion races with order-0 swapin, and removes the anonymous
> zswap-never-enabled restriction once mixed ranges are filtered.
>
> I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
> CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.
>
> The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
> fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
> as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
> reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
> mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
> order-4 THP groups in the mapping.
>
> In the mixed-backend run, the workload used a 64MiB anonymous mapping
> split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
> one zswap base-page entry, refault left 1023 order-4 THP groups and one
> order-0 mixed group. The kernel stats matched that shape:
> mthp64_swpin=1023, zswpin=16383 and zswpwb=1.
>
> CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
> writeback deterministic; it is not required by the implementation.
>
> Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
> swap cache and zswap entry state into a virtual swap descriptor, and lists
> mixed backing THP swapin as a future use case. This RFC is independent and
> works with the current swap/zswap infrastructure, but may need rebasing if
> VSS lands first.
>
> Feedback would be especially helpful on:
>
> 1. whether it makes sense to support all-zswap large folio swapin first,
> while keeping mixed zswap/disk ranges on the order-0 fallback path
> 2. whether a follow-up for mixed zswap/disk large folio swapin would be
> useful after this RFC
>
> Thanks.
>
> ---
>
> fujunjie (5):
> mm: zswap: decompress into a folio subpage
> mm: zswap: add a zswap entry batch helper
> mm: zswap: load fully stored large folios
> mm: swap: fall back to order-0 after large swapin races
> mm: swap: allow zswap-backed large folio swapin
>
> Documentation/admin-guide/mm/transhuge.rst | 4 +-
> include/linux/zswap.h | 9 ++
> mm/memory.c | 67 ++++++++-----
> mm/swap_state.c | 23 +++--
> mm/zswap.c | 111 ++++++++++++++++-----
> 5 files changed, 154 insertions(+), 60 deletions(-)
>
>
> base-commit: 917719c412c48687d4a176965d1fa35320ec457c
> --
> 2.34.1
>
>
>
So I have been working on the exact same thing for some weeks now. My work
is based on Usama's series [1].
The problem with large folio swapin is that it can create swap thrashing:
to swap in a large folio, swap out may be necessary, as reported in [2].
I implemented quite a few throttling algorithms on top to try to avoid this
issue and so far, I have had mixed/inconsistent results.
How did you test this series? Did you encounter thrashing? Do you have
performance numbers?
Happy to talk more about this, thanks for your series!
Alex
[1]
https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
[2]
https://lore.kernel.org/all/SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com/
[-- Attachment #2: Type: text/html, Size: 5616 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
2026-05-11 22:13 ` [RFC PATCH 0/5] mm: support zswap-backed anonymous " Yosry Ahmed
@ 2026-05-12 6:14 ` David Hildenbrand (Arm)
2026-05-12 19:19 ` Yosry Ahmed
2026-05-12 8:02 ` Fujunjie
1 sibling, 1 reply; 17+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-12 6:14 UTC (permalink / raw)
To: Yosry Ahmed, fujunjie
Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet, Ryan Roberts,
Barry Song, Baolin Wang, Chengming Zhou, Baoquan He,
Lorenzo Stoakes
On 5/12/26 00:13, Yosry Ahmed wrote:
> On Fri, May 08, 2026 at 08:18:29PM +0000, fujunjie wrote:
>> Hi,
>>
>> This RFC explores anonymous large folio swapin when a contiguous swap
>> range is backed consistently by zswap.
>>
>> Large folio swapout to zswap is already supported by storing each base
>> page in the folio as a separate zswap entry. The anonymous synchronous
>> swapin path has remained order-0 once zswap has ever been enabled:
>> zswap_load() rejected large folios, and alloc_swap_folio() avoided large
>> folio allocation to protect against mixed backend ranges.
>>
>> This RFC keeps the scope intentionally conservative. It does not try to
>> read one large folio from mixed zswap and disk backends, and it does not
>> change shmem swapin. Shmem still has its existing zswap fallback and is
>> left for later discussion. For anonymous swapin, the backend rule is made
>> explicit:
>>
>> - a range fully absent from zswap can keep using the disk backend
>> - a range fully present in zswap can be decompressed into a large folio
>> - a mixed zswap/non-zswap range falls back to order-0 swapin
>>
>> The series adds a zswap range query helper, teaches zswap_load() to
>> decompress all-zswap large folios one base page at a time, accounts mTHP
>> swpin for zswap-loaded large folios, retries synchronous large-folio
>> insertion races with order-0 swapin, and removes the anonymous
>> zswap-never-enabled restriction once mixed ranges are filtered.
>>
>> I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
>> CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.
>>
>> The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
>> fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
>> as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
>> reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
>> mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
>> order-4 THP groups in the mapping.
>>
>> In the mixed-backend run, the workload used a 64MiB anonymous mapping
>> split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
>> one zswap base-page entry, refault left 1023 order-4 THP groups and one
>> order-0 mixed group. The kernel stats matched that shape:
>> mthp64_swpin=1023, zswpin=16383 and zswpwb=1.
>>
>> CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
>> writeback deterministic; it is not required by the implementation.
>>
>> Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
>> swap cache and zswap entry state into a virtual swap descriptor, and lists
>> mixed backing THP swapin as a future use case. This RFC is independent and
>> works with the current swap/zswap infrastructure, but may need rebasing if
>> VSS lands first.
>>
>> Feedback would be especially helpful on:
>>
>> 1. whether it makes sense to support all-zswap large folio swapin first,
>> while keeping mixed zswap/disk ranges on the order-0 fallback path
>
> I think so, yes, but based on my read of the code this RFC only affects
> synchornous swapin, which is more-or-less zram+zswap. This is an
> uncommon setup outside of testing.
BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's
also pretty uncommon I assume. Well, maybe if your hypervisor provides you with
an emulated NVDIMM to use as swap backend ... maybe.
I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see
other usage.
So seeing it for zswap is pretty rare I assume.
--
Cheers,
David
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
2026-05-12 4:20 ` Alexandre Ghiti
@ 2026-05-12 7:46 ` Fujunjie
0 siblings, 0 replies; 17+ messages in thread
From: Fujunjie @ 2026-05-12 7:46 UTC (permalink / raw)
To: Alexandre Ghiti
Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
Yosry Ahmed, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
On 5/12/2026 12:20 PM, Alexandre Ghiti wrote:
> So I have been working on the exact same thing for some weeks now. My work is based on Usama's series [1].
>
> The problem with large folio swapin is that it can create swap thrashing: to swap in a large folio, swap out may be necessary, as reported in [2].
>
> I implemented quite a few throttling algorithms on top to try to avoid this issue and so far, I have had mixed/inconsistent results.
>
> How did you test this series? Did you encounter thrashing? Do you have performance numbers?
>
> Happy to talk more about this, thanks for your series!
>
> Alex
>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/ <https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/>
> [2] https://lore.kernel.org/all/SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com/ <https://lore.kernel.org/all/SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com/>
Thanks Alexandre.
My RFC only had correctness testing so far. I tested the all-zswap path
and fallback cases under QEMU, but I don't have bare-metal
performance numbers yet.
If you are already actively working on this, I don't want to duplicate the
same effort. I will pause this RFC for now and wait for your series.
After your series is posted, I will take another look and see if there is
anything that still needs follow-up work.
Thanks for letting me know.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races
2026-05-11 14:59 ` Kairui Song
@ 2026-05-12 7:57 ` Fujunjie
0 siblings, 0 replies; 17+ messages in thread
From: Fujunjie @ 2026-05-12 7:57 UTC (permalink / raw)
To: Kairui Song, David Hildenbrand (Arm)
Cc: Andrew Morton, Chris Li, Johannes Weiner, Nhat Pham, Yosry Ahmed,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet, Ryan Roberts,
Barry Song, Baolin Wang, Chengming Zhou, Baoquan He,
Lorenzo Stoakes
On 5/11/2026 10:59 PM, Kairui Song wrote:
> On Mon, May 11, 2026 at 9:14 PM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 5/8/26 22:20, fujunjie wrote:
>>> swapin_folio() documents that a large folio insertion race returns NULL
>>> so the caller can fall back to order-0 swapin. do_swap_page() currently
>>> turns that NULL into VM_FAULT_OOM if the PTE is unchanged, which is
>>> harsher than necessary and gets in the way of rejecting large folio
>>> ranges for backend reasons.
>>>
>>> Move the synchronous swapin sequence into a helper and retry with an
>>> order-0 folio when a large folio cannot be inserted into the swap cache.
>>> Count the event as an mTHP swapin fallback before dropping the failed
>>> large allocation.
>>>
>>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>>> ---
>>> mm/memory.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
>>> 1 file changed, 39 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index ea6568571131..84e3b77b8293 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4757,6 +4757,44 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>> }
>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> +static struct folio *swapin_synchronous_folio(swp_entry_t entry,
>>> + struct vm_fault *vmf)
>>> +{
>>> + struct folio *swapcache, *folio;
>>> + bool large;
>>> + int order;
>>> +
>>> + folio = alloc_swap_folio(vmf);
>>> + if (!folio)
>>> + return NULL;
>>> +
>>> + large = folio_test_large(folio);
>>> + order = folio_order(folio);
>>> +
>>> + /*
>>> + * folio is charged, so swapin can only fail due to raced swapin and
>>> + * return NULL.
>>> + */
>>> + swapcache = swapin_folio(entry, folio);
>>> + if (swapcache == folio)
>>> + return folio;
>>> +
>>> + if (!swapcache && large)
>>> + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
>>> + folio_put(folio);
>>> + if (swapcache || !large)
>>> + return swapcache;
>>> +
>>> + folio = __alloc_swap_folio(vmf);
>>> + if (!folio)
>>> + return NULL;
>>> +
>>> + swapcache = swapin_folio(entry, folio);
>>> + if (swapcache != folio)
>>> + folio_put(folio);
>>> + return swapcache;
>>> +}
>>> +
>>> /* Sanity check that a folio is fully exclusive */
>>> static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
>>> unsigned int nr_pages)
>>> @@ -4860,17 +4898,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> swap_update_readahead(folio, vma, vmf->address);
>>> if (!folio) {
>>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>>> - folio = alloc_swap_folio(vmf);
>>> - if (folio) {
>>> - /*
>>> - * folio is charged, so swapin can only fail due
>>> - * to raced swapin and return NULL.
>>> - */
>>> - swapcache = swapin_folio(entry, folio);
>>> - if (swapcache != folio)
>>> - folio_put(folio);
>>> - folio = swapcache;
>>> - }
>>> + folio = swapin_synchronous_folio(entry, vmf);
>>> } else {
>>> folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
>>> }
>>
>> There are some upcoming changes with:
>>
>> https://lore.kernel.org/r/20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com
>>
>>
>> All the of that logic you have in swapin_synchronous_folio() should ideally not
>> go into memory.c, but into some swap specific code.
>>
>> But
>>
>> https://lore.kernel.org/r/20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com
>
> Thanks for mentioning this!
>
> I think Junjie's change fits better after that change indeed. And I
> checked the code, it should fits easily too.
>
> It's already strange enough that THP swapin is bundled with
> synchronous swapin, we better not make it more divergent here, and add
> more bits into memory.c.
>
> And this commit will limit it to anon, no shmem, which is another
> strange detail. Or we'll have to repeat everything and copy these code
> to shmem.c...
>
> Once all swap-ins uses basically the same path as in that series, all
> swap-ins will be able to have similar THP and zswap THP support too.
Thanks David and Kairui.
That makes sense. The helper in memory.c was mainly added to demonstrate
the fallback needed by this RFC, but I agree that growing more large-folio
swapin logic in the anon synchronous swapin path is not the right
direction.
I will not carry this patch forward in its current form. If I continue
with this work, I will rebase it on top of Kairui's swap-table or unified
swapin work and keep the allocation and fallback handling in the swap-specific
common path.
I also noticed Alexandre is already working on the large-folio swapin
side, so I will follow that series to avoid duplicating work.
Thanks for the pointers.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
2026-05-11 22:13 ` [RFC PATCH 0/5] mm: support zswap-backed anonymous " Yosry Ahmed
2026-05-12 6:14 ` David Hildenbrand (Arm)
@ 2026-05-12 8:02 ` Fujunjie
1 sibling, 0 replies; 17+ messages in thread
From: Fujunjie @ 2026-05-12 8:02 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner, Nhat Pham,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes
On 5/12/2026 6:13 AM, Yosry Ahmed wrote:
>> Feedback would be especially helpful on:
>>
>> 1. whether it makes sense to support all-zswap large folio swapin first,
>> while keeping mixed zswap/disk ranges on the order-0 fallback path
>
> I think so, yes, but based on my read of the code this RFC only affects
> synchornous swapin, which is more-or-less zram+zswap. This is an
> uncommon setup outside of testing.
>
>> 2. whether a follow-up for mixed zswap/disk large folio swapin would be
>> useful after this RFC
>
> That's a heavier lift and I think we should consider this in the
> longer-term, once the virtual swap work settles down. This is
> conceptually not a zswap thing, you can have parts of a folio on disk,
> in zswap, in the zeromap, etc. So it needs to be handled at a higher
> layer (virtual swap for example).
>
Thanks Yosry.
That makes sense. I agree that the mixed zswap/disk/zeromap case is not
really zswap-specific and should be handled at a higher layer, likely
after the virtual swap work settles.
Given the feedback on the swapin path structure and Alexandre's ongoing
work in this area, I will pause this RFC in its current form and follow
those series first.
Thanks
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 3/5] mm: zswap: load fully stored large folios
2026-05-11 22:38 ` Yosry Ahmed
@ 2026-05-12 8:05 ` Fujunjie
0 siblings, 0 replies; 17+ messages in thread
From: Fujunjie @ 2026-05-12 8:05 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Andrew Morton, Johannes Weiner, Chris Li, Kairui Song, Nhat Pham,
linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
David Hildenbrand, Ryan Roberts, Barry Song, Baolin Wang,
Chengming Zhou, Baoquan He, Lorenzo Stoakes, Michal Hocko,
Roman Gushchin, Shakeel Butt
On 5/12/2026 6:38 AM, Yosry Ahmed wrote:
> On Fri, May 08, 2026 at 08:20:31PM +0000, fujunjie wrote:
>> zswap_store() already stores every base page of a large folio as a
>> separate zswap entry and tears the whole folio back down on store
>> failure. The load side still rejects any large folio, which forces the
>> swapin path to avoid mTHP swapin once zswap has ever been enabled.
>>
>> Use zswap_entry_batch() to distinguish three cases: the whole range is
>> absent from zswap and should fall through to the disk backend, the whole
>> range is present and can be decompressed one base page at a time, or the
>> range is mixed and must be treated as an invalid large-folio backend
>> selection.
>>
>> After all entries decompress successfully, mark the folio uptodate and
>> dirty, account the mTHP swpin stat once for the folio, account one ZSWPIN
>> event per base page, and invalidate each zswap entry because the
>> swapcache folio becomes authoritative.
>>
>> Signed-off-by: fujunjie <fujunjie1@qq.com>
>> ---
>> Documentation/admin-guide/mm/transhuge.rst | 4 +-
>> mm/zswap.c | 65 ++++++++++++++--------
>> 2 files changed, 45 insertions(+), 24 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 5fbc3d89bb07..05456906aff6 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -644,8 +644,8 @@ zswpout
>> piece without splitting.
>>
>> swpin
>> - is incremented every time a huge page is swapped in from a non-zswap
>> - swap device in one piece.
>> + is incremented every time a huge page is swapped in from swap I/O or
>> + zswap in one piece.
>>
>> swpin_fallback
>> is incremented if swapin fails to allocate or charge a huge page
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 27c14b8edd15..863ca1e896ed 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -28,6 +28,7 @@
>> #include <crypto/acompress.h>
>> #include <crypto/scatterwalk.h>
>> #include <linux/zswap.h>
>> +#include <linux/huge_mm.h>
>> #include <linux/mm_types.h>
>> #include <linux/page-flags.h>
>> #include <linux/swapops.h>
>> @@ -1614,20 +1615,23 @@ bool zswap_store(struct folio *folio)
>> * NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
>> * will SIGBUS).
>> *
>> - * -EINVAL: if the swapped out content was in zswap, but the page belongs
>> - * to a large folio, which is not supported by zswap. The folio is unlocked,
>> - * but NOT marked up-to-date, so that an IO error is emitted (e.g.
>> - * do_swap_page() will SIGBUS).
>> + * -EINVAL: if the folio spans a mix of zswap and non-zswap entries. The
>> + * folio is unlocked, but NOT marked up-to-date, so that an IO error is
>> + * emitted (e.g. do_swap_page() will SIGBUS). Large folio swapin should
>> + * reject such ranges before calling zswap_load().
>> *
>> - * -ENOENT: if the swapped out content was not in zswap. The folio remains
>> + * -ENOENT: if the swapped out content was not in zswap. For a large folio,
>> + * this means the whole folio range was not in zswap. The folio remains
>> * locked on return.
>> */
>> int zswap_load(struct folio *folio)
>> {
>> swp_entry_t swp = folio->swap;
>> pgoff_t offset = swp_offset(swp);
>> - struct xarray *tree = swap_zswap_tree(swp);
>> struct zswap_entry *entry;
>> + int nr_pages = folio_nr_pages(folio);
>> + bool is_zswap;
>> + int index;
>>
>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>> @@ -1635,30 +1639,36 @@ int zswap_load(struct folio *folio)
>> if (zswap_never_enabled())
>> return -ENOENT;
>>
>> - /*
>> - * Large folios should not be swapped in while zswap is being used, as
>> - * they are not properly handled. Zswap does not properly load large
>> - * folios, and a large folio may only be partially in zswap.
>> - */
>> - if (WARN_ON_ONCE(folio_test_large(folio))) {
>> + if (zswap_entry_batch(swp, nr_pages, &is_zswap) != nr_pages) {
>> + WARN_ON_ONCE(folio_test_large(folio));
>
> IIUC the condition can only be true for large folios, so I think just
> WARN_ON_ONCE() in the if statement itself?
>
> Taking a step back, this is doing xa_load() lookups that are repeated
> again below. Maybe we should drop this check here and integrate the
> check into the loop below (either all entries exist or don't)?
>
> We might end up with a case where we decompress some parts of the folio
> then return a failure, but this possibility already exists (e.g.
> decompression failures).
>
>> folio_unlock(folio);
>> return -EINVAL;
>> }
>>
>> - entry = xa_load(tree, offset);
>> - if (!entry)
>> + if (!is_zswap)
>> return -ENOENT;
>>
>> - if (!zswap_decompress(entry, folio, 0)) {
>> - folio_unlock(folio);
>> - return -EIO;
>> + for (index = 0; index < nr_pages; index++) {
>> + swp_entry_t entry_swp = swp_entry(swp_type(swp),
>> + offset + index);
>> + struct xarray *tree = swap_zswap_tree(entry_swp);
>> +
>> + entry = xa_load(tree, offset + index);
>> + if (WARN_ON_ONCE(!entry)) {
>> + folio_unlock(folio);
>> + return -EINVAL;
>> + }
>> +
>> + if (!zswap_decompress(entry, folio, index)) {
>> + folio_unlock(folio);
>> + return -EIO;
>> + }
>> }
>>
>> folio_mark_uptodate(folio);
>>
>> - count_vm_event(ZSWPIN);
>> - if (entry->objcg)
>> - count_objcg_events(entry->objcg, ZSWPIN, 1);
>> + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
>
> Not a problem with this patch, but I wonder why different swapin
> implementations are making this call instead of putting it in a
> higher-level (e.g. alloc_swap_folio()).
>
>> + count_vm_events(ZSWPIN, nr_pages);
>>
>> /*
>> * We are reading into the swapcache, invalidate zswap entry.
>> @@ -1668,8 +1678,19 @@ int zswap_load(struct folio *folio)
>> * compression work.
>> */
>> folio_mark_dirty(folio);
>> - xa_erase(tree, offset);
>> - zswap_entry_free(entry);
>> +
>> + for (index = 0; index < nr_pages; index++) {
>> + swp_entry_t entry_swp = swp_entry(swp_type(swp),
>> + offset + index);
>> + struct xarray *tree = swap_zswap_tree(entry_swp);
>> +
>> + entry = xa_erase(tree, offset + index);
>> + if (WARN_ON_ONCE(!entry))
>> + continue;
>> + if (entry->objcg)
>> + count_objcg_events(entry->objcg, ZSWPIN, 1);
>
> Hmm I was wondering how the stat update should be handled here. It is
> possible that a folio was swapped out from one memcg and is being
> swapped in by a process in a different memcg. For order-0 folios, the
> folio gets charged to the swapout memcg.
>
> However, looking at alloc_swap_folio() ->
> mem_cgroup_swapin_charge_folio(), it seems like we charge mTHP folios to
> the swapout memcg of the first swap entry. This seems a bit arbitrary.
>
> Focusing at the code here, seems like we'll count the swapin against the
> swapout memcg, which is good in terms of keeping ZSWPIN and ZSWPOUT
> balanced. However, it seems like it's possible that the folio is being
> charged to a different memcg here than the swapout memcg, so we may end
> up counting ZSWPIN in one memcg but charging the folio in another.
>
> I am not sure if this is practically a problem, but something doesn't
> sound right. Johannes (and other memcg folks), were these the intended
> charging semantics for mTHP swapin? Is it reasonable to count ZSWPIN in
> one memcg and charge the folio in another?
>
>> + zswap_entry_free(entry);
>> + }
>>
>> folio_unlock(folio);
>> return 0;
>> --
>> 2.34.1
>>
>>
Agreed on the duplicated xa_load() lookups; that should be folded into
the load loop if this is picked up again.
The MTHP_STAT_SWPIN and memcg accounting points also make sense. I will
avoid defining those semantics in this RFC and revisit them if there is a
future version based on the common swapin path.
Thanks.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
2026-05-12 6:14 ` David Hildenbrand (Arm)
@ 2026-05-12 19:19 ` Yosry Ahmed
0 siblings, 0 replies; 17+ messages in thread
From: Yosry Ahmed @ 2026-05-12 19:19 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: fujunjie, Andrew Morton, Chris Li, Kairui Song, Johannes Weiner,
Nhat Pham, linux-mm, linux-kernel, linux-doc, Jonathan Corbet,
Ryan Roberts, Barry Song, Baolin Wang, Chengming Zhou, Baoquan He,
Lorenzo Stoakes
> >> Feedback would be especially helpful on:
> >>
> >> 1. whether it makes sense to support all-zswap large folio swapin first,
> >> while keeping mixed zswap/disk ranges on the order-0 fallback path
> >
> > I think so, yes, but based on my read of the code this RFC only affects
> > synchornous swapin, which is more-or-less zram+zswap. This is an
> > uncommon setup outside of testing.
>
> BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's
> also pretty uncommon I assume. Well, maybe if your hypervisor provides you with
> an emulated NVDIMM to use as swap backend ... maybe.
Yeah, I said "more-or-less" to capture pmem/brd/etc :P
> I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see
> other usage.
>
> So seeing it for zswap is pretty rare I assume.
Yeah that's my understanding as well.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-05-12 19:19 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 20:18 [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin fujunjie
2026-05-08 20:20 ` [RFC PATCH 1/5] mm: zswap: decompress into a folio subpage fujunjie
2026-05-08 20:20 ` [RFC PATCH 2/5] mm: zswap: add a zswap entry batch helper fujunjie
2026-05-08 20:20 ` [RFC PATCH 3/5] mm: zswap: load fully stored large folios fujunjie
2026-05-11 22:38 ` Yosry Ahmed
2026-05-12 8:05 ` Fujunjie
2026-05-08 20:20 ` [RFC PATCH 4/5] mm: swap: fall back to order-0 after large swapin races fujunjie
2026-05-11 13:03 ` David Hildenbrand (Arm)
2026-05-11 14:59 ` Kairui Song
2026-05-12 7:57 ` Fujunjie
2026-05-08 20:20 ` [RFC PATCH 5/5] mm: swap: allow zswap-backed large folio swapin fujunjie
2026-05-11 22:13 ` [RFC PATCH 0/5] mm: support zswap-backed anonymous " Yosry Ahmed
2026-05-12 6:14 ` David Hildenbrand (Arm)
2026-05-12 19:19 ` Yosry Ahmed
2026-05-12 8:02 ` Fujunjie
2026-05-12 4:20 ` Alexandre Ghiti
2026-05-12 7:46 ` Fujunjie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox