[PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in
@ 2025-07-10  3:36 Kairui Song
  2025-07-10  3:36 ` [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current mTHP swapin path have several problems. It may potentially
hang, may cause redundant faults due to false positive swap cache lookup,
and it will involve at least 4 Xarray tree walks (get order, get order
again, confirm swap, insert folio). And for !CONFIG_TRANSPARENT_HUGEPAGE
builds, it will performs some mTHP related checks.

This series fixes all of the mentioned issues, and the code should be more
robust and prepared for the swap table series. Now tree walks is reduced
to twice (get order & confirm, insert folio), !CONFIG_TRANSPARENT_HUGEPAGE
build overhead is also minimized, and comes with a sanity check now.

The performance is slightly better after this series, sequential swap in of
24G data from ZRAM, using transparent_hugepage_tmpfs=always (24 samples each):

Before:         Avg: 10.67s,  stddev: 0.04
After patch 1:  Avg: 10.49s,  stddev: 0.04
After patch 2:  Avg: 10.42s,  stddev: 0.05
After patch 3:  Avg: 10.45s,  stddev: 0.05
After patch 4:  Avg: 10.49s,  stddev: 0.04
After patch 5:  Avg: 9.67s,   stddev: 0.03
After patch 6:  Avg: 9.67s,   stddev: 0.04
After patch 7:  Avg: 9.68s,   stddev: 0.05
After patch 8:  Avg: 9.66s,   stddev: 0.04

Several patches improve the performance by a little, which is about
~10% faster in total.

Build kernel test showed very slightly improvement, testing with
make -j48 with defconfig in a 768M memcg also using ZRAM as swap,
and transparent_hugepage_tmpfs=always (6 test runs):

Before:         avg: 3353.66s, stddev: 33.73
After patch 1:  avg: 3354.19s, stddev: 42.54
After patch 2:  avg: 3364.16s, stddev: 52.74
After patch 3:  avg: 3355.73s, stddev: 36.17
After patch 4:  avg: 3352.78s, stddev: 39.80
After patch 5:  avg: 3355.19s, stddev: 50.78
After patch 6:  avg: 3333.63s, stddev: 32.50
After patch 7:  avg: 3297.70s, stddev: 38.93
After patch 8:  avg: 3302.35s, stddev: 50.61

---

V4: https://lore.kernel.org/linux-mm/20250704181748.63181-1-ryncsn@gmail.com/
Updates:
- Merge patch 5 into patch 8 to fix a intermediate code error. [ Baolin
  Wang ]
- Instead of passing two swap entries, calculate the new order 0 entry
  in shmem_swap_alloc_folio, to improve code readability. [ Baolin Wang ]
- Rebase on top of mm-new.

V3: https://lore.kernel.org/linux-mm/20250627062020.534-1-ryncsn@gmail.com/
Updates:
- Split and reorganize a few intermediate patches [ Baolin Wang ]
- Fix a duplicated fault issue that may occur in one intermediate patch
  [ Baolin Wang ]
- Improve variable naming [ Baolin Wang ]
- Fix a wrong error value problem, return proper error value when direct
  swapin failed.
- No major code change from V3.

V2: https://lore.kernel.org/linux-mm/20250619175538.15799-1-ryncsn@gmail.com/
Updates:
- Split the clean up patch into 3 individual patches [ Baolin Wang ]
- Fix a code error in the first patch [ Baolin Wang ]
- I found there are some other remaining issue that can be fixed after
  the clean up so includes these too: fix major fault counter, and clean
  up the goto labels.

V1: https://lore.kernel.org/linux-mm/20250617183503.10527-1-ryncsn@gmail.com/
Updates:
- Improve of funtion name and variable names, also commit message [ Kemeng Shi,
  Dev Jain ]
- Correct Fixes: tag [ Andrew Morton ]
- Collect Reviewed-by.

Two of the patches in this series comes from the swap table series [1],
and worth noting that the performance gain of this series is independent
to the swap table series, we'll see another bigger performance gain and
reduce of memory usage after the swap table series.

I found these issues while trying to split the shmem changes out of the
swap table series for easier reviewing, and found several more issues
while doing stress tests for performance comparision. Barry also mentioned
that CONFIG_TRANSPARENT_HUGEPAGE may have redundant checks [2] and I
managed to clean them up properly too.

No issue is found with a few days of stress testing.

Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7AsKFz7UN+seR5atznE_RBTDC9qjDmwN5saMe+KL3b1mQ@mail.gmail.com/ [2]


Kairui Song (8):
  mm/shmem, swap: improve cached mTHP handling and fix potential hung
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  mm/shmem, swap: tidy up THP swapin checks
  mm/shmem, swap: tidy up swap entry splitting
  mm/shmem, swap: never use swap cache and readahead for
    SWP_SYNCHRONOUS_IO
  mm/shmem, swap: simplify swapin path and result handling
  mm/shmem, swap: rework swap entry and index calculation for large
    swapin
  mm/shmem, swap: fix major fault counting

 mm/shmem.c | 267 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 143 insertions(+), 124 deletions(-)

-- 
2.50.0



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
@ 2025-07-10  3:36 ` Kairui Song
  2025-07-24 17:02   ` Kairui Song
  2025-07-10  3:37 ` [PATCH v5 2/8] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song, stable

From: Kairui Song <kasong@tencent.com>

The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0.  The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.

And this looks fragile.  So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio.  And drop the
redundant tree walks before the insertion.

This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.

And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true.  The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention.  The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.

Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
---
 mm/shmem.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 334b7b4a61a0..e3c9a1365ff4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
 				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
-	long nr = folio_nr_pages(folio);
+	unsigned long nr = folio_nr_pages(folio);
+	swp_entry_t iter, swap;
+	void *entry;
 
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
 
 	gfp &= GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
+	swap = iter = radix_to_swp_entry(expected);
 
 	do {
 		xas_lock_irq(&xas);
-		if (expected != xas_find_conflict(&xas)) {
-			xas_set_err(&xas, -EEXIST);
-			goto unlock;
+		xas_for_each_conflict(&xas, entry) {
+			/*
+			 * The range must either be empty, or filled with
+			 * expected swap entries. Shmem swap entries are never
+			 * partially freed without split of both entry and
+			 * folio, so there shouldn't be any holes.
+			 */
+			if (!expected || entry != swp_to_radix_entry(iter)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			iter.val += 1 << xas_get_order(&xas);
 		}
-		if (expected && xas_find_conflict(&xas)) {
+		if (expected && iter.val - nr != swap.val) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}
@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order != folio_order(folio)) {
+	} else if (order > folio_order(folio)) {
 		/*
 		 * Swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
+	} else if (order < folio_order(folio)) {
+		swap.val = round_down(swap.val, 1 << folio_order(folio));
 	}
 
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap) ||
-	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+	    folio->swap.val != swap.val) {
 		error = -EEXIST;
 		goto unlock;
 	}
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 2/8] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
  2025-07-10  3:36 ` [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  2025-07-10  3:37 ` [PATCH v5 3/8] mm/shmem, swap: tidy up THP swapin checks Kairui Song
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song, Dev Jain

From: Kairui Song <kasong@tencent.com>

Currently shmem calls xa_get_order to get the swap radix entry order,
requiring a full tree walk.  This can be easily combined with the swap
entry value checking (shmem_confirm_swap) to avoid the duplicated lookup,
which should improve the performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 32 ++++++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e3c9a1365ff4..85ecc6709b5f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
 
 /*
  * Sometimes, before we decide whether to proceed or to fail, we must check
- * that an entry was not already brought back from swap by a racing thread.
+ * that an entry was not already brought back or split by a racing thread.
  *
  * Checking folio is not enough: by the time a swapcache folio is locked, it
  * might be reused, and again be swapcache, using the same swap as before.
+ * Returns the swap entry's order if it still presents, else returns -1.
  */
-static bool shmem_confirm_swap(struct address_space *mapping,
-			       pgoff_t index, swp_entry_t swap)
+static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index,
+			      swp_entry_t swap)
 {
-	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
+	XA_STATE(xas, &mapping->i_pages, index);
+	int ret = -1;
+	void *entry;
+
+	rcu_read_lock();
+	do {
+		entry = xas_load(&xas);
+		if (entry == swp_to_radix_entry(swap))
+			ret = xas_get_order(&xas);
+	} while (xas_retry(&xas, entry));
+	rcu_read_unlock();
+	return ret;
 }
 
 /*
@@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EIO;
 
 	si = get_swap_device(swap);
-	if (!si) {
-		if (!shmem_confirm_swap(mapping, index, swap))
+	order = shmem_confirm_swap(mapping, index, swap);
+	if (unlikely(!si)) {
+		if (order < 0)
 			return -EEXIST;
 		else
 			return -EINVAL;
 	}
+	if (unlikely(order < 0)) {
+		put_swap_device(si);
+		return -EEXIST;
+	}
 
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
-	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		int nr_pages = 1 << order;
 		bool fallback_order0 = false;
@@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	*foliop = folio;
 	return 0;
 failed:
-	if (!shmem_confirm_swap(mapping, index, swap))
+	if (shmem_confirm_swap(mapping, index, swap) < 0)
 		error = -EEXIST;
 	if (error == -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 3/8] mm/shmem, swap: tidy up THP swapin checks
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
  2025-07-10  3:36 ` [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
  2025-07-10  3:37 ` [PATCH v5 2/8] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  2025-07-10  3:37 ` [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting Kairui Song
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Move all THP swapin related checks under CONFIG_TRANSPARENT_HUGEPAGE,
so they will be trimmed off by the compiler if not needed.

And add a WARN if shmem sees a order > 0 entry when
CONFIG_TRANSPARENT_HUGEPAGE is disabled, that should never happen unless
things went very wrong.

There should be no observable feature change except the new added WARN.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 85ecc6709b5f..d8c872ab3570 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1980,26 +1980,38 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		swp_entry_t entry, int order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	int nr_pages = 1 << order;
 	struct folio *new;
 	void *shadow;
-	int nr_pages;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success with further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
-		gfp_t huge_gfp = vma_thp_gfp_mask(vma);
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (WARN_ON_ONCE(order))
+			return ERR_PTR(-EINVAL);
+	} else if (order) {
+		/*
+		 * If uffd is active for the vma, we need per-page fault
+		 * fidelity to maintain the uffd semantics, then fallback
+		 * to swapin order-0 folio, as well as for zswap case.
+		 * Any existing sub folio in the swap cache also blocks
+		 * mTHP swapin.
+		 */
+		if ((vma && unlikely(userfaultfd_armed(vma))) ||
+		     !zswap_never_enabled() ||
+		     non_swapcache_batch(entry, nr_pages) != nr_pages)
+			return ERR_PTR(-EINVAL);
 
-		gfp = limit_gfp_mask(huge_gfp, gfp);
+		gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
 	}
 
 	new = shmem_alloc_folio(gfp, order, info, index);
 	if (!new)
 		return ERR_PTR(-ENOMEM);
 
-	nr_pages = folio_nr_pages(new);
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
 					   gfp, entry)) {
 		folio_put(new);
@@ -2283,9 +2295,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
-		int nr_pages = 1 << order;
-		bool fallback_order0 = false;
-
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
@@ -2293,20 +2302,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
 
-		/*
-		 * If uffd is active for the vma, we need per-page fault
-		 * fidelity to maintain the uffd semantics, then fallback
-		 * to swapin order-0 folio, as well as for zswap case.
-		 * Any existing sub folio in the swap cache also blocks
-		 * mTHP swapin.
-		 */
-		if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
-				  !zswap_never_enabled() ||
-				  non_swapcache_batch(swap, nr_pages) != nr_pages))
-			fallback_order0 = true;
-
 		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
                   ` (2 preceding siblings ...)
  2025-07-10  3:37 ` [PATCH v5 3/8] mm/shmem, swap: tidy up THP swapin checks Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  2025-07-16  7:09   ` Baoquan He
  2025-07-10  3:37 ` [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of keeping different paths of splitting the entry before the
swap in start, move the entry splitting after the swapin has put
the folio in swap cache (or set the SWAP_HAS_CACHE bit). This way
we only need one place and one unified way to split the large entry.
Whenever swapin brought in a folio smaller than the shmem swap entry,
split the entry and recalculate the entry and index for verification.

This removes duplicated codes and function calls, reduces LOC,
and the split is less racy as it's guarded by swap cache now. So it
will have a lower chance of repeated faults due to raced split.
The compiler is also able to optimize the coder further:

bloat-o-meter results with GCC 14:

With DEBUG_SECTION_MISMATCH (-fno-inline-functions-called-once):
./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-82 (-82)
Function                                     old     new   delta
shmem_swapin_folio                          2361    2279     -82
Total: Before=33151, After=33069, chg -0.25%

With !DEBUG_SECTION_MISMATCH:
./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/1 grow/shrink: 1/0 up/down: 949/-750 (199)
Function                                     old     new   delta
shmem_swapin_folio                          2878    3827    +949
shmem_split_large_entry.isra                 750       -    -750
Total: Before=33086, After=33285, chg +0.60%

Since shmem_split_large_entry is only called in one place now. The
compiler will either generate more compact code, or inlined it for
better performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 56 ++++++++++++++++++++++--------------------------------
 1 file changed, 23 insertions(+), 33 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d8c872ab3570..97db1097f7de 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2266,14 +2266,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	swp_entry_t swap, index_entry;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
-	swp_entry_t swap;
 	int error, nr_pages, order, split_order;
+	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
-	swap = radix_to_swp_entry(*foliop);
+	index_entry = radix_to_swp_entry(*foliop);
+	swap = index_entry;
 	*foliop = NULL;
 
 	if (is_poisoned_swp_entry(swap))
@@ -2321,46 +2323,35 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		}
 
 		/*
-		 * Now swap device can only swap in order 0 folio, then we
-		 * should split the large swap entry stored in the pagecache
-		 * if necessary.
-		 */
-		split_order = shmem_split_large_entry(inode, index, swap, gfp);
-		if (split_order < 0) {
-			error = split_order;
-			goto failed;
-		}
-
-		/*
-		 * If the large swap entry has already been split, it is
+		 * Now swap device can only swap in order 0 folio, it is
 		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
+		 * the offset, as the swapin index might be unalgined.
 		 */
-		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
+		if (order) {
+			offset = index - round_down(index, 1 << order);
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
 
-		/* Here we actually start the io */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order > folio_order(folio)) {
+	}
+alloced:
+	if (order > folio_order(folio)) {
 		/*
-		 * Swap readahead may swap in order 0 folios into swapcache
+		 * Swapin may get smaller folios due to various reasons:
+		 * It may fallback to order 0 due to memory pressure or race,
+		 * swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
 		 * large swap entries. In such cases, we should split the
 		 * large swap entry to prevent possible data corruption.
 		 */
-		split_order = shmem_split_large_entry(inode, index, swap, gfp);
+		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
 		if (split_order < 0) {
-			folio_put(folio);
-			folio = NULL;
 			error = split_order;
-			goto failed;
+			goto failed_nolock;
 		}
 
 		/*
@@ -2369,15 +2360,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		 * the old order alignment.
 		 */
 		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+			offset = index - round_down(index, 1 << split_order);
+			swap = swp_entry(swp_type(swap), swp_offset(index_entry) + offset);
 		}
 	} else if (order < folio_order(folio)) {
 		swap.val = round_down(swap.val, 1 << folio_order(folio));
 	}
 
-alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
@@ -2434,12 +2423,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
 					     skip_swapcache);
 unlock:
-	if (skip_swapcache)
-		swapcache_clear(si, swap, folio_nr_pages(folio));
-	if (folio) {
+	if (folio)
 		folio_unlock(folio);
+failed_nolock:
+	if (skip_swapcache)
+		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
+	if (folio)
 		folio_put(folio);
-	}
 	put_swap_device(si);
 
 	return error;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
                   ` (3 preceding siblings ...)
  2025-07-10  3:37 ` [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  2025-07-11  6:10   ` Baolin Wang
  2025-07-13 10:53   ` Barry Song
  2025-07-10  3:37 ` [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling Kairui Song
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

For SWP_SYNCHRONOUS_IO devices, if a cache bypassing THP swapin failed
due to reasons like memory pressure, partially conflicting swap cache
or ZSWAP enabled, shmem will fallback to cached order 0 swapin.

Right now the swap cache still has a non-trivial overhead, and readahead
is not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
the readahead and swap cache even if the swapin falls back to order 0.

So handle the fallback logic without falling back to the cached read.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 41 ++++++++++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 97db1097f7de..847e6f128485 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1982,6 +1982,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	int nr_pages = 1 << order;
 	struct folio *new;
+	gfp_t alloc_gfp;
 	void *shadow;
 
 	/*
@@ -1989,6 +1990,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	 * limit chance of success with further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
+	alloc_gfp = gfp;
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 		if (WARN_ON_ONCE(order))
 			return ERR_PTR(-EINVAL);
@@ -2003,19 +2005,22 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		if ((vma && unlikely(userfaultfd_armed(vma))) ||
 		     !zswap_never_enabled() ||
 		     non_swapcache_batch(entry, nr_pages) != nr_pages)
-			return ERR_PTR(-EINVAL);
+			goto fallback;
 
-		gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+		alloc_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+	}
+retry:
+	new = shmem_alloc_folio(alloc_gfp, order, info, index);
+	if (!new) {
+		new = ERR_PTR(-ENOMEM);
+		goto fallback;
 	}
-
-	new = shmem_alloc_folio(gfp, order, info, index);
-	if (!new)
-		return ERR_PTR(-ENOMEM);
 
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
-					   gfp, entry)) {
+					   alloc_gfp, entry)) {
 		folio_put(new);
-		return ERR_PTR(-ENOMEM);
+		new = ERR_PTR(-ENOMEM);
+		goto fallback;
 	}
 
 	/*
@@ -2030,7 +2035,9 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	 */
 	if (swapcache_prepare(entry, nr_pages)) {
 		folio_put(new);
-		return ERR_PTR(-EEXIST);
+		new = ERR_PTR(-EEXIST);
+		/* Try smaller folio to avoid cache conflict */
+		goto fallback;
 	}
 
 	__folio_set_locked(new);
@@ -2044,6 +2051,15 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	folio_add_lru(new);
 	swap_read_folio(new, NULL);
 	return new;
+fallback:
+	/* Order 0 swapin failed, nothing to fallback to, abort */
+	if (!order)
+		return new;
+	entry.val += index - round_down(index, nr_pages);
+	alloc_gfp = gfp;
+	nr_pages = 1;
+	order = 0;
+	goto retry;
 }
 
 /*
@@ -2313,13 +2329,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			}
 
 			/*
-			 * Fallback to swapin order-0 folio unless the swap entry
-			 * already exists.
+			 * Direct swapin handled order 0 fallback already,
+			 * if it failed, abort.
 			 */
 			error = PTR_ERR(folio);
 			folio = NULL;
-			if (error == -EEXIST)
-				goto failed;
+			goto failed;
 		}
 
 		/*
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
                   ` (4 preceding siblings ...)
  2025-07-10  3:37 ` [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  2025-07-11  6:23   ` Baolin Wang
  2025-07-10  3:37 ` [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin Kairui Song
  2025-07-10  3:37 ` [PATCH v5 8/8] mm/shmem, swap: fix major fault counting Kairui Song
  7 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Slightly tidy up the different handling of swap in and error handling
for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
will always use either shmem_swap_alloc_folio or shmem_swapin_cluster,
then check the result.

Simplify the control flow and avoid a redundant goto label.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 45 +++++++++++++++++++--------------------------
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 847e6f128485..80f5b8c73eb8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2320,40 +2320,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
 
-		/* Skip swapcache for synchronous device. */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+			/* Direct mTHP swapin skipping swap cache & readhaed */
 			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
-			if (!IS_ERR(folio)) {
-				skip_swapcache = true;
-				goto alloced;
+			if (IS_ERR(folio)) {
+				error = PTR_ERR(folio);
+				folio = NULL;
+				goto failed;
 			}
-
+			skip_swapcache = true;
+		} else {
 			/*
-			 * Direct swapin handled order 0 fallback already,
-			 * if it failed, abort.
+			 * Cached swapin only supports order 0 folio, it is
+			 * necessary to recalculate the new swap entry based on
+			 * the offset, as the swapin index might be unalgined.
 			 */
-			error = PTR_ERR(folio);
-			folio = NULL;
-			goto failed;
-		}
-
-		/*
-		 * Now swap device can only swap in order 0 folio, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the offset, as the swapin index might be unalgined.
-		 */
-		if (order) {
-			offset = index - round_down(index, 1 << order);
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
+			if (order) {
+				offset = index - round_down(index, 1 << order);
+				swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+			}
 
-		folio = shmem_swapin_cluster(swap, gfp, info, index);
-		if (!folio) {
-			error = -ENOMEM;
-			goto failed;
+			folio = shmem_swapin_cluster(swap, gfp, info, index);
+			if (!folio) {
+				error = -ENOMEM;
+				goto failed;
+			}
 		}
 	}
-alloced:
 	if (order > folio_order(folio)) {
 		/*
 		 * Swapin may get smaller folios due to various reasons:
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
                   ` (5 preceding siblings ...)
  2025-07-10  3:37 ` [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  2025-07-11  6:36   ` Baolin Wang
  2025-07-10  3:37 ` [PATCH v5 8/8] mm/shmem, swap: fix major fault counting Kairui Song
  7 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of calculating the swap entry differently in different swapin
paths, calculate it early before the swap cache lookup and use that
for the lookup and later swapin. And after swapin have brought a folio,
simply round it down against the size of the folio.

This is simple and effective enough to verify the swap value. A folio's
swap entry is always aligned by its size. Any kind of parallel split or
race is acceptable because the final shmem_add_to_page_cache ensures
that all entries covered by the folio are correct, and thus there
will be no data corruption.

This also prevents false positive cache lookup. If a shmem read
request's index points to the middle of a large swap entry,
previously, shmem will try the swap cache lookup using the large swap
entry's starting value (which is the first sub swap entry of this
large entry). This will lead to false positive lookup results if only
the first few swap entries are cached but the actual requested swap
entry pointed by the index is uncached. This is not a rare event,
as swap readahead always tries to cache order 0 folios when possible.

And this shouldn't cause any increased repeated faults. Instead, no
matter how the shmem mapping is split in parallel, as long as the
mapping still contains the right entries, the swapin will succeed.

The final object size and stack usage are also reduced due to
simplified code:

./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-233 (-233)
Function                                     old     new   delta
shmem_swapin_folio                          4040    3807    -233
Total: Before=33152, After=32919, chg -0.70%

Stack usage (Before vs After):
mm/shmem.c:2277:12:shmem_swapin_folio   264     static
mm/shmem.c:2277:12:shmem_swapin_folio   256     static

And while at it, round down the index too if swap entry is round down.
The index is used either for folio reallocation or confirming the
mapping content. In either case, it should be aligned with the swap
folio.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 66 ++++++++++++++++++++++++++----------------------------
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 80f5b8c73eb8..9c50607ac455 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2265,7 +2265,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);
 
-	return entry_order;
+	return 0;
 }
 
 /*
@@ -2286,7 +2286,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
-	int error, nr_pages, order, split_order;
+	int error, nr_pages, order;
 	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -2294,11 +2294,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	swap = index_entry;
 	*foliop = NULL;
 
-	if (is_poisoned_swp_entry(swap))
+	if (is_poisoned_swp_entry(index_entry))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	order = shmem_confirm_swap(mapping, index, swap);
+	si = get_swap_device(index_entry);
+	order = shmem_confirm_swap(mapping, index, index_entry);
 	if (unlikely(!si)) {
 		if (order < 0)
 			return -EEXIST;
@@ -2310,6 +2310,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EEXIST;
 	}
 
+	/* index may point to the middle of a large entry, get the sub entry */
+	if (order) {
+		offset = index - round_down(index, 1 << order);
+		swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+	}
+
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
@@ -2322,7 +2328,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct mTHP swapin skipping swap cache & readhaed */
-			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+			folio = shmem_swap_alloc_folio(inode, vma, index,
+						       index_entry, order, gfp);
 			if (IS_ERR(folio)) {
 				error = PTR_ERR(folio);
 				folio = NULL;
@@ -2330,16 +2337,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			}
 			skip_swapcache = true;
 		} else {
-			/*
-			 * Cached swapin only supports order 0 folio, it is
-			 * necessary to recalculate the new swap entry based on
-			 * the offset, as the swapin index might be unalgined.
-			 */
-			if (order) {
-				offset = index - round_down(index, 1 << order);
-				swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-			}
-
+			/* Cached swapin only supports order 0 folio */
 			folio = shmem_swapin_cluster(swap, gfp, info, index);
 			if (!folio) {
 				error = -ENOMEM;
@@ -2356,23 +2354,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		 * large swap entries. In such cases, we should split the
 		 * large swap entry to prevent possible data corruption.
 		 */
-		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
-		if (split_order < 0) {
-			error = split_order;
+		error = shmem_split_large_entry(inode, index, index_entry, gfp);
+		if (error)
 			goto failed_nolock;
-		}
+	}
 
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			offset = index - round_down(index, 1 << split_order);
-			swap = swp_entry(swp_type(swap), swp_offset(index_entry) + offset);
-		}
-	} else if (order < folio_order(folio)) {
-		swap.val = round_down(swap.val, 1 << folio_order(folio));
+	/*
+	 * If the folio is large, round down swap and index by folio size.
+	 * No matter what race occurs, the swap layer ensures we either get
+	 * a valid folio that has its swap entry aligned by size, or a
+	 * temporarily invalid one which we'll abort very soon and retry.
+	 *
+	 * shmem_add_to_page_cache ensures the whole range contains expected
+	 * entries and prevents any corruption, so any race split is fine
+	 * too, it will succeed as long as the entries are still there.
+	 */
+	nr_pages = folio_nr_pages(folio);
+	if (nr_pages > 1) {
+		swap.val = round_down(swap.val, nr_pages);
+		index = round_down(index, nr_pages);
 	}
 
 	/* We have to do this with folio locked to prevent races */
@@ -2387,7 +2387,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
-	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -2401,8 +2400,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping,
-					round_down(index, nr_pages),
+	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 8/8] mm/shmem, swap: fix major fault counting
  2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
                   ` (6 preceding siblings ...)
  2025-07-10  3:37 ` [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin Kairui Song
@ 2025-07-10  3:37 ` Kairui Song
  7 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-10  3:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

If the swapin failed, don't update the major fault count. There is a
long existing comment for doing it this way, now with previous cleanups,
we can finally fix it.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 9c50607ac455..f97c4e9f821d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2319,13 +2319,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
-		/* Or update major stats only when swapin succeeds?? */
-		if (fault_type) {
-			*fault_type |= VM_FAULT_MAJOR;
-			count_vm_event(PGMAJFAULT);
-			count_memcg_event_mm(fault_mm, PGMAJFAULT);
-		}
-
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct mTHP swapin skipping swap cache & readhaed */
 			folio = shmem_swap_alloc_folio(inode, vma, index,
@@ -2344,6 +2337,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 				goto failed;
 			}
 		}
+		if (fault_type) {
+			*fault_type |= VM_FAULT_MAJOR;
+			count_vm_event(PGMAJFAULT);
+			count_memcg_event_mm(fault_mm, PGMAJFAULT);
+		}
 	}
 	if (order > folio_order(folio)) {
 		/*
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  2025-07-10  3:37 ` [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-07-11  6:10   ` Baolin Wang
  2025-07-13 10:53   ` Barry Song
  1 sibling, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2025-07-11  6:10 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/10 11:37, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> For SWP_SYNCHRONOUS_IO devices, if a cache bypassing THP swapin failed
> due to reasons like memory pressure, partially conflicting swap cache
> or ZSWAP enabled, shmem will fallback to cached order 0 swapin.
> 
> Right now the swap cache still has a non-trivial overhead, and readahead
> is not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
> the readahead and swap cache even if the swapin falls back to order 0.
> 
> So handle the fallback logic without falling back to the cached read.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

LGTM. Thanks.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling
  2025-07-10  3:37 ` [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling Kairui Song
@ 2025-07-11  6:23   ` Baolin Wang
  2025-07-11  6:28     ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2025-07-11  6:23 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/10 11:37, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Slightly tidy up the different handling of swap in and error handling
> for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
> will always use either shmem_swap_alloc_folio or shmem_swapin_cluster,
> then check the result.
> 
> Simplify the control flow and avoid a redundant goto label.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

LGTM, with a nit as follows.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> ---
>   mm/shmem.c | 45 +++++++++++++++++++--------------------------
>   1 file changed, 19 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 847e6f128485..80f5b8c73eb8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2320,40 +2320,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>   		}
>   
> -		/* Skip swapcache for synchronous device. */
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> +			/* Direct mTHP swapin skipping swap cache & readhaed */
>   			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);

Nit: the 'mTHP' word can be confusing, since we will skip swapcache for 
order 0 too. Please drop it.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling
  2025-07-11  6:23   ` Baolin Wang
@ 2025-07-11  6:28     ` Kairui Song
  2025-07-15 22:09       ` Hugh Dickins
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-11  6:28 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel

On Fri, Jul 11, 2025 at 2:23 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/10 11:37, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Slightly tidy up the different handling of swap in and error handling
> > for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
> > will always use either shmem_swap_alloc_folio or shmem_swapin_cluster,
> > then check the result.
> >
> > Simplify the control flow and avoid a redundant goto label.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> LGTM, with a nit as follows.
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> > ---
> >   mm/shmem.c | 45 +++++++++++++++++++--------------------------
> >   1 file changed, 19 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 847e6f128485..80f5b8c73eb8 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2320,40 +2320,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >               }
> >
> > -             /* Skip swapcache for synchronous device. */
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > +                     /* Direct mTHP swapin skipping swap cache & readhaed */
> >                       folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
>
> Nit: the 'mTHP' word can be confusing, since we will skip swapcache for
> order 0 too. Please drop it.
>

Yes, thanks for the review.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin
  2025-07-10  3:37 ` [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin Kairui Song
@ 2025-07-11  6:36   ` Baolin Wang
  2025-07-14  2:39     ` Baolin Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2025-07-11  6:36 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/10 11:37, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Instead of calculating the swap entry differently in different swapin
> paths, calculate it early before the swap cache lookup and use that
> for the lookup and later swapin. And after swapin have brought a folio,
> simply round it down against the size of the folio.
> 
> This is simple and effective enough to verify the swap value. A folio's
> swap entry is always aligned by its size. Any kind of parallel split or
> race is acceptable because the final shmem_add_to_page_cache ensures
> that all entries covered by the folio are correct, and thus there
> will be no data corruption.
> 
> This also prevents false positive cache lookup. If a shmem read
> request's index points to the middle of a large swap entry,
> previously, shmem will try the swap cache lookup using the large swap
> entry's starting value (which is the first sub swap entry of this
> large entry). This will lead to false positive lookup results if only
> the first few swap entries are cached but the actual requested swap
> entry pointed by the index is uncached. This is not a rare event,
> as swap readahead always tries to cache order 0 folios when possible.
> 
> And this shouldn't cause any increased repeated faults. Instead, no
> matter how the shmem mapping is split in parallel, as long as the
> mapping still contains the right entries, the swapin will succeed.
> 
> The final object size and stack usage are also reduced due to
> simplified code:
> 
> ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-233 (-233)
> Function                                     old     new   delta
> shmem_swapin_folio                          4040    3807    -233
> Total: Before=33152, After=32919, chg -0.70%
> 
> Stack usage (Before vs After):
> mm/shmem.c:2277:12:shmem_swapin_folio   264     static
> mm/shmem.c:2277:12:shmem_swapin_folio   256     static
> 
> And while at it, round down the index too if swap entry is round down.
> The index is used either for folio reallocation or confirming the
> mapping content. In either case, it should be aligned with the swap
> folio.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Overall, I don't see any obvious issues, but please give me some time to 
run some tests and then get back to you. Thanks.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  2025-07-10  3:37 ` [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
  2025-07-11  6:10   ` Baolin Wang
@ 2025-07-13 10:53   ` Barry Song
  1 sibling, 0 replies; 22+ messages in thread
From: Barry Song @ 2025-07-13 10:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	linux-kernel

On Thu, Jul 10, 2025 at 11:37 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> For SWP_SYNCHRONOUS_IO devices, if a cache bypassing THP swapin failed
> due to reasons like memory pressure, partially conflicting swap cache
> or ZSWAP enabled, shmem will fallback to cached order 0 swapin.
>
> Right now the swap cache still has a non-trivial overhead, and readahead
> is not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
> the readahead and swap cache even if the swapin falls back to order 0.
>
> So handle the fallback logic without falling back to the cached read.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 41 ++++++++++++++++++++++++++++-------------
>  1 file changed, 28 insertions(+), 13 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 97db1097f7de..847e6f128485 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1982,6 +1982,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>         struct shmem_inode_info *info = SHMEM_I(inode);
>         int nr_pages = 1 << order;
>         struct folio *new;
> +       gfp_t alloc_gfp;
>         void *shadow;
>
>         /*
> @@ -1989,6 +1990,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>          * limit chance of success with further cpuset and node constraints.
>          */
>         gfp &= ~GFP_CONSTRAINT_MASK;
> +       alloc_gfp = gfp;
>         if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>                 if (WARN_ON_ONCE(order))
>                         return ERR_PTR(-EINVAL);
> @@ -2003,19 +2005,22 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>                 if ((vma && unlikely(userfaultfd_armed(vma))) ||
>                      !zswap_never_enabled() ||
>                      non_swapcache_batch(entry, nr_pages) != nr_pages)
> -                       return ERR_PTR(-EINVAL);
> +                       goto fallback;
>
> -               gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> +               alloc_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> +       }
> +retry:
> +       new = shmem_alloc_folio(alloc_gfp, order, info, index);
> +       if (!new) {
> +               new = ERR_PTR(-ENOMEM);
> +               goto fallback;
>         }
> -
> -       new = shmem_alloc_folio(gfp, order, info, index);
> -       if (!new)
> -               return ERR_PTR(-ENOMEM);
>
>         if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
> -                                          gfp, entry)) {
> +                                          alloc_gfp, entry)) {
>                 folio_put(new);
> -               return ERR_PTR(-ENOMEM);
> +               new = ERR_PTR(-ENOMEM);
> +               goto fallback;
>         }
>
>         /*
> @@ -2030,7 +2035,9 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>          */
>         if (swapcache_prepare(entry, nr_pages)) {
>                 folio_put(new);
> -               return ERR_PTR(-EEXIST);
> +               new = ERR_PTR(-EEXIST);
> +               /* Try smaller folio to avoid cache conflict */
> +               goto fallback;
>         }
>
>         __folio_set_locked(new);
> @@ -2044,6 +2051,15 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>         folio_add_lru(new);
>         swap_read_folio(new, NULL);
>         return new;
> +fallback:
> +       /* Order 0 swapin failed, nothing to fallback to, abort */
> +       if (!order)
> +               return new;


Feels a bit odd to me. Would it be possible to handle this earlier,
like:

    if (!order)
        return ERR_PTR(-ENOMEM);
    goto fallback;

or:

    if (order)
        goto fallback;
    return ERR_PTR(-ENOMEM);

Not strongly opinionated here—totally up to you.

Thanks
Barry


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin
  2025-07-11  6:36   ` Baolin Wang
@ 2025-07-14  2:39     ` Baolin Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Baolin Wang @ 2025-07-14  2:39 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/11 14:36, Baolin Wang wrote:
> 
> 
> On 2025/7/10 11:37, Kairui Song wrote:
>> From: Kairui Song <kasong@tencent.com>
>>
>> Instead of calculating the swap entry differently in different swapin
>> paths, calculate it early before the swap cache lookup and use that
>> for the lookup and later swapin. And after swapin have brought a folio,
>> simply round it down against the size of the folio.
>>
>> This is simple and effective enough to verify the swap value. A folio's
>> swap entry is always aligned by its size. Any kind of parallel split or
>> race is acceptable because the final shmem_add_to_page_cache ensures
>> that all entries covered by the folio are correct, and thus there
>> will be no data corruption.
>>
>> This also prevents false positive cache lookup. If a shmem read
>> request's index points to the middle of a large swap entry,
>> previously, shmem will try the swap cache lookup using the large swap
>> entry's starting value (which is the first sub swap entry of this
>> large entry). This will lead to false positive lookup results if only
>> the first few swap entries are cached but the actual requested swap
>> entry pointed by the index is uncached. This is not a rare event,
>> as swap readahead always tries to cache order 0 folios when possible.
>>
>> And this shouldn't cause any increased repeated faults. Instead, no
>> matter how the shmem mapping is split in parallel, as long as the
>> mapping still contains the right entries, the swapin will succeed.
>>
>> The final object size and stack usage are also reduced due to
>> simplified code:
>>
>> ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
>> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-233 (-233)
>> Function                                     old     new   delta
>> shmem_swapin_folio                          4040    3807    -233
>> Total: Before=33152, After=32919, chg -0.70%
>>
>> Stack usage (Before vs After):
>> mm/shmem.c:2277:12:shmem_swapin_folio   264     static
>> mm/shmem.c:2277:12:shmem_swapin_folio   256     static
>>
>> And while at it, round down the index too if swap entry is round down.
>> The index is used either for folio reallocation or confirming the
>> mapping content. In either case, it should be aligned with the swap
>> folio.
>>
>> Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> Overall, I don't see any obvious issues, but please give me some time to 
> run some tests and then get back to you. Thanks.

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling
  2025-07-11  6:28     ` Kairui Song
@ 2025-07-15 22:09       ` Hugh Dickins
  2025-07-16  7:14         ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2025-07-15 22:09 UTC (permalink / raw)
  To: Kairui Song, Andrew Morton
  Cc: Baolin Wang, linux-mm, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2683 bytes --]

On Fri, 11 Jul 2025, Kairui Song wrote:
> On Fri, Jul 11, 2025 at 2:23 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2025/7/10 11:37, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Slightly tidy up the different handling of swap in and error handling
> > > for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
> > > will always use either shmem_swap_alloc_folio or shmem_swapin_cluster,
> > > then check the result.
> > >
> > > Simplify the control flow and avoid a redundant goto label.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > LGTM, with a nit as follows.
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > > ---
> > >   mm/shmem.c | 45 +++++++++++++++++++--------------------------
> > >   1 file changed, 19 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 847e6f128485..80f5b8c73eb8 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2320,40 +2320,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > >                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
> > >               }
> > >
> > > -             /* Skip swapcache for synchronous device. */
> > >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > > +                     /* Direct mTHP swapin skipping swap cache & readhaed */
> > >                       folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
> >
> > Nit: the 'mTHP' word can be confusing, since we will skip swapcache for
> > order 0 too. Please drop it.
> >
> 
> Yes, thanks for the review.

And a few words after that 'mTHP ', I keep wincing at 'readhaed':
Andrew, you already did a fix to remove the 'mTHP ', I hope we can
also persuade you to change 'readhaed' to 'readahead' there - thanks!

Kairui, I'm a little uneasy about the way this series does arithmetic
on swap.val, in the knowledge that swp_offset(entry) involves no shift.

Perhaps I haven't noticed, but I think this is the first place to
make that assumption; and a few years ago it was not true at all -
swp_type() was down the bottom.  Usually we would do it all with
swp_entry(swp_type(x), arithmetic_on(swp_offset(x))).

But I guess, let's just agree that it's easier to read and get right
the way you have it, and make no change: if I try to "correct" you,
or demand that you change it, we shall probably just bring in bugs.

I'm particularly glad that you now avoid SWP_SYNCHRONOUS_IO readahead:
that stupidity had very much annoyed me, once I realized it.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting
  2025-07-10  3:37 ` [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting Kairui Song
@ 2025-07-16  7:09   ` Baoquan He
  0 siblings, 0 replies; 22+ messages in thread
From: Baoquan He @ 2025-07-16  7:09 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Kemeng Shi, Chris Li, Nhat Pham, Barry Song,
	linux-kernel

On 07/10/25 at 11:37am, Kairui Song wrote:
......snip...
> @@ -2321,46 +2323,35 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		}
>  
>  		/*
> -		 * Now swap device can only swap in order 0 folio, then we
> -		 * should split the large swap entry stored in the pagecache
> -		 * if necessary.
> -		 */
> -		split_order = shmem_split_large_entry(inode, index, swap, gfp);
> -		if (split_order < 0) {
> -			error = split_order;
> -			goto failed;
> -		}
> -
> -		/*
> -		 * If the large swap entry has already been split, it is
> +		 * Now swap device can only swap in order 0 folio, it is
>  		 * necessary to recalculate the new swap entry based on
> -		 * the old order alignment.
> +		 * the offset, as the swapin index might be unalgined.
>  		 */
> -		if (split_order > 0) {
> -			pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> +		if (order) {
> +			offset = index - round_down(index, 1 << order);
>  			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>  		}
>  
> -		/* Here we actually start the io */
>  		folio = shmem_swapin_cluster(swap, gfp, info, index);
>  		if (!folio) {
>  			error = -ENOMEM;
>  			goto failed;
>  		}
> -	} else if (order > folio_order(folio)) {
> +	}
> +alloced:

Here, only synchronous device handling will jump to label 'alloced', while
its folio is allocated with order. Maybe we should move the label down 
below these if else conditional checking and handling?

Anyway, this is an intermediary patch and code will be changed, not strong
opinion.

> +	if (order > folio_order(folio)) {
>  		/*
> -		 * Swap readahead may swap in order 0 folios into swapcache
> +		 * Swapin may get smaller folios due to various reasons:
> +		 * It may fallback to order 0 due to memory pressure or race,
> +		 * swap readahead may swap in order 0 folios into swapcache
>  		 * asynchronously, while the shmem mapping can still stores
>  		 * large swap entries. In such cases, we should split the
>  		 * large swap entry to prevent possible data corruption.
>  		 */
> -		split_order = shmem_split_large_entry(inode, index, swap, gfp);
> +		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
>  		if (split_order < 0) {
> -			folio_put(folio);
> -			folio = NULL;
>  			error = split_order;
> -			goto failed;
> +			goto failed_nolock;
>  		}
>  
>  		/*
...snip...



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling
  2025-07-15 22:09       ` Hugh Dickins
@ 2025-07-16  7:14         ` Kairui Song
  0 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-16  7:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Baolin Wang, linux-mm, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel

On Wed, Jul 16, 2025 at 6:10 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 11 Jul 2025, Kairui Song wrote:
> > On Fri, Jul 11, 2025 at 2:23 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> > >
> > >
> > >
> > > On 2025/7/10 11:37, Kairui Song wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Slightly tidy up the different handling of swap in and error handling
> > > > for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
> > > > will always use either shmem_swap_alloc_folio or shmem_swapin_cluster,
> > > > then check the result.
> > > >
> > > > Simplify the control flow and avoid a redundant goto label.
> > > >
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > >
> > > LGTM, with a nit as follows.
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > >
> > > > ---
> > > >   mm/shmem.c | 45 +++++++++++++++++++--------------------------
> > > >   1 file changed, 19 insertions(+), 26 deletions(-)
> > > >
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index 847e6f128485..80f5b8c73eb8 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -2320,40 +2320,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > > >                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
> > > >               }
> > > >
> > > > -             /* Skip swapcache for synchronous device. */
> > > >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > > > +                     /* Direct mTHP swapin skipping swap cache & readhaed */
> > > >                       folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
> > >
> > > Nit: the 'mTHP' word can be confusing, since we will skip swapcache for
> > > order 0 too. Please drop it.
> > >
> >
> > Yes, thanks for the review.
>
> And a few words after that 'mTHP ', I keep wincing at 'readhaed':
> Andrew, you already did a fix to remove the 'mTHP ', I hope we can
> also persuade you to change 'readhaed' to 'readahead' there - thanks!
>
> Kairui, I'm a little uneasy about the way this series does arithmetic
> on swap.val, in the knowledge that swp_offset(entry) involves no shift.
>
> Perhaps I haven't noticed, but I think this is the first place to
> make that assumption; and a few years ago it was not true at all -
> swp_type() was down the bottom.  Usually we would do it all with
> swp_entry(swp_type(x), arithmetic_on(swp_offset(x))).
>
> But I guess, let's just agree that it's easier to read and get right
> the way you have it, and make no change: if I try to "correct" you,
> or demand that you change it, we shall probably just bring in bugs.

Thanks!

I think maybe we can introduce some helpers for things like rounding
the swap entry later, we already have other similar helpers for swap
entries.

There was already same arithmetics in memoy.c some time ago, and I
remember seeing people doing this several times. Current swap values
are OK with this, will be easier to track with a helper.


> I'm particularly glad that you now avoid SWP_SYNCHRONOUS_IO readahead:
> that stupidity had very much annoyed me, once I realized it.
>
> Thanks,
> Hugh


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-07-10  3:36 ` [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-07-24 17:02   ` Kairui Song
  2025-07-24 18:16     ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-24 17:02 UTC (permalink / raw)
  To: Baolin Wang, Kemeng Shi, Andrew Morton, linux-mm
  Cc: Hugh Dickins, Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, linux-kernel, stable

On Thu, Jul 10, 2025 at 11:37 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The current swap-in code assumes that, when a swap entry in shmem mapping
> is order 0, its cached folios (if present) must be order 0 too, which
> turns out not always correct.
>
> The problem is shmem_split_large_entry is called before verifying the
> folio will eventually be swapped in, one possible race is:
>
>     CPU1                          CPU2
> shmem_swapin_folio
> /* swap in of order > 0 swap entry S1 */
>   folio = swap_cache_get_folio
>   /* folio = NULL */
>   order = xa_get_order
>   /* order > 0 */
>   folio = shmem_swap_alloc_folio
>   /* mTHP alloc failure, folio = NULL */
>   <... Interrupted ...>
>                                  shmem_swapin_folio
>                                  /* S1 is swapped in */
>                                  shmem_writeout
>                                  /* S1 is swapped out, folio cached */
>   shmem_split_large_entry(..., S1)
>   /* S1 is split, but the folio covering it has order > 0 now */
>
> Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
> folio lookup will return a folio with order > 0.  The
> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
> return false causing swap-in to return -EEXIST.
>
> And this looks fragile.  So fix this up by allowing seeing a larger folio
> in swap cache, and check the whole shmem mapping range covered by the
> swapin have the right swap value upon inserting the folio.  And drop the
> redundant tree walks before the insertion.
>
> This will actually improve performance, as it avoids two redundant Xarray
> tree walks in the hot path, and the only side effect is that in the
> failure path, shmem may redundantly reallocate a few folios causing
> temporary slight memory pressure.
>
> And worth noting, it may seems the order and value check before inserting
> might help reducing the lock contention, which is not true.  The swap
> cache layer ensures raced swapin will either see a swap cache folio or
> failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
> bypassed), so holding the folio lock and checking the folio flag is
> already good enough for avoiding the lock contention.  The chance that a
> folio passes the swap entry value check but the shmem mapping slot has
> changed should be very low.
>
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/shmem.c | 30 +++++++++++++++++++++---------
>  1 file changed, 21 insertions(+), 9 deletions(-)

Hi All,

Just found some issue here with this patch...

>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 334b7b4a61a0..e3c9a1365ff4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
>                                    pgoff_t index, void *expected, gfp_t gfp)
>  {
>         XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> -       long nr = folio_nr_pages(folio);
> +       unsigned long nr = folio_nr_pages(folio);
> +       swp_entry_t iter, swap;
> +       void *entry;
>
>         VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
>
>         gfp &= GFP_RECLAIM_MASK;
>         folio_throttle_swaprate(folio, gfp);
> +       swap = iter = radix_to_swp_entry(expected);
>
>         do {
>                 xas_lock_irq(&xas);

I missed a xas_reset here, also better reset iter value too.

> -               if (expected != xas_find_conflict(&xas)) {
> -                       xas_set_err(&xas, -EEXIST);
> -                       goto unlock;
> +               xas_for_each_conflict(&xas, entry) {
> +                       /*
> +                        * The range must either be empty, or filled with
> +                        * expected swap entries. Shmem swap entries are never
> +                        * partially freed without split of both entry and
> +                        * folio, so there shouldn't be any holes.
> +                        */
> +                       if (!expected || entry != swp_to_radix_entry(iter)) {
> +                               xas_set_err(&xas, -EEXIST);
> +                               goto unlock;
> +                       }
> +                       iter.val += 1 << xas_get_order(&xas);
>                 }
> -               if (expected && xas_find_conflict(&xas)) {
> +               if (expected && iter.val - nr != swap.val) {
>                         xas_set_err(&xas, -EEXIST);
>                         goto unlock;
>                 }
> @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                         error = -ENOMEM;
>                         goto failed;
>                 }
> -       } else if (order != folio_order(folio)) {
> +       } else if (order > folio_order(folio)) {
>                 /*
>                  * Swap readahead may swap in order 0 folios into swapcache
>                  * asynchronously, while the shmem mapping can still stores
> @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>
>                         swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>                 }
> +       } else if (order < folio_order(folio)) {
> +               swap.val = round_down(swap.val, 1 << folio_order(folio));
>         }
>
>  alloced:
>         /* We have to do this with folio locked to prevent races */
>         folio_lock(folio);
>         if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> -           folio->swap.val != swap.val ||
> -           !shmem_confirm_swap(mapping, index, swap) ||
> -           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {

And this part is incorrect. This `shmem_confirm_swap(mapping, index,
swap) ` can't be simply omitted. Some functions below before the
shmem_add_to_page_cache shouldn't be called on folios might have
already been mapped by others. This shmem_confirm_swap ensures that
won't happen.

It may seem like a small change, but it leads to some minor conflicts
in one or two following commits, the benchmark result will change too.
So I'll have to send a V6 I think.

We can remove this `shmem_confirm_swap`, but not in this series I
think, maybe after this. Need to re-arrange some functions, with some
clean ups for shmem_add_to_page_cache and others.

> +           folio->swap.val != swap.val) {
>                 error = -EEXIST;
>                 goto unlock;
>         }
> --
> 2.50.0
>

In summary, I'll squash this patch into it and do a rebase of later commits:

diff --git a/mm/shmem.c b/mm/shmem.c
index e3c9a1365ff4..4ca0b665b79e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -898,9 +898,11 @@ static int shmem_add_to_page_cache(struct folio *folio,

        gfp &= GFP_RECLAIM_MASK;
        folio_throttle_swaprate(folio, gfp);
-       swap = iter = radix_to_swp_entry(expected);
+       swap = radix_to_swp_entry(expected);

        do {
+               iter = swap;
+               xas_reset(&xas);
                xas_lock_irq(&xas);
                xas_for_each_conflict(&xas, entry) {
                        /*
@@ -2365,9 +2367,16 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
        }

 alloced:
-       /* We have to do this with folio locked to prevent races */
+       /*
+        * We have to do this with folio locked to prevent races.
+        * The shmem_confirm_swap below only checks if the first swap
+        * entry matches the folio, that's enough to ensure the folio
+        * is not used outside of shmem, as shmem swap entrie
+        * and swap cache folios are never partially freed.
+        */
        folio_lock(folio);
        if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
+           !shmem_confirm_swap(mapping, index, swap) ||
            folio->swap.val != swap.val) {
                error = -EEXIST;
                goto unlock;

And I'll do some clean up afterward to get rid of this
shmem_confirm_swap. How do you think?


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-07-24 17:02   ` Kairui Song
@ 2025-07-24 18:16     ` Kairui Song
  2025-07-25  3:52       ` Baolin Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-07-24 18:16 UTC (permalink / raw)
  To: Baolin Wang, Kemeng Shi, Andrew Morton, linux-mm
  Cc: Hugh Dickins, Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, linux-kernel, stable

On Fri, Jul 25, 2025 at 1:02 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jul 10, 2025 at 11:37 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current swap-in code assumes that, when a swap entry in shmem mapping
> > is order 0, its cached folios (if present) must be order 0 too, which
> > turns out not always correct.
> >
> > The problem is shmem_split_large_entry is called before verifying the
> > folio will eventually be swapped in, one possible race is:
> >
> >     CPU1                          CPU2
> > shmem_swapin_folio
> > /* swap in of order > 0 swap entry S1 */
> >   folio = swap_cache_get_folio
> >   /* folio = NULL */
> >   order = xa_get_order
> >   /* order > 0 */
> >   folio = shmem_swap_alloc_folio
> >   /* mTHP alloc failure, folio = NULL */
> >   <... Interrupted ...>
> >                                  shmem_swapin_folio
> >                                  /* S1 is swapped in */
> >                                  shmem_writeout
> >                                  /* S1 is swapped out, folio cached */
> >   shmem_split_large_entry(..., S1)
> >   /* S1 is split, but the folio covering it has order > 0 now */
> >
> > Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
> > folio lookup will return a folio with order > 0.  The
> > `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
> > return false causing swap-in to return -EEXIST.
> >
> > And this looks fragile.  So fix this up by allowing seeing a larger folio
> > in swap cache, and check the whole shmem mapping range covered by the
> > swapin have the right swap value upon inserting the folio.  And drop the
> > redundant tree walks before the insertion.
> >
> > This will actually improve performance, as it avoids two redundant Xarray
> > tree walks in the hot path, and the only side effect is that in the
> > failure path, shmem may redundantly reallocate a few folios causing
> > temporary slight memory pressure.
> >
> > And worth noting, it may seems the order and value check before inserting
> > might help reducing the lock contention, which is not true.  The swap
> > cache layer ensures raced swapin will either see a swap cache folio or
> > failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
> > bypassed), so holding the folio lock and checking the folio flag is
> > already good enough for avoiding the lock contention.  The chance that a
> > folio passes the swap entry value check but the shmem mapping slot has
> > changed should be very low.
> >
> > Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  mm/shmem.c | 30 +++++++++++++++++++++---------
> >  1 file changed, 21 insertions(+), 9 deletions(-)
>
> Hi All,
>
> Just found some issue here with this patch...
>
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 334b7b4a61a0..e3c9a1365ff4 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
> >                                    pgoff_t index, void *expected, gfp_t gfp)
> >  {
> >         XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> > -       long nr = folio_nr_pages(folio);
> > +       unsigned long nr = folio_nr_pages(folio);
> > +       swp_entry_t iter, swap;
> > +       void *entry;
> >
> >         VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
> >         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> > @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
> >
> >         gfp &= GFP_RECLAIM_MASK;
> >         folio_throttle_swaprate(folio, gfp);
> > +       swap = iter = radix_to_swp_entry(expected);
> >
> >         do {
> >                 xas_lock_irq(&xas);
>
> I missed a xas_reset here, also better reset iter value too.
>
> > -               if (expected != xas_find_conflict(&xas)) {
> > -                       xas_set_err(&xas, -EEXIST);
> > -                       goto unlock;
> > +               xas_for_each_conflict(&xas, entry) {
> > +                       /*
> > +                        * The range must either be empty, or filled with
> > +                        * expected swap entries. Shmem swap entries are never
> > +                        * partially freed without split of both entry and
> > +                        * folio, so there shouldn't be any holes.
> > +                        */
> > +                       if (!expected || entry != swp_to_radix_entry(iter)) {
> > +                               xas_set_err(&xas, -EEXIST);
> > +                               goto unlock;
> > +                       }
> > +                       iter.val += 1 << xas_get_order(&xas);
> >                 }
> > -               if (expected && xas_find_conflict(&xas)) {
> > +               if (expected && iter.val - nr != swap.val) {
> >                         xas_set_err(&xas, -EEXIST);
> >                         goto unlock;
> >                 }
> > @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                         error = -ENOMEM;
> >                         goto failed;
> >                 }
> > -       } else if (order != folio_order(folio)) {
> > +       } else if (order > folio_order(folio)) {
> >                 /*
> >                  * Swap readahead may swap in order 0 folios into swapcache
> >                  * asynchronously, while the shmem mapping can still stores
> > @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >
> >                         swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >                 }
> > +       } else if (order < folio_order(folio)) {
> > +               swap.val = round_down(swap.val, 1 << folio_order(folio));
> >         }
> >
> >  alloced:
> >         /* We have to do this with folio locked to prevent races */
> >         folio_lock(folio);
> >         if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> > -           folio->swap.val != swap.val ||
> > -           !shmem_confirm_swap(mapping, index, swap) ||
> > -           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
>
> And this part is incorrect. This `shmem_confirm_swap(mapping, index,
> swap) ` can't be simply omitted. Some functions below before the
> shmem_add_to_page_cache shouldn't be called on folios might have
> already been mapped by others. This shmem_confirm_swap ensures that
> won't happen.
>
> It may seem like a small change, but it leads to some minor conflicts
> in one or two following commits, the benchmark result will change too.
> So I'll have to send a V6 I think.
>
> We can remove this `shmem_confirm_swap`, but not in this series I
> think, maybe after this. Need to re-arrange some functions, with some
> clean ups for shmem_add_to_page_cache and others.
>
> > +           folio->swap.val != swap.val) {
> >                 error = -EEXIST;
> >                 goto unlock;
> >         }
> > --
> > 2.50.0
> >
>
> In summary, I'll squash this patch into it and do a rebase of later commits:
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e3c9a1365ff4..4ca0b665b79e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -898,9 +898,11 @@ static int shmem_add_to_page_cache(struct folio *folio,
>
>         gfp &= GFP_RECLAIM_MASK;
>         folio_throttle_swaprate(folio, gfp);
> -       swap = iter = radix_to_swp_entry(expected);
> +       swap = radix_to_swp_entry(expected);
>
>         do {
> +               iter = swap;
> +               xas_reset(&xas);

Correction: this xas_reset is not needed, the iter = swap is needed.

>                 xas_lock_irq(&xas);
>                 xas_for_each_conflict(&xas, entry) {
>                         /*
> @@ -2365,9 +2367,16 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
>         }
>
>  alloced:

And it needs `nr_pages = folio_nr_pages(folio); index =
round_down(index, nr_pages);` here...

> -       /* We have to do this with folio locked to prevent races */
> +       /*
> +        * We have to do this with folio locked to prevent races.
> +        * The shmem_confirm_swap below only checks if the first swap
> +        * entry matches the folio, that's enough to ensure the folio
> +        * is not used outside of shmem, as shmem swap entrie
> +        * and swap cache folios are never partially freed.
> +        */
>         folio_lock(folio);
>         if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> +           !shmem_confirm_swap(mapping, index, swap) ||
>             folio->swap.val != swap.val) {
>                 error = -EEXIST;
>                 goto unlock;
>
> And I'll do some clean up afterward to get rid of this
> shmem_confirm_swap. How do you think?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-07-24 18:16     ` Kairui Song
@ 2025-07-25  3:52       ` Baolin Wang
  2025-07-25  4:54         ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Baolin Wang @ 2025-07-25  3:52 UTC (permalink / raw)
  To: Kairui Song, Kemeng Shi, Andrew Morton, linux-mm
  Cc: Hugh Dickins, Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, linux-kernel, stable



On 2025/7/25 02:16, Kairui Song wrote:
> On Fri, Jul 25, 2025 at 1:02 AM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> On Thu, Jul 10, 2025 at 11:37 AM Kairui Song <ryncsn@gmail.com> wrote:
>>>
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> The current swap-in code assumes that, when a swap entry in shmem mapping
>>> is order 0, its cached folios (if present) must be order 0 too, which
>>> turns out not always correct.
>>>
>>> The problem is shmem_split_large_entry is called before verifying the
>>> folio will eventually be swapped in, one possible race is:
>>>
>>>      CPU1                          CPU2
>>> shmem_swapin_folio
>>> /* swap in of order > 0 swap entry S1 */
>>>    folio = swap_cache_get_folio
>>>    /* folio = NULL */
>>>    order = xa_get_order
>>>    /* order > 0 */
>>>    folio = shmem_swap_alloc_folio
>>>    /* mTHP alloc failure, folio = NULL */
>>>    <... Interrupted ...>
>>>                                   shmem_swapin_folio
>>>                                   /* S1 is swapped in */
>>>                                   shmem_writeout
>>>                                   /* S1 is swapped out, folio cached */
>>>    shmem_split_large_entry(..., S1)
>>>    /* S1 is split, but the folio covering it has order > 0 now */
>>>
>>> Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
>>> folio lookup will return a folio with order > 0.  The
>>> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
>>> return false causing swap-in to return -EEXIST.
>>>
>>> And this looks fragile.  So fix this up by allowing seeing a larger folio
>>> in swap cache, and check the whole shmem mapping range covered by the
>>> swapin have the right swap value upon inserting the folio.  And drop the
>>> redundant tree walks before the insertion.
>>>
>>> This will actually improve performance, as it avoids two redundant Xarray
>>> tree walks in the hot path, and the only side effect is that in the
>>> failure path, shmem may redundantly reallocate a few folios causing
>>> temporary slight memory pressure.
>>>
>>> And worth noting, it may seems the order and value check before inserting
>>> might help reducing the lock contention, which is not true.  The swap
>>> cache layer ensures raced swapin will either see a swap cache folio or
>>> failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
>>> bypassed), so holding the folio lock and checking the folio flag is
>>> already good enough for avoiding the lock contention.  The chance that a
>>> folio passes the swap entry value check but the shmem mapping slot has
>>> changed should be very low.
>>>
>>> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: <stable@vger.kernel.org>
>>> ---
>>>   mm/shmem.c | 30 +++++++++++++++++++++---------
>>>   1 file changed, 21 insertions(+), 9 deletions(-)
>>
>> Hi All,
>>
>> Just found some issue here with this patch...
>>
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 334b7b4a61a0..e3c9a1365ff4 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
>>>                                     pgoff_t index, void *expected, gfp_t gfp)
>>>   {
>>>          XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
>>> -       long nr = folio_nr_pages(folio);
>>> +       unsigned long nr = folio_nr_pages(folio);
>>> +       swp_entry_t iter, swap;
>>> +       void *entry;
>>>
>>>          VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
>>>          VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>> @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
>>>
>>>          gfp &= GFP_RECLAIM_MASK;
>>>          folio_throttle_swaprate(folio, gfp);
>>> +       swap = iter = radix_to_swp_entry(expected);
>>>
>>>          do {
>>>                  xas_lock_irq(&xas);
>>
>> I missed a xas_reset here, also better reset iter value too.
>>
>>> -               if (expected != xas_find_conflict(&xas)) {
>>> -                       xas_set_err(&xas, -EEXIST);
>>> -                       goto unlock;
>>> +               xas_for_each_conflict(&xas, entry) {
>>> +                       /*
>>> +                        * The range must either be empty, or filled with
>>> +                        * expected swap entries. Shmem swap entries are never
>>> +                        * partially freed without split of both entry and
>>> +                        * folio, so there shouldn't be any holes.
>>> +                        */
>>> +                       if (!expected || entry != swp_to_radix_entry(iter)) {
>>> +                               xas_set_err(&xas, -EEXIST);
>>> +                               goto unlock;
>>> +                       }
>>> +                       iter.val += 1 << xas_get_order(&xas);
>>>                  }
>>> -               if (expected && xas_find_conflict(&xas)) {
>>> +               if (expected && iter.val - nr != swap.val) {
>>>                          xas_set_err(&xas, -EEXIST);
>>>                          goto unlock;
>>>                  }
>>> @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>                          error = -ENOMEM;
>>>                          goto failed;
>>>                  }
>>> -       } else if (order != folio_order(folio)) {
>>> +       } else if (order > folio_order(folio)) {
>>>                  /*
>>>                   * Swap readahead may swap in order 0 folios into swapcache
>>>                   * asynchronously, while the shmem mapping can still stores
>>> @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>
>>>                          swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>                  }
>>> +       } else if (order < folio_order(folio)) {
>>> +               swap.val = round_down(swap.val, 1 << folio_order(folio));
>>>          }
>>>
>>>   alloced:
>>>          /* We have to do this with folio locked to prevent races */
>>>          folio_lock(folio);
>>>          if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
>>> -           folio->swap.val != swap.val ||
>>> -           !shmem_confirm_swap(mapping, index, swap) ||
>>> -           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
>>
>> And this part is incorrect. This `shmem_confirm_swap(mapping, index,
>> swap) ` can't be simply omitted. Some functions below before the
>> shmem_add_to_page_cache shouldn't be called on folios might have
>> already been mapped by others. This shmem_confirm_swap ensures that
>> won't happen.

OK, thanks for the reminding. But could you elaborate a bit? Which 
function should not be called, and what problem might be caused?

>> It may seem like a small change, but it leads to some minor conflicts
>> in one or two following commits, the benchmark result will change too.
>> So I'll have to send a V6 I think.
>>
>> We can remove this `shmem_confirm_swap`, but not in this series I
>> think, maybe after this. Need to re-arrange some functions, with some
>> clean ups for shmem_add_to_page_cache and others.
>>
>>> +           folio->swap.val != swap.val) {
>>>                  error = -EEXIST;
>>>                  goto unlock;
>>>          }
>>> --
>>> 2.50.0
>>>
>>
>> In summary, I'll squash this patch into it and do a rebase of later commits:
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index e3c9a1365ff4..4ca0b665b79e 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -898,9 +898,11 @@ static int shmem_add_to_page_cache(struct folio *folio,
>>
>>          gfp &= GFP_RECLAIM_MASK;
>>          folio_throttle_swaprate(folio, gfp);
>> -       swap = iter = radix_to_swp_entry(expected);
>> +       swap = radix_to_swp_entry(expected);
>>
>>          do {
>> +               iter = swap;
>> +               xas_reset(&xas);
> 
> Correction: this xas_reset is not needed, the iter = swap is needed.

Indeed, my tests do not cover the scenario where xas_nomem() returns true.

>>                  xas_lock_irq(&xas);
>>                  xas_for_each_conflict(&xas, entry) {
>>                          /*
>> @@ -2365,9 +2367,16 @@ static int shmem_swapin_folio(struct inode
>> *inode, pgoff_t index,
>>          }
>>
>>   alloced:
> 
> And it needs `nr_pages = folio_nr_pages(folio); index =
> round_down(index, nr_pages);` here...

IIUC, the index alignment should move into the 'order < 
folio_order(folio)' branch?

>> -       /* We have to do this with folio locked to prevent races */
>> +       /*
>> +        * We have to do this with folio locked to prevent races.
>> +        * The shmem_confirm_swap below only checks if the first swap
>> +        * entry matches the folio, that's enough to ensure the folio
>> +        * is not used outside of shmem, as shmem swap entrie
>> +        * and swap cache folios are never partially freed.
>> +        */
>>          folio_lock(folio);
>>          if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
>> +           !shmem_confirm_swap(mapping, index, swap) ||
>>              folio->swap.val != swap.val) {
>>                  error = -EEXIST;
>>                  goto unlock;
>>
>> And I'll do some clean up afterward to get rid of this
>> shmem_confirm_swap. How do you think?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-07-25  3:52       ` Baolin Wang
@ 2025-07-25  4:54         ` Kairui Song
  0 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-07-25  4:54 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Kemeng Shi, Andrew Morton, linux-mm, Hugh Dickins, Matthew Wilcox,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel, stable

On Fri, Jul 25, 2025 at 11:52 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/25 02:16, Kairui Song wrote:
> > On Fri, Jul 25, 2025 at 1:02 AM Kairui Song <ryncsn@gmail.com> wrote:
> >>
> >> On Thu, Jul 10, 2025 at 11:37 AM Kairui Song <ryncsn@gmail.com> wrote:
> >>>
> >>> From: Kairui Song <kasong@tencent.com>
> >>>
> >>> The current swap-in code assumes that, when a swap entry in shmem mapping
> >>> is order 0, its cached folios (if present) must be order 0 too, which
> >>> turns out not always correct.
> >>>
> >>> The problem is shmem_split_large_entry is called before verifying the
> >>> folio will eventually be swapped in, one possible race is:
> >>>
> >>>      CPU1                          CPU2
> >>> shmem_swapin_folio
> >>> /* swap in of order > 0 swap entry S1 */
> >>>    folio = swap_cache_get_folio
> >>>    /* folio = NULL */
> >>>    order = xa_get_order
> >>>    /* order > 0 */
> >>>    folio = shmem_swap_alloc_folio
> >>>    /* mTHP alloc failure, folio = NULL */
> >>>    <... Interrupted ...>
> >>>                                   shmem_swapin_folio
> >>>                                   /* S1 is swapped in */
> >>>                                   shmem_writeout
> >>>                                   /* S1 is swapped out, folio cached */
> >>>    shmem_split_large_entry(..., S1)
> >>>    /* S1 is split, but the folio covering it has order > 0 now */
> >>>
> >>> Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
> >>> folio lookup will return a folio with order > 0.  The
> >>> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
> >>> return false causing swap-in to return -EEXIST.
> >>>
> >>> And this looks fragile.  So fix this up by allowing seeing a larger folio
> >>> in swap cache, and check the whole shmem mapping range covered by the
> >>> swapin have the right swap value upon inserting the folio.  And drop the
> >>> redundant tree walks before the insertion.
> >>>
> >>> This will actually improve performance, as it avoids two redundant Xarray
> >>> tree walks in the hot path, and the only side effect is that in the
> >>> failure path, shmem may redundantly reallocate a few folios causing
> >>> temporary slight memory pressure.
> >>>
> >>> And worth noting, it may seems the order and value check before inserting
> >>> might help reducing the lock contention, which is not true.  The swap
> >>> cache layer ensures raced swapin will either see a swap cache folio or
> >>> failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
> >>> bypassed), so holding the folio lock and checking the folio flag is
> >>> already good enough for avoiding the lock contention.  The chance that a
> >>> folio passes the swap entry value check but the shmem mapping slot has
> >>> changed should be very low.
> >>>
> >>> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> >>> Signed-off-by: Kairui Song <kasong@tencent.com>
> >>> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> >>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Cc: <stable@vger.kernel.org>
> >>> ---
> >>>   mm/shmem.c | 30 +++++++++++++++++++++---------
> >>>   1 file changed, 21 insertions(+), 9 deletions(-)
> >>
> >> Hi All,
> >>
> >> Just found some issue here with this patch...
> >>
> >>>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index 334b7b4a61a0..e3c9a1365ff4 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
> >>>                                     pgoff_t index, void *expected, gfp_t gfp)
> >>>   {
> >>>          XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> >>> -       long nr = folio_nr_pages(folio);
> >>> +       unsigned long nr = folio_nr_pages(folio);
> >>> +       swp_entry_t iter, swap;
> >>> +       void *entry;
> >>>
> >>>          VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
> >>>          VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> >>> @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
> >>>
> >>>          gfp &= GFP_RECLAIM_MASK;
> >>>          folio_throttle_swaprate(folio, gfp);
> >>> +       swap = iter = radix_to_swp_entry(expected);
> >>>
> >>>          do {
> >>>                  xas_lock_irq(&xas);
> >>
> >> I missed a xas_reset here, also better reset iter value too.
> >>
> >>> -               if (expected != xas_find_conflict(&xas)) {
> >>> -                       xas_set_err(&xas, -EEXIST);
> >>> -                       goto unlock;
> >>> +               xas_for_each_conflict(&xas, entry) {
> >>> +                       /*
> >>> +                        * The range must either be empty, or filled with
> >>> +                        * expected swap entries. Shmem swap entries are never
> >>> +                        * partially freed without split of both entry and
> >>> +                        * folio, so there shouldn't be any holes.
> >>> +                        */
> >>> +                       if (!expected || entry != swp_to_radix_entry(iter)) {
> >>> +                               xas_set_err(&xas, -EEXIST);
> >>> +                               goto unlock;
> >>> +                       }
> >>> +                       iter.val += 1 << xas_get_order(&xas);
> >>>                  }
> >>> -               if (expected && xas_find_conflict(&xas)) {
> >>> +               if (expected && iter.val - nr != swap.val) {
> >>>                          xas_set_err(&xas, -EEXIST);
> >>>                          goto unlock;
> >>>                  }
> >>> @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>                          error = -ENOMEM;
> >>>                          goto failed;
> >>>                  }
> >>> -       } else if (order != folio_order(folio)) {
> >>> +       } else if (order > folio_order(folio)) {
> >>>                  /*
> >>>                   * Swap readahead may swap in order 0 folios into swapcache
> >>>                   * asynchronously, while the shmem mapping can still stores
> >>> @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>
> >>>                          swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>>                  }
> >>> +       } else if (order < folio_order(folio)) {
> >>> +               swap.val = round_down(swap.val, 1 << folio_order(folio));
> >>>          }
> >>>
> >>>   alloced:
> >>>          /* We have to do this with folio locked to prevent races */
> >>>          folio_lock(folio);
> >>>          if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> >>> -           / ||
> >>> -           !shmem_confirm_swap(mapping, index, swap) ||
> >>> -           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
> >>
> >> And this part is incorrect. This `shmem_confirm_swap(mapping, index,
> >> swap) ` can't be simply omitted. Some functions below before the
> >> shmem_add_to_page_cache shouldn't be called on folios might have
> >> already been mapped by others. This shmem_confirm_swap ensures that
> >> won't happen.
>
> OK, thanks for the reminding. But could you elaborate a bit? Which
> function should not be called, and what problem might be caused?

Yes, first shmem_add_to_page_cache itself will reset the
folio->mapping and index before verifying the mapping.

So even if the folio is still a valid swap cache folio and the
folio->swap.val matches swap.val, a parallel swapin could have swapped
in the freed this folio from swap, and now it's possible that the
folio is now part of anon memory:

CPU1                      CPU2
/* Start swap in of swap entry S1 */
shmem_swapin_folio
/* Interrupted */
                          /* Raced swap in of swap entry S1 */
                          shmem_swapin_folio
                          /* Swapin done, S1 is freed */

                          /* Anon swapout of folio A using S1 */
                          pageout(folio) != PAGE_SUCCESS
                          /* Now anon folio A is in swpa cache */
folio = swap_cache_get_folio
/* Got folio A */
if (!folio_test_swapcache(folio)
    folio->swap.val != swap.val))
       error = -EEXIST;
/* Check passed, folio A is using S1 as swap entry */
shmem_add_to_page_cache
  folio->mapping = mapping
  /* BUG: folio->mapping is an anon mapping, info lost */

And I managed to trigger this issue, it will result in at least an RSS
counter error like this:

[  1944.374356] BUG: Bad rss-counter state mm:ffff0000c1539640
type:MM_ANONPAGES val:1
[  1944.374384] BUG: Bad rss-counter state mm:ffff0000c1539640
type:MM_SHMEMPAGES val:-1

Clearly it will trigger even more issues.

And other helpers like arch_swap_restore and shmem_replace_folio, they
seems to be OK, but if the folio is not part of shmem anymore they
better stay off of it too. So for safety measure I think we'd better
add the shmem_confirm_swap back. And only checking the first swap
entry is good enough.


>
> >> It may seem like a small change, but it leads to some minor conflicts
> >> in one or two following commits, the benchmark result will change too.
> >> So I'll have to send a V6 I think.
> >>
> >> We can remove this `shmem_confirm_swap`, but not in this series I
> >> think, maybe after this. Need to re-arrange some functions, with some
> >> clean ups for shmem_add_to_page_cache and others.
> >>
> >>> +           folio->swap.val != swap.val) {
> >>>                  error = -EEXIST;
> >>>                  goto unlock;
> >>>          }
> >>> --
> >>> 2.50.0
> >>>
> >>
> >> In summary, I'll squash this patch into it and do a rebase of later commits:
> >>
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index e3c9a1365ff4..4ca0b665b79e 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -898,9 +898,11 @@ static int shmem_add_to_page_cache(struct folio *folio,
> >>
> >>          gfp &= GFP_RECLAIM_MASK;
> >>          folio_throttle_swaprate(folio, gfp);
> >> -       swap = iter = radix_to_swp_entry(expected);
> >> +       swap = radix_to_swp_entry(expected);
> >>
> >>          do {
> >> +               iter = swap;
> >> +               xas_reset(&xas);
> >
> > Correction: this xas_reset is not needed, the iter = swap is needed.
>
> Indeed, my tests do not cover the scenario where xas_nomem() returns true.
>
> >>                  xas_lock_irq(&xas);
> >>                  xas_for_each_conflict(&xas, entry) {
> >>                          /*
> >> @@ -2365,9 +2367,16 @@ static int shmem_swapin_folio(struct inode
> >> *inode, pgoff_t index,
> >>          }
> >>
> >>   alloced:
> >
> > And it needs `nr_pages = folio_nr_pages(folio); index =
> > round_down(index, nr_pages);` here...
>
> IIUC, the index alignment should move into the 'order <
> folio_order(folio)' branch?

Ok, I'll move it here. It should be fine either way.

>
> >> -       /* We have to do this with folio locked to prevent races */
> >> +       /*
> >> +        * We have to do this with folio locked to prevent races.
> >> +        * The shmem_confirm_swap below only checks if the first swap
> >> +        * entry matches the folio, that's enough to ensure the folio
> >> +        * is not used outside of shmem, as shmem swap entrie
> >> +        * and swap cache folios are never partially freed.
> >> +        */
> >>          folio_lock(folio);
> >>          if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> >> +           !shmem_confirm_swap(mapping, index, swap) ||
> >>              folio->swap.val != swap.val) {
> >>                  error = -EEXIST;
> >>                  goto unlock;
> >>
> >> And I'll do some clean up afterward to get rid of this
> >> shmem_confirm_swap. How do you think?
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-07-25  4:55 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
2025-07-10  3:36 ` [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
2025-07-24 17:02   ` Kairui Song
2025-07-24 18:16     ` Kairui Song
2025-07-25  3:52       ` Baolin Wang
2025-07-25  4:54         ` Kairui Song
2025-07-10  3:37 ` [PATCH v5 2/8] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-07-10  3:37 ` [PATCH v5 3/8] mm/shmem, swap: tidy up THP swapin checks Kairui Song
2025-07-10  3:37 ` [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting Kairui Song
2025-07-16  7:09   ` Baoquan He
2025-07-10  3:37 ` [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
2025-07-11  6:10   ` Baolin Wang
2025-07-13 10:53   ` Barry Song
2025-07-10  3:37 ` [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling Kairui Song
2025-07-11  6:23   ` Baolin Wang
2025-07-11  6:28     ` Kairui Song
2025-07-15 22:09       ` Hugh Dickins
2025-07-16  7:14         ` Kairui Song
2025-07-10  3:37 ` [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin Kairui Song
2025-07-11  6:36   ` Baolin Wang
2025-07-14  2:39     ` Baolin Wang
2025-07-10  3:37 ` [PATCH v5 8/8] mm/shmem, swap: fix major fault counting Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).