[PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in
@ 2025-07-04 18:17 Kairui Song
  2025-07-04 18:17 ` [PATCH v4 1/9] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current mTHP swapin path have several problems. It may potentially
hang, may cause redundant faults due to false positive swap cache lookup,
and it will involve at least 4 Xarray tree walks (get order, get order
again, confirm swap, insert folio). And for !CONFIG_TRANSPARENT_HUGEPAGE
builds, it will performs some mTHP related checks.

This series fixes all of the mentioned issues, and the code should be more
robust and prepared for the swap table series. Now tree walks is reduced
to twice (get order & confirm, insert folio), !CONFIG_TRANSPARENT_HUGEPAGE
build overhead is also minimized, and comes with a sanity check now.

The performance is slightly better after this series, sequential swap in of
24G data from ZRAM, using transparent_hugepage_tmpfs=always (24 samples each):

Before:        11.02,  stddev: 0.06
After patch 1: 10.74,  stddev: 0.03
After patch 2: 10.72,  stddev: 0.01
After patch 3: 10.73,  stddev: 0.04
After patch 4: 10.72,  stddev: 0.02
After patch 5: 10.74,  stddev: 0.01
After patch 6: 10.13,  stddev: 0.09
After patch 7: 9.95,   stddev: 0.02
After patch 8: 9.88,   stddev: 0.04

Each patch improves the performance by a little, which is about ~10%
faster in total.

Build kernel test showed very slightly improvement, testing with make -j24
with defconfig in a 256M memcg also using ZRAM as swap, and
transparent_hugepage_tmpfs=always (6 test runs):

Before:      system time avg: 3911.80s
After:       system time avg: 3863.76s

--

V3: https://lore.kernel.org/linux-mm/20250627062020.534-1-ryncsn@gmail.com/
Updates:
- Split and reorganize a few intermediate patches [ Baolin Wang ]
- Fix a duplicated fault issue that may occur in one intermediate patch
  [ Baolin Wang ]
- Improve variable naming [ Baolin Wang ]
- Fix a wrong error value problem, return proper error value when
  swapin failed.
- No major code change from V3.

V2: https://lore.kernel.org/linux-mm/20250619175538.15799-1-ryncsn@gmail.com/
Updates:
- Split the clean up patch into 3 individual patches [ Baolin Wang ]
- Fix a code error in the first patch [ Baolin Wang ]
- I found there are some other remaining issue that can be fixed after
  the clean up so includes these too: fix major fault counter, and clean
  up the goto labels.

V1: https://lore.kernel.org/linux-mm/20250617183503.10527-1-ryncsn@gmail.com/
Updates:
- Improve of funtion name and variable names, also commit message [ Kemeng Shi,
  Dev Jain ]
- Correct Fixes: tag [ Andrew Morton ]
- Collect Reviewed-by.

Two of the patches in this series comes from the swap table series [1],
and worth noting that the performance gain of this series is independent
to the swap table series, we'll see another bigger performance gain and
reduce of memory usage after the swap table series.

I found these issues while trying to split the shmem changes out of the
swap table series for easier reviewing, and found several more issues
while doing stress tests for performance comparision. Barry also mentioned
that CONFIG_TRANSPARENT_HUGEPAGE may have redundant checks [2] and I
managed to clean them up properly too.

No issue is found with a few days of stress testing.

Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7AsKFz7UN+seR5atznE_RBTDC9qjDmwN5saMe+KL3b1mQ@mail.gmail.com/ [2]
Kairui Song (9):
  mm/shmem, swap: improve cached mTHP handling and fix potential hung
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  mm/shmem, swap: tidy up THP swapin checks
  mm/shmem, swap: tidy up swap entry splitting
  mm/shmem, swap: avoid false positive swap cache lookup
  mm/shmem, swap: never use swap cache and readahead for
    SWP_SYNCHRONOUS_IO
  mm/shmem, swap: simplify swapin path and result handling
  mm/shmem, swap: simplify swap entry and index calculation of large
    swapin
  mm/shmem, swap: fix major fault counting

 mm/shmem.c | 273 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 147 insertions(+), 126 deletions(-)

-- 
2.50.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 1/9] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-04 18:17 ` [PATCH v4 2/9] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song, stable

From: Kairui Song <kasong@tencent.com>

The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0.  The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.

And this looks fragile.  So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio.  And drop the
redundant tree walks before the insertion.

This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.

And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true.  The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention.  The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.

Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
---
 mm/shmem.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 334b7b4a61a0..e3c9a1365ff4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
 				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
-	long nr = folio_nr_pages(folio);
+	unsigned long nr = folio_nr_pages(folio);
+	swp_entry_t iter, swap;
+	void *entry;
 
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
 
 	gfp &= GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
+	swap = iter = radix_to_swp_entry(expected);
 
 	do {
 		xas_lock_irq(&xas);
-		if (expected != xas_find_conflict(&xas)) {
-			xas_set_err(&xas, -EEXIST);
-			goto unlock;
+		xas_for_each_conflict(&xas, entry) {
+			/*
+			 * The range must either be empty, or filled with
+			 * expected swap entries. Shmem swap entries are never
+			 * partially freed without split of both entry and
+			 * folio, so there shouldn't be any holes.
+			 */
+			if (!expected || entry != swp_to_radix_entry(iter)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			iter.val += 1 << xas_get_order(&xas);
 		}
-		if (expected && xas_find_conflict(&xas)) {
+		if (expected && iter.val - nr != swap.val) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}
@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order != folio_order(folio)) {
+	} else if (order > folio_order(folio)) {
 		/*
 		 * Swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
+	} else if (order < folio_order(folio)) {
+		swap.val = round_down(swap.val, 1 << folio_order(folio));
 	}
 
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap) ||
-	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+	    folio->swap.val != swap.val) {
 		error = -EEXIST;
 		goto unlock;
 	}
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 2/9] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
  2025-07-04 18:17 ` [PATCH v4 1/9] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-04 18:17 ` [PATCH v4 3/9] mm/shmem, swap: tidy up THP swapin checks Kairui Song
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song, Dev Jain

From: Kairui Song <kasong@tencent.com>

Currently shmem calls xa_get_order to get the swap radix entry order,
requiring a full tree walk.  This can be easily combined with the swap
entry value checking (shmem_confirm_swap) to avoid the duplicated lookup,
which should improve the performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e3c9a1365ff4..033dc7a3435d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
 
 /*
  * Sometimes, before we decide whether to proceed or to fail, we must check
- * that an entry was not already brought back from swap by a racing thread.
+ * that an entry was not already brought back or split by a racing thread.
  *
  * Checking folio is not enough: by the time a swapcache folio is locked, it
  * might be reused, and again be swapcache, using the same swap as before.
+ * Returns the swap entry's order if it still presents, else returns -1.
  */
-static bool shmem_confirm_swap(struct address_space *mapping,
-			       pgoff_t index, swp_entry_t swap)
+static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index,
+			      swp_entry_t swap)
 {
-	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
+	XA_STATE(xas, &mapping->i_pages, index);
+	int ret = -1;
+	void *entry;
+
+	rcu_read_lock();
+	do {
+		entry = xas_load(&xas);
+		if (entry == swp_to_radix_entry(swap))
+			ret = xas_get_order(&xas);
+	} while (xas_retry(&xas, entry));
+	rcu_read_unlock();
+	return ret;
 }
 
 /*
@@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EIO;
 
 	si = get_swap_device(swap);
-	if (!si) {
-		if (!shmem_confirm_swap(mapping, index, swap))
+	order = shmem_confirm_swap(mapping, index, swap);
+	if (unlikely(!si)) {
+		if (order < 0)
 			return -EEXIST;
 		else
 			return -EINVAL;
 	}
+	if (unlikely(order < 0)) {
+		put_swap_device(si);
+		return -EEXIST;
+	}
 
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
-	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		int nr_pages = 1 << order;
 		bool fallback_order0 = false;
@@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	*foliop = folio;
 	return 0;
 failed:
-	if (!shmem_confirm_swap(mapping, index, swap))
+	if (shmem_confirm_swap(mapping, index, swap) < 0)
 		error = -EEXIST;
 	if (error == -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
@@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_put(folio);
 	}
 	put_swap_device(si);
-
 	return error;
 }
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 3/9] mm/shmem, swap: tidy up THP swapin checks
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
  2025-07-04 18:17 ` [PATCH v4 1/9] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
  2025-07-04 18:17 ` [PATCH v4 2/9] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-04 18:17 ` [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting Kairui Song
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Move all THP swapin related checks under CONFIG_TRANSPARENT_HUGEPAGE,
so they will be trimmed off by the compiler if not needed.

And add a WARN if shmem sees a order > 0 entry when
CONFIG_TRANSPARENT_HUGEPAGE is disabled, that should never happen unless
things went very wrong.

There should be no observable feature change except the new added WARN.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 033dc7a3435d..e43becfa04b3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1980,26 +1980,38 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		swp_entry_t entry, int order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	int nr_pages = 1 << order;
 	struct folio *new;
 	void *shadow;
-	int nr_pages;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success with further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
-		gfp_t huge_gfp = vma_thp_gfp_mask(vma);
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (WARN_ON_ONCE(order))
+			return ERR_PTR(-EINVAL);
+	} else if (order) {
+		/*
+		 * If uffd is active for the vma, we need per-page fault
+		 * fidelity to maintain the uffd semantics, then fallback
+		 * to swapin order-0 folio, as well as for zswap case.
+		 * Any existing sub folio in the swap cache also blocks
+		 * mTHP swapin.
+		 */
+		if ((vma && unlikely(userfaultfd_armed(vma))) ||
+		     !zswap_never_enabled() ||
+		     non_swapcache_batch(entry, nr_pages) != nr_pages)
+			return ERR_PTR(-EINVAL);
 
-		gfp = limit_gfp_mask(huge_gfp, gfp);
+		gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
 	}
 
 	new = shmem_alloc_folio(gfp, order, info, index);
 	if (!new)
 		return ERR_PTR(-ENOMEM);
 
-	nr_pages = folio_nr_pages(new);
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
 					   gfp, entry)) {
 		folio_put(new);
@@ -2283,9 +2295,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
-		int nr_pages = 1 << order;
-		bool fallback_order0 = false;
-
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
@@ -2293,20 +2302,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
 
-		/*
-		 * If uffd is active for the vma, we need per-page fault
-		 * fidelity to maintain the uffd semantics, then fallback
-		 * to swapin order-0 folio, as well as for zswap case.
-		 * Any existing sub folio in the swap cache also blocks
-		 * mTHP swapin.
-		 */
-		if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
-				  !zswap_never_enabled() ||
-				  non_swapcache_batch(swap, nr_pages) != nr_pages))
-			fallback_order0 = true;
-
 		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
                   ` (2 preceding siblings ...)
  2025-07-04 18:17 ` [PATCH v4 3/9] mm/shmem, swap: tidy up THP swapin checks Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-06  3:35   ` Baolin Wang
  2025-07-04 18:17 ` [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of keeping different paths of splitting the entry before the
swap in start, move the entry splitting after the swapin has put
the folio in swap cache (or set the SWAP_HAS_CACHE bit). This way
we only need one place and one unified way to split the large entry.
Whenever swapin brought in a folio smaller than the shmem swap entry,
split the entry and recalculate the entry and index for verification.

This removes duplicated codes and function calls, reduces LOC,
and the split is less racy as it's guarded by swap cache now. So it
will have a lower chance of repeated faults due to raced split.
The compiler is also able to optimize the coder further:

bloat-o-meter results with GCC 14:

With DEBUG_SECTION_MISMATCH (-fno-inline-functions-called-once):
./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-82 (-82)
Function                                     old     new   delta
shmem_swapin_folio                          2361    2279     -82
Total: Before=33151, After=33069, chg -0.25%

With !DEBUG_SECTION_MISMATCH:
./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/1 grow/shrink: 1/0 up/down: 949/-750 (199)
Function                                     old     new   delta
shmem_swapin_folio                          2878    3827    +949
shmem_split_large_entry.isra                 750       -    -750
Total: Before=33086, After=33285, chg +0.60%

Since shmem_split_large_entry is only called in one place now. The
compiler will either generate more compact code, or inlined it for
better performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 53 +++++++++++++++++++++--------------------------------
 1 file changed, 21 insertions(+), 32 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e43becfa04b3..217264315842 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2266,14 +2266,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	swp_entry_t swap, index_entry;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
-	swp_entry_t swap;
 	int error, nr_pages, order, split_order;
+	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
-	swap = radix_to_swp_entry(*foliop);
+	swap = index_entry = radix_to_swp_entry(*foliop);
 	*foliop = NULL;
 
 	if (is_poisoned_swp_entry(swap))
@@ -2321,46 +2322,35 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		}
 
 		/*
-		 * Now swap device can only swap in order 0 folio, then we
-		 * should split the large swap entry stored in the pagecache
-		 * if necessary.
-		 */
-		split_order = shmem_split_large_entry(inode, index, swap, gfp);
-		if (split_order < 0) {
-			error = split_order;
-			goto failed;
-		}
-
-		/*
-		 * If the large swap entry has already been split, it is
+		 * Now swap device can only swap in order 0 folio, it is
 		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
+		 * the offset, as the swapin index might be unalgined.
 		 */
-		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
+		if (order) {
+			offset = index - round_down(index, 1 << order);
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
 
-		/* Here we actually start the io */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order > folio_order(folio)) {
+	}
+alloced:
+	if (order > folio_order(folio)) {
 		/*
-		 * Swap readahead may swap in order 0 folios into swapcache
+		 * Swapin may get smaller folios due to various reasons:
+		 * It may fallback to order 0 due to memory pressure or race,
+		 * swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
 		 * large swap entries. In such cases, we should split the
 		 * large swap entry to prevent possible data corruption.
 		 */
-		split_order = shmem_split_large_entry(inode, index, swap, gfp);
+		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
 		if (split_order < 0) {
-			folio_put(folio);
-			folio = NULL;
 			error = split_order;
-			goto failed;
+			goto failed_nolock;
 		}
 
 		/*
@@ -2369,15 +2359,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		 * the old order alignment.
 		 */
 		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
+			offset = index - round_down(index, 1 << split_order);
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
 	} else if (order < folio_order(folio)) {
 		swap.val = round_down(swap.val, 1 << folio_order(folio));
 	}
 
-alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
@@ -2434,12 +2422,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
 					     skip_swapcache);
 unlock:
-	if (skip_swapcache)
-		swapcache_clear(si, swap, folio_nr_pages(folio));
-	if (folio) {
+	if (folio)
 		folio_unlock(folio);
+failed_nolock:
+	if (skip_swapcache)
+		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
+	if (folio)
 		folio_put(folio);
-	}
 	put_swap_device(si);
 	return error;
 }
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting
  2025-07-04 18:17 ` [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting Kairui Song
@ 2025-07-06  3:35   ` Baolin Wang
  2025-07-06 11:50     ` Kairui Song
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-07-06  3:35 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/5 02:17, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Instead of keeping different paths of splitting the entry before the
> swap in start, move the entry splitting after the swapin has put
> the folio in swap cache (or set the SWAP_HAS_CACHE bit). This way
> we only need one place and one unified way to split the large entry.
> Whenever swapin brought in a folio smaller than the shmem swap entry,
> split the entry and recalculate the entry and index for verification.
> 
> This removes duplicated codes and function calls, reduces LOC,
> and the split is less racy as it's guarded by swap cache now. So it
> will have a lower chance of repeated faults due to raced split.
> The compiler is also able to optimize the coder further:
> 
> bloat-o-meter results with GCC 14:
> 
> With DEBUG_SECTION_MISMATCH (-fno-inline-functions-called-once):
> ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-82 (-82)
> Function                                     old     new   delta
> shmem_swapin_folio                          2361    2279     -82
> Total: Before=33151, After=33069, chg -0.25%
> 
> With !DEBUG_SECTION_MISMATCH:
> ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
> add/remove: 0/1 grow/shrink: 1/0 up/down: 949/-750 (199)
> Function                                     old     new   delta
> shmem_swapin_folio                          2878    3827    +949
> shmem_split_large_entry.isra                 750       -    -750
> Total: Before=33086, After=33285, chg +0.60%
> 
> Since shmem_split_large_entry is only called in one place now. The
> compiler will either generate more compact code, or inlined it for
> better performance.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 53 +++++++++++++++++++++--------------------------------
>   1 file changed, 21 insertions(+), 32 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e43becfa04b3..217264315842 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2266,14 +2266,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   	struct address_space *mapping = inode->i_mapping;
>   	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>   	struct shmem_inode_info *info = SHMEM_I(inode);
> +	swp_entry_t swap, index_entry;
>   	struct swap_info_struct *si;
>   	struct folio *folio = NULL;
>   	bool skip_swapcache = false;
> -	swp_entry_t swap;
>   	int error, nr_pages, order, split_order;
> +	pgoff_t offset;
>   
>   	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> -	swap = radix_to_swp_entry(*foliop);
> +	swap = index_entry = radix_to_swp_entry(*foliop);
>   	*foliop = NULL;
>   
>   	if (is_poisoned_swp_entry(swap))
> @@ -2321,46 +2322,35 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		}
>   
>   		/*
> -		 * Now swap device can only swap in order 0 folio, then we
> -		 * should split the large swap entry stored in the pagecache
> -		 * if necessary.
> -		 */
> -		split_order = shmem_split_large_entry(inode, index, swap, gfp);
> -		if (split_order < 0) {
> -			error = split_order;
> -			goto failed;
> -		}
> -
> -		/*
> -		 * If the large swap entry has already been split, it is
> +		 * Now swap device can only swap in order 0 folio, it is
>   		 * necessary to recalculate the new swap entry based on
> -		 * the old order alignment.
> +		 * the offset, as the swapin index might be unalgined.
>   		 */
> -		if (split_order > 0) {
> -			pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> +		if (order) {
> +			offset = index - round_down(index, 1 << order);
>   			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>   		}
>   
> -		/* Here we actually start the io */
>   		folio = shmem_swapin_cluster(swap, gfp, info, index);
>   		if (!folio) {
>   			error = -ENOMEM;
>   			goto failed;
>   		}
> -	} else if (order > folio_order(folio)) {
> +	}
> +alloced:
> +	if (order > folio_order(folio)) {
>   		/*
> -		 * Swap readahead may swap in order 0 folios into swapcache
> +		 * Swapin may get smaller folios due to various reasons:
> +		 * It may fallback to order 0 due to memory pressure or race,
> +		 * swap readahead may swap in order 0 folios into swapcache
>   		 * asynchronously, while the shmem mapping can still stores
>   		 * large swap entries. In such cases, we should split the
>   		 * large swap entry to prevent possible data corruption.
>   		 */
> -		split_order = shmem_split_large_entry(inode, index, swap, gfp);
> +		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
>   		if (split_order < 0) {
> -			folio_put(folio);
> -			folio = NULL;
>   			error = split_order;
> -			goto failed;
> +			goto failed_nolock;
>   		}
>   
>   		/*
> @@ -2369,15 +2359,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		 * the old order alignment.
>   		 */
>   		if (split_order > 0) {
> -			pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> +			offset = index - round_down(index, 1 << split_order);
>   			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);

Obviously, you should use the original swap value 'index_entry' to 
calculate the new swap value.

With the following fix, you can add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>

diff --git a/mm/shmem.c b/mm/shmem.c
index d530df550f7f..1e8422ac863e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2361,7 +2361,7 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
                  */
                 if (split_order > 0) {
                         offset = index - round_down(index, 1 << 
split_order);
-                       swap = swp_entry(swp_type(swap), 
swp_offset(swap) + offset);
+                       swap = swp_entry(swp_type(swap), 
swp_offset(index_swap) + offset);
                 }
         } else if (order < folio_order(folio)) {
                 swap.val = round_down(swap.val, 1 << folio_order(folio));



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting
  2025-07-06  3:35   ` Baolin Wang
@ 2025-07-06 11:50     ` Kairui Song
  0 siblings, 0 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-06 11:50 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel

On Sun, Jul 6, 2025 at 11:38 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/5 02:17, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of keeping different paths of splitting the entry before the
> > swap in start, move the entry splitting after the swapin has put
> > the folio in swap cache (or set the SWAP_HAS_CACHE bit). This way
> > we only need one place and one unified way to split the large entry.
> > Whenever swapin brought in a folio smaller than the shmem swap entry,
> > split the entry and recalculate the entry and index for verification.
> >
> > This removes duplicated codes and function calls, reduces LOC,
> > and the split is less racy as it's guarded by swap cache now. So it
> > will have a lower chance of repeated faults due to raced split.
> > The compiler is also able to optimize the coder further:
> >
> > bloat-o-meter results with GCC 14:
> >
> > With DEBUG_SECTION_MISMATCH (-fno-inline-functions-called-once):
> > ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-82 (-82)
> > Function                                     old     new   delta
> > shmem_swapin_folio                          2361    2279     -82
> > Total: Before=33151, After=33069, chg -0.25%
> >
> > With !DEBUG_SECTION_MISMATCH:
> > ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
> > add/remove: 0/1 grow/shrink: 1/0 up/down: 949/-750 (199)
> > Function                                     old     new   delta
> > shmem_swapin_folio                          2878    3827    +949
> > shmem_split_large_entry.isra                 750       -    -750
> > Total: Before=33086, After=33285, chg +0.60%
> >
> > Since shmem_split_large_entry is only called in one place now. The
> > compiler will either generate more compact code, or inlined it for
> > better performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/shmem.c | 53 +++++++++++++++++++++--------------------------------
> >   1 file changed, 21 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index e43becfa04b3..217264315842 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2266,14 +2266,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >       struct address_space *mapping = inode->i_mapping;
> >       struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> > +     swp_entry_t swap, index_entry;
> >       struct swap_info_struct *si;
> >       struct folio *folio = NULL;
> >       bool skip_swapcache = false;
> > -     swp_entry_t swap;
> >       int error, nr_pages, order, split_order;
> > +     pgoff_t offset;
> >
> >       VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > -     swap = radix_to_swp_entry(*foliop);
> > +     swap = index_entry = radix_to_swp_entry(*foliop);
> >       *foliop = NULL;
> >
> >       if (is_poisoned_swp_entry(swap))
> > @@ -2321,46 +2322,35 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               }
> >
> >               /*
> > -              * Now swap device can only swap in order 0 folio, then we
> > -              * should split the large swap entry stored in the pagecache
> > -              * if necessary.
> > -              */
> > -             split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > -             if (split_order < 0) {
> > -                     error = split_order;
> > -                     goto failed;
> > -             }
> > -
> > -             /*
> > -              * If the large swap entry has already been split, it is
> > +              * Now swap device can only swap in order 0 folio, it is
> >                * necessary to recalculate the new swap entry based on
> > -              * the old order alignment.
> > +              * the offset, as the swapin index might be unalgined.
> >                */
> > -             if (split_order > 0) {
> > -                     pgoff_t offset = index - round_down(index, 1 << split_order);
> > -
> > +             if (order) {
> > +                     offset = index - round_down(index, 1 << order);
> >                       swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >               }
> >
> > -             /* Here we actually start the io */
> >               folio = shmem_swapin_cluster(swap, gfp, info, index);
> >               if (!folio) {
> >                       error = -ENOMEM;
> >                       goto failed;
> >               }
> > -     } else if (order > folio_order(folio)) {
> > +     }
> > +alloced:
> > +     if (order > folio_order(folio)) {
> >               /*
> > -              * Swap readahead may swap in order 0 folios into swapcache
> > +              * Swapin may get smaller folios due to various reasons:
> > +              * It may fallback to order 0 due to memory pressure or race,
> > +              * swap readahead may swap in order 0 folios into swapcache
> >                * asynchronously, while the shmem mapping can still stores
> >                * large swap entries. In such cases, we should split the
> >                * large swap entry to prevent possible data corruption.
> >                */
> > -             split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > +             split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
> >               if (split_order < 0) {
> > -                     folio_put(folio);
> > -                     folio = NULL;
> >                       error = split_order;
> > -                     goto failed;
> > +                     goto failed_nolock;
> >               }
> >
> >               /*
> > @@ -2369,15 +2359,13 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                * the old order alignment.
> >                */
> >               if (split_order > 0) {
> > -                     pgoff_t offset = index - round_down(index, 1 << split_order);
> > -
> > +                     offset = index - round_down(index, 1 << split_order);
> >                       swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>
> Obviously, you should use the original swap value 'index_entry' to
> calculate the new swap value.

Thanks, good catch.

>
> With the following fix, you can add:
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d530df550f7f..1e8422ac863e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2361,7 +2361,7 @@ static int shmem_swapin_folio(struct inode *inode,
> pgoff_t index,
>                   */
>                  if (split_order > 0) {
>                          offset = index - round_down(index, 1 <<
> split_order);
> -                       swap = swp_entry(swp_type(swap),
> swp_offset(swap) + offset);
> +                       swap = swp_entry(swp_type(swap),
> swp_offset(index_swap) + offset);
>                  }
>          } else if (order < folio_order(folio)) {
>                  swap.val = round_down(swap.val, 1 << folio_order(folio));
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
                   ` (3 preceding siblings ...)
  2025-07-04 18:17 ` [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-07  7:53   ` Baolin Wang
  2025-07-04 18:17 ` [PATCH v4 6/9] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

If a shmem read request's index points to the middle of a large swap
entry, shmem swap in will try the swap cache lookup using the large
swap entry's starting value (which is the first sub swap entry of this
large entry).  This will lead to false positive lookup results, if only
the first few swap entries are cached but the actual requested swap
entry pointed by index is uncached. This is not a rare event as swap
readahead always try to cache order 0 folios when possible.

Currently, shmem will do a large entry split when it occurs, aborts
due to a mismatching folio swap value, then retry the swapin from
the beginning, which is a waste of CPU and adds wrong info to
the readahead statistics.

This can be optimized easily by doing the lookup using the right
swap entry value.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 217264315842..2ab214e2771c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
-	swap = index_entry = radix_to_swp_entry(*foliop);
+	index_entry = radix_to_swp_entry(*foliop);
+	swap = index_entry;
 	*foliop = NULL;
 
-	if (is_poisoned_swp_entry(swap))
+	if (is_poisoned_swp_entry(index_entry))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	order = shmem_confirm_swap(mapping, index, swap);
+	si = get_swap_device(index_entry);
+	order = shmem_confirm_swap(mapping, index, index_entry);
 	if (unlikely(!si)) {
 		if (order < 0)
 			return -EEXIST;
@@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EEXIST;
 	}
 
+	/* index may point to the middle of a large entry, get the sub entry */
+	if (order) {
+		offset = index - round_down(index, 1 << order);
+		swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+	}
+
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
@@ -2305,8 +2312,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 		/* Skip swapcache for synchronous device. */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+			folio = shmem_swap_alloc_folio(inode, vma, index,
+						       index_entry, order, gfp);
 			if (!IS_ERR(folio)) {
+				swap = index_entry;
 				skip_swapcache = true;
 				goto alloced;
 			}
@@ -2320,17 +2329,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			if (error == -EEXIST)
 				goto failed;
 		}
-
-		/*
-		 * Now swap device can only swap in order 0 folio, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the offset, as the swapin index might be unalgined.
-		 */
-		if (order) {
-			offset = index - round_down(index, 1 << order);
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-
+		/* Cached swapin with readahead, only supports order 0 */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
 			error = -ENOMEM;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup
  2025-07-04 18:17 ` [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
@ 2025-07-07  7:53   ` Baolin Wang
  2025-07-07  8:04     ` Kairui Song
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-07-07  7:53 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel

Hi Kairui,

On 2025/7/5 02:17, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> If a shmem read request's index points to the middle of a large swap
> entry, shmem swap in will try the swap cache lookup using the large
> swap entry's starting value (which is the first sub swap entry of this
> large entry).  This will lead to false positive lookup results, if only
> the first few swap entries are cached but the actual requested swap
> entry pointed by index is uncached. This is not a rare event as swap
> readahead always try to cache order 0 folios when possible.
> 
> Currently, shmem will do a large entry split when it occurs, aborts
> due to a mismatching folio swap value, then retry the swapin from
> the beginning, which is a waste of CPU and adds wrong info to
> the readahead statistics.
> 
> This can be optimized easily by doing the lookup using the right
> swap entry value.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 31 +++++++++++++++----------------
>   1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 217264315842..2ab214e2771c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   	pgoff_t offset;
>   
>   	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> -	swap = index_entry = radix_to_swp_entry(*foliop);
> +	index_entry = radix_to_swp_entry(*foliop);
> +	swap = index_entry;
>   	*foliop = NULL;
>   
> -	if (is_poisoned_swp_entry(swap))
> +	if (is_poisoned_swp_entry(index_entry))
>   		return -EIO;
>   
> -	si = get_swap_device(swap);
> -	order = shmem_confirm_swap(mapping, index, swap);
> +	si = get_swap_device(index_entry);
> +	order = shmem_confirm_swap(mapping, index, index_entry);
>   	if (unlikely(!si)) {
>   		if (order < 0)
>   			return -EEXIST;
> @@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		return -EEXIST;
>   	}
>   
> +	/* index may point to the middle of a large entry, get the sub entry */
> +	if (order) {
> +		offset = index - round_down(index, 1 << order);
> +		swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> +	}
> +
>   	/* Look it up and read it in.. */
>   	folio = swap_cache_get_folio(swap, NULL, 0);

Please drop this patch, which will cause a swapin fault dead loop.

Assume an order-4 shmem folio has been swapped out, and the swap cache 
holds this order-4 folio (assuming index == 0, swap.val == 0x4000).

During swapin, if the index is 1, and the recalculation of the swap 
value here will result in 'swap.val == 0x4001'. This will cause the 
subsequent 'folio->swap.val != swap.val' check to fail, continuously 
triggering a dead-loop swapin fault, ultimately causing the CPU to hang.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup
  2025-07-07  7:53   ` Baolin Wang
@ 2025-07-07  8:04     ` Kairui Song
  2025-07-08  6:00       ` Baolin Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Kairui Song @ 2025-07-07  8:04 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel

On Mon, Jul 7, 2025 at 3:53 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Hi Kairui,
>
> On 2025/7/5 02:17, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > If a shmem read request's index points to the middle of a large swap
> > entry, shmem swap in will try the swap cache lookup using the large
> > swap entry's starting value (which is the first sub swap entry of this
> > large entry).  This will lead to false positive lookup results, if only
> > the first few swap entries are cached but the actual requested swap
> > entry pointed by index is uncached. This is not a rare event as swap
> > readahead always try to cache order 0 folios when possible.
> >
> > Currently, shmem will do a large entry split when it occurs, aborts
> > due to a mismatching folio swap value, then retry the swapin from
> > the beginning, which is a waste of CPU and adds wrong info to
> > the readahead statistics.
> >
> > This can be optimized easily by doing the lookup using the right
> > swap entry value.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/shmem.c | 31 +++++++++++++++----------------
> >   1 file changed, 15 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 217264315842..2ab214e2771c 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >       pgoff_t offset;
> >
> >       VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > -     swap = index_entry = radix_to_swp_entry(*foliop);
> > +     index_entry = radix_to_swp_entry(*foliop);
> > +     swap = index_entry;
> >       *foliop = NULL;
> >
> > -     if (is_poisoned_swp_entry(swap))
> > +     if (is_poisoned_swp_entry(index_entry))
> >               return -EIO;
> >
> > -     si = get_swap_device(swap);
> > -     order = shmem_confirm_swap(mapping, index, swap);
> > +     si = get_swap_device(index_entry);
> > +     order = shmem_confirm_swap(mapping, index, index_entry);
> >       if (unlikely(!si)) {
> >               if (order < 0)
> >                       return -EEXIST;
> > @@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               return -EEXIST;
> >       }
> >
> > +     /* index may point to the middle of a large entry, get the sub entry */
> > +     if (order) {
> > +             offset = index - round_down(index, 1 << order);
> > +             swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > +     }
> > +
> >       /* Look it up and read it in.. */
> >       folio = swap_cache_get_folio(swap, NULL, 0);
>
> Please drop this patch, which will cause a swapin fault dead loop.
>
> Assume an order-4 shmem folio has been swapped out, and the swap cache
> holds this order-4 folio (assuming index == 0, swap.val == 0x4000).
>
> During swapin, if the index is 1, and the recalculation of the swap
> value here will result in 'swap.val == 0x4001'. This will cause the
> subsequent 'folio->swap.val != swap.val' check to fail, continuously
> triggering a dead-loop swapin fault, ultimately causing the CPU to hang.
>

Oh, thanks for catching that.

Clearly I wasn't thinking carefully enough on this. The problem will
be gone if we calculate the `swap.val` based on folio_order and not
split_order, which is currently done in patch 8.

Previously there were only 4 patches so I never expected this
problem... I can try to organize the patch order again. I was hoping
they could be merged as one patch, some designs are supposed to work
together so splitting the patch may cause intermediate problems like
this.

Perhaps you can help have a look at later patches, if we can just
merge them into one? eg. merge or move patch 8 into this. Or maybe I
need to move this patch later.

The performance / object size / stack usage improvements are
shown in the commit message.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup
  2025-07-07  8:04     ` Kairui Song
@ 2025-07-08  6:00       ` Baolin Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2025-07-08  6:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/7 16:04, Kairui Song wrote:
> On Mon, Jul 7, 2025 at 3:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Hi Kairui,
>>
>> On 2025/7/5 02:17, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> If a shmem read request's index points to the middle of a large swap
>>> entry, shmem swap in will try the swap cache lookup using the large
>>> swap entry's starting value (which is the first sub swap entry of this
>>> large entry).  This will lead to false positive lookup results, if only
>>> the first few swap entries are cached but the actual requested swap
>>> entry pointed by index is uncached. This is not a rare event as swap
>>> readahead always try to cache order 0 folios when possible.
>>>
>>> Currently, shmem will do a large entry split when it occurs, aborts
>>> due to a mismatching folio swap value, then retry the swapin from
>>> the beginning, which is a waste of CPU and adds wrong info to
>>> the readahead statistics.
>>>
>>> This can be optimized easily by doing the lookup using the right
>>> swap entry value.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/shmem.c | 31 +++++++++++++++----------------
>>>    1 file changed, 15 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 217264315842..2ab214e2771c 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2274,14 +2274,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>        pgoff_t offset;
>>>
>>>        VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>> -     swap = index_entry = radix_to_swp_entry(*foliop);
>>> +     index_entry = radix_to_swp_entry(*foliop);
>>> +     swap = index_entry;
>>>        *foliop = NULL;
>>>
>>> -     if (is_poisoned_swp_entry(swap))
>>> +     if (is_poisoned_swp_entry(index_entry))
>>>                return -EIO;
>>>
>>> -     si = get_swap_device(swap);
>>> -     order = shmem_confirm_swap(mapping, index, swap);
>>> +     si = get_swap_device(index_entry);
>>> +     order = shmem_confirm_swap(mapping, index, index_entry);
>>>        if (unlikely(!si)) {
>>>                if (order < 0)
>>>                        return -EEXIST;
>>> @@ -2293,6 +2294,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>                return -EEXIST;
>>>        }
>>>
>>> +     /* index may point to the middle of a large entry, get the sub entry */
>>> +     if (order) {
>>> +             offset = index - round_down(index, 1 << order);
>>> +             swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> +     }
>>> +
>>>        /* Look it up and read it in.. */
>>>        folio = swap_cache_get_folio(swap, NULL, 0);
>>
>> Please drop this patch, which will cause a swapin fault dead loop.
>>
>> Assume an order-4 shmem folio has been swapped out, and the swap cache
>> holds this order-4 folio (assuming index == 0, swap.val == 0x4000).
>>
>> During swapin, if the index is 1, and the recalculation of the swap
>> value here will result in 'swap.val == 0x4001'. This will cause the
>> subsequent 'folio->swap.val != swap.val' check to fail, continuously
>> triggering a dead-loop swapin fault, ultimately causing the CPU to hang.
>>
> 
> Oh, thanks for catching that.
> 
> Clearly I wasn't thinking carefully enough on this. The problem will
> be gone if we calculate the `swap.val` based on folio_order and not
> split_order, which is currently done in patch 8.

OK. I saw patch 8. After patch 8, the logic seems correct.

> Previously there were only 4 patches so I never expected this
> problem... I can try to organize the patch order again. I was hoping
> they could be merged as one patch, some designs are supposed to work
> together so splitting the patch may cause intermediate problems like
> this.

Again, please do not combine different changes into one huge patch, 
which is _really_ hard to review and discuss. Please split your patches 
properly and ensure each patch has been tested.

> Perhaps you can help have a look at later patches, if we can just
> merge them into one? eg. merge or move patch 8 into this. Or maybe I
> need to move this patch later.

It seems that patch 5 depends on the cleanup in patch 8. If there's no 
better way to split them, I suggest merging patch 5 into patch 8.

> The performance / object size / stack usage improvements are
> shown in the commit message.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 6/9] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
                   ` (4 preceding siblings ...)
  2025-07-04 18:17 ` [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-07  8:05   ` Baolin Wang
  2025-07-04 18:17 ` [PATCH v4 7/9] mm/shmem, swap: simplify swapin path and result handling Kairui Song
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently if a THP swapin failed due to reasons like partially
conflicting swap cache or ZSWAP enabled, it will fallback to
cached swapin.

Right now the swap cache still has a non-trivial overhead, and readahead
is not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
the readahead and swap cache even if the swapin falls back to order 0.

So handle the fallback logic without falling back to the cached read.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 55 +++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 19 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2ab214e2771c..1fe9a3eb92b1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1975,13 +1975,16 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 	return ERR_PTR(error);
 }
 
-static struct folio *shmem_swap_alloc_folio(struct inode *inode,
+static struct folio *shmem_swapin_direct(struct inode *inode,
 		struct vm_area_struct *vma, pgoff_t index,
-		swp_entry_t entry, int order, gfp_t gfp)
+		swp_entry_t swap, swp_entry_t index_entry,
+		int order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	swp_entry_t entry = index_entry;
 	int nr_pages = 1 << order;
 	struct folio *new;
+	gfp_t alloc_gfp;
 	void *shadow;
 
 	/*
@@ -1989,6 +1992,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	 * limit chance of success with further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
+	alloc_gfp = gfp;
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 		if (WARN_ON_ONCE(order))
 			return ERR_PTR(-EINVAL);
@@ -2003,19 +2007,22 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		if ((vma && unlikely(userfaultfd_armed(vma))) ||
 		     !zswap_never_enabled() ||
 		     non_swapcache_batch(entry, nr_pages) != nr_pages)
-			return ERR_PTR(-EINVAL);
+			goto fallback;
 
-		gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+		alloc_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+	}
+retry:
+	new = shmem_alloc_folio(alloc_gfp, order, info, index);
+	if (!new) {
+		new = ERR_PTR(-ENOMEM);
+		goto fallback;
 	}
-
-	new = shmem_alloc_folio(gfp, order, info, index);
-	if (!new)
-		return ERR_PTR(-ENOMEM);
 
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
-					   gfp, entry)) {
+					   alloc_gfp, entry)) {
 		folio_put(new);
-		return ERR_PTR(-ENOMEM);
+		new = ERR_PTR(-ENOMEM);
+		goto fallback;
 	}
 
 	/*
@@ -2030,7 +2037,9 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	 */
 	if (swapcache_prepare(entry, nr_pages)) {
 		folio_put(new);
-		return ERR_PTR(-EEXIST);
+		new = ERR_PTR(-EEXIST);
+		/* Try smaller folio to avoid cache conflict */
+		goto fallback;
 	}
 
 	__folio_set_locked(new);
@@ -2044,6 +2053,15 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	folio_add_lru(new);
 	swap_read_folio(new, NULL);
 	return new;
+fallback:
+	/* Order 0 swapin failed, nothing to fallback to, abort */
+	if (!order)
+		return new;
+	order = 0;
+	nr_pages = 1;
+	alloc_gfp = gfp;
+	entry = swap;
+	goto retry;
 }
 
 /*
@@ -2309,25 +2327,24 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-
 		/* Skip swapcache for synchronous device. */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio = shmem_swap_alloc_folio(inode, vma, index,
-						       index_entry, order, gfp);
+			folio = shmem_swapin_direct(inode, vma, index, swap,
+						    index_entry, order, gfp);
 			if (!IS_ERR(folio)) {
-				swap = index_entry;
+				if (folio_test_large(folio))
+					swap = index_entry;
 				skip_swapcache = true;
 				goto alloced;
 			}
 
 			/*
-			 * Fallback to swapin order-0 folio unless the swap entry
-			 * already exists.
+			 * Direct swapin handled order 0 fallback already,
+			 * if it failed, abort.
 			 */
 			error = PTR_ERR(folio);
 			folio = NULL;
-			if (error == -EEXIST)
-				goto failed;
+			goto failed;
 		}
 		/* Cached swapin with readahead, only supports order 0 */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 6/9] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  2025-07-04 18:17 ` [PATCH v4 6/9] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-07-07  8:05   ` Baolin Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2025-07-07  8:05 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/5 02:17, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently if a THP swapin failed due to reasons like partially
> conflicting swap cache or ZSWAP enabled, it will fallback to
> cached swapin.
> 
> Right now the swap cache still has a non-trivial overhead, and readahead
> is not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
> the readahead and swap cache even if the swapin falls back to order 0.
> 
> So handle the fallback logic without falling back to the cached read.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 55 +++++++++++++++++++++++++++++++++++-------------------
>   1 file changed, 36 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2ab214e2771c..1fe9a3eb92b1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1975,13 +1975,16 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>   	return ERR_PTR(error);
>   }
>   
> -static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> +static struct folio *shmem_swapin_direct(struct inode *inode,
>   		struct vm_area_struct *vma, pgoff_t index,
> -		swp_entry_t entry, int order, gfp_t gfp)
> +		swp_entry_t swap, swp_entry_t index_entry,

IMO, 'swap' and 'index_entry' are confusing, and it's easy to be unclear 
about their roles. I suggest only passing the original swap value. If it 
falls back to order 0, the swap value can be recalculated, which is more 
readable as well as maintaining the independence of the function.

> +		int order, gfp_t gfp)
>   {
>   	struct shmem_inode_info *info = SHMEM_I(inode);
> +	swp_entry_t entry = index_entry;
>   	int nr_pages = 1 << order;
>   	struct folio *new;
> +	gfp_t alloc_gfp;
>   	void *shadow;
>   
>   	/*
> @@ -1989,6 +1992,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>   	 * limit chance of success with further cpuset and node constraints.
>   	 */
>   	gfp &= ~GFP_CONSTRAINT_MASK;
> +	alloc_gfp = gfp;
>   	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>   		if (WARN_ON_ONCE(order))
>   			return ERR_PTR(-EINVAL);
> @@ -2003,19 +2007,22 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>   		if ((vma && unlikely(userfaultfd_armed(vma))) ||
>   		     !zswap_never_enabled() ||
>   		     non_swapcache_batch(entry, nr_pages) != nr_pages)
> -			return ERR_PTR(-EINVAL);
> +			goto fallback;
>   
> -		gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> +		alloc_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> +	}
> +retry:
> +	new = shmem_alloc_folio(alloc_gfp, order, info, index);
> +	if (!new) {
> +		new = ERR_PTR(-ENOMEM);
> +		goto fallback;
>   	}
> -
> -	new = shmem_alloc_folio(gfp, order, info, index);
> -	if (!new)
> -		return ERR_PTR(-ENOMEM);
>   
>   	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
> -					   gfp, entry)) {
> +					   alloc_gfp, entry)) {
>   		folio_put(new);
> -		return ERR_PTR(-ENOMEM);
> +		new = ERR_PTR(-ENOMEM);
> +		goto fallback;
>   	}
>   
>   	/*
> @@ -2030,7 +2037,9 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>   	 */
>   	if (swapcache_prepare(entry, nr_pages)) {
>   		folio_put(new);
> -		return ERR_PTR(-EEXIST);
> +		new = ERR_PTR(-EEXIST);
> +		/* Try smaller folio to avoid cache conflict */
> +		goto fallback;
>   	}
>   
>   	__folio_set_locked(new);
> @@ -2044,6 +2053,15 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>   	folio_add_lru(new);
>   	swap_read_folio(new, NULL);
>   	return new;
> +fallback:
> +	/* Order 0 swapin failed, nothing to fallback to, abort */
> +	if (!order)
> +		return new;
> +	order = 0;
> +	nr_pages = 1;
> +	alloc_gfp = gfp;
> +	entry = swap;
> +	goto retry;
>   }
>   
>   /*
> @@ -2309,25 +2327,24 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   			count_vm_event(PGMAJFAULT);
>   			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>   		}
> -

Nit: do not add unnecessary change.

>   		/* Skip swapcache for synchronous device. */
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> -			folio = shmem_swap_alloc_folio(inode, vma, index,
> -						       index_entry, order, gfp);
> +			folio = shmem_swapin_direct(inode, vma, index, swap,
> +						    index_entry, order, gfp);
>   			if (!IS_ERR(folio)) {
> -				swap = index_entry;
> +				if (folio_test_large(folio))
> +					swap = index_entry;
>   				skip_swapcache = true;
>   				goto alloced;
>   			}
>   
>   			/*
> -			 * Fallback to swapin order-0 folio unless the swap entry
> -			 * already exists.
> +			 * Direct swapin handled order 0 fallback already,
> +			 * if it failed, abort.
>   			 */
>   			error = PTR_ERR(folio);
>   			folio = NULL;
> -			if (error == -EEXIST)
> -				goto failed;
> +			goto failed;
>   		}
>   		/* Cached swapin with readahead, only supports order 0 */
>   		folio = shmem_swapin_cluster(swap, gfp, info, index);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 7/9] mm/shmem, swap: simplify swapin path and result handling
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
                   ` (5 preceding siblings ...)
  2025-07-04 18:17 ` [PATCH v4 6/9] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-07  8:14   ` Baolin Wang
  2025-07-04 18:17 ` [PATCH v4 8/9] mm/shmem, swap: simplify swap entry and index calculation of large swapin Kairui Song
  2025-07-04 18:17 ` [PATCH v4 9/9] mm/shmem, swap: fix major fault counting Kairui Song
  8 siblings, 1 reply; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Slightly tidy up the different handling of swap in and error handling
for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
will always use either shmem_swapin_direct or shmem_swapin_cluster,
then check the result.

Simplify the control flow and avoid a redundant goto label.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 31 +++++++++++++------------------
 1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1fe9a3eb92b1..782162c0c4e0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2327,33 +2327,28 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-		/* Skip swapcache for synchronous device. */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+			/* Direct mTHP swapin skipping swap cache & readhaed */
 			folio = shmem_swapin_direct(inode, vma, index, swap,
 						    index_entry, order, gfp);
-			if (!IS_ERR(folio)) {
+			if (IS_ERR(folio)) {
+				error = PTR_ERR(folio);
+				folio = NULL;
+				goto failed;
+			} else {
 				if (folio_test_large(folio))
 					swap = index_entry;
 				skip_swapcache = true;
-				goto alloced;
 			}
-
-			/*
-			 * Direct swapin handled order 0 fallback already,
-			 * if it failed, abort.
-			 */
-			error = PTR_ERR(folio);
-			folio = NULL;
-			goto failed;
-		}
-		/* Cached swapin with readahead, only supports order 0 */
-		folio = shmem_swapin_cluster(swap, gfp, info, index);
-		if (!folio) {
-			error = -ENOMEM;
-			goto failed;
+		} else {
+			/* Cached swapin with readhaed, only supports order 0 */
+			folio = shmem_swapin_cluster(swap, gfp, info, index);
+			if (!folio) {
+				error = -ENOMEM;
+				goto failed;
+			}
 		}
 	}
-alloced:
 	if (order > folio_order(folio)) {
 		/*
 		 * Swapin may get smaller folios due to various reasons:
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 7/9] mm/shmem, swap: simplify swapin path and result handling
  2025-07-04 18:17 ` [PATCH v4 7/9] mm/shmem, swap: simplify swapin path and result handling Kairui Song
@ 2025-07-07  8:14   ` Baolin Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2025-07-07  8:14 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
	Nhat Pham, Baoquan He, Barry Song, linux-kernel



On 2025/7/5 02:17, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Slightly tidy up the different handling of swap in and error handling
> for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin
> will always use either shmem_swapin_direct or shmem_swapin_cluster,
> then check the result.
> 
> Simplify the control flow and avoid a redundant goto label.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 31 +++++++++++++------------------
>   1 file changed, 13 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1fe9a3eb92b1..782162c0c4e0 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2327,33 +2327,28 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   			count_vm_event(PGMAJFAULT);
>   			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>   		}
> -		/* Skip swapcache for synchronous device. */
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> +			/* Direct mTHP swapin skipping swap cache & readhaed */
>   			folio = shmem_swapin_direct(inode, vma, index, swap,
>   						    index_entry, order, gfp);
> -			if (!IS_ERR(folio)) {
> +			if (IS_ERR(folio)) {
> +				error = PTR_ERR(folio);
> +				folio = NULL;
> +				goto failed;
> +			} else {

The 'else' can be avoided, otherwise LGTM.

>   				if (folio_test_large(folio))
>   					swap = index_entry;
>   				skip_swapcache = true;
> -				goto alloced;
>   			}
> -
> -			/*
> -			 * Direct swapin handled order 0 fallback already,
> -			 * if it failed, abort.
> -			 */
> -			error = PTR_ERR(folio);
> -			folio = NULL;
> -			goto failed;
> -		}
> -		/* Cached swapin with readahead, only supports order 0 */
> -		folio = shmem_swapin_cluster(swap, gfp, info, index);
> -		if (!folio) {
> -			error = -ENOMEM;
> -			goto failed;
> +		} else {
> +			/* Cached swapin with readhaed, only supports order 0 */
> +			folio = shmem_swapin_cluster(swap, gfp, info, index);
> +			if (!folio) {
> +				error = -ENOMEM;
> +				goto failed;
> +			}
>   		}
>   	}
> -alloced:
>   	if (order > folio_order(folio)) {
>   		/*
>   		 * Swapin may get smaller folios due to various reasons:



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 8/9] mm/shmem, swap: simplify swap entry and index calculation of large swapin
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
                   ` (6 preceding siblings ...)
  2025-07-04 18:17 ` [PATCH v4 7/9] mm/shmem, swap: simplify swapin path and result handling Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  2025-07-04 18:17 ` [PATCH v4 9/9] mm/shmem, swap: fix major fault counting Kairui Song
  8 siblings, 0 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Noticing large shmem swapin have already calculated the right swap value
to be used before the swap cache look up, simply rounding it down against
the size of the folio brought in the swapin is simple and effective
enough to get the swap value to be verified. A folio's swap entry is
always aligned by its size.

Any kind of parallel split or race is fine, because the final
shmem_add_to_page_cache always ensures entries covered by the folio
are all correct, and there won't be any data corruption.

This shouldn't cause any increased repeated fault too, instead, no
matter how the shmem mapping is split in parallel, as long as the
mapping still contains the right entries, the swapin will succeed.

This reduced both the final object size and stack usage:

./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/0 grow/shrink: 1/1 up/down: 5/-214 (-209)
Function                                     old     new   delta
shmem_read_mapping_page_gfp                  143     148      +5
shmem_swapin_folio                          4020    3806    -214
Total: Before=33478, After=33269, chg -0.62%

Stack usage (Before vs After):
shmem.c:2279:12:shmem_swapin_folio     280    static
shmem.c:2279:12:shmem_swapin_folio     264    static

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 43 +++++++++++++++++++++----------------------
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 782162c0c4e0..646b1db9501c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2267,7 +2267,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);
 
-	return entry_order;
+	return 0;
 }
 
 /*
@@ -2288,7 +2288,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
-	int error, nr_pages, order, split_order;
+	int error, nr_pages, order;
 	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -2336,8 +2336,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 				folio = NULL;
 				goto failed;
 			} else {
-				if (folio_test_large(folio))
-					swap = index_entry;
 				skip_swapcache = true;
 			}
 		} else {
@@ -2349,6 +2347,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			}
 		}
 	}
+
 	if (order > folio_order(folio)) {
 		/*
 		 * Swapin may get smaller folios due to various reasons:
@@ -2358,23 +2357,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		 * large swap entries. In such cases, we should split the
 		 * large swap entry to prevent possible data corruption.
 		 */
-		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
-		if (split_order < 0) {
-			error = split_order;
+		error = shmem_split_large_entry(inode, index, index_entry, gfp);
+		if (error)
 			goto failed_nolock;
-		}
+	}
 
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			offset = index - round_down(index, 1 << split_order);
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-	} else if (order < folio_order(folio)) {
-		swap.val = round_down(swap.val, 1 << folio_order(folio));
+	/*
+	 * If the folio is large, round down swap and index by folio size.
+	 * No matter what race occurs, the swap layer ensures we either get
+	 * a valid folio that has its swap entry aligned by size, or a
+	 * temporarily invalid one which we'll abort very soon and retry.
+	 *
+	 * shmem_add_to_page_cache ensures the whole range contains expected
+	 * entries and prevents any corruption, so any race split is fine
+	 * too, it will succeed as long as the entries are still there.
+	 */
+	nr_pages = folio_nr_pages(folio);
+	if (nr_pages > 1) {
+		swap.val = round_down(swap.val, nr_pages);
+		index = round_down(index, nr_pages);
 	}
 
 	/* We have to do this with folio locked to prevent races */
@@ -2389,7 +2390,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
-	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -2403,8 +2403,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping,
-					round_down(index, nr_pages),
+	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 9/9] mm/shmem, swap: fix major fault counting
  2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
                   ` (7 preceding siblings ...)
  2025-07-04 18:17 ` [PATCH v4 8/9] mm/shmem, swap: simplify swap entry and index calculation of large swapin Kairui Song
@ 2025-07-04 18:17 ` Kairui Song
  8 siblings, 0 replies; 17+ messages in thread
From: Kairui Song @ 2025-07-04 18:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

If the swapin failed, don't update the major fault count. There is a
long existing comment for doing it this way, now with previous cleanups,
we can finally fix it.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 646b1db9501c..b03b5bf2df38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2321,12 +2321,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
-		/* Or update major stats only when swapin succeeds?? */
-		if (fault_type) {
-			*fault_type |= VM_FAULT_MAJOR;
-			count_vm_event(PGMAJFAULT);
-			count_memcg_event_mm(fault_mm, PGMAJFAULT);
-		}
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct mTHP swapin skipping swap cache & readhaed */
 			folio = shmem_swapin_direct(inode, vma, index, swap,
@@ -2346,6 +2340,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 				goto failed;
 			}
 		}
+		if (fault_type) {
+			*fault_type |= VM_FAULT_MAJOR;
+			count_vm_event(PGMAJFAULT);
+			count_memcg_event_mm(fault_mm, PGMAJFAULT);
+		}
 	}
 
 	if (order > folio_order(folio)) {
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-08  6:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 18:17 [PATCH v4 0/9] mm/shmem, swap: bugfix and improvement of mTHP swap-in Kairui Song
2025-07-04 18:17 ` [PATCH v4 1/9] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
2025-07-04 18:17 ` [PATCH v4 2/9] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-07-04 18:17 ` [PATCH v4 3/9] mm/shmem, swap: tidy up THP swapin checks Kairui Song
2025-07-04 18:17 ` [PATCH v4 4/9] mm/shmem, swap: tidy up swap entry splitting Kairui Song
2025-07-06  3:35   ` Baolin Wang
2025-07-06 11:50     ` Kairui Song
2025-07-04 18:17 ` [PATCH v4 5/9] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
2025-07-07  7:53   ` Baolin Wang
2025-07-07  8:04     ` Kairui Song
2025-07-08  6:00       ` Baolin Wang
2025-07-04 18:17 ` [PATCH v4 6/9] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
2025-07-07  8:05   ` Baolin Wang
2025-07-04 18:17 ` [PATCH v4 7/9] mm/shmem, swap: simplify swapin path and result handling Kairui Song
2025-07-07  8:14   ` Baolin Wang
2025-07-04 18:17 ` [PATCH v4 8/9] mm/shmem, swap: simplify swap entry and index calculation of large swapin Kairui Song
2025-07-04 18:17 ` [PATCH v4 9/9] mm/shmem, swap: fix major fault counting Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).