[PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in
@ 2025-06-17 18:34 Kairui Song
  2025-06-17 18:35 ` [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Kairui Song @ 2025-06-17 18:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current mTHP swapin path have several problems. It may potentially
hang, may cause redundant faults due to false positive swap cache lookup,
and it will involve at least 4 Xarray tree walks (get order, get order
again, confirm swap, insert folio). And for !CONFIG_TRANSPARENT_HUGEPAGE
builds, it will performs some mTHP related checks.

This series fixes all of the mentioned issues, and the code should be more
robust and prepared for the swap table series. Now tree walks is reduced
to twice (get order & confirm, insert folio) and added more sanity checks
and comments. !CONFIG_TRANSPARENT_HUGEPAGE build overhead is also
minimized, and comes with a sanity check now.

The performance is slightly better after this series, sequential swap in of
24G data from ZRAM, using transparent_hugepage_tmpfs=always (36 samples each):

Before:        avg: 11.23s, stddev: 0.06
After patch 1: avg: 10.92s, stddev: 0.05
After patch 2: avg: 10.93s, stddev: 0.15
After patch 3: avg: 10.07s, stddev: 0.09
After patch 4: avg: 10.09s, stddev: 0.08

Each patch improves the performance by a little, which is about ~10%
faster in total.

Build kernel test showed very slightly improvement, testing with make -j24
with defconfig in a 256M memcg also using ZRAM as swap, and
transparent_hugepage_tmpfs=always (6 samples each):

Before:        system time avg: 3945.25s
After patch 1: system time avg: 3903.21s
After patch 2: system time avg: 3914.76s
After patch 3: system time avg: 3907.41s
After patch 4: system time avg: 3876.24s

Slightly better than noise level given the number of samples.

---

Two of the patches in this series comes from the swap table series [1],
and worth noting that the performance gain of this series is independent
to the swap table series, we'll see another bigger performance gain and
reduce of memory usage after the swap table series.

I found these issues while trying to split the shmem changes out of the
swap table series for easier reviewing, and found several more issues
while doing stress tests for performance comparision. Barry also mentioned
that CONFIG_TRANSPARENT_HUGEPAGE may have redundant checks [2] and I
managed to clean them up properly too.

No issue is found with a few days of stress testing.

Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7AsKFz7UN+seR5atznE_RBTDC9qjDmwN5saMe+KL3b1mQ@mail.gmail.com/ [2]

Kairui Song (4):
  mm/shmem, swap: improve cached mTHP handling and fix potential hung
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  mm/shmem, swap: improve mthp swapin process
  mm/shmem, swap: avoid false positive swap cache lookup

 mm/shmem.c | 247 +++++++++++++++++++++++++++--------------------------
 1 file changed, 126 insertions(+), 121 deletions(-)

-- 
2.50.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-06-17 18:34 [PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
@ 2025-06-17 18:35 ` Kairui Song
  2025-06-17 22:58   ` Andrew Morton
  2025-06-18  2:08   ` Kemeng Shi
  2025-06-17 18:35 ` [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Kairui Song @ 2025-06-17 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song, stable

From: Kairui Song <kasong@tencent.com>

The current swap-in code assumes that, when a swap entry in shmem
mapping is order 0, its cached folios (if present) must be order 0
too, which turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0,
and folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
always return false causing swap-in to return -EEXIST.

And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop
the redundant tree walks before the insertion.

This will actually improve the performance, as it avoided two redundant
Xarray tree walks in the hot path, and the only side effect is that in
the failure path, shmem may redundantly reallocate a few folios
causing temporary slight memory pressure.

And worth noting, it may seems the order and value check before
inserting might help reducing the lock contention, which is not true.
The swap cache layer ensures raced swapin will either see a swap cache
folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
swap cache is bypassed), so holding the folio lock and checking the
folio flag is already good enough for avoiding the lock contention.
The chance that a folio passes the swap entry value check but the
shmem mapping slot has changed should be very low.

Cc: stable@vger.kernel.org
Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index eda35be2a8d9..4e7ef343a29b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
 				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
-	long nr = folio_nr_pages(folio);
+	unsigned long nr = folio_nr_pages(folio);
+	swp_entry_t iter, swap;
+	void *entry;
 
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
 
 	gfp &= GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
+	swap = iter = radix_to_swp_entry(expected);
 
 	do {
 		xas_lock_irq(&xas);
-		if (expected != xas_find_conflict(&xas)) {
-			xas_set_err(&xas, -EEXIST);
-			goto unlock;
+		xas_for_each_conflict(&xas, entry) {
+			/*
+			 * The range must either be empty, or filled with
+			 * expected swap entries. Shmem swap entries are never
+			 * partially freed without split of both entry and
+			 * folio, so there shouldn't be any holes.
+			 */
+			if (!expected || entry != swp_to_radix_entry(iter)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			iter.val += 1 << xas_get_order(&xas);
 		}
-		if (expected && xas_find_conflict(&xas)) {
+		if (expected && iter.val - nr != swap.val) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}
@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order != folio_order(folio)) {
+	} else if (order > folio_order(folio)) {
 		/*
 		 * Swap readahead may swap in order 0 folios into swapcache
 		 * asynchronously, while the shmem mapping can still stores
@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
+	} else if (order < folio_order(folio)) {
+		swap.val = round_down(swp_type(swap), folio_order(folio));
 	}
 
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap) ||
-	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+	    folio->swap.val != swap.val) {
 		error = -EEXIST;
 		goto unlock;
 	}
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-17 18:34 [PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
  2025-06-17 18:35 ` [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-06-17 18:35 ` Kairui Song
  2025-06-18  2:48   ` Kemeng Shi
  2025-06-18  7:16   ` Dev Jain
  2025-06-17 18:35 ` [PATCH 3/4] mm/shmem, swap: improve mthp swapin process Kairui Song
  2025-06-17 18:35 ` [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
  3 siblings, 2 replies; 22+ messages in thread
From: Kairui Song @ 2025-06-17 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently shmem calls xa_get_order to get the swap radix entry order,
requiring a full tree walk. This can be easily combined with the swap
entry value checking (shmem_confirm_swap) to avoid the duplicated
lookup, which should improve the performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4e7ef343a29b..0ad49e57f736 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
 
 /*
  * Sometimes, before we decide whether to proceed or to fail, we must check
- * that an entry was not already brought back from swap by a racing thread.
+ * that an entry was not already brought back or split by a racing thread.
  *
  * Checking folio is not enough: by the time a swapcache folio is locked, it
  * might be reused, and again be swapcache, using the same swap as before.
+ * Returns the swap entry's order if it still presents, else returns -1.
  */
-static bool shmem_confirm_swap(struct address_space *mapping,
-			       pgoff_t index, swp_entry_t swap)
+static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
+				  swp_entry_t swap)
 {
-	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
+	XA_STATE(xas, &mapping->i_pages, index);
+	int ret = -1;
+	void *entry;
+
+	rcu_read_lock();
+	do {
+		entry = xas_load(&xas);
+		if (entry == swp_to_radix_entry(swap))
+			ret = xas_get_order(&xas);
+	} while (xas_retry(&xas, entry));
+	rcu_read_unlock();
+	return ret;
 }
 
 /*
@@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EIO;
 
 	si = get_swap_device(swap);
-	if (!si) {
-		if (!shmem_confirm_swap(mapping, index, swap))
+	order = shmem_swap_check_entry(mapping, index, swap);
+	if (unlikely(!si)) {
+		if (order < 0)
 			return -EEXIST;
 		else
 			return -EINVAL;
 	}
+	if (unlikely(order < 0)) {
+		put_swap_device(si);
+		return -EEXIST;
+	}
 
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
-	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		int nr_pages = 1 << order;
 		bool fallback_order0 = false;
@@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	*foliop = folio;
 	return 0;
 failed:
-	if (!shmem_confirm_swap(mapping, index, swap))
+	if (shmem_swap_check_entry(mapping, index, swap) < 0)
 		error = -EEXIST;
 	if (error == -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
@@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_put(folio);
 	}
 	put_swap_device(si);
-
 	return error;
 }
 
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-17 18:34 [PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
  2025-06-17 18:35 ` [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
  2025-06-17 18:35 ` [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
@ 2025-06-17 18:35 ` Kairui Song
  2025-06-18  6:27   ` Kemeng Shi
  2025-06-18  8:26   ` Kemeng Shi
  2025-06-17 18:35 ` [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
  3 siblings, 2 replies; 22+ messages in thread
From: Kairui Song @ 2025-06-17 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Tidy up the mTHP swapin workflow. There should be no feature change, but
consolidates the mTHP related check to one place so they are now all
wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
compiler if not needed.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
 1 file changed, 78 insertions(+), 97 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0ad49e57f736..46dea2fa1b43 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1975,31 +1975,51 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 	return ERR_PTR(error);
 }
 
-static struct folio *shmem_swap_alloc_folio(struct inode *inode,
+static struct folio *shmem_swapin_direct(struct inode *inode,
 		struct vm_area_struct *vma, pgoff_t index,
-		swp_entry_t entry, int order, gfp_t gfp)
+		swp_entry_t entry, int *order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	int nr_pages = 1 << *order;
 	struct folio *new;
+	pgoff_t offset;
 	void *shadow;
-	int nr_pages;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success with further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
-		gfp_t huge_gfp = vma_thp_gfp_mask(vma);
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (WARN_ON_ONCE(*order))
+			return ERR_PTR(-EINVAL);
+	} else if (*order) {
+		/*
+		 * If uffd is active for the vma, we need per-page fault
+		 * fidelity to maintain the uffd semantics, then fallback
+		 * to swapin order-0 folio, as well as for zswap case.
+		 * Any existing sub folio in the swap cache also blocks
+		 * mTHP swapin.
+		 */
+		if ((vma && userfaultfd_armed(vma)) ||
+		    !zswap_never_enabled() ||
+		    non_swapcache_batch(entry, nr_pages) != nr_pages) {
+			offset = index - round_down(index, nr_pages);
+			entry = swp_entry(swp_type(entry),
+					  swp_offset(entry) + offset);
+			*order = 0;
+			nr_pages = 1;
+		} else {
+			gfp_t huge_gfp = vma_thp_gfp_mask(vma);
 
-		gfp = limit_gfp_mask(huge_gfp, gfp);
+			gfp = limit_gfp_mask(huge_gfp, gfp);
+		}
 	}
 
-	new = shmem_alloc_folio(gfp, order, info, index);
+	new = shmem_alloc_folio(gfp, *order, info, index);
 	if (!new)
 		return ERR_PTR(-ENOMEM);
 
-	nr_pages = folio_nr_pages(new);
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
 					   gfp, entry)) {
 		folio_put(new);
@@ -2165,8 +2185,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	swap_free_nr(swap, nr_pages);
 }
 
-static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
-				   swp_entry_t swap, gfp_t gfp)
+/*
+ * Split an existing large swap entry. @index should point to one sub mapping
+ * slot within the entry @swap, this sub slot will be split into order 0.
+ */
+static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
+				  swp_entry_t swap, gfp_t gfp)
 {
 	struct address_space *mapping = inode->i_mapping;
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
@@ -2226,7 +2250,6 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			cur_order = split_order;
 			split_order = xas_try_split_min_order(split_order);
 		}
-
 unlock:
 		xas_unlock_irq(&xas);
 
@@ -2237,7 +2260,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);
 
-	return entry_order;
+	return 0;
 }
 
 /*
@@ -2254,11 +2277,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	int error, nr_pages, order, swap_order;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
 	swp_entry_t swap;
-	int error, nr_pages, order, split_order;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
@@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
-		int nr_pages = 1 << order;
-		bool fallback_order0 = false;
-
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-
-		/*
-		 * If uffd is active for the vma, we need per-page fault
-		 * fidelity to maintain the uffd semantics, then fallback
-		 * to swapin order-0 folio, as well as for zswap case.
-		 * Any existing sub folio in the swap cache also blocks
-		 * mTHP swapin.
-		 */
-		if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
-				  !zswap_never_enabled() ||
-				  non_swapcache_batch(swap, nr_pages) != nr_pages))
-			fallback_order0 = true;
-
-		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+		/* Try direct mTHP swapin bypassing swap cache and readahead */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+			swap_order = order;
+			folio = shmem_swapin_direct(inode, vma, index,
+						    swap, &swap_order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
 				goto alloced;
 			}
-
-			/*
-			 * Fallback to swapin order-0 folio unless the swap entry
-			 * already exists.
-			 */
+			/* Fallback if order > 0 swapin failed with -ENOMEM */
 			error = PTR_ERR(folio);
 			folio = NULL;
-			if (error == -EEXIST)
+			if (error != -ENOMEM || !swap_order)
 				goto failed;
 		}
-
 		/*
-		 * Now swap device can only swap in order 0 folio, then we
-		 * should split the large swap entry stored in the pagecache
-		 * if necessary.
+		 * Try order 0 swapin using swap cache and readahead, it still
+		 * may return order > 0 folio due to raced swap cache.
 		 */
-		split_order = shmem_split_large_entry(inode, index, swap, gfp);
-		if (split_order < 0) {
-			error = split_order;
-			goto failed;
-		}
-
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-
-		/* Here we actually start the io */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
 			error = -ENOMEM;
 			goto failed;
 		}
-	} else if (order > folio_order(folio)) {
-		/*
-		 * Swap readahead may swap in order 0 folios into swapcache
-		 * asynchronously, while the shmem mapping can still stores
-		 * large swap entries. In such cases, we should split the
-		 * large swap entry to prevent possible data corruption.
-		 */
-		split_order = shmem_split_large_entry(inode, index, swap, gfp);
-		if (split_order < 0) {
-			folio_put(folio);
-			folio = NULL;
-			error = split_order;
-			goto failed;
-		}
-
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
-	} else if (order < folio_order(folio)) {
-		swap.val = round_down(swp_type(swap), folio_order(folio));
 	}
-
 alloced:
+	/*
+	 * We need to split an existing large entry if swapin brought in a
+	 * smaller folio due to various of reasons.
+	 *
+	 * And worth noting there is a special case: if there is a smaller
+	 * cached folio that covers @swap, but not @index (it only covers
+	 * first few sub entries of the large entry, but @index points to
+	 * later parts), the swap cache lookup will still see this folio,
+	 * And we need to split the large entry here. Later checks will fail,
+	 * as it can't satisfy the swap requirement, and we will retry
+	 * the swapin from beginning.
+	 */
+	swap_order = folio_order(folio);
+	if (order > swap_order) {
+		error = shmem_split_swap_entry(inode, index, swap, gfp);
+		if (error)
+			goto failed_nolock;
+	}
+
+	index = round_down(index, 1 << swap_order);
+	swap.val = round_down(swap.val, 1 << swap_order);
+
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
 	    folio->swap.val != swap.val) {
 		error = -EEXIST;
-		goto unlock;
+		goto failed_unlock;
 	}
 	if (!folio_test_uptodate(folio)) {
 		error = -EIO;
@@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping,
-					round_down(index, nr_pages),
+	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
@@ -2419,8 +2397,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_mark_accessed(folio);
 
 	if (skip_swapcache) {
+		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
 		folio->swap.val = 0;
-		swapcache_clear(si, swap, nr_pages);
 	} else {
 		delete_from_swap_cache(folio);
 	}
@@ -2436,13 +2414,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (error == -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
 					     skip_swapcache);
-unlock:
-	if (skip_swapcache)
-		swapcache_clear(si, swap, folio_nr_pages(folio));
-	if (folio) {
+failed_unlock:
+	if (folio)
 		folio_unlock(folio);
-		folio_put(folio);
+failed_nolock:
+	if (skip_swapcache) {
+		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
+		folio->swap.val = 0;
 	}
+	if (folio)
+		folio_put(folio);
 	put_swap_device(si);
 	return error;
 }
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup
  2025-06-17 18:34 [PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
                   ` (2 preceding siblings ...)
  2025-06-17 18:35 ` [PATCH 3/4] mm/shmem, swap: improve mthp swapin process Kairui Song
@ 2025-06-17 18:35 ` Kairui Song
  2025-06-19  1:28   ` Kemeng Shi
  3 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-06-17 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

If the shmem read request's index points to the middle of a large swap
entry, shmem swap in does the swap cache lookup use the large swap
entry's starting value (the first sub swap entry of this large entry).
This will lead to false positive lookup result if only the first few
swap entries are cached, but the requested swap entry pointed by index
is uncached.

Currently shmem will do a large entry split then retry the swapin from
beginning, which is a waste of CPU and fragile. Handle this correctly.

Also add some sanity checks to help understand the code and ensure
things won't go wrong.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 61 ++++++++++++++++++++++++++----------------------------
 1 file changed, 29 insertions(+), 32 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 46dea2fa1b43..0bc30dafad90 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1977,12 +1977,12 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 
 static struct folio *shmem_swapin_direct(struct inode *inode,
 		struct vm_area_struct *vma, pgoff_t index,
-		swp_entry_t entry, int *order, gfp_t gfp)
+		swp_entry_t swap_entry, swp_entry_t swap,
+		int *order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	int nr_pages = 1 << *order;
 	struct folio *new;
-	pgoff_t offset;
 	void *shadow;
 
 	/*
@@ -2003,13 +2003,11 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
 		 */
 		if ((vma && userfaultfd_armed(vma)) ||
 		    !zswap_never_enabled() ||
-		    non_swapcache_batch(entry, nr_pages) != nr_pages) {
-			offset = index - round_down(index, nr_pages);
-			entry = swp_entry(swp_type(entry),
-					  swp_offset(entry) + offset);
+		    non_swapcache_batch(swap_entry, nr_pages) != nr_pages) {
 			*order = 0;
 			nr_pages = 1;
 		} else {
+			swap.val = swap_entry.val;
 			gfp_t huge_gfp = vma_thp_gfp_mask(vma);
 
 			gfp = limit_gfp_mask(huge_gfp, gfp);
@@ -2021,7 +2019,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
 		return ERR_PTR(-ENOMEM);
 
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
-					   gfp, entry)) {
+					   gfp, swap)) {
 		folio_put(new);
 		return ERR_PTR(-ENOMEM);
 	}
@@ -2036,17 +2034,17 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
 	 * In this case, shmem_add_to_page_cache() will help identify the
 	 * concurrent swapin and return -EEXIST.
 	 */
-	if (swapcache_prepare(entry, nr_pages)) {
+	if (swapcache_prepare(swap, nr_pages)) {
 		folio_put(new);
 		return ERR_PTR(-EEXIST);
 	}
 
 	__folio_set_locked(new);
 	__folio_set_swapbacked(new);
-	new->swap = entry;
+	new->swap = swap;
 
-	memcg1_swapin(entry, nr_pages);
-	shadow = get_shadow_from_swap_cache(entry);
+	memcg1_swapin(swap, nr_pages);
+	shadow = get_shadow_from_swap_cache(swap);
 	if (shadow)
 		workingset_refault(new, shadow);
 	folio_add_lru(new);
@@ -2278,20 +2276,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	int error, nr_pages, order, swap_order;
+	swp_entry_t swap, swap_entry;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
-	swp_entry_t swap;
+	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
-	swap = radix_to_swp_entry(*foliop);
+	swap_entry = radix_to_swp_entry(*foliop);
 	*foliop = NULL;
 
-	if (is_poisoned_swp_entry(swap))
+	if (is_poisoned_swp_entry(swap_entry))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	order = shmem_swap_check_entry(mapping, index, swap);
+	si = get_swap_device(swap_entry);
+	order = shmem_swap_check_entry(mapping, index, swap_entry);
 	if (unlikely(!si)) {
 		if (order < 0)
 			return -EEXIST;
@@ -2303,7 +2302,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EEXIST;
 	}
 
-	/* Look it up and read it in.. */
+	/* @index may points to the middle of a large entry, get the real swap value first */
+	offset = index - round_down(index, 1 << order);
+	swap.val = swap_entry.val + offset;
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
 		/* Or update major stats only when swapin succeeds?? */
@@ -2315,7 +2316,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		/* Try direct mTHP swapin bypassing swap cache and readahead */
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			swap_order = order;
-			folio = shmem_swapin_direct(inode, vma, index,
+			folio = shmem_swapin_direct(inode, vma, index, swap_entry,
 						    swap, &swap_order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
@@ -2338,28 +2339,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		}
 	}
 alloced:
+	swap_order = folio_order(folio);
+	nr_pages = folio_nr_pages(folio);
+
+	/* The swap-in should cover both @swap and @index */
+	swap.val = round_down(swap.val, nr_pages);
+	VM_WARN_ON_ONCE(swap.val > swap_entry.val + offset);
+	VM_WARN_ON_ONCE(swap.val + nr_pages <= swap_entry.val + offset);
+
 	/*
 	 * We need to split an existing large entry if swapin brought in a
 	 * smaller folio due to various of reasons.
-	 *
-	 * And worth noting there is a special case: if there is a smaller
-	 * cached folio that covers @swap, but not @index (it only covers
-	 * first few sub entries of the large entry, but @index points to
-	 * later parts), the swap cache lookup will still see this folio,
-	 * And we need to split the large entry here. Later checks will fail,
-	 * as it can't satisfy the swap requirement, and we will retry
-	 * the swapin from beginning.
 	 */
-	swap_order = folio_order(folio);
+	index = round_down(index, nr_pages);
 	if (order > swap_order) {
-		error = shmem_split_swap_entry(inode, index, swap, gfp);
+		error = shmem_split_swap_entry(inode, index, swap_entry, gfp);
 		if (error)
 			goto failed_nolock;
 	}
 
-	index = round_down(index, 1 << swap_order);
-	swap.val = round_down(swap.val, 1 << swap_order);
-
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
@@ -2372,7 +2370,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
-	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-06-17 18:35 ` [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-06-17 22:58   ` Andrew Morton
  2025-06-18  2:11     ` Kairui Song
  2025-06-18  2:08   ` Kemeng Shi
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2025-06-17 22:58 UTC (permalink / raw)
  To: Kairui Song
  Cc: Kairui Song, linux-mm, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, stable

On Wed, 18 Jun 2025 02:35:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
> The current swap-in code assumes that, when a swap entry in shmem
> mapping is order 0, its cached folios (if present) must be order 0
> too, which turns out not always correct.
> 
> The problem is shmem_split_large_entry is called before verifying the
> folio will eventually be swapped in, one possible race is:
> 
>     CPU1                          CPU2
> shmem_swapin_folio
> /* swap in of order > 0 swap entry S1 */
>   folio = swap_cache_get_folio
>   /* folio = NULL */
>   order = xa_get_order
>   /* order > 0 */
>   folio = shmem_swap_alloc_folio
>   /* mTHP alloc failure, folio = NULL */
>   <... Interrupted ...>
>                                  shmem_swapin_folio
>                                  /* S1 is swapped in */
>                                  shmem_writeout
>                                  /* S1 is swapped out, folio cached */
>   shmem_split_large_entry(..., S1)
>   /* S1 is split, but the folio covering it has order > 0 now */
> 
> Now any following swapin of S1 will hang: `xa_get_order` returns 0,
> and folio lookup will return a folio with order > 0. The
> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
> always return false causing swap-in to return -EEXIST.
> 
> And this looks fragile. So fix this up by allowing seeing a larger folio
> in swap cache, and check the whole shmem mapping range covered by the
> swapin have the right swap value upon inserting the folio. And drop
> the redundant tree walks before the insertion.
> 
> This will actually improve the performance, as it avoided two redundant
> Xarray tree walks in the hot path, and the only side effect is that in
> the failure path, shmem may redundantly reallocate a few folios
> causing temporary slight memory pressure.
> 
> And worth noting, it may seems the order and value check before
> inserting might help reducing the lock contention, which is not true.
> The swap cache layer ensures raced swapin will either see a swap cache
> folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
> swap cache is bypassed), so holding the folio lock and checking the
> folio flag is already good enough for avoiding the lock contention.
> The chance that a folio passes the swap entry value check but the
> shmem mapping slot has changed should be very low.
> 
> Cc: stable@vger.kernel.org
> Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")

The Fixes: tells -stable maintainers (and others) which kernel versions
need the fix.  So having two Fixes: against different kernel versions is
very confusing!  Are we recommending that kernels which contain
809bc86517cc but not 058313515d5a be patched?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-06-17 18:35 ` [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
  2025-06-17 22:58   ` Andrew Morton
@ 2025-06-18  2:08   ` Kemeng Shi
  1 sibling, 0 replies; 22+ messages in thread
From: Kemeng Shi @ 2025-06-18  2:08 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel, stable



on 6/18/2025 2:35 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current swap-in code assumes that, when a swap entry in shmem
> mapping is order 0, its cached folios (if present) must be order 0
> too, which turns out not always correct.
> 
> The problem is shmem_split_large_entry is called before verifying the
> folio will eventually be swapped in, one possible race is:
> 
>     CPU1                          CPU2
> shmem_swapin_folio
> /* swap in of order > 0 swap entry S1 */
>   folio = swap_cache_get_folio
>   /* folio = NULL */
>   order = xa_get_order
>   /* order > 0 */
>   folio = shmem_swap_alloc_folio
>   /* mTHP alloc failure, folio = NULL */
>   <... Interrupted ...>
>                                  shmem_swapin_folio
>                                  /* S1 is swapped in */
>                                  shmem_writeout
>                                  /* S1 is swapped out, folio cached */
>   shmem_split_large_entry(..., S1)
>   /* S1 is split, but the folio covering it has order > 0 now */
> 
> Now any following swapin of S1 will hang: `xa_get_order` returns 0,
> and folio lookup will return a folio with order > 0. The
> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
> always return false causing swap-in to return -EEXIST.
> 
> And this looks fragile. So fix this up by allowing seeing a larger folio
> in swap cache, and check the whole shmem mapping range covered by the
> swapin have the right swap value upon inserting the folio. And drop
> the redundant tree walks before the insertion.
> 
> This will actually improve the performance, as it avoided two redundant
> Xarray tree walks in the hot path, and the only side effect is that in
> the failure path, shmem may redundantly reallocate a few folios
> causing temporary slight memory pressure.
> 
> And worth noting, it may seems the order and value check before
> inserting might help reducing the lock contention, which is not true.
> The swap cache layer ensures raced swapin will either see a swap cache
> folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
> swap cache is bypassed), so holding the folio lock and checking the
> folio flag is already good enough for avoiding the lock contention.
> The chance that a folio passes the swap entry value check but the
> shmem mapping slot has changed should be very low.
> 
> Cc: stable@vger.kernel.org
> Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 30 +++++++++++++++++++++---------
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index eda35be2a8d9..4e7ef343a29b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  				   pgoff_t index, void *expected, gfp_t gfp)
>  {
>  	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> -	long nr = folio_nr_pages(folio);
> +	unsigned long nr = folio_nr_pages(folio);
> +	swp_entry_t iter, swap;
> +	void *entry;
>  
>  	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
>  
>  	gfp &= GFP_RECLAIM_MASK;
>  	folio_throttle_swaprate(folio, gfp);
> +	swap = iter = radix_to_swp_entry(expected);
>  
>  	do {
>  		xas_lock_irq(&xas);
> -		if (expected != xas_find_conflict(&xas)) {
> -			xas_set_err(&xas, -EEXIST);
> -			goto unlock;
> +		xas_for_each_conflict(&xas, entry) {
> +			/*
> +			 * The range must either be empty, or filled with
> +			 * expected swap entries. Shmem swap entries are never
> +			 * partially freed without split of both entry and
> +			 * folio, so there shouldn't be any holes.
> +			 */
> +			if (!expected || entry != swp_to_radix_entry(iter)) {
> +				xas_set_err(&xas, -EEXIST);
> +				goto unlock;
> +			}
> +			iter.val += 1 << xas_get_order(&xas);
>  		}
> -		if (expected && xas_find_conflict(&xas)) {
> +		if (expected && iter.val - nr != swap.val) {
>  			xas_set_err(&xas, -EEXIST);
>  			goto unlock;
>  		}
> @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  			error = -ENOMEM;
>  			goto failed;
>  		}
> -	} else if (order != folio_order(folio)) {
> +	} else if (order > folio_order(folio)) {
>  		/*
>  		 * Swap readahead may swap in order 0 folios into swapcache
>  		 * asynchronously, while the shmem mapping can still stores
> @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  
>  			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>  		}
> +	} else if (order < folio_order(folio)) {
> +		swap.val = round_down(swp_type(swap), folio_order(folio));
>  	}
>  
>  alloced:
>  	/* We have to do this with folio locked to prevent races */
>  	folio_lock(folio);
>  	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> -	    folio->swap.val != swap.val ||
> -	    !shmem_confirm_swap(mapping, index, swap) ||
> -	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
> +	    folio->swap.val != swap.val) {
>  		error = -EEXIST;
>  		goto unlock;
>  	}
> 
Nice catch!
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung
  2025-06-17 22:58   ` Andrew Morton
@ 2025-06-18  2:11     ` Kairui Song
  0 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-06-18  2:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Hugh Dickins, Baolin Wang, Matthew Wilcox, Kemeng Shi,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel, stable

On Wed, Jun 18, 2025 at 6:58 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 18 Jun 2025 02:35:00 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current swap-in code assumes that, when a swap entry in shmem
> > mapping is order 0, its cached folios (if present) must be order 0
> > too, which turns out not always correct.
> >
> > The problem is shmem_split_large_entry is called before verifying the
> > folio will eventually be swapped in, one possible race is:
> >
> >     CPU1                          CPU2
> > shmem_swapin_folio
> > /* swap in of order > 0 swap entry S1 */
> >   folio = swap_cache_get_folio
> >   /* folio = NULL */
> >   order = xa_get_order
> >   /* order > 0 */
> >   folio = shmem_swap_alloc_folio
> >   /* mTHP alloc failure, folio = NULL */
> >   <... Interrupted ...>
> >                                  shmem_swapin_folio
> >                                  /* S1 is swapped in */
> >                                  shmem_writeout
> >                                  /* S1 is swapped out, folio cached */
> >   shmem_split_large_entry(..., S1)
> >   /* S1 is split, but the folio covering it has order > 0 now */
> >
> > Now any following swapin of S1 will hang: `xa_get_order` returns 0,
> > and folio lookup will return a folio with order > 0. The
> > `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will
> > always return false causing swap-in to return -EEXIST.
> >
> > And this looks fragile. So fix this up by allowing seeing a larger folio
> > in swap cache, and check the whole shmem mapping range covered by the
> > swapin have the right swap value upon inserting the folio. And drop
> > the redundant tree walks before the insertion.
> >
> > This will actually improve the performance, as it avoided two redundant
> > Xarray tree walks in the hot path, and the only side effect is that in
> > the failure path, shmem may redundantly reallocate a few folios
> > causing temporary slight memory pressure.
> >
> > And worth noting, it may seems the order and value check before
> > inserting might help reducing the lock contention, which is not true.
> > The swap cache layer ensures raced swapin will either see a swap cache
> > folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if
> > swap cache is bypassed), so holding the folio lock and checking the
> > folio flag is already good enough for avoiding the lock contention.
> > The chance that a folio passes the swap entry value check but the
> > shmem mapping slot has changed should be very low.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
> > Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
>
> The Fixes: tells -stable maintainers (and others) which kernel versions
> need the fix.  So having two Fixes: against different kernel versions is
> very confusing!  Are we recommending that kernels which contain
> 809bc86517cc but not 058313515d5a be patched?

809bc86517cc introduced mTHP support for shmem but it's buggy, and
058313515d5a tried to fix that, which is also buggy, I thought this
could help people to backport this.

I think keeping either is OK, I'll keep 809bc86517cc then, any branch
having 809bc86517cc should already have 058313515d5a backported.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-17 18:35 ` [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
@ 2025-06-18  2:48   ` Kemeng Shi
  2025-06-18  3:07     ` Kairui Song
  2025-06-18  7:16   ` Dev Jain
  1 sibling, 1 reply; 22+ messages in thread
From: Kemeng Shi @ 2025-06-18  2:48 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel



on 6/18/2025 2:35 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently shmem calls xa_get_order to get the swap radix entry order,
> requiring a full tree walk. This can be easily combined with the swap
> entry value checking (shmem_confirm_swap) to avoid the duplicated
> lookup, which should improve the performance.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 33 ++++++++++++++++++++++++---------
>  1 file changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4e7ef343a29b..0ad49e57f736 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
>  
>  /*
>   * Sometimes, before we decide whether to proceed or to fail, we must check
> - * that an entry was not already brought back from swap by a racing thread.
> + * that an entry was not already brought back or split by a racing thread.
>   *
>   * Checking folio is not enough: by the time a swapcache folio is locked, it
>   * might be reused, and again be swapcache, using the same swap as before.
> + * Returns the swap entry's order if it still presents, else returns -1.
>   */
> -static bool shmem_confirm_swap(struct address_space *mapping,
> -			       pgoff_t index, swp_entry_t swap)
> +static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
> +				  swp_entry_t swap)
>  {
> -	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> +	XA_STATE(xas, &mapping->i_pages, index);
> +	int ret = -1;
> +	void *entry;
> +
> +	rcu_read_lock();
> +	do {
> +		entry = xas_load(&xas);
> +		if (entry == swp_to_radix_entry(swap))
> +			ret = xas_get_order(&xas);
> +	} while (xas_retry(&xas, entry));
> +	rcu_read_unlock();
> +	return ret;
>  }
>  
>  /*
> @@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		return -EIO;
>  
>  	si = get_swap_device(swap);
> -	if (!si) {
> -		if (!shmem_confirm_swap(mapping, index, swap))
> +	order = shmem_swap_check_entry(mapping, index, swap);
> +	if (unlikely(!si)) {
> +		if (order < 0)
>  			return -EEXIST;
>  		else
>  			return -EINVAL;
>  	}
> +	if (unlikely(order < 0)) {
> +		put_swap_device(si);
> +		return -EEXIST;
> +	}
Can we re-arrange the code block as following:
        order = shmem_swap_check_entry(mapping, index, swap);
        if (unlikely(order < 0))
                return -EEXIST;

        si = get_swap_device(swap);
        if (!si) {
                return -EINVAL;
...
>  
>  	/* Look it up and read it in.. */
>  	folio = swap_cache_get_folio(swap, NULL, 0);
> -	order = xa_get_order(&mapping->i_pages, index);
>  	if (!folio) {
>  		int nr_pages = 1 << order;
>  		bool fallback_order0 = false;
> @@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	*foliop = folio;
>  	return 0;
>  failed:
> -	if (!shmem_confirm_swap(mapping, index, swap))
> +	if (shmem_swap_check_entry(mapping, index, swap) < 0)
>  		error = -EEXIST;
>  	if (error == -EIO)
>  		shmem_set_folio_swapin_error(inode, index, folio, swap,
> @@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		folio_put(folio);
>  	}
>  	put_swap_device(si);
> -
>  	return error;
>  }
>  
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-18  2:48   ` Kemeng Shi
@ 2025-06-18  3:07     ` Kairui Song
  2025-06-19  1:30       ` Kemeng Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-06-18  3:07 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel

On Wed, Jun 18, 2025 at 10:49 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> on 6/18/2025 2:35 AM, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently shmem calls xa_get_order to get the swap radix entry order,
> > requiring a full tree walk. This can be easily combined with the swap
> > entry value checking (shmem_confirm_swap) to avoid the duplicated
> > lookup, which should improve the performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/shmem.c | 33 ++++++++++++++++++++++++---------
> >  1 file changed, 24 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 4e7ef343a29b..0ad49e57f736 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
> >
> >  /*
> >   * Sometimes, before we decide whether to proceed or to fail, we must check
> > - * that an entry was not already brought back from swap by a racing thread.
> > + * that an entry was not already brought back or split by a racing thread.
> >   *
> >   * Checking folio is not enough: by the time a swapcache folio is locked, it
> >   * might be reused, and again be swapcache, using the same swap as before.
> > + * Returns the swap entry's order if it still presents, else returns -1.
> >   */
> > -static bool shmem_confirm_swap(struct address_space *mapping,
> > -                            pgoff_t index, swp_entry_t swap)
> > +static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
> > +                               swp_entry_t swap)
> >  {
> > -     return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> > +     XA_STATE(xas, &mapping->i_pages, index);
> > +     int ret = -1;
> > +     void *entry;
> > +
> > +     rcu_read_lock();
> > +     do {
> > +             entry = xas_load(&xas);
> > +             if (entry == swp_to_radix_entry(swap))
> > +                     ret = xas_get_order(&xas);
> > +     } while (xas_retry(&xas, entry));
> > +     rcu_read_unlock();
> > +     return ret;
> >  }
> >
> >  /*
> > @@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               return -EIO;
> >
> >       si = get_swap_device(swap);
> > -     if (!si) {
> > -             if (!shmem_confirm_swap(mapping, index, swap))
> > +     order = shmem_swap_check_entry(mapping, index, swap);
> > +     if (unlikely(!si)) {
> > +             if (order < 0)
> >                       return -EEXIST;
> >               else
> >                       return -EINVAL;
> >       }
> > +     if (unlikely(order < 0)) {
> > +             put_swap_device(si);
> > +             return -EEXIST;
> > +     }
> Can we re-arrange the code block as following:
>         order = shmem_swap_check_entry(mapping, index, swap);
>         if (unlikely(order < 0))
>                 return -EEXIST;
>
>         si = get_swap_device(swap);
>         if (!si) {
>                 return -EINVAL;
> ...

Hi, thanks for the suggestion.

This may lead to a trivial higher chance of getting -EINVAL when it
should return -EEXIST, leading to user space errors.

For example if this CPU get interrupted after `order =
shmem_swap_check_entry(mapping, index, swap);`, and another CPU
swapoff-ed the device. Next, we get `si = NULL` here, but the entry is
swapped in already, so it should return -EEXIST. Not -EINVAL.

The chance is really low so it's kind of trivial, we can do a `goto
failed` if got (!si) here, but it will make the logic under `failed:`
more complex. So I'd prefer to not change the original behaviour,
which looks more correct.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-17 18:35 ` [PATCH 3/4] mm/shmem, swap: improve mthp swapin process Kairui Song
@ 2025-06-18  6:27   ` Kemeng Shi
  2025-06-18  6:50     ` Kairui Song
  2025-06-18  8:26   ` Kemeng Shi
  1 sibling, 1 reply; 22+ messages in thread
From: Kemeng Shi @ 2025-06-18  6:27 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel



on 6/18/2025 2:35 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Tidy up the mTHP swapin workflow. There should be no feature change, but
> consolidates the mTHP related check to one place so they are now all
> wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
> compiler if not needed.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
>  1 file changed, 78 insertions(+), 97 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0ad49e57f736..46dea2fa1b43 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c

...

> @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	/* Look it up and read it in.. */
>  	folio = swap_cache_get_folio(swap, NULL, 0);
>  	if (!folio) {
> -		int nr_pages = 1 << order;
> -		bool fallback_order0 = false;
> -
>  		/* Or update major stats only when swapin succeeds?? */
>  		if (fault_type) {
>  			*fault_type |= VM_FAULT_MAJOR;
>  			count_vm_event(PGMAJFAULT);
>  			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>  		}
> -
> -		/*
> -		 * If uffd is active for the vma, we need per-page fault
> -		 * fidelity to maintain the uffd semantics, then fallback
> -		 * to swapin order-0 folio, as well as for zswap case.
> -		 * Any existing sub folio in the swap cache also blocks
> -		 * mTHP swapin.
> -		 */
> -		if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
> -				  !zswap_never_enabled() ||
> -				  non_swapcache_batch(swap, nr_pages) != nr_pages))
> -			fallback_order0 = true;
> -
> -		/* Skip swapcache for synchronous device. */
> -		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> -			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
> +		/* Try direct mTHP swapin bypassing swap cache and readahead */
> +		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> +			swap_order = order;
> +			folio = shmem_swapin_direct(inode, vma, index,
> +						    swap, &swap_order, gfp);
>  			if (!IS_ERR(folio)) {
>  				skip_swapcache = true;
>  				goto alloced;
>  			}
> -
> -			/*
> -			 * Fallback to swapin order-0 folio unless the swap entry
> -			 * already exists.
> -			 */
> +			/* Fallback if order > 0 swapin failed with -ENOMEM */
>  			error = PTR_ERR(folio);
>  			folio = NULL;
> -			if (error == -EEXIST)
> +			if (error != -ENOMEM || !swap_order)
>  				goto failed;
>  		}
> -
>  		/*
> -		 * Now swap device can only swap in order 0 folio, then we
> -		 * should split the large swap entry stored in the pagecache
> -		 * if necessary.
> +		 * Try order 0 swapin using swap cache and readahead, it still
> +		 * may return order > 0 folio due to raced swap cache.
>  		 */
> -		split_order = shmem_split_large_entry(inode, index, swap, gfp);
> -		if (split_order < 0) {
> -			error = split_order;
> -			goto failed;
> -		}
> -
> -		/*
> -		 * If the large swap entry has already been split, it is
> -		 * necessary to recalculate the new swap entry based on
> -		 * the old order alignment.
> -		 */
> -		if (split_order > 0) {
> -			pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> -			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> -		}
> -
For fallback order 0, we always call shmem_swapin_cluster() before but we will call
shmem_swap_alloc_folio() now. It seems fine to me. Just point this out for others
to recheck this.
> -		/* Here we actually start the io */
>  		folio = shmem_swapin_cluster(swap, gfp, info, index);
>  		if (!folio) {
>  			error = -ENOMEM;
>  			goto failed;
>  		}
> -	} else if (order > folio_order(folio)) {
> -		/*
> -		 * Swap readahead may swap in order 0 folios into swapcache
> -		 * asynchronously, while the shmem mapping can still stores
> -		 * large swap entries. In such cases, we should split the
> -		 * large swap entry to prevent possible data corruption.
> -		 */
> -		split_order = shmem_split_large_entry(inode, index, swap, gfp);
> -		if (split_order < 0) {
> -			folio_put(folio);
> -			folio = NULL;
> -			error = split_order;
> -			goto failed;
> -		}
> -
> -		/*
> -		 * If the large swap entry has already been split, it is
> -		 * necessary to recalculate the new swap entry based on
> -		 * the old order alignment.
> -		 */
> -		if (split_order > 0) {
> -			pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> -			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> -		}
> -	} else if (order < folio_order(folio)) {
> -		swap.val = round_down(swp_type(swap), folio_order(folio));
>  	}
> -
>  alloced:
> +	/*
> +	 * We need to split an existing large entry if swapin brought in a
> +	 * smaller folio due to various of reasons.
> +	 *
> +	 * And worth noting there is a special case: if there is a smaller
> +	 * cached folio that covers @swap, but not @index (it only covers
> +	 * first few sub entries of the large entry, but @index points to
> +	 * later parts), the swap cache lookup will still see this folio,
> +	 * And we need to split the large entry here. Later checks will fail,
> +	 * as it can't satisfy the swap requirement, and we will retry
> +	 * the swapin from beginning.
> +	 */
> +	swap_order = folio_order(folio);
> +	if (order > swap_order) {
> +		error = shmem_split_swap_entry(inode, index, swap, gfp);
> +		if (error)
> +			goto failed_nolock;
> +	}
> +
> +	index = round_down(index, 1 << swap_order);
> +	swap.val = round_down(swap.val, 1 << swap_order);
> +
If swap entry order is reduced but index and value keep unchange,
the shmem_split_swap_entry() will split the reduced large swap entry
successfully but index and swap.val will be incorrect beacuse of wrong
swap_order. We can catch unexpected order and swap entry in
shmem_add_to_page_cache() and will retry the swapin, but this will
introduce extra cost.
If we return order of entry which is splited in shmem_split_swap_entry()
and update index and swap.val with returned order, we can avoid the extra
cost for mentioned racy case.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-18  6:27   ` Kemeng Shi
@ 2025-06-18  6:50     ` Kairui Song
  2025-06-18  8:08       ` Kemeng Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-06-18  6:50 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel

On Wed, Jun 18, 2025 at 2:27 PM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> on 6/18/2025 2:35 AM, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Tidy up the mTHP swapin workflow. There should be no feature change, but
> > consolidates the mTHP related check to one place so they are now all
> > wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
> > compiler if not needed.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
> >  1 file changed, 78 insertions(+), 97 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 0ad49e57f736..46dea2fa1b43 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
>
> ...
>
> > @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >       /* Look it up and read it in.. */
> >       folio = swap_cache_get_folio(swap, NULL, 0);
> >       if (!folio) {
> > -             int nr_pages = 1 << order;
> > -             bool fallback_order0 = false;
> > -
> >               /* Or update major stats only when swapin succeeds?? */
> >               if (fault_type) {
> >                       *fault_type |= VM_FAULT_MAJOR;
> >                       count_vm_event(PGMAJFAULT);
> >                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >               }
> > -
> > -             /*
> > -              * If uffd is active for the vma, we need per-page fault
> > -              * fidelity to maintain the uffd semantics, then fallback
> > -              * to swapin order-0 folio, as well as for zswap case.
> > -              * Any existing sub folio in the swap cache also blocks
> > -              * mTHP swapin.
> > -              */
> > -             if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
> > -                               !zswap_never_enabled() ||
> > -                               non_swapcache_batch(swap, nr_pages) != nr_pages))
> > -                     fallback_order0 = true;
> > -
> > -             /* Skip swapcache for synchronous device. */
> > -             if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > -                     folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
> > +             /* Try direct mTHP swapin bypassing swap cache and readahead */
> > +             if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > +                     swap_order = order;
> > +                     folio = shmem_swapin_direct(inode, vma, index,
> > +                                                 swap, &swap_order, gfp);
> >                       if (!IS_ERR(folio)) {
> >                               skip_swapcache = true;
> >                               goto alloced;
> >                       }
> > -
> > -                     /*
> > -                      * Fallback to swapin order-0 folio unless the swap entry
> > -                      * already exists.
> > -                      */
> > +                     /* Fallback if order > 0 swapin failed with -ENOMEM */
> >                       error = PTR_ERR(folio);
> >                       folio = NULL;
> > -                     if (error == -EEXIST)
> > +                     if (error != -ENOMEM || !swap_order)
> >                               goto failed;
> >               }
> > -
> >               /*
> > -              * Now swap device can only swap in order 0 folio, then we
> > -              * should split the large swap entry stored in the pagecache
> > -              * if necessary.
> > +              * Try order 0 swapin using swap cache and readahead, it still
> > +              * may return order > 0 folio due to raced swap cache.
> >                */
> > -             split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > -             if (split_order < 0) {
> > -                     error = split_order;
> > -                     goto failed;
> > -             }
> > -
> > -             /*
> > -              * If the large swap entry has already been split, it is
> > -              * necessary to recalculate the new swap entry based on
> > -              * the old order alignment.
> > -              */
> > -             if (split_order > 0) {
> > -                     pgoff_t offset = index - round_down(index, 1 << split_order);
> > -
> > -                     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > -             }
> > -
> For fallback order 0, we always call shmem_swapin_cluster() before but we will call
> shmem_swap_alloc_folio() now. It seems fine to me. Just point this out for others
> to recheck this.

It's an improvement, I forgot to mention that in the commit message.
Readahead is a burden for SWP_SYNCHRONOUS_IO devices so calling
shmem_swap_alloc_folio is better. I'll update the commit message.

> > -             /* Here we actually start the io */
> >               folio = shmem_swapin_cluster(swap, gfp, info, index);
> >               if (!folio) {
> >                       error = -ENOMEM;
> >                       goto failed;
> >               }
> > -     } else if (order > folio_order(folio)) {
> > -             /*
> > -              * Swap readahead may swap in order 0 folios into swapcache
> > -              * asynchronously, while the shmem mapping can still stores
> > -              * large swap entries. In such cases, we should split the
> > -              * large swap entry to prevent possible data corruption.
> > -              */
> > -             split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > -             if (split_order < 0) {
> > -                     folio_put(folio);
> > -                     folio = NULL;
> > -                     error = split_order;
> > -                     goto failed;
> > -             }
> > -
> > -             /*
> > -              * If the large swap entry has already been split, it is
> > -              * necessary to recalculate the new swap entry based on
> > -              * the old order alignment.
> > -              */
> > -             if (split_order > 0) {
> > -                     pgoff_t offset = index - round_down(index, 1 << split_order);
> > -
> > -                     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > -             }
> > -     } else if (order < folio_order(folio)) {
> > -             swap.val = round_down(swp_type(swap), folio_order(folio));
> >       }
> > -
> >  alloced:
> > +     /*
> > +      * We need to split an existing large entry if swapin brought in a
> > +      * smaller folio due to various of reasons.
> > +      *
> > +      * And worth noting there is a special case: if there is a smaller
> > +      * cached folio that covers @swap, but not @index (it only covers
> > +      * first few sub entries of the large entry, but @index points to
> > +      * later parts), the swap cache lookup will still see this folio,
> > +      * And we need to split the large entry here. Later checks will fail,
> > +      * as it can't satisfy the swap requirement, and we will retry
> > +      * the swapin from beginning.
> > +      */
> > +     swap_order = folio_order(folio);
> > +     if (order > swap_order) {
> > +             error = shmem_split_swap_entry(inode, index, swap, gfp);
> > +             if (error)
> > +                     goto failed_nolock;
> > +     }
> > +
> > +     index = round_down(index, 1 << swap_order);
> > +     swap.val = round_down(swap.val, 1 << swap_order);
> > +
>
> If swap entry order is reduced but index and value keep unchange,
> the shmem_split_swap_entry() will split the reduced large swap entry
> successfully but index and swap.val will be incorrect beacuse of wrong
> swap_order. We can catch unexpected order and swap entry in
> shmem_add_to_page_cache() and will retry the swapin, but this will
> introduce extra cost.
>
> If we return order of entry which is splited in shmem_split_swap_entry()
> and update index and swap.val with returned order, we can avoid the extra
> cost for mentioned racy case.

The swap_order here is simply the folio's order, so doing this
round_down will get the swap entry and index that will be covered by
this folio. (the later folio->swap.val != swap.val ensures the values
are valid here).

Remember the previous patch mentioned that, we may see the shmem
mapping's entry got split but still seeing a large folio here. With
current design, shmem_add_to_page_cache can still succeed as it should
be, but if we round_down with the returned order of
shmem_split_swap_entry, it will fail.

So I think to make it better to keep it this way, and besides, the
next patch makes use of this for sanity checks and reducing false
positives of swap cache lookups.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-17 18:35 ` [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
  2025-06-18  2:48   ` Kemeng Shi
@ 2025-06-18  7:16   ` Dev Jain
  2025-06-18  7:22     ` Kairui Song
  1 sibling, 1 reply; 22+ messages in thread
From: Dev Jain @ 2025-06-18  7:16 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Ryan Roberts


On 18/06/25 12:05 am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Currently shmem calls xa_get_order to get the swap radix entry order,
> requiring a full tree walk. This can be easily combined with the swap
> entry value checking (shmem_confirm_swap) to avoid the duplicated
> lookup, which should improve the performance.

Nice spot!

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 33 ++++++++++++++++++++++++---------
>   1 file changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4e7ef343a29b..0ad49e57f736 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
>   
>   /*
>    * Sometimes, before we decide whether to proceed or to fail, we must check
> - * that an entry was not already brought back from swap by a racing thread.
> + * that an entry was not already brought back or split by a racing thread.
>    *
>    * Checking folio is not enough: by the time a swapcache folio is locked, it
>    * might be reused, and again be swapcache, using the same swap as before.
> + * Returns the swap entry's order if it still presents, else returns -1.
>    */
> -static bool shmem_confirm_swap(struct address_space *mapping,
> -			       pgoff_t index, swp_entry_t swap)
> +static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
> +				  swp_entry_t swap)

I think the function name shmem_confirm_swap is already good enough? Anyhow the
changed name should at least be shmem_check_entry_is_swap.

>   {
> -	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> +	XA_STATE(xas, &mapping->i_pages, index);
> +	int ret = -1;
> +	void *entry;
> +
> +	rcu_read_lock();
> +	do {
> +		entry = xas_load(&xas);
> +		if (entry == swp_to_radix_entry(swap))
> +			ret = xas_get_order(&xas);
> +	} while (xas_retry(&xas, entry));
> +	rcu_read_unlock();
> +	return ret;
>   }
>   
>   /*
> @@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		return -EIO;
>   
>   	si = get_swap_device(swap);
> -	if (!si) {
> -		if (!shmem_confirm_swap(mapping, index, swap))
> +	order = shmem_swap_check_entry(mapping, index, swap);
> +	if (unlikely(!si)) {
> +		if (order < 0)
>   			return -EEXIST;
>   		else
>   			return -EINVAL;
>   	}
> +	if (unlikely(order < 0)) {
> +		put_swap_device(si);
> +		return -EEXIST;
> +	}
>   
>   	/* Look it up and read it in.. */
>   	folio = swap_cache_get_folio(swap, NULL, 0);
> -	order = xa_get_order(&mapping->i_pages, index);
>   	if (!folio) {
>   		int nr_pages = 1 << order;
>   		bool fallback_order0 = false;
> @@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   	*foliop = folio;
>   	return 0;
>   failed:
> -	if (!shmem_confirm_swap(mapping, index, swap))
> +	if (shmem_swap_check_entry(mapping, index, swap) < 0)
>   		error = -EEXIST;
>   	if (error == -EIO)
>   		shmem_set_folio_swapin_error(inode, index, folio, swap,
> @@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		folio_put(folio);
>   	}
>   	put_swap_device(si);
> -
>   	return error;
>   }
>   


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-18  7:16   ` Dev Jain
@ 2025-06-18  7:22     ` Kairui Song
  2025-06-18  7:29       ` Dev Jain
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-06-18  7:22 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, linux-kernel, Ryan Roberts

On Wed, Jun 18, 2025 at 3:17 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 18/06/25 12:05 am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently shmem calls xa_get_order to get the swap radix entry order,
> > requiring a full tree walk. This can be easily combined with the swap
> > entry value checking (shmem_confirm_swap) to avoid the duplicated
> > lookup, which should improve the performance.
>
> Nice spot!
>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/shmem.c | 33 ++++++++++++++++++++++++---------
> >   1 file changed, 24 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 4e7ef343a29b..0ad49e57f736 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
> >
> >   /*
> >    * Sometimes, before we decide whether to proceed or to fail, we must check
> > - * that an entry was not already brought back from swap by a racing thread.
> > + * that an entry was not already brought back or split by a racing thread.
> >    *
> >    * Checking folio is not enough: by the time a swapcache folio is locked, it
> >    * might be reused, and again be swapcache, using the same swap as before.
> > + * Returns the swap entry's order if it still presents, else returns -1.
> >    */
> > -static bool shmem_confirm_swap(struct address_space *mapping,
> > -                            pgoff_t index, swp_entry_t swap)
> > +static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
> > +                               swp_entry_t swap)
>
> I think the function name shmem_confirm_swap is already good enough? Anyhow the
> changed name should at least be shmem_check_entry_is_swap.
>

Good, I can keep the function name unchanged or follow your
suggestion, I thought a `confirm` function returning non-binary return
value may look strange. I'm terrible at naming things :P


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-18  7:22     ` Kairui Song
@ 2025-06-18  7:29       ` Dev Jain
  0 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-06-18  7:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, linux-kernel, Ryan Roberts


On 18/06/25 12:52 pm, Kairui Song wrote:
> On Wed, Jun 18, 2025 at 3:17 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> On 18/06/25 12:05 am, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Currently shmem calls xa_get_order to get the swap radix entry order,
>>> requiring a full tree walk. This can be easily combined with the swap
>>> entry value checking (shmem_confirm_swap) to avoid the duplicated
>>> lookup, which should improve the performance.
>> Nice spot!
>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/shmem.c | 33 ++++++++++++++++++++++++---------
>>>    1 file changed, 24 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 4e7ef343a29b..0ad49e57f736 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
>>>
>>>    /*
>>>     * Sometimes, before we decide whether to proceed or to fail, we must check
>>> - * that an entry was not already brought back from swap by a racing thread.
>>> + * that an entry was not already brought back or split by a racing thread.
>>>     *
>>>     * Checking folio is not enough: by the time a swapcache folio is locked, it
>>>     * might be reused, and again be swapcache, using the same swap as before.
>>> + * Returns the swap entry's order if it still presents, else returns -1.
>>>     */
>>> -static bool shmem_confirm_swap(struct address_space *mapping,
>>> -                            pgoff_t index, swp_entry_t swap)
>>> +static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
>>> +                               swp_entry_t swap)
>> I think the function name shmem_confirm_swap is already good enough? Anyhow the
>> changed name should at least be shmem_check_entry_is_swap.
>>
> Good, I can keep the function name unchanged or follow your
> suggestion, I thought a `confirm` function returning non-binary return

True. I will vote for keeping the name unchanged; you have already documented
the return value so it should be fine. Just can you put a new line between
"Returns the swap entry's order..." and the previous line to make it clear.

> value may look strange. I'm terrible at naming things :P


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-18  6:50     ` Kairui Song
@ 2025-06-18  8:08       ` Kemeng Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Kemeng Shi @ 2025-06-18  8:08 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel



on 6/18/2025 2:50 PM, Kairui Song wrote:
> On Wed, Jun 18, 2025 at 2:27 PM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>> on 6/18/2025 2:35 AM, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Tidy up the mTHP swapin workflow. There should be no feature change, but
>>> consolidates the mTHP related check to one place so they are now all
>>> wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
>>> compiler if not needed.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>  mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
>>>  1 file changed, 78 insertions(+), 97 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 0ad49e57f736..46dea2fa1b43 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>
>> ...
>>
>>> @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>       /* Look it up and read it in.. */
>>>       folio = swap_cache_get_folio(swap, NULL, 0);
>>>       if (!folio) {
>>> -             int nr_pages = 1 << order;
>>> -             bool fallback_order0 = false;
>>> -
>>>               /* Or update major stats only when swapin succeeds?? */
>>>               if (fault_type) {
>>>                       *fault_type |= VM_FAULT_MAJOR;
>>>                       count_vm_event(PGMAJFAULT);
>>>                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
>>>               }
>>> -
>>> -             /*
>>> -              * If uffd is active for the vma, we need per-page fault
>>> -              * fidelity to maintain the uffd semantics, then fallback
>>> -              * to swapin order-0 folio, as well as for zswap case.
>>> -              * Any existing sub folio in the swap cache also blocks
>>> -              * mTHP swapin.
>>> -              */
>>> -             if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
>>> -                               !zswap_never_enabled() ||
>>> -                               non_swapcache_batch(swap, nr_pages) != nr_pages))
>>> -                     fallback_order0 = true;
>>> -
>>> -             /* Skip swapcache for synchronous device. */
>>> -             if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>>> -                     folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
>>> +             /* Try direct mTHP swapin bypassing swap cache and readahead */
>>> +             if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>>> +                     swap_order = order;
>>> +                     folio = shmem_swapin_direct(inode, vma, index,
>>> +                                                 swap, &swap_order, gfp);
>>>                       if (!IS_ERR(folio)) {
>>>                               skip_swapcache = true;
>>>                               goto alloced;
>>>                       }
>>> -
>>> -                     /*
>>> -                      * Fallback to swapin order-0 folio unless the swap entry
>>> -                      * already exists.
>>> -                      */
>>> +                     /* Fallback if order > 0 swapin failed with -ENOMEM */
>>>                       error = PTR_ERR(folio);
>>>                       folio = NULL;
>>> -                     if (error == -EEXIST)
>>> +                     if (error != -ENOMEM || !swap_order)
>>>                               goto failed;
>>>               }
>>> -
>>>               /*
>>> -              * Now swap device can only swap in order 0 folio, then we
>>> -              * should split the large swap entry stored in the pagecache
>>> -              * if necessary.
>>> +              * Try order 0 swapin using swap cache and readahead, it still
>>> +              * may return order > 0 folio due to raced swap cache.
>>>                */
>>> -             split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>> -             if (split_order < 0) {
>>> -                     error = split_order;
>>> -                     goto failed;
>>> -             }
>>> -
>>> -             /*
>>> -              * If the large swap entry has already been split, it is
>>> -              * necessary to recalculate the new swap entry based on
>>> -              * the old order alignment.
>>> -              */
>>> -             if (split_order > 0) {
>>> -                     pgoff_t offset = index - round_down(index, 1 << split_order);
>>> -
>>> -                     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> -             }
>>> -
>> For fallback order 0, we always call shmem_swapin_cluster() before but we will call
>> shmem_swap_alloc_folio() now. It seems fine to me. Just point this out for others
>> to recheck this.
> 
> It's an improvement, I forgot to mention that in the commit message.
> Readahead is a burden for SWP_SYNCHRONOUS_IO devices so calling
> shmem_swap_alloc_folio is better. I'll update the commit message.
> 
>>> -             /* Here we actually start the io */
>>>               folio = shmem_swapin_cluster(swap, gfp, info, index);
>>>               if (!folio) {
>>>                       error = -ENOMEM;
>>>                       goto failed;
>>>               }
>>> -     } else if (order > folio_order(folio)) {
>>> -             /*
>>> -              * Swap readahead may swap in order 0 folios into swapcache
>>> -              * asynchronously, while the shmem mapping can still stores
>>> -              * large swap entries. In such cases, we should split the
>>> -              * large swap entry to prevent possible data corruption.
>>> -              */
>>> -             split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>> -             if (split_order < 0) {
>>> -                     folio_put(folio);
>>> -                     folio = NULL;
>>> -                     error = split_order;
>>> -                     goto failed;
>>> -             }
>>> -
>>> -             /*
>>> -              * If the large swap entry has already been split, it is
>>> -              * necessary to recalculate the new swap entry based on
>>> -              * the old order alignment.
>>> -              */
>>> -             if (split_order > 0) {
>>> -                     pgoff_t offset = index - round_down(index, 1 << split_order);
>>> -
>>> -                     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> -             }
>>> -     } else if (order < folio_order(folio)) {
>>> -             swap.val = round_down(swp_type(swap), folio_order(folio));
>>>       }
>>> -
>>>  alloced:
>>> +     /*
>>> +      * We need to split an existing large entry if swapin brought in a
>>> +      * smaller folio due to various of reasons.
>>> +      *
>>> +      * And worth noting there is a special case: if there is a smaller
>>> +      * cached folio that covers @swap, but not @index (it only covers
>>> +      * first few sub entries of the large entry, but @index points to
>>> +      * later parts), the swap cache lookup will still see this folio,
>>> +      * And we need to split the large entry here. Later checks will fail,
>>> +      * as it can't satisfy the swap requirement, and we will retry
>>> +      * the swapin from beginning.
>>> +      */
>>> +     swap_order = folio_order(folio);
>>> +     if (order > swap_order) {
>>> +             error = shmem_split_swap_entry(inode, index, swap, gfp);
>>> +             if (error)
>>> +                     goto failed_nolock;
>>> +     }
>>> +
>>> +     index = round_down(index, 1 << swap_order);
>>> +     swap.val = round_down(swap.val, 1 << swap_order);
>>> +
>>
>> If swap entry order is reduced but index and value keep unchange,
>> the shmem_split_swap_entry() will split the reduced large swap entry
>> successfully but index and swap.val will be incorrect beacuse of wrong
>> swap_order. We can catch unexpected order and swap entry in
>> shmem_add_to_page_cache() and will retry the swapin, but this will
>> introduce extra cost.
>>
>> If we return order of entry which is splited in shmem_split_swap_entry()
>> and update index and swap.val with returned order, we can avoid the extra
>> cost for mentioned racy case.
> 
> The swap_order here is simply the folio's order, so doing this
> round_down will get the swap entry and index that will be covered by
> this folio. (the later folio->swap.val != swap.val ensures the values
> are valid here).
> 
> Remember the previous patch mentioned that, we may see the shmem
> mapping's entry got split but still seeing a large folio here. With
> current design, shmem_add_to_page_cache can still succeed as it should
> be, but if we round_down with the returned order of
> shmem_split_swap_entry, it will fail.
My bad... Thanks for explanation. The calculation looks fine to me.
> 
> So I think to make it better to keep it this way, and besides, the
> next patch makes use of this for sanity checks and reducing false
> positives of swap cache lookups.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-17 18:35 ` [PATCH 3/4] mm/shmem, swap: improve mthp swapin process Kairui Song
  2025-06-18  6:27   ` Kemeng Shi
@ 2025-06-18  8:26   ` Kemeng Shi
  2025-06-18  8:46     ` Kairui Song
  1 sibling, 1 reply; 22+ messages in thread
From: Kemeng Shi @ 2025-06-18  8:26 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel



on 6/18/2025 2:35 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Tidy up the mTHP swapin workflow. There should be no feature change, but
> consolidates the mTHP related check to one place so they are now all
> wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
> compiler if not needed.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
>  1 file changed, 78 insertions(+), 97 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c

...

Hello, here is another potensial issue if shmem swapin can race with folio
split.

>  alloced:
> +	/*
> +	 * We need to split an existing large entry if swapin brought in a
> +	 * smaller folio due to various of reasons.
> +	 *
> +	 * And worth noting there is a special case: if there is a smaller
> +	 * cached folio that covers @swap, but not @index (it only covers
> +	 * first few sub entries of the large entry, but @index points to
> +	 * later parts), the swap cache lookup will still see this folio,
> +	 * And we need to split the large entry here. Later checks will fail,
> +	 * as it can't satisfy the swap requirement, and we will retry
> +	 * the swapin from beginning.
> +	 */
> +	swap_order = folio_order(folio);
> +	if (order > swap_order) {
> +		error = shmem_split_swap_entry(inode, index, swap, gfp);
> +		if (error)
> +			goto failed_nolock;
> +	}
> +
> +	index = round_down(index, 1 << swap_order);
> +	swap.val = round_down(swap.val, 1 << swap_order);
> +
/* suppose folio is splited */
>  	/* We have to do this with folio locked to prevent races */
>  	folio_lock(folio);
>  	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
>  	    folio->swap.val != swap.val) {
>  		error = -EEXIST;
> -		goto unlock;
> +		goto failed_unlock;
>  	}
>  	if (!folio_test_uptodate(folio)) {
>  		error = -EIO;
> @@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  			goto failed;
>  	}
>  
> -	error = shmem_add_to_page_cache(folio, mapping,
> -					round_down(index, nr_pages),
> +	error = shmem_add_to_page_cache(folio, mapping, index,
>  					swp_to_radix_entry(swap), gfp);
The actual order swapin is less than swap_order and the swap-in folio
may not cover index from caller.

So we should move the index and swap.val calculation after folio is
locked.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-18  8:26   ` Kemeng Shi
@ 2025-06-18  8:46     ` Kairui Song
  2025-06-19  1:32       ` Kemeng Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2025-06-18  8:46 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel

On Wed, Jun 18, 2025 at 4:26 PM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
> on 6/18/2025 2:35 AM, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Tidy up the mTHP swapin workflow. There should be no feature change, but
> > consolidates the mTHP related check to one place so they are now all
> > wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
> > compiler if not needed.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
> >  1 file changed, 78 insertions(+), 97 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
>
> ...
>
> Hello, here is another potensial issue if shmem swapin can race with folio
> split.
>
> >  alloced:
> > +     /*
> > +      * We need to split an existing large entry if swapin brought in a
> > +      * smaller folio due to various of reasons.
> > +      *
> > +      * And worth noting there is a special case: if there is a smaller
> > +      * cached folio that covers @swap, but not @index (it only covers
> > +      * first few sub entries of the large entry, but @index points to
> > +      * later parts), the swap cache lookup will still see this folio,
> > +      * And we need to split the large entry here. Later checks will fail,
> > +      * as it can't satisfy the swap requirement, and we will retry
> > +      * the swapin from beginning.
> > +      */
> > +     swap_order = folio_order(folio);
> > +     if (order > swap_order) {
> > +             error = shmem_split_swap_entry(inode, index, swap, gfp);
> > +             if (error)
> > +                     goto failed_nolock;
> > +     }
> > +
> > +     index = round_down(index, 1 << swap_order);
> > +     swap.val = round_down(swap.val, 1 << swap_order);
> > +
> /* suppose folio is splited */
> >       /* We have to do this with folio locked to prevent races */
> >       folio_lock(folio);
> >       if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> >           folio->swap.val != swap.val) {
> >               error = -EEXIST;
> > -             goto unlock;
> > +             goto failed_unlock;
> >       }
> >       if (!folio_test_uptodate(folio)) {
> >               error = -EIO;
> > @@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                       goto failed;
> >       }
> >
> > -     error = shmem_add_to_page_cache(folio, mapping,
> > -                                     round_down(index, nr_pages),
> > +     error = shmem_add_to_page_cache(folio, mapping, index,
> >                                       swp_to_radix_entry(swap), gfp);
>
> The actual order swapin is less than swap_order and the swap-in folio
> may not cover index from caller.
>
> So we should move the index and swap.val calculation after folio is
> locked.

Hi, Thanks very much for checking the code carefully!

If I'm not wrong here, holding a reference is enough to stabilize the folio
order.
See split_huge_page_to_list_to_order, "Any unexpected folio references
...  -EAGAIN" and can_split_folio.

We can add a `swap_order == folio_order(folio)` check after folio lock
though, as a (sanity) check, just in case.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup
  2025-06-17 18:35 ` [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
@ 2025-06-19  1:28   ` Kemeng Shi
  2025-06-19 17:37     ` Kairui Song
  0 siblings, 1 reply; 22+ messages in thread
From: Kemeng Shi @ 2025-06-19  1:28 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
	Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel



on 6/18/2025 2:35 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> If the shmem read request's index points to the middle of a large swap
> entry, shmem swap in does the swap cache lookup use the large swap
> entry's starting value (the first sub swap entry of this large entry).
> This will lead to false positive lookup result if only the first few
> swap entries are cached, but the requested swap entry pointed by index
> is uncached.
> 
> Currently shmem will do a large entry split then retry the swapin from
> beginning, which is a waste of CPU and fragile. Handle this correctly.
> 
> Also add some sanity checks to help understand the code and ensure
> things won't go wrong.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 61 ++++++++++++++++++++++++++----------------------------
>  1 file changed, 29 insertions(+), 32 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 46dea2fa1b43..0bc30dafad90 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1977,12 +1977,12 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>  
>  static struct folio *shmem_swapin_direct(struct inode *inode,
>  		struct vm_area_struct *vma, pgoff_t index,
> -		swp_entry_t entry, int *order, gfp_t gfp)
> +		swp_entry_t swap_entry, swp_entry_t swap,
> +		int *order, gfp_t gfp)
>  {
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	int nr_pages = 1 << *order;
>  	struct folio *new;
> -	pgoff_t offset;
>  	void *shadow;
>  
>  	/*
> @@ -2003,13 +2003,11 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
>  		 */
>  		if ((vma && userfaultfd_armed(vma)) ||
>  		    !zswap_never_enabled() ||
> -		    non_swapcache_batch(entry, nr_pages) != nr_pages) {
> -			offset = index - round_down(index, nr_pages);
> -			entry = swp_entry(swp_type(entry),
> -					  swp_offset(entry) + offset);
> +		    non_swapcache_batch(swap_entry, nr_pages) != nr_pages) {
>  			*order = 0;
>  			nr_pages = 1;
>  		} else {
> +			swap.val = swap_entry.val;
>  			gfp_t huge_gfp = vma_thp_gfp_mask(vma);
>  
>  			gfp = limit_gfp_mask(huge_gfp, gfp);
> @@ -2021,7 +2019,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
>  		return ERR_PTR(-ENOMEM);
>  
>  	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
> -					   gfp, entry)) {
> +					   gfp, swap)) {
>  		folio_put(new);
>  		return ERR_PTR(-ENOMEM);
>  	}
> @@ -2036,17 +2034,17 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
>  	 * In this case, shmem_add_to_page_cache() will help identify the
>  	 * concurrent swapin and return -EEXIST.
>  	 */
> -	if (swapcache_prepare(entry, nr_pages)) {
> +	if (swapcache_prepare(swap, nr_pages)) {
>  		folio_put(new);
>  		return ERR_PTR(-EEXIST);
>  	}
>  
>  	__folio_set_locked(new);
>  	__folio_set_swapbacked(new);
> -	new->swap = entry;
> +	new->swap = swap;
>  
> -	memcg1_swapin(entry, nr_pages);
> -	shadow = get_shadow_from_swap_cache(entry);
> +	memcg1_swapin(swap, nr_pages);
> +	shadow = get_shadow_from_swap_cache(swap);
>  	if (shadow)
>  		workingset_refault(new, shadow);
>  	folio_add_lru(new);
> @@ -2278,20 +2276,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	int error, nr_pages, order, swap_order;
> +	swp_entry_t swap, swap_entry;
>  	struct swap_info_struct *si;
>  	struct folio *folio = NULL;
>  	bool skip_swapcache = false;
> -	swp_entry_t swap;
> +	pgoff_t offset;
>  
>  	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> -	swap = radix_to_swp_entry(*foliop);
> +	swap_entry = radix_to_swp_entry(*foliop);
>  	*foliop = NULL;
>  
> -	if (is_poisoned_swp_entry(swap))
> +	if (is_poisoned_swp_entry(swap_entry))
>  		return -EIO;
>  
> -	si = get_swap_device(swap);
> -	order = shmem_swap_check_entry(mapping, index, swap);
> +	si = get_swap_device(swap_entry);
> +	order = shmem_swap_check_entry(mapping, index, swap_entry);
>  	if (unlikely(!si)) {
>  		if (order < 0)
>  			return -EEXIST;
> @@ -2303,7 +2302,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		return -EEXIST;
>  	}
>  
> -	/* Look it up and read it in.. */
> +	/* @index may points to the middle of a large entry, get the real swap value first */
> +	offset = index - round_down(index, 1 << order);
> +	swap.val = swap_entry.val + offset;
>  	folio = swap_cache_get_folio(swap, NULL, 0);
>  	if (!folio) {
>  		/* Or update major stats only when swapin succeeds?? */
> @@ -2315,7 +2316,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		/* Try direct mTHP swapin bypassing swap cache and readahead */
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>  			swap_order = order;
> -			folio = shmem_swapin_direct(inode, vma, index,
> +			folio = shmem_swapin_direct(inode, vma, index, swap_entry,
>  						    swap, &swap_order, gfp);
>  			if (!IS_ERR(folio)) {
>  				skip_swapcache = true;
> @@ -2338,28 +2339,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		}
>  	}
>  alloced:
> +	swap_order = folio_order(folio);
> +	nr_pages = folio_nr_pages(folio);
> +
> +	/* The swap-in should cover both @swap and @index */
> +	swap.val = round_down(swap.val, nr_pages);
> +	VM_WARN_ON_ONCE(swap.val > swap_entry.val + offset);
> +	VM_WARN_ON_ONCE(swap.val + nr_pages <= swap_entry.val + offset);> +
>  	/*
>  	 * We need to split an existing large entry if swapin brought in a
>  	 * smaller folio due to various of reasons.
> -	 *
> -	 * And worth noting there is a special case: if there is a smaller
> -	 * cached folio that covers @swap, but not @index (it only covers
> -	 * first few sub entries of the large entry, but @index points to
> -	 * later parts), the swap cache lookup will still see this folio,
> -	 * And we need to split the large entry here. Later checks will fail,
> -	 * as it can't satisfy the swap requirement, and we will retry
> -	 * the swapin from beginning.
>  	 */
> -	swap_order = folio_order(folio);
> +	index = round_down(index, nr_pages);
>  	if (order > swap_order) {
> -		error = shmem_split_swap_entry(inode, index, swap, gfp);
> +		error = shmem_split_swap_entry(inode, index, swap_entry, gfp);
>  		if (error)
>  			goto failed_nolock;
>  	}
>  
> -	index = round_down(index, 1 << swap_order);
> -	swap.val = round_down(swap.val, 1 << swap_order);
> -
>  	/* We have to do this with folio locked to prevent races */
>  	folio_lock(folio);
>  	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> @@ -2372,7 +2370,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  		goto failed;
>  	}
>  	folio_wait_writeback(folio);
> -	nr_pages = folio_nr_pages(folio);
>  
>  	/*
>  	 * Some architectures may have to restore extra metadata to the
> 
The patch look good to me, just some small suggestion.
I think the name "swap" and "swap_entry" is not good enough. Maybe something
like "index_entry" and "align_entry" will be more clean.
Besides we pass "swap" and "order" already, we can calculate swap_entry easily
and the code will be more easy to understand.
Not a big deal anyway, so:
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-06-18  3:07     ` Kairui Song
@ 2025-06-19  1:30       ` Kemeng Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Kemeng Shi @ 2025-06-19  1:30 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel



on 6/18/2025 11:07 AM, Kairui Song wrote:
> On Wed, Jun 18, 2025 at 10:49 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>> on 6/18/2025 2:35 AM, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Currently shmem calls xa_get_order to get the swap radix entry order,
>>> requiring a full tree walk. This can be easily combined with the swap
>>> entry value checking (shmem_confirm_swap) to avoid the duplicated
>>> lookup, which should improve the performance.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>  mm/shmem.c | 33 ++++++++++++++++++++++++---------
>>>  1 file changed, 24 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 4e7ef343a29b..0ad49e57f736 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
>>>
>>>  /*
>>>   * Sometimes, before we decide whether to proceed or to fail, we must check
>>> - * that an entry was not already brought back from swap by a racing thread.
>>> + * that an entry was not already brought back or split by a racing thread.
>>>   *
>>>   * Checking folio is not enough: by the time a swapcache folio is locked, it
>>>   * might be reused, and again be swapcache, using the same swap as before.
>>> + * Returns the swap entry's order if it still presents, else returns -1.
>>>   */
>>> -static bool shmem_confirm_swap(struct address_space *mapping,
>>> -                            pgoff_t index, swp_entry_t swap)
>>> +static int shmem_swap_check_entry(struct address_space *mapping, pgoff_t index,
>>> +                               swp_entry_t swap)
>>>  {
>>> -     return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
>>> +     XA_STATE(xas, &mapping->i_pages, index);
>>> +     int ret = -1;
>>> +     void *entry;
>>> +
>>> +     rcu_read_lock();
>>> +     do {
>>> +             entry = xas_load(&xas);
>>> +             if (entry == swp_to_radix_entry(swap))
>>> +                     ret = xas_get_order(&xas);
>>> +     } while (xas_retry(&xas, entry));
>>> +     rcu_read_unlock();
>>> +     return ret;
>>>  }
>>>
>>>  /*
>>> @@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>               return -EIO;
>>>
>>>       si = get_swap_device(swap);
>>> -     if (!si) {
>>> -             if (!shmem_confirm_swap(mapping, index, swap))
>>> +     order = shmem_swap_check_entry(mapping, index, swap);
>>> +     if (unlikely(!si)) {
>>> +             if (order < 0)
>>>                       return -EEXIST;
>>>               else
>>>                       return -EINVAL;
>>>       }
>>> +     if (unlikely(order < 0)) {
>>> +             put_swap_device(si);
>>> +             return -EEXIST;
>>> +     }
>> Can we re-arrange the code block as following:
>>         order = shmem_swap_check_entry(mapping, index, swap);
>>         if (unlikely(order < 0))
>>                 return -EEXIST;
>>
>>         si = get_swap_device(swap);
>>         if (!si) {
>>                 return -EINVAL;
>> ...
> 
> Hi, thanks for the suggestion.
> 
> This may lead to a trivial higher chance of getting -EINVAL when it
> should return -EEXIST, leading to user space errors.
> 
> For example if this CPU get interrupted after `order =
> shmem_swap_check_entry(mapping, index, swap);`, and another CPU
> swapoff-ed the device. Next, we get `si = NULL` here, but the entry is
> swapped in already, so it should return -EEXIST. Not -EINVAL.
> 
> The chance is really low so it's kind of trivial, we can do a `goto
> failed` if got (!si) here, but it will make the logic under `failed:`
> more complex. So I'd prefer to not change the original behaviour,
> which looks more correct.
> 
Right, thanks for explanation.

Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
  2025-06-18  8:46     ` Kairui Song
@ 2025-06-19  1:32       ` Kemeng Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Kemeng Shi @ 2025-06-19  1:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel



on 6/18/2025 4:46 PM, Kairui Song wrote:
> On Wed, Jun 18, 2025 at 4:26 PM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>> on 6/18/2025 2:35 AM, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Tidy up the mTHP swapin workflow. There should be no feature change, but
>>> consolidates the mTHP related check to one place so they are now all
>>> wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
>>> compiler if not needed.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>  mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
>>>  1 file changed, 78 insertions(+), 97 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>
>> ...
>>
>> Hello, here is another potensial issue if shmem swapin can race with folio
>> split.
>>
>>>  alloced:
>>> +     /*
>>> +      * We need to split an existing large entry if swapin brought in a
>>> +      * smaller folio due to various of reasons.
>>> +      *
>>> +      * And worth noting there is a special case: if there is a smaller
>>> +      * cached folio that covers @swap, but not @index (it only covers
>>> +      * first few sub entries of the large entry, but @index points to
>>> +      * later parts), the swap cache lookup will still see this folio,
>>> +      * And we need to split the large entry here. Later checks will fail,
>>> +      * as it can't satisfy the swap requirement, and we will retry
>>> +      * the swapin from beginning.
>>> +      */
>>> +     swap_order = folio_order(folio);
>>> +     if (order > swap_order) {
>>> +             error = shmem_split_swap_entry(inode, index, swap, gfp);
>>> +             if (error)
>>> +                     goto failed_nolock;
>>> +     }
>>> +
>>> +     index = round_down(index, 1 << swap_order);
>>> +     swap.val = round_down(swap.val, 1 << swap_order);
>>> +
>> /* suppose folio is splited */
>>>       /* We have to do this with folio locked to prevent races */
>>>       folio_lock(folio);
>>>       if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
>>>           folio->swap.val != swap.val) {
>>>               error = -EEXIST;
>>> -             goto unlock;
>>> +             goto failed_unlock;
>>>       }
>>>       if (!folio_test_uptodate(folio)) {
>>>               error = -EIO;
>>> @@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>                       goto failed;
>>>       }
>>>
>>> -     error = shmem_add_to_page_cache(folio, mapping,
>>> -                                     round_down(index, nr_pages),
>>> +     error = shmem_add_to_page_cache(folio, mapping, index,
>>>                                       swp_to_radix_entry(swap), gfp);
>>
>> The actual order swapin is less than swap_order and the swap-in folio
>> may not cover index from caller.
>>
>> So we should move the index and swap.val calculation after folio is
>> locked.
> 
> Hi, Thanks very much for checking the code carefully!
> 
> If I'm not wrong here, holding a reference is enough to stabilize the folio
> order.
> See split_huge_page_to_list_to_order, "Any unexpected folio references
> ...  -EAGAIN" and can_split_folio.
Thanks for feedback, then the change looks good to me.

Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> 
> We can add a `swap_order == folio_order(folio)` check after folio lock
> though, as a (sanity) check, just in case.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup
  2025-06-19  1:28   ` Kemeng Shi
@ 2025-06-19 17:37     ` Kairui Song
  0 siblings, 0 replies; 22+ messages in thread
From: Kairui Song @ 2025-06-19 17:37 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: linux-mm, Andrew Morton, Hugh Dickins, Baolin Wang,
	Matthew Wilcox, Chris Li, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel

On Thu, Jun 19, 2025 at 9:28 AM Kemeng Shi <shikemeng@huaweicloud.com> wrote:
>
>
>
> on 6/18/2025 2:35 AM, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > If the shmem read request's index points to the middle of a large swap
> > entry, shmem swap in does the swap cache lookup use the large swap
> > entry's starting value (the first sub swap entry of this large entry).
> > This will lead to false positive lookup result if only the first few
> > swap entries are cached, but the requested swap entry pointed by index
> > is uncached.
> >
> > Currently shmem will do a large entry split then retry the swapin from
> > beginning, which is a waste of CPU and fragile. Handle this correctly.
> >
> > Also add some sanity checks to help understand the code and ensure
> > things won't go wrong.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/shmem.c | 61 ++++++++++++++++++++++++++----------------------------
> >  1 file changed, 29 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 46dea2fa1b43..0bc30dafad90 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1977,12 +1977,12 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> >
> >  static struct folio *shmem_swapin_direct(struct inode *inode,
> >               struct vm_area_struct *vma, pgoff_t index,
> > -             swp_entry_t entry, int *order, gfp_t gfp)
> > +             swp_entry_t swap_entry, swp_entry_t swap,
> > +             int *order, gfp_t gfp)
> >  {
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> >       int nr_pages = 1 << *order;
> >       struct folio *new;
> > -     pgoff_t offset;
> >       void *shadow;
> >
> >       /*
> > @@ -2003,13 +2003,11 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> >                */
> >               if ((vma && userfaultfd_armed(vma)) ||
> >                   !zswap_never_enabled() ||
> > -                 non_swapcache_batch(entry, nr_pages) != nr_pages) {
> > -                     offset = index - round_down(index, nr_pages);
> > -                     entry = swp_entry(swp_type(entry),
> > -                                       swp_offset(entry) + offset);
> > +                 non_swapcache_batch(swap_entry, nr_pages) != nr_pages) {
> >                       *order = 0;
> >                       nr_pages = 1;
> >               } else {
> > +                     swap.val = swap_entry.val;
> >                       gfp_t huge_gfp = vma_thp_gfp_mask(vma);
> >
> >                       gfp = limit_gfp_mask(huge_gfp, gfp);
> > @@ -2021,7 +2019,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> >               return ERR_PTR(-ENOMEM);
> >
> >       if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
> > -                                        gfp, entry)) {
> > +                                        gfp, swap)) {
> >               folio_put(new);
> >               return ERR_PTR(-ENOMEM);
> >       }
> > @@ -2036,17 +2034,17 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> >        * In this case, shmem_add_to_page_cache() will help identify the
> >        * concurrent swapin and return -EEXIST.
> >        */
> > -     if (swapcache_prepare(entry, nr_pages)) {
> > +     if (swapcache_prepare(swap, nr_pages)) {
> >               folio_put(new);
> >               return ERR_PTR(-EEXIST);
> >       }
> >
> >       __folio_set_locked(new);
> >       __folio_set_swapbacked(new);
> > -     new->swap = entry;
> > +     new->swap = swap;
> >
> > -     memcg1_swapin(entry, nr_pages);
> > -     shadow = get_shadow_from_swap_cache(entry);
> > +     memcg1_swapin(swap, nr_pages);
> > +     shadow = get_shadow_from_swap_cache(swap);
> >       if (shadow)
> >               workingset_refault(new, shadow);
> >       folio_add_lru(new);
> > @@ -2278,20 +2276,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >       struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> >       int error, nr_pages, order, swap_order;
> > +     swp_entry_t swap, swap_entry;
> >       struct swap_info_struct *si;
> >       struct folio *folio = NULL;
> >       bool skip_swapcache = false;
> > -     swp_entry_t swap;
> > +     pgoff_t offset;
> >
> >       VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > -     swap = radix_to_swp_entry(*foliop);
> > +     swap_entry = radix_to_swp_entry(*foliop);
> >       *foliop = NULL;
> >
> > -     if (is_poisoned_swp_entry(swap))
> > +     if (is_poisoned_swp_entry(swap_entry))
> >               return -EIO;
> >
> > -     si = get_swap_device(swap);
> > -     order = shmem_swap_check_entry(mapping, index, swap);
> > +     si = get_swap_device(swap_entry);
> > +     order = shmem_swap_check_entry(mapping, index, swap_entry);
> >       if (unlikely(!si)) {
> >               if (order < 0)
> >                       return -EEXIST;
> > @@ -2303,7 +2302,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               return -EEXIST;
> >       }
> >
> > -     /* Look it up and read it in.. */
> > +     /* @index may points to the middle of a large entry, get the real swap value first */
> > +     offset = index - round_down(index, 1 << order);
> > +     swap.val = swap_entry.val + offset;
> >       folio = swap_cache_get_folio(swap, NULL, 0);
> >       if (!folio) {
> >               /* Or update major stats only when swapin succeeds?? */
> > @@ -2315,7 +2316,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               /* Try direct mTHP swapin bypassing swap cache and readahead */
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> >                       swap_order = order;
> > -                     folio = shmem_swapin_direct(inode, vma, index,
> > +                     folio = shmem_swapin_direct(inode, vma, index, swap_entry,
> >                                                   swap, &swap_order, gfp);
> >                       if (!IS_ERR(folio)) {
> >                               skip_swapcache = true;
> > @@ -2338,28 +2339,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               }
> >       }
> >  alloced:
> > +     swap_order = folio_order(folio);
> > +     nr_pages = folio_nr_pages(folio);
> > +
> > +     /* The swap-in should cover both @swap and @index */
> > +     swap.val = round_down(swap.val, nr_pages);
> > +     VM_WARN_ON_ONCE(swap.val > swap_entry.val + offset);
> > +     VM_WARN_ON_ONCE(swap.val + nr_pages <= swap_entry.val + offset);> +
> >       /*
> >        * We need to split an existing large entry if swapin brought in a
> >        * smaller folio due to various of reasons.
> > -      *
> > -      * And worth noting there is a special case: if there is a smaller
> > -      * cached folio that covers @swap, but not @index (it only covers
> > -      * first few sub entries of the large entry, but @index points to
> > -      * later parts), the swap cache lookup will still see this folio,
> > -      * And we need to split the large entry here. Later checks will fail,
> > -      * as it can't satisfy the swap requirement, and we will retry
> > -      * the swapin from beginning.
> >        */
> > -     swap_order = folio_order(folio);
> > +     index = round_down(index, nr_pages);
> >       if (order > swap_order) {
> > -             error = shmem_split_swap_entry(inode, index, swap, gfp);
> > +             error = shmem_split_swap_entry(inode, index, swap_entry, gfp);
> >               if (error)
> >                       goto failed_nolock;
> >       }
> >
> > -     index = round_down(index, 1 << swap_order);
> > -     swap.val = round_down(swap.val, 1 << swap_order);
> > -
> >       /* We have to do this with folio locked to prevent races */
> >       folio_lock(folio);
> >       if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> > @@ -2372,7 +2370,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               goto failed;
> >       }
> >       folio_wait_writeback(folio);
> > -     nr_pages = folio_nr_pages(folio);
> >
> >       /*
> >        * Some architectures may have to restore extra metadata to the
> >
> The patch look good to me, just some small suggestion.
> I think the name "swap" and "swap_entry" is not good enough. Maybe something
> like "index_entry" and "align_entry" will be more clean.

Thanks, very good suggestion, I prefer index_entry then.

> Besides we pass "swap" and "order" already, we can calculate swap_entry easily
> and the code will be more easy to understand.

True, I'm not sure if the compiler is smart enough to avoid a
round_down here, the inlined function can be optimized better with
parameters.

> Not a big deal anyway, so:
> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
>

Thanks again!


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-06-19 17:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17 18:34 [PATCH 0/4] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
2025-06-17 18:35 ` [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
2025-06-17 22:58   ` Andrew Morton
2025-06-18  2:11     ` Kairui Song
2025-06-18  2:08   ` Kemeng Shi
2025-06-17 18:35 ` [PATCH 2/4] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-06-18  2:48   ` Kemeng Shi
2025-06-18  3:07     ` Kairui Song
2025-06-19  1:30       ` Kemeng Shi
2025-06-18  7:16   ` Dev Jain
2025-06-18  7:22     ` Kairui Song
2025-06-18  7:29       ` Dev Jain
2025-06-17 18:35 ` [PATCH 3/4] mm/shmem, swap: improve mthp swapin process Kairui Song
2025-06-18  6:27   ` Kemeng Shi
2025-06-18  6:50     ` Kairui Song
2025-06-18  8:08       ` Kemeng Shi
2025-06-18  8:26   ` Kemeng Shi
2025-06-18  8:46     ` Kairui Song
2025-06-19  1:32       ` Kemeng Shi
2025-06-17 18:35 ` [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
2025-06-19  1:28   ` Kemeng Shi
2025-06-19 17:37     ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).