* [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in
@ 2025-06-27 6:20 Kairui Song
2025-06-27 6:20 ` [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
` (6 more replies)
0 siblings, 7 replies; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song
From: Kairui Song <kasong@tencent.com>
The current mTHP swapin path have some problems. It may potentially
hang, may cause redundant faults due to false positive swap cache lookup,
and it will involve at least 4 Xarray tree walks (get order, get order
again, confirm swap, insert folio). And for !CONFIG_TRANSPARENT_HUGEPAGE
builds, it will performs some mTHP related checks.
This series fixes all of the mentioned issues, and the code should be more
robust and prepared for the swap table series. Now tree walks is reduced
to twice (get order & confirm, insert folio) and added more sanity checks
and comments. !CONFIG_TRANSPARENT_HUGEPAGE build overhead is also
minimized, and comes with a sanity check now.
The performance is slightly better after this series, sequential swap in of
24G data from ZRAM, using transparent_hugepage_tmpfs=always (24 samples each):
Before: 11.17, Standard Deviation: 0.02
After patch 1: 10.89, Standard Deviation: 0.05
After patch 2: 10.84, Standard Deviation: 0.03
After patch 3: 10.91, Standard Deviation: 0.03
After patch 4: 10.86, Standard Deviation: 0.03
After patch 5: 10.07, Standard Deviation: 0.04
After patch 7: 10.09, Standard Deviation: 0.03
Each patch improves the performance by a little, which is about ~10%
faster in total.
Build kernel test showed very slightly improvement, testing with make -j24
with defconfig in a 256M memcg also using ZRAM as swap, and
transparent_hugepage_tmpfs=always (6 test runs):
Before: system time avg: 3911.80s
After: system time avg: 3863.76s
---
V2: https://lore.kernel.org/linux-mm/20250619175538.15799-1-ryncsn@gmail.com/
Updates:
- Split the clean up patch into 3 individual patches [ Baolin Wang ]
- Fix a code error in the first patch [ Baolin Wang ]
- I found there are some other remaining issue that can be fixed easily
after the cleanups so includes these too: fix major fault counter, and
clean up the goto labels.
V1: https://lore.kernel.org/linux-mm/20250617183503.10527-1-ryncsn@gmail.com/
Updates:
- Improve of funtion name and variable names, also commit message [ Kemeng Shi,
Dev Jain ]
- Correct Fixes: tag [ Andrew Morton ]
- Collect Reviewed-by.
Two of the patches in this series comes from the swap table series [1],
and worth noting that the performance gain of this series is independent
to the swap table series, we'll see another bigger performance gain and
reduce of memory usage after the swap table series.
I found these issues while trying to split the shmem changes out of the
swap table series for easier reviewing, and found several more issues
while doing stress tests for performance comparision. Barry also mentioned
that CONFIG_TRANSPARENT_HUGEPAGE may have redundant checks [2] and I
managed to clean them up properly too.
No issue is found with a few days of stress testing.
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7AsKFz7UN+seR5atznE_RBTDC9qjDmwN5saMe+KL3b1mQ@mail.gmail.com/ [2]
Kairui Song (7):
mm/shmem, swap: improve cached mTHP handling and fix potential hung
mm/shmem, swap: avoid redundant Xarray lookup during swapin
mm/shmem, swap: tidy up THP swapin checks
mm/shmem, swap: clean up swap entry splitting
mm/shmem, swap: never use swap cache and readahead for
SWP_SYNCHRONOUS_IO
mm/shmem, swap: fix major fault counting
mm/shmem, swap: avoid false positive swap cache lookup
mm/shmem.c | 283 ++++++++++++++++++++++++++++-------------------------
1 file changed, 149 insertions(+), 134 deletions(-)
--
2.50.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-30 3:44 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 2/7] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
` (5 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song, stable
From: Kairui Song <kasong@tencent.com>
The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.
The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:
CPU1 CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
folio = swap_cache_get_folio
/* folio = NULL */
order = xa_get_order
/* order > 0 */
folio = shmem_swap_alloc_folio
/* mTHP alloc failure, folio = NULL */
<... Interrupted ...>
shmem_swapin_folio
/* S1 is swapped in */
shmem_writeout
/* S1 is swapped out, folio cached */
shmem_split_large_entry(..., S1)
/* S1 is split, but the folio covering it has order > 0 now */
Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.
And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop the
redundant tree walks before the insertion.
This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.
And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true. The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention. The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: <stable@vger.kernel.org>
---
mm/shmem.c | 30 +++++++++++++++++++++---------
1 file changed, 21 insertions(+), 9 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 334b7b4a61a0..e3c9a1365ff4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
pgoff_t index, void *expected, gfp_t gfp)
{
XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
- long nr = folio_nr_pages(folio);
+ unsigned long nr = folio_nr_pages(folio);
+ swp_entry_t iter, swap;
+ void *entry;
VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
gfp &= GFP_RECLAIM_MASK;
folio_throttle_swaprate(folio, gfp);
+ swap = iter = radix_to_swp_entry(expected);
do {
xas_lock_irq(&xas);
- if (expected != xas_find_conflict(&xas)) {
- xas_set_err(&xas, -EEXIST);
- goto unlock;
+ xas_for_each_conflict(&xas, entry) {
+ /*
+ * The range must either be empty, or filled with
+ * expected swap entries. Shmem swap entries are never
+ * partially freed without split of both entry and
+ * folio, so there shouldn't be any holes.
+ */
+ if (!expected || entry != swp_to_radix_entry(iter)) {
+ xas_set_err(&xas, -EEXIST);
+ goto unlock;
+ }
+ iter.val += 1 << xas_get_order(&xas);
}
- if (expected && xas_find_conflict(&xas)) {
+ if (expected && iter.val - nr != swap.val) {
xas_set_err(&xas, -EEXIST);
goto unlock;
}
@@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
error = -ENOMEM;
goto failed;
}
- } else if (order != folio_order(folio)) {
+ } else if (order > folio_order(folio)) {
/*
* Swap readahead may swap in order 0 folios into swapcache
* asynchronously, while the shmem mapping can still stores
@@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
}
+ } else if (order < folio_order(folio)) {
+ swap.val = round_down(swap.val, 1 << folio_order(folio));
}
alloced:
/* We have to do this with folio locked to prevent races */
folio_lock(folio);
if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
- folio->swap.val != swap.val ||
- !shmem_confirm_swap(mapping, index, swap) ||
- xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+ folio->swap.val != swap.val) {
error = -EEXIST;
goto unlock;
}
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 2/7] mm/shmem, swap: avoid redundant Xarray lookup during swapin
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
2025-06-27 6:20 ` [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-27 6:20 ` [PATCH v3 3/7] mm/shmem, swap: tidy up THP swapin checks Kairui Song
` (4 subsequent siblings)
6 siblings, 0 replies; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song, Dev Jain
From: Kairui Song <kasong@tencent.com>
Currently shmem calls xa_get_order to get the swap radix entry order,
requiring a full tree walk. This can be easily combined with the swap
entry value checking (shmem_confirm_swap) to avoid the duplicated lookup,
which should improve the performance.
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/shmem.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index e3c9a1365ff4..033dc7a3435d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -505,15 +505,27 @@ static int shmem_replace_entry(struct address_space *mapping,
/*
* Sometimes, before we decide whether to proceed or to fail, we must check
- * that an entry was not already brought back from swap by a racing thread.
+ * that an entry was not already brought back or split by a racing thread.
*
* Checking folio is not enough: by the time a swapcache folio is locked, it
* might be reused, and again be swapcache, using the same swap as before.
+ * Returns the swap entry's order if it still presents, else returns -1.
*/
-static bool shmem_confirm_swap(struct address_space *mapping,
- pgoff_t index, swp_entry_t swap)
+static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index,
+ swp_entry_t swap)
{
- return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
+ XA_STATE(xas, &mapping->i_pages, index);
+ int ret = -1;
+ void *entry;
+
+ rcu_read_lock();
+ do {
+ entry = xas_load(&xas);
+ if (entry == swp_to_radix_entry(swap))
+ ret = xas_get_order(&xas);
+ } while (xas_retry(&xas, entry));
+ rcu_read_unlock();
+ return ret;
}
/*
@@ -2256,16 +2268,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
return -EIO;
si = get_swap_device(swap);
- if (!si) {
- if (!shmem_confirm_swap(mapping, index, swap))
+ order = shmem_confirm_swap(mapping, index, swap);
+ if (unlikely(!si)) {
+ if (order < 0)
return -EEXIST;
else
return -EINVAL;
}
+ if (unlikely(order < 0)) {
+ put_swap_device(si);
+ return -EEXIST;
+ }
/* Look it up and read it in.. */
folio = swap_cache_get_folio(swap, NULL, 0);
- order = xa_get_order(&mapping->i_pages, index);
if (!folio) {
int nr_pages = 1 << order;
bool fallback_order0 = false;
@@ -2415,7 +2431,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
*foliop = folio;
return 0;
failed:
- if (!shmem_confirm_swap(mapping, index, swap))
+ if (shmem_confirm_swap(mapping, index, swap) < 0)
error = -EEXIST;
if (error == -EIO)
shmem_set_folio_swapin_error(inode, index, folio, swap,
@@ -2428,7 +2444,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
folio_put(folio);
}
put_swap_device(si);
-
return error;
}
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 3/7] mm/shmem, swap: tidy up THP swapin checks
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
2025-06-27 6:20 ` [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
2025-06-27 6:20 ` [PATCH v3 2/7] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-30 4:47 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting Kairui Song
` (3 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song
From: Kairui Song <kasong@tencent.com>
Move all THP swapin related checks under CONFIG_TRANSPARENT_HUGEPAGE,
so they will be trimmed off by the compiler if not needed.
And add a WARN if shmem sees a order > 0 entry when
CONFIG_TRANSPARENT_HUGEPAGE is disabled, that should never happen unless
things went very wrong.
There should be no observable feature change except the new added WARN.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/shmem.c | 42 ++++++++++++++++++++----------------------
1 file changed, 20 insertions(+), 22 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 033dc7a3435d..f85a985167c5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1980,26 +1980,39 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
swp_entry_t entry, int order, gfp_t gfp)
{
struct shmem_inode_info *info = SHMEM_I(inode);
+ int nr_pages = 1 << order;
struct folio *new;
void *shadow;
- int nr_pages;
/*
* We have arrived here because our zones are constrained, so don't
* limit chance of success with further cpuset and node constraints.
*/
gfp &= ~GFP_CONSTRAINT_MASK;
- if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
- gfp_t huge_gfp = vma_thp_gfp_mask(vma);
-
- gfp = limit_gfp_mask(huge_gfp, gfp);
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ if (WARN_ON_ONCE(order))
+ return ERR_PTR(-EINVAL);
+ } else if (order) {
+ /*
+ * If uffd is active for the vma, we need per-page fault
+ * fidelity to maintain the uffd semantics, then fallback
+ * to swapin order-0 folio, as well as for zswap case.
+ * Any existing sub folio in the swap cache also blocks
+ * mTHP swapin.
+ */
+ if ((vma && unlikely(userfaultfd_armed(vma))) ||
+ !zswap_never_enabled() ||
+ non_swapcache_batch(entry, nr_pages) != nr_pages) {
+ return ERR_PTR(-EINVAL);
+ } else {
+ gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+ }
}
new = shmem_alloc_folio(gfp, order, info, index);
if (!new)
return ERR_PTR(-ENOMEM);
- nr_pages = folio_nr_pages(new);
if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
gfp, entry)) {
folio_put(new);
@@ -2283,9 +2296,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
/* Look it up and read it in.. */
folio = swap_cache_get_folio(swap, NULL, 0);
if (!folio) {
- int nr_pages = 1 << order;
- bool fallback_order0 = false;
-
/* Or update major stats only when swapin succeeds?? */
if (fault_type) {
*fault_type |= VM_FAULT_MAJOR;
@@ -2293,20 +2303,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
- /*
- * If uffd is active for the vma, we need per-page fault
- * fidelity to maintain the uffd semantics, then fallback
- * to swapin order-0 folio, as well as for zswap case.
- * Any existing sub folio in the swap cache also blocks
- * mTHP swapin.
- */
- if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
- !zswap_never_enabled() ||
- non_swapcache_batch(swap, nr_pages) != nr_pages))
- fallback_order0 = true;
-
/* Skip swapcache for synchronous device. */
- if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
if (!IS_ERR(folio)) {
skip_swapcache = true;
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
` (2 preceding siblings ...)
2025-06-27 6:20 ` [PATCH v3 3/7] mm/shmem, swap: tidy up THP swapin checks Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-30 6:34 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 5/7] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
` (2 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song
From: Kairui Song <kasong@tencent.com>
Instead of keeping different paths of splitting the entry and
recalculating the swap entry and index, do it in one place.
Whenever swapin brought in a folio smaller than the entry, split the
entry. And always recalculate the entry and index, in case it might
read in a folio that's larger than the entry order. This removes
duplicated code and function calls, and makes the code more robust.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
1 file changed, 41 insertions(+), 62 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index f85a985167c5..5be9c905396e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
swap_free_nr(swap, nr_pages);
}
-static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
- swp_entry_t swap, gfp_t gfp)
+/*
+ * Split an existing large swap entry. @index should point to one sub mapping
+ * slot within the entry @swap, this sub slot will be split into order 0.
+ */
+static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
+ swp_entry_t swap, gfp_t gfp)
{
struct address_space *mapping = inode->i_mapping;
XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
@@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
if (xas_error(&xas))
return xas_error(&xas);
- return entry_order;
+ return 0;
}
/*
@@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct address_space *mapping = inode->i_mapping;
struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
struct shmem_inode_info *info = SHMEM_I(inode);
+ int error, nr_pages, order, swap_order;
struct swap_info_struct *si;
struct folio *folio = NULL;
bool skip_swapcache = false;
swp_entry_t swap;
- int error, nr_pages, order, split_order;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
swap = radix_to_swp_entry(*foliop);
@@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
goto failed;
}
- /*
- * Now swap device can only swap in order 0 folio, then we
- * should split the large swap entry stored in the pagecache
- * if necessary.
- */
- split_order = shmem_split_large_entry(inode, index, swap, gfp);
- if (split_order < 0) {
- error = split_order;
- goto failed;
- }
-
- /*
- * If the large swap entry has already been split, it is
- * necessary to recalculate the new swap entry based on
- * the old order alignment.
- */
- if (split_order > 0) {
- pgoff_t offset = index - round_down(index, 1 << split_order);
-
- swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
- }
-
/* Here we actually start the io */
folio = shmem_swapin_cluster(swap, gfp, info, index);
if (!folio) {
error = -ENOMEM;
goto failed;
}
- } else if (order > folio_order(folio)) {
- /*
- * Swap readahead may swap in order 0 folios into swapcache
- * asynchronously, while the shmem mapping can still stores
- * large swap entries. In such cases, we should split the
- * large swap entry to prevent possible data corruption.
- */
- split_order = shmem_split_large_entry(inode, index, swap, gfp);
- if (split_order < 0) {
- folio_put(folio);
- folio = NULL;
- error = split_order;
- goto failed;
- }
-
- /*
- * If the large swap entry has already been split, it is
- * necessary to recalculate the new swap entry based on
- * the old order alignment.
- */
- if (split_order > 0) {
- pgoff_t offset = index - round_down(index, 1 << split_order);
-
- swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
- }
- } else if (order < folio_order(folio)) {
- swap.val = round_down(swap.val, 1 << folio_order(folio));
}
alloced:
+ /*
+ * We need to split an existing large entry if swapin brought in a
+ * smaller folio due to various of reasons.
+ *
+ * And worth noting there is a special case: if there is a smaller
+ * cached folio that covers @swap, but not @index (it only covers
+ * first few sub entries of the large entry, but @index points to
+ * later parts), the swap cache lookup will still see this folio,
+ * And we need to split the large entry here. Later checks will fail,
+ * as it can't satisfy the swap requirement, and we will retry
+ * the swapin from beginning.
+ */
+ swap_order = folio_order(folio);
+ if (order > swap_order) {
+ error = shmem_split_swap_entry(inode, index, swap, gfp);
+ if (error)
+ goto failed_nolock;
+ }
+
+ index = round_down(index, 1 << swap_order);
+ swap.val = round_down(swap.val, 1 << swap_order);
+
/* We have to do this with folio locked to prevent races */
folio_lock(folio);
if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
folio->swap.val != swap.val) {
error = -EEXIST;
- goto unlock;
+ goto failed_unlock;
}
if (!folio_test_uptodate(folio)) {
error = -EIO;
@@ -2405,8 +2382,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
goto failed;
}
- error = shmem_add_to_page_cache(folio, mapping,
- round_down(index, nr_pages),
+ error = shmem_add_to_page_cache(folio, mapping, index,
swp_to_radix_entry(swap), gfp);
if (error)
goto failed;
@@ -2417,8 +2393,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
folio_mark_accessed(folio);
if (skip_swapcache) {
+ swapcache_clear(si, folio->swap, folio_nr_pages(folio));
folio->swap.val = 0;
- swapcache_clear(si, swap, nr_pages);
} else {
delete_from_swap_cache(folio);
}
@@ -2434,13 +2410,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (error == -EIO)
shmem_set_folio_swapin_error(inode, index, folio, swap,
skip_swapcache);
-unlock:
- if (skip_swapcache)
- swapcache_clear(si, swap, folio_nr_pages(folio));
- if (folio) {
+failed_unlock:
+ if (folio)
folio_unlock(folio);
- folio_put(folio);
+failed_nolock:
+ if (skip_swapcache) {
+ swapcache_clear(si, folio->swap, folio_nr_pages(folio));
+ folio->swap.val = 0;
}
+ if (folio)
+ folio_put(folio);
put_swap_device(si);
return error;
}
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 5/7] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
` (3 preceding siblings ...)
2025-06-27 6:20 ` [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-30 7:24 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 6/7] mm/shmem, swap: fix major fault counting Kairui Song
2025-06-27 6:20 ` [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
6 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song
From: Kairui Song <kasong@tencent.com>
Currently if THP swapin failed due to reasons like partially conflicting
swap cache or ZSWAP enabled, it will fallback to cached swapin.
Right now the swap cache has a non-trivial overhead, and readahead is
not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
the readahead and swap cache even if the swapin falls back to order 0.
So handle the fallback logic without falling back to the cached read.
Also slightly tweak the behavior if the WARN_ON is triggered (shmem
mapping is corrupted or buggy code) as a side effect, just return
with -EINVAL. This should be OK as things are already very wrong
beyond recovery at that point.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/shmem.c | 68 ++++++++++++++++++++++++++++++------------------------
1 file changed, 38 insertions(+), 30 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 5be9c905396e..5f2641fd1be7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1975,13 +1975,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
return ERR_PTR(error);
}
-static struct folio *shmem_swap_alloc_folio(struct inode *inode,
+static struct folio *shmem_swapin_direct(struct inode *inode,
struct vm_area_struct *vma, pgoff_t index,
swp_entry_t entry, int order, gfp_t gfp)
{
struct shmem_inode_info *info = SHMEM_I(inode);
int nr_pages = 1 << order;
struct folio *new;
+ pgoff_t offset;
+ gfp_t swap_gfp;
void *shadow;
/*
@@ -1989,6 +1991,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
* limit chance of success with further cpuset and node constraints.
*/
gfp &= ~GFP_CONSTRAINT_MASK;
+ swap_gfp = gfp;
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
if (WARN_ON_ONCE(order))
return ERR_PTR(-EINVAL);
@@ -2003,20 +2006,23 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
if ((vma && unlikely(userfaultfd_armed(vma))) ||
!zswap_never_enabled() ||
non_swapcache_batch(entry, nr_pages) != nr_pages) {
- return ERR_PTR(-EINVAL);
+ goto fallback;
} else {
- gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+ swap_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
}
}
-
- new = shmem_alloc_folio(gfp, order, info, index);
- if (!new)
- return ERR_PTR(-ENOMEM);
+retry:
+ new = shmem_alloc_folio(swap_gfp, order, info, index);
+ if (!new) {
+ new = ERR_PTR(-ENOMEM);
+ goto fallback;
+ }
if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
- gfp, entry)) {
+ swap_gfp, entry)) {
folio_put(new);
- return ERR_PTR(-ENOMEM);
+ new = ERR_PTR(-ENOMEM);
+ goto fallback;
}
/*
@@ -2045,6 +2051,17 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
folio_add_lru(new);
swap_read_folio(new, NULL);
return new;
+fallback:
+ /* Order 0 swapin failed, nothing to fallback to, abort */
+ if (!order)
+ return new;
+ /* High order swapin failed, fallback to order 0 and retry */
+ order = 0;
+ nr_pages = 1;
+ swap_gfp = gfp;
+ offset = index - round_down(index, nr_pages);
+ entry = swp_entry(swp_type(entry), swp_offset(entry) + offset);
+ goto retry;
}
/*
@@ -2243,7 +2260,6 @@ static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
cur_order = split_order;
split_order = xas_try_split_min_order(split_order);
}
-
unlock:
xas_unlock_irq(&xas);
@@ -2306,34 +2322,26 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
-
- /* Skip swapcache for synchronous device. */
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
- folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
- if (!IS_ERR(folio)) {
+ /* Direct mTHP swapin without swap cache or readahead */
+ folio = shmem_swapin_direct(inode, vma, index,
+ swap, order, gfp);
+ if (IS_ERR(folio)) {
+ error = PTR_ERR(folio);
+ folio = NULL;
+ } else {
skip_swapcache = true;
- goto alloced;
}
-
+ } else {
/*
- * Fallback to swapin order-0 folio unless the swap entry
- * already exists.
+ * Order 0 swapin using swap cache and readahead, it
+ * may return order > 0 folio due to raced swap cache
*/
- error = PTR_ERR(folio);
- folio = NULL;
- if (error == -EEXIST)
- goto failed;
+ folio = shmem_swapin_cluster(swap, gfp, info, index);
}
-
- /* Here we actually start the io */
- folio = shmem_swapin_cluster(swap, gfp, info, index);
- if (!folio) {
- error = -ENOMEM;
+ if (!folio)
goto failed;
- }
}
-
-alloced:
/*
* We need to split an existing large entry if swapin brought in a
* smaller folio due to various of reasons.
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 6/7] mm/shmem, swap: fix major fault counting
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
` (4 preceding siblings ...)
2025-06-27 6:20 ` [PATCH v3 5/7] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-30 7:05 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
6 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song
From: Kairui Song <kasong@tencent.com>
If the swapin failed, don't update the major fault count. There is a
long existing comment for doing it this way, now with previous cleanups,
we can finally fix it.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/shmem.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 5f2641fd1be7..ea9a105ded5d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2316,12 +2316,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
/* Look it up and read it in.. */
folio = swap_cache_get_folio(swap, NULL, 0);
if (!folio) {
- /* Or update major stats only when swapin succeeds?? */
- if (fault_type) {
- *fault_type |= VM_FAULT_MAJOR;
- count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(fault_mm, PGMAJFAULT);
- }
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
/* Direct mTHP swapin without swap cache or readahead */
folio = shmem_swapin_direct(inode, vma, index,
@@ -2341,6 +2335,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
}
if (!folio)
goto failed;
+ if (fault_type) {
+ *fault_type |= VM_FAULT_MAJOR;
+ count_vm_event(PGMAJFAULT);
+ count_memcg_event_mm(fault_mm, PGMAJFAULT);
+ }
}
/*
* We need to split an existing large entry if swapin brought in a
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
` (5 preceding siblings ...)
2025-06-27 6:20 ` [PATCH v3 6/7] mm/shmem, swap: fix major fault counting Kairui Song
@ 2025-06-27 6:20 ` Kairui Song
2025-06-30 7:21 ` Baolin Wang
6 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-27 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Hugh Dickins, Baolin Wang, Matthew Wilcox,
Kemeng Shi, Chris Li, Nhat Pham, Baoquan He, Barry Song,
linux-kernel, Kairui Song
From: Kairui Song <kasong@tencent.com>
If the shmem read request's index points to the middle of a large swap
entry, shmem swap in does the swap cache lookup use the large swap entry's
starting value (the first sub swap entry of this large entry). This will
lead to false positive lookup result if only the first few swap entries
are cached, but the requested swap entry pointed by index is uncached.
Currently shmem will do a large entry split then retry the swapin from
beginning, which is a waste of CPU and fragile. Handle this correctly.
Also add some sanity checks to help understand the code and ensure things
won't go wrong.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/shmem.c | 60 +++++++++++++++++++++++++++++-------------------------
1 file changed, 32 insertions(+), 28 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index ea9a105ded5d..9341c51c3d10 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1977,14 +1977,19 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
static struct folio *shmem_swapin_direct(struct inode *inode,
struct vm_area_struct *vma, pgoff_t index,
- swp_entry_t entry, int order, gfp_t gfp)
+ swp_entry_t index_entry, swp_entry_t swap,
+ int order, gfp_t gfp)
{
struct shmem_inode_info *info = SHMEM_I(inode);
- int nr_pages = 1 << order;
struct folio *new;
- pgoff_t offset;
+ swp_entry_t entry;
gfp_t swap_gfp;
void *shadow;
+ int nr_pages;
+
+ /* Prefer aligned THP swapin */
+ entry.val = index_entry.val;
+ nr_pages = 1 << order;
/*
* We have arrived here because our zones are constrained, so don't
@@ -2011,6 +2016,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
swap_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
}
}
+
retry:
new = shmem_alloc_folio(swap_gfp, order, info, index);
if (!new) {
@@ -2056,11 +2062,10 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
if (!order)
return new;
/* High order swapin failed, fallback to order 0 and retry */
- order = 0;
- nr_pages = 1;
+ entry.val = swap.val;
swap_gfp = gfp;
- offset = index - round_down(index, nr_pages);
- entry = swp_entry(swp_type(entry), swp_offset(entry) + offset);
+ nr_pages = 1;
+ order = 0;
goto retry;
}
@@ -2288,20 +2293,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
struct shmem_inode_info *info = SHMEM_I(inode);
int error, nr_pages, order, swap_order;
+ swp_entry_t swap, index_entry;
struct swap_info_struct *si;
struct folio *folio = NULL;
bool skip_swapcache = false;
- swp_entry_t swap;
+ pgoff_t offset;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
- swap = radix_to_swp_entry(*foliop);
+ index_entry = radix_to_swp_entry(*foliop);
*foliop = NULL;
- if (is_poisoned_swp_entry(swap))
+ if (is_poisoned_swp_entry(index_entry))
return -EIO;
- si = get_swap_device(swap);
- order = shmem_confirm_swap(mapping, index, swap);
+ si = get_swap_device(index_entry);
+ order = shmem_confirm_swap(mapping, index, index_entry);
if (unlikely(!si)) {
if (order < 0)
return -EEXIST;
@@ -2313,13 +2319,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
return -EEXIST;
}
- /* Look it up and read it in.. */
+ /* @index may points to the middle of a large entry, get the real swap value first */
+ offset = index - round_down(index, 1 << order);
+ swap.val = index_entry.val + offset;
folio = swap_cache_get_folio(swap, NULL, 0);
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
/* Direct mTHP swapin without swap cache or readahead */
folio = shmem_swapin_direct(inode, vma, index,
- swap, order, gfp);
+ index_entry, swap, order, gfp);
if (IS_ERR(folio)) {
error = PTR_ERR(folio);
folio = NULL;
@@ -2341,28 +2349,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
}
+
+ swap_order = folio_order(folio);
+ nr_pages = folio_nr_pages(folio);
+ /* The swap-in should cover both @swap and @index */
+ swap.val = round_down(swap.val, nr_pages);
+ VM_WARN_ON_ONCE(swap.val > index_entry.val + offset);
+ VM_WARN_ON_ONCE(swap.val + nr_pages <= index_entry.val + offset);
+
/*
* We need to split an existing large entry if swapin brought in a
* smaller folio due to various of reasons.
- *
- * And worth noting there is a special case: if there is a smaller
- * cached folio that covers @swap, but not @index (it only covers
- * first few sub entries of the large entry, but @index points to
- * later parts), the swap cache lookup will still see this folio,
- * And we need to split the large entry here. Later checks will fail,
- * as it can't satisfy the swap requirement, and we will retry
- * the swapin from beginning.
*/
- swap_order = folio_order(folio);
+ index = round_down(index, nr_pages);
if (order > swap_order) {
- error = shmem_split_swap_entry(inode, index, swap, gfp);
+ error = shmem_split_swap_entry(inode, index, index_entry, gfp);
if (error)
goto failed_nolock;
}
- index = round_down(index, 1 << swap_order);
- swap.val = round_down(swap.val, 1 << swap_order);
-
/* We have to do this with folio locked to prevent races */
folio_lock(folio);
if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
@@ -2375,7 +2380,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
goto failed;
}
folio_wait_writeback(folio);
- nr_pages = folio_nr_pages(folio);
/*
* Some architectures may have to restore extra metadata to the
--
2.50.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung
2025-06-27 6:20 ` [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
@ 2025-06-30 3:44 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 3:44 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, linux-kernel, stable
On 2025/6/27 14:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> The current swap-in code assumes that, when a swap entry in shmem mapping
> is order 0, its cached folios (if present) must be order 0 too, which
> turns out not always correct.
>
> The problem is shmem_split_large_entry is called before verifying the
> folio will eventually be swapped in, one possible race is:
>
> CPU1 CPU2
> shmem_swapin_folio
> /* swap in of order > 0 swap entry S1 */
> folio = swap_cache_get_folio
> /* folio = NULL */
> order = xa_get_order
> /* order > 0 */
> folio = shmem_swap_alloc_folio
> /* mTHP alloc failure, folio = NULL */
> <... Interrupted ...>
> shmem_swapin_folio
> /* S1 is swapped in */
> shmem_writeout
> /* S1 is swapped out, folio cached */
> shmem_split_large_entry(..., S1)
> /* S1 is split, but the folio covering it has order > 0 now */
>
> Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
> folio lookup will return a folio with order > 0. The
> `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
> return false causing swap-in to return -EEXIST.
>
> And this looks fragile. So fix this up by allowing seeing a larger folio
> in swap cache, and check the whole shmem mapping range covered by the
> swapin have the right swap value upon inserting the folio. And drop the
> redundant tree walks before the insertion.
>
> This will actually improve performance, as it avoids two redundant Xarray
> tree walks in the hot path, and the only side effect is that in the
> failure path, shmem may redundantly reallocate a few folios causing
> temporary slight memory pressure.
>
> And worth noting, it may seems the order and value check before inserting
> might help reducing the lock contention, which is not true. The swap
> cache layer ensures raced swapin will either see a swap cache folio or
> failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
> bypassed), so holding the folio lock and checking the folio flag is
> already good enough for avoiding the lock contention. The chance that a
> folio passes the swap entry value check but the shmem mapping slot has
> changed should be very low.
>
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
> Cc: <stable@vger.kernel.org>
Thanks for fixing the issue.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/shmem.c | 30 +++++++++++++++++++++---------
> 1 file changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 334b7b4a61a0..e3c9a1365ff4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
> pgoff_t index, void *expected, gfp_t gfp)
> {
> XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> - long nr = folio_nr_pages(folio);
> + unsigned long nr = folio_nr_pages(folio);
> + swp_entry_t iter, swap;
> + void *entry;
>
> VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio,
>
> gfp &= GFP_RECLAIM_MASK;
> folio_throttle_swaprate(folio, gfp);
> + swap = iter = radix_to_swp_entry(expected);
>
> do {
> xas_lock_irq(&xas);
> - if (expected != xas_find_conflict(&xas)) {
> - xas_set_err(&xas, -EEXIST);
> - goto unlock;
> + xas_for_each_conflict(&xas, entry) {
> + /*
> + * The range must either be empty, or filled with
> + * expected swap entries. Shmem swap entries are never
> + * partially freed without split of both entry and
> + * folio, so there shouldn't be any holes.
> + */
> + if (!expected || entry != swp_to_radix_entry(iter)) {
> + xas_set_err(&xas, -EEXIST);
> + goto unlock;
> + }
> + iter.val += 1 << xas_get_order(&xas);
> }
> - if (expected && xas_find_conflict(&xas)) {
> + if (expected && iter.val - nr != swap.val) {
> xas_set_err(&xas, -EEXIST);
> goto unlock;
> }
> @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> error = -ENOMEM;
> goto failed;
> }
> - } else if (order != folio_order(folio)) {
> + } else if (order > folio_order(folio)) {
> /*
> * Swap readahead may swap in order 0 folios into swapcache
> * asynchronously, while the shmem mapping can still stores
> @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>
> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> }
> + } else if (order < folio_order(folio)) {
> + swap.val = round_down(swap.val, 1 << folio_order(folio));
> }
>
> alloced:
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> - folio->swap.val != swap.val ||
> - !shmem_confirm_swap(mapping, index, swap) ||
> - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
> + folio->swap.val != swap.val) {
> error = -EEXIST;
> goto unlock;
> }
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 3/7] mm/shmem, swap: tidy up THP swapin checks
2025-06-27 6:20 ` [PATCH v3 3/7] mm/shmem, swap: tidy up THP swapin checks Kairui Song
@ 2025-06-30 4:47 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 4:47 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/27 14:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Move all THP swapin related checks under CONFIG_TRANSPARENT_HUGEPAGE,
> so they will be trimmed off by the compiler if not needed.
>
> And add a WARN if shmem sees a order > 0 entry when
> CONFIG_TRANSPARENT_HUGEPAGE is disabled, that should never happen unless
> things went very wrong.
>
> There should be no observable feature change except the new added WARN.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
LGTM. Thanks.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/shmem.c | 42 ++++++++++++++++++++----------------------
> 1 file changed, 20 insertions(+), 22 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 033dc7a3435d..f85a985167c5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1980,26 +1980,39 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> swp_entry_t entry, int order, gfp_t gfp)
> {
> struct shmem_inode_info *info = SHMEM_I(inode);
> + int nr_pages = 1 << order;
> struct folio *new;
> void *shadow;
> - int nr_pages;
>
> /*
> * We have arrived here because our zones are constrained, so don't
> * limit chance of success with further cpuset and node constraints.
> */
> gfp &= ~GFP_CONSTRAINT_MASK;
> - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) {
> - gfp_t huge_gfp = vma_thp_gfp_mask(vma);
> -
> - gfp = limit_gfp_mask(huge_gfp, gfp);
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + if (WARN_ON_ONCE(order))
> + return ERR_PTR(-EINVAL);
> + } else if (order) {
> + /*
> + * If uffd is active for the vma, we need per-page fault
> + * fidelity to maintain the uffd semantics, then fallback
> + * to swapin order-0 folio, as well as for zswap case.
> + * Any existing sub folio in the swap cache also blocks
> + * mTHP swapin.
> + */
> + if ((vma && unlikely(userfaultfd_armed(vma))) ||
> + !zswap_never_enabled() ||
> + non_swapcache_batch(entry, nr_pages) != nr_pages) {
> + return ERR_PTR(-EINVAL);
> + } else {
> + gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> + }
> }
>
> new = shmem_alloc_folio(gfp, order, info, index);
> if (!new)
> return ERR_PTR(-ENOMEM);
>
> - nr_pages = folio_nr_pages(new);
> if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
> gfp, entry)) {
> folio_put(new);
> @@ -2283,9 +2296,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> /* Look it up and read it in.. */
> folio = swap_cache_get_folio(swap, NULL, 0);
> if (!folio) {
> - int nr_pages = 1 << order;
> - bool fallback_order0 = false;
> -
> /* Or update major stats only when swapin succeeds?? */
> if (fault_type) {
> *fault_type |= VM_FAULT_MAJOR;
> @@ -2293,20 +2303,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> count_memcg_event_mm(fault_mm, PGMAJFAULT);
> }
>
> - /*
> - * If uffd is active for the vma, we need per-page fault
> - * fidelity to maintain the uffd semantics, then fallback
> - * to swapin order-0 folio, as well as for zswap case.
> - * Any existing sub folio in the swap cache also blocks
> - * mTHP swapin.
> - */
> - if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
> - !zswap_never_enabled() ||
> - non_swapcache_batch(swap, nr_pages) != nr_pages))
> - fallback_order0 = true;
> -
> /* Skip swapcache for synchronous device. */
> - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
> if (!IS_ERR(folio)) {
> skip_swapcache = true;
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-27 6:20 ` [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting Kairui Song
@ 2025-06-30 6:34 ` Baolin Wang
2025-06-30 9:16 ` Kairui Song
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 6:34 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/27 14:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Instead of keeping different paths of splitting the entry and
> recalculating the swap entry and index, do it in one place.
>
> Whenever swapin brought in a folio smaller than the entry, split the
> entry. And always recalculate the entry and index, in case it might
> read in a folio that's larger than the entry order. This removes
> duplicated code and function calls, and makes the code more robust.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
> 1 file changed, 41 insertions(+), 62 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f85a985167c5..5be9c905396e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> swap_free_nr(swap, nr_pages);
> }
>
> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> - swp_entry_t swap, gfp_t gfp)
> +/*
> + * Split an existing large swap entry. @index should point to one sub mapping
> + * slot within the entry @swap, this sub slot will be split into order 0.
> + */
> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
> + swp_entry_t swap, gfp_t gfp)
> {
> struct address_space *mapping = inode->i_mapping;
> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> if (xas_error(&xas))
> return xas_error(&xas);
>
> - return entry_order;
> + return 0;
> }
>
> /*
> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> struct address_space *mapping = inode->i_mapping;
> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> struct shmem_inode_info *info = SHMEM_I(inode);
> + int error, nr_pages, order, swap_order;
> struct swap_info_struct *si;
> struct folio *folio = NULL;
> bool skip_swapcache = false;
> swp_entry_t swap;
> - int error, nr_pages, order, split_order;
>
> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> swap = radix_to_swp_entry(*foliop);
> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> goto failed;
> }
>
> - /*
> - * Now swap device can only swap in order 0 folio, then we
> - * should split the large swap entry stored in the pagecache
> - * if necessary.
> - */
> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> - if (split_order < 0) {
> - error = split_order;
> - goto failed;
> - }
> -
> - /*
> - * If the large swap entry has already been split, it is
> - * necessary to recalculate the new swap entry based on
> - * the old order alignment.
> - */
> - if (split_order > 0) {
> - pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> - }
> -
> /* Here we actually start the io */
> folio = shmem_swapin_cluster(swap, gfp, info, index);
> if (!folio) {
> error = -ENOMEM;
> goto failed;
> }
> - } else if (order > folio_order(folio)) {
> - /*
> - * Swap readahead may swap in order 0 folios into swapcache
> - * asynchronously, while the shmem mapping can still stores
> - * large swap entries. In such cases, we should split the
> - * large swap entry to prevent possible data corruption.
> - */
> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> - if (split_order < 0) {
> - folio_put(folio);
> - folio = NULL;
> - error = split_order;
> - goto failed;
> - }
> -
> - /*
> - * If the large swap entry has already been split, it is
> - * necessary to recalculate the new swap entry based on
> - * the old order alignment.
> - */
> - if (split_order > 0) {
> - pgoff_t offset = index - round_down(index, 1 << split_order);
> -
> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> - }
> - } else if (order < folio_order(folio)) {
> - swap.val = round_down(swap.val, 1 << folio_order(folio));
> }
>
> alloced:
> + /*
> + * We need to split an existing large entry if swapin brought in a
> + * smaller folio due to various of reasons.
> + *
> + * And worth noting there is a special case: if there is a smaller
> + * cached folio that covers @swap, but not @index (it only covers
> + * first few sub entries of the large entry, but @index points to
> + * later parts), the swap cache lookup will still see this folio,
> + * And we need to split the large entry here. Later checks will fail,
> + * as it can't satisfy the swap requirement, and we will retry
> + * the swapin from beginning.
> + */
> + swap_order = folio_order(folio);
Nit: 'swap_order' is confusing, and can you just use folio_order() or a
btter name?
> + if (order > swap_order) {
> + error = shmem_split_swap_entry(inode, index, swap, gfp);
> + if (error)
> + goto failed_nolock;
> + }
> +
> + index = round_down(index, 1 << swap_order);
> + swap.val = round_down(swap.val, 1 << swap_order);
The round_down() of index and swap value here may cause
shmem_add_to_page_cache() to fail to insert a new folio, because the
swap value stored at that index in the shmem mapping does not match,
leading to another swapin page fault for correction.
For example, shmem stores a large swap entry of order 4 in the range of
index 0-64. When a swapin fault occurs at index = 3, with swap.val =
0x4000, if a split happens and this round_down() logic is applied, then
index = 3, swap.val = 0x4000. However, the actual swap.val should be
0x4003 stored in the shmem mapping. This would cause another swapin fault.
I still prefer my original alignment method, and do you find this will
cause any issues?
"
if (split_order > 0) {
pgoff_t offset = index - round_down(index, 1 << split_order);
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
}
"
> +
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> folio->swap.val != swap.val) {
> error = -EEXIST;
> - goto unlock;
> + goto failed_unlock;
> }
> if (!folio_test_uptodate(folio)) {
> error = -EIO;
> @@ -2405,8 +2382,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> goto failed;
> }
>
> - error = shmem_add_to_page_cache(folio, mapping,
> - round_down(index, nr_pages),
> + error = shmem_add_to_page_cache(folio, mapping, index,
> swp_to_radix_entry(swap), gfp);
> if (error)
> goto failed;
> @@ -2417,8 +2393,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> folio_mark_accessed(folio);
>
> if (skip_swapcache) {
> + swapcache_clear(si, folio->swap, folio_nr_pages(folio));
> folio->swap.val = 0;
> - swapcache_clear(si, swap, nr_pages);
> } else {
> delete_from_swap_cache(folio);
> }
> @@ -2434,13 +2410,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> if (error == -EIO)
> shmem_set_folio_swapin_error(inode, index, folio, swap,
> skip_swapcache);
> -unlock:
> - if (skip_swapcache)
> - swapcache_clear(si, swap, folio_nr_pages(folio));
> - if (folio) {
> +failed_unlock:
> + if (folio)
> folio_unlock(folio);
> - folio_put(folio);
> +failed_nolock:
> + if (skip_swapcache) {
> + swapcache_clear(si, folio->swap, folio_nr_pages(folio));
> + folio->swap.val = 0;
> }
> + if (folio)
> + folio_put(folio);
> put_swap_device(si);
> return error;
> }
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 6/7] mm/shmem, swap: fix major fault counting
2025-06-27 6:20 ` [PATCH v3 6/7] mm/shmem, swap: fix major fault counting Kairui Song
@ 2025-06-30 7:05 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 7:05 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/27 14:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> If the swapin failed, don't update the major fault count. There is a
> long existing comment for doing it this way, now with previous cleanups,
> we can finally fix it.
Sounds reasonable to me. Additionally, swapin failure is a rare event,
so I think this patch will have little impact on user statistics.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/shmem.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5f2641fd1be7..ea9a105ded5d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2316,12 +2316,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> /* Look it up and read it in.. */
> folio = swap_cache_get_folio(swap, NULL, 0);
> if (!folio) {
> - /* Or update major stats only when swapin succeeds?? */
> - if (fault_type) {
> - *fault_type |= VM_FAULT_MAJOR;
> - count_vm_event(PGMAJFAULT);
> - count_memcg_event_mm(fault_mm, PGMAJFAULT);
> - }
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> /* Direct mTHP swapin without swap cache or readahead */
> folio = shmem_swapin_direct(inode, vma, index,
> @@ -2341,6 +2335,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> }
> if (!folio)
> goto failed;
> + if (fault_type) {
> + *fault_type |= VM_FAULT_MAJOR;
> + count_vm_event(PGMAJFAULT);
> + count_memcg_event_mm(fault_mm, PGMAJFAULT);
> + }
> }
> /*
> * We need to split an existing large entry if swapin brought in a
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup
2025-06-27 6:20 ` [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
@ 2025-06-30 7:21 ` Baolin Wang
2025-06-30 9:17 ` Kairui Song
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 7:21 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/27 14:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> If the shmem read request's index points to the middle of a large swap
> entry, shmem swap in does the swap cache lookup use the large swap entry's
> starting value (the first sub swap entry of this large entry).
Right.
This will
> lead to false positive lookup result if only the first few swap entries
> are cached, but the requested swap entry pointed by index is uncached.
>
> Currently shmem will do a large entry split then retry the swapin from
> beginning, which is a waste of CPU and fragile. Handle this correctly.
Sorry, I did not get you here. Only when the 'order' (get from
shmem_confirm_swap()) is not equal to folio_order(), will it trigger the
large swap entry split. Could you describe the issue in more detail?
I also found a false positive swap-in in your patch 4, seems they are
related?
> Also add some sanity checks to help understand the code and ensure things
> won't go wrong.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/shmem.c | 60 +++++++++++++++++++++++++++++-------------------------
> 1 file changed, 32 insertions(+), 28 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ea9a105ded5d..9341c51c3d10 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1977,14 +1977,19 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>
> static struct folio *shmem_swapin_direct(struct inode *inode,
> struct vm_area_struct *vma, pgoff_t index,
> - swp_entry_t entry, int order, gfp_t gfp)
> + swp_entry_t index_entry, swp_entry_t swap,
> + int order, gfp_t gfp)
> {
> struct shmem_inode_info *info = SHMEM_I(inode);
> - int nr_pages = 1 << order;
> struct folio *new;
> - pgoff_t offset;
> + swp_entry_t entry;
> gfp_t swap_gfp;
> void *shadow;
> + int nr_pages;
> +
> + /* Prefer aligned THP swapin */
> + entry.val = index_entry.val;
> + nr_pages = 1 << order;
>
> /*
> * We have arrived here because our zones are constrained, so don't
> @@ -2011,6 +2016,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> swap_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> }
> }
> +
> retry:
> new = shmem_alloc_folio(swap_gfp, order, info, index);
> if (!new) {
> @@ -2056,11 +2062,10 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> if (!order)
> return new;
> /* High order swapin failed, fallback to order 0 and retry */
> - order = 0;
> - nr_pages = 1;
> + entry.val = swap.val;
> swap_gfp = gfp;
> - offset = index - round_down(index, nr_pages);
> - entry = swp_entry(swp_type(entry), swp_offset(entry) + offset);
> + nr_pages = 1;
> + order = 0;
> goto retry;
> }
>
> @@ -2288,20 +2293,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> struct shmem_inode_info *info = SHMEM_I(inode);
> int error, nr_pages, order, swap_order;
> + swp_entry_t swap, index_entry;
> struct swap_info_struct *si;
> struct folio *folio = NULL;
> bool skip_swapcache = false;
> - swp_entry_t swap;
> + pgoff_t offset;
>
> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> - swap = radix_to_swp_entry(*foliop);
> + index_entry = radix_to_swp_entry(*foliop);
> *foliop = NULL;
>
> - if (is_poisoned_swp_entry(swap))
> + if (is_poisoned_swp_entry(index_entry))
> return -EIO;
>
> - si = get_swap_device(swap);
> - order = shmem_confirm_swap(mapping, index, swap);
> + si = get_swap_device(index_entry);
> + order = shmem_confirm_swap(mapping, index, index_entry);
> if (unlikely(!si)) {
> if (order < 0)
> return -EEXIST;
> @@ -2313,13 +2319,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> return -EEXIST;
> }
>
> - /* Look it up and read it in.. */
> + /* @index may points to the middle of a large entry, get the real swap value first */
> + offset = index - round_down(index, 1 << order);
> + swap.val = index_entry.val + offset;
> folio = swap_cache_get_folio(swap, NULL, 0);
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> /* Direct mTHP swapin without swap cache or readahead */
> folio = shmem_swapin_direct(inode, vma, index,
> - swap, order, gfp);
> + index_entry, swap, order, gfp);
> if (IS_ERR(folio)) {
> error = PTR_ERR(folio);
> folio = NULL;
> @@ -2341,28 +2349,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> count_memcg_event_mm(fault_mm, PGMAJFAULT);
> }
> }
> +
> + swap_order = folio_order(folio);
> + nr_pages = folio_nr_pages(folio);
> + /* The swap-in should cover both @swap and @index */
> + swap.val = round_down(swap.val, nr_pages);
> + VM_WARN_ON_ONCE(swap.val > index_entry.val + offset);
> + VM_WARN_ON_ONCE(swap.val + nr_pages <= index_entry.val + offset);
> +
> /*
> * We need to split an existing large entry if swapin brought in a
> * smaller folio due to various of reasons.
> - *
> - * And worth noting there is a special case: if there is a smaller
> - * cached folio that covers @swap, but not @index (it only covers
> - * first few sub entries of the large entry, but @index points to
> - * later parts), the swap cache lookup will still see this folio,
> - * And we need to split the large entry here. Later checks will fail,
> - * as it can't satisfy the swap requirement, and we will retry
> - * the swapin from beginning.
> */
> - swap_order = folio_order(folio);
> + index = round_down(index, nr_pages);
> if (order > swap_order) {
> - error = shmem_split_swap_entry(inode, index, swap, gfp);
> + error = shmem_split_swap_entry(inode, index, index_entry, gfp);
> if (error)
> goto failed_nolock;
> }
>
> - index = round_down(index, 1 << swap_order);
> - swap.val = round_down(swap.val, 1 << swap_order);
> -
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> @@ -2375,7 +2380,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> goto failed;
> }
> folio_wait_writeback(folio);
> - nr_pages = folio_nr_pages(folio);
>
> /*
> * Some architectures may have to restore extra metadata to the
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 5/7] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
2025-06-27 6:20 ` [PATCH v3 5/7] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-06-30 7:24 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 7:24 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi, Chris Li,
Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/27 14:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> Currently if THP swapin failed due to reasons like partially conflicting
> swap cache or ZSWAP enabled, it will fallback to cached swapin.
>
> Right now the swap cache has a non-trivial overhead, and readahead is
> not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip
> the readahead and swap cache even if the swapin falls back to order 0.
>
> So handle the fallback logic without falling back to the cached read.
>
> Also slightly tweak the behavior if the WARN_ON is triggered (shmem
> mapping is corrupted or buggy code) as a side effect, just return
> with -EINVAL. This should be OK as things are already very wrong
> beyond recovery at that point.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/shmem.c | 68 ++++++++++++++++++++++++++++++------------------------
> 1 file changed, 38 insertions(+), 30 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5be9c905396e..5f2641fd1be7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1975,13 +1975,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> return ERR_PTR(error);
> }
>
> -static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> +static struct folio *shmem_swapin_direct(struct inode *inode,
> struct vm_area_struct *vma, pgoff_t index,
> swp_entry_t entry, int order, gfp_t gfp)
> {
> struct shmem_inode_info *info = SHMEM_I(inode);
> int nr_pages = 1 << order;
> struct folio *new;
> + pgoff_t offset;
> + gfp_t swap_gfp;
Nit: The term 'swap' always reminds me of swap allocation:) But here
it's actually about allocating a folio. Would 'alloc_gfp' be a better
name? Otherwise look good to me.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-30 6:34 ` Baolin Wang
@ 2025-06-30 9:16 ` Kairui Song
2025-06-30 9:53 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-30 9:16 UTC (permalink / raw)
To: Baolin Wang
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 2025/6/27 14:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of keeping different paths of splitting the entry and
> > recalculating the swap entry and index, do it in one place.
> >
> > Whenever swapin brought in a folio smaller than the entry, split the
> > entry. And always recalculate the entry and index, in case it might
> > read in a folio that's larger than the entry order. This removes
> > duplicated code and function calls, and makes the code more robust.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
> > 1 file changed, 41 insertions(+), 62 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index f85a985167c5..5be9c905396e 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> > swap_free_nr(swap, nr_pages);
> > }
> >
> > -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> > - swp_entry_t swap, gfp_t gfp)
> > +/*
> > + * Split an existing large swap entry. @index should point to one sub mapping
> > + * slot within the entry @swap, this sub slot will be split into order 0.
> > + */
> > +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
> > + swp_entry_t swap, gfp_t gfp)
> > {
> > struct address_space *mapping = inode->i_mapping;
> > XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> > @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> > if (xas_error(&xas))
> > return xas_error(&xas);
> >
> > - return entry_order;
> > + return 0;
> > }
> >
> > /*
> > @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > struct address_space *mapping = inode->i_mapping;
> > struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> > struct shmem_inode_info *info = SHMEM_I(inode);
> > + int error, nr_pages, order, swap_order;
> > struct swap_info_struct *si;
> > struct folio *folio = NULL;
> > bool skip_swapcache = false;
> > swp_entry_t swap;
> > - int error, nr_pages, order, split_order;
> >
> > VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > swap = radix_to_swp_entry(*foliop);
> > @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > goto failed;
> > }
> >
> > - /*
> > - * Now swap device can only swap in order 0 folio, then we
> > - * should split the large swap entry stored in the pagecache
> > - * if necessary.
> > - */
> > - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > - if (split_order < 0) {
> > - error = split_order;
> > - goto failed;
> > - }
> > -
> > - /*
> > - * If the large swap entry has already been split, it is
> > - * necessary to recalculate the new swap entry based on
> > - * the old order alignment.
> > - */
> > - if (split_order > 0) {
> > - pgoff_t offset = index - round_down(index, 1 << split_order);
> > -
> > - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > - }
> > -
> > /* Here we actually start the io */
> > folio = shmem_swapin_cluster(swap, gfp, info, index);
> > if (!folio) {
> > error = -ENOMEM;
> > goto failed;
> > }
> > - } else if (order > folio_order(folio)) {
> > - /*
> > - * Swap readahead may swap in order 0 folios into swapcache
> > - * asynchronously, while the shmem mapping can still stores
> > - * large swap entries. In such cases, we should split the
> > - * large swap entry to prevent possible data corruption.
> > - */
> > - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> > - if (split_order < 0) {
> > - folio_put(folio);
> > - folio = NULL;
> > - error = split_order;
> > - goto failed;
> > - }
> > -
> > - /*
> > - * If the large swap entry has already been split, it is
> > - * necessary to recalculate the new swap entry based on
> > - * the old order alignment.
> > - */
> > - if (split_order > 0) {
> > - pgoff_t offset = index - round_down(index, 1 << split_order);
> > -
> > - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> > - }
> > - } else if (order < folio_order(folio)) {
> > - swap.val = round_down(swap.val, 1 << folio_order(folio));
> > }
> >
> > alloced:
> > + /*
> > + * We need to split an existing large entry if swapin brought in a
> > + * smaller folio due to various of reasons.
> > + *
> > + * And worth noting there is a special case: if there is a smaller
> > + * cached folio that covers @swap, but not @index (it only covers
> > + * first few sub entries of the large entry, but @index points to
> > + * later parts), the swap cache lookup will still see this folio,
> > + * And we need to split the large entry here. Later checks will fail,
> > + * as it can't satisfy the swap requirement, and we will retry
> > + * the swapin from beginning.
> > + */
> > + swap_order = folio_order(folio);
>
> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
> btter name?
Good idea.
>
> > + if (order > swap_order) {
> > + error = shmem_split_swap_entry(inode, index, swap, gfp);
> > + if (error)
> > + goto failed_nolock;
> > + }
> > +
> > + index = round_down(index, 1 << swap_order);
> > + swap.val = round_down(swap.val, 1 << swap_order);
>
> The round_down() of index and swap value here may cause
> shmem_add_to_page_cache() to fail to insert a new folio, because the
> swap value stored at that index in the shmem mapping does not match,
> leading to another swapin page fault for correction.
>
> For example, shmem stores a large swap entry of order 4 in the range of
> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
> 0x4000, if a split happens and this round_down() logic is applied, then
> index = 3, swap.val = 0x4000. However, the actual swap.val should be
> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
above, it should re-calculate the swap value there. It's fixed in the
final patch but left unhandled here. I'll update this part.
>
> I still prefer my original alignment method, and do you find this will
> cause any issues?
>
> "
> if (split_order > 0) {
> pgoff_t offset = index - round_down(index, 1 << split_order);
>
> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> }
> "
It only fits the cached swapin and uncached swapin, not the cache hit
case. Cache hits may see a larger folio so split didn't happen, but
the round_down is still needed.
And there is another racy case: before this patch, the split may
happen first, but shmem_swapin_cluster brought in a large folio due to
race in the swap cache layer.
And I'm not sure if split_order is always reliable here, for example
concurrent split may return an inaccurate value here.
So I wanted to simplify it: by round_down(folio_order(folio)) we
simply get the index and swap that will be covered by this specific
folio, if the covered range still has the corresponding swap entries
(check and ensured by shmem_add_to_page_cache which holds the
xa_lock), then the folio will be inserted back safely and
successfully.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup
2025-06-30 7:21 ` Baolin Wang
@ 2025-06-30 9:17 ` Kairui Song
0 siblings, 0 replies; 23+ messages in thread
From: Kairui Song @ 2025-06-30 9:17 UTC (permalink / raw)
To: Baolin Wang
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On Mon, Jun 30, 2025 at 3:22 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 2025/6/27 14:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > If the shmem read request's index points to the middle of a large swap
> > entry, shmem swap in does the swap cache lookup use the large swap entry's
> > starting value (the first sub swap entry of this large entry).
>
> Right.
>
> This will
> > lead to false positive lookup result if only the first few swap entries
> > are cached, but the requested swap entry pointed by index is uncached.
> >
> > Currently shmem will do a large entry split then retry the swapin from
> > beginning, which is a waste of CPU and fragile. Handle this correctly.
>
> Sorry, I did not get you here. Only when the 'order' (get from
> shmem_confirm_swap()) is not equal to folio_order(), will it trigger the
> large swap entry split. Could you describe the issue in more detail?
Right, for example we have a `order = 4` swap entry covering `index 0
- 16`, `swap.val = 0x4000`.
A swapin request starts with `index = 3`. The swap_cache_get_folio
will be called with `swap.val = 0x4000`. It may return an order 0
folio with `swap.val = 0x4000` (especially readaheads read order 0
folios easily). It doesn't satisfy the swapin requests. A split is
issued and swapin will fall, then the fault is triggered again.
After this patch, swap_cache_get_folio will return either NULL or the
right folio.
> I also found a false positive swap-in in your patch 4, seems they are
> related?
It's unrelated, I should added this code in this patch for
!SWP_SYNCHRONOUS_IO path in that patch:
offset = index - round_down(index, 1 << order);
swap.val = index_entry.val + offset;
>
> > Also add some sanity checks to help understand the code and ensure things
> > won't go wrong.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/shmem.c | 60 +++++++++++++++++++++++++++++-------------------------
> > 1 file changed, 32 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index ea9a105ded5d..9341c51c3d10 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1977,14 +1977,19 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> >
> > static struct folio *shmem_swapin_direct(struct inode *inode,
> > struct vm_area_struct *vma, pgoff_t index,
> > - swp_entry_t entry, int order, gfp_t gfp)
> > + swp_entry_t index_entry, swp_entry_t swap,
> > + int order, gfp_t gfp)
> > {
> > struct shmem_inode_info *info = SHMEM_I(inode);
> > - int nr_pages = 1 << order;
> > struct folio *new;
> > - pgoff_t offset;
> > + swp_entry_t entry;
> > gfp_t swap_gfp;
> > void *shadow;
> > + int nr_pages;
> > +
> > + /* Prefer aligned THP swapin */
> > + entry.val = index_entry.val;
> > + nr_pages = 1 << order;
> >
> > /*
> > * We have arrived here because our zones are constrained, so don't
> > @@ -2011,6 +2016,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> > swap_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
> > }
> > }
> > +
> > retry:
> > new = shmem_alloc_folio(swap_gfp, order, info, index);
> > if (!new) {
> > @@ -2056,11 +2062,10 @@ static struct folio *shmem_swapin_direct(struct inode *inode,
> > if (!order)
> > return new;
> > /* High order swapin failed, fallback to order 0 and retry */
> > - order = 0;
> > - nr_pages = 1;
> > + entry.val = swap.val;
> > swap_gfp = gfp;
> > - offset = index - round_down(index, nr_pages);
> > - entry = swp_entry(swp_type(entry), swp_offset(entry) + offset);
> > + nr_pages = 1;
> > + order = 0;
> > goto retry;
> > }
> >
> > @@ -2288,20 +2293,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> > struct shmem_inode_info *info = SHMEM_I(inode);
> > int error, nr_pages, order, swap_order;
> > + swp_entry_t swap, index_entry;
> > struct swap_info_struct *si;
> > struct folio *folio = NULL;
> > bool skip_swapcache = false;
> > - swp_entry_t swap;
> > + pgoff_t offset;
> >
> > VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > - swap = radix_to_swp_entry(*foliop);
> > + index_entry = radix_to_swp_entry(*foliop);
> > *foliop = NULL;
> >
> > - if (is_poisoned_swp_entry(swap))
> > + if (is_poisoned_swp_entry(index_entry))
> > return -EIO;
> >
> > - si = get_swap_device(swap);
> > - order = shmem_confirm_swap(mapping, index, swap);
> > + si = get_swap_device(index_entry);
> > + order = shmem_confirm_swap(mapping, index, index_entry);
> > if (unlikely(!si)) {
> > if (order < 0)
> > return -EEXIST;
> > @@ -2313,13 +2319,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > return -EEXIST;
> > }
> >
> > - /* Look it up and read it in.. */
> > + /* @index may points to the middle of a large entry, get the real swap value first */
> > + offset = index - round_down(index, 1 << order);
> > + swap.val = index_entry.val + offset;
> > folio = swap_cache_get_folio(swap, NULL, 0);
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> > /* Direct mTHP swapin without swap cache or readahead */
> > folio = shmem_swapin_direct(inode, vma, index,
> > - swap, order, gfp);
> > + index_entry, swap, order, gfp);
> > if (IS_ERR(folio)) {
> > error = PTR_ERR(folio);
> > folio = NULL;
> > @@ -2341,28 +2349,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > count_memcg_event_mm(fault_mm, PGMAJFAULT);
> > }
> > }
> > +
> > + swap_order = folio_order(folio);
> > + nr_pages = folio_nr_pages(folio);
> > + /* The swap-in should cover both @swap and @index */
> > + swap.val = round_down(swap.val, nr_pages);
> > + VM_WARN_ON_ONCE(swap.val > index_entry.val + offset);
> > + VM_WARN_ON_ONCE(swap.val + nr_pages <= index_entry.val + offset);
> > +
> > /*
> > * We need to split an existing large entry if swapin brought in a
> > * smaller folio due to various of reasons.
> > - *
> > - * And worth noting there is a special case: if there is a smaller
> > - * cached folio that covers @swap, but not @index (it only covers
> > - * first few sub entries of the large entry, but @index points to
> > - * later parts), the swap cache lookup will still see this folio,
> > - * And we need to split the large entry here. Later checks will fail,
> > - * as it can't satisfy the swap requirement, and we will retry
> > - * the swapin from beginning.
> > */
> > - swap_order = folio_order(folio);
> > + index = round_down(index, nr_pages);
> > if (order > swap_order) {
> > - error = shmem_split_swap_entry(inode, index, swap, gfp);
> > + error = shmem_split_swap_entry(inode, index, index_entry, gfp);
> > if (error)
> > goto failed_nolock;
> > }
> >
> > - index = round_down(index, 1 << swap_order);
> > - swap.val = round_down(swap.val, 1 << swap_order);
> > -
> > /* We have to do this with folio locked to prevent races */
> > folio_lock(folio);
> > if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
> > @@ -2375,7 +2380,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > goto failed;
> > }
> > folio_wait_writeback(folio);
> > - nr_pages = folio_nr_pages(folio);
> >
> > /*
> > * Some architectures may have to restore extra metadata to the
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-30 9:16 ` Kairui Song
@ 2025-06-30 9:53 ` Baolin Wang
2025-06-30 10:06 ` Kairui Song
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 9:53 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/30 17:16, Kairui Song wrote:
> On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>> On 2025/6/27 14:20, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Instead of keeping different paths of splitting the entry and
>>> recalculating the swap entry and index, do it in one place.
>>>
>>> Whenever swapin brought in a folio smaller than the entry, split the
>>> entry. And always recalculate the entry and index, in case it might
>>> read in a folio that's larger than the entry order. This removes
>>> duplicated code and function calls, and makes the code more robust.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
>>> 1 file changed, 41 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index f85a985167c5..5be9c905396e 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>> swap_free_nr(swap, nr_pages);
>>> }
>>>
>>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>> - swp_entry_t swap, gfp_t gfp)
>>> +/*
>>> + * Split an existing large swap entry. @index should point to one sub mapping
>>> + * slot within the entry @swap, this sub slot will be split into order 0.
>>> + */
>>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
>>> + swp_entry_t swap, gfp_t gfp)
>>> {
>>> struct address_space *mapping = inode->i_mapping;
>>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>> if (xas_error(&xas))
>>> return xas_error(&xas);
>>>
>>> - return entry_order;
>>> + return 0;
>>> }
>>>
>>> /*
>>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>> struct address_space *mapping = inode->i_mapping;
>>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>>> struct shmem_inode_info *info = SHMEM_I(inode);
>>> + int error, nr_pages, order, swap_order;
>>> struct swap_info_struct *si;
>>> struct folio *folio = NULL;
>>> bool skip_swapcache = false;
>>> swp_entry_t swap;
>>> - int error, nr_pages, order, split_order;
>>>
>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>> swap = radix_to_swp_entry(*foliop);
>>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>> goto failed;
>>> }
>>>
>>> - /*
>>> - * Now swap device can only swap in order 0 folio, then we
>>> - * should split the large swap entry stored in the pagecache
>>> - * if necessary.
>>> - */
>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>> - if (split_order < 0) {
>>> - error = split_order;
>>> - goto failed;
>>> - }
>>> -
>>> - /*
>>> - * If the large swap entry has already been split, it is
>>> - * necessary to recalculate the new swap entry based on
>>> - * the old order alignment.
>>> - */
>>> - if (split_order > 0) {
>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>> -
>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> - }
>>> -
>>> /* Here we actually start the io */
>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>> if (!folio) {
>>> error = -ENOMEM;
>>> goto failed;
>>> }
>>> - } else if (order > folio_order(folio)) {
>>> - /*
>>> - * Swap readahead may swap in order 0 folios into swapcache
>>> - * asynchronously, while the shmem mapping can still stores
>>> - * large swap entries. In such cases, we should split the
>>> - * large swap entry to prevent possible data corruption.
>>> - */
>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>> - if (split_order < 0) {
>>> - folio_put(folio);
>>> - folio = NULL;
>>> - error = split_order;
>>> - goto failed;
>>> - }
>>> -
>>> - /*
>>> - * If the large swap entry has already been split, it is
>>> - * necessary to recalculate the new swap entry based on
>>> - * the old order alignment.
>>> - */
>>> - if (split_order > 0) {
>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>> -
>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> - }
>>> - } else if (order < folio_order(folio)) {
>>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
>>> }
>>>
>>> alloced:
>>> + /*
>>> + * We need to split an existing large entry if swapin brought in a
>>> + * smaller folio due to various of reasons.
>>> + *
>>> + * And worth noting there is a special case: if there is a smaller
>>> + * cached folio that covers @swap, but not @index (it only covers
>>> + * first few sub entries of the large entry, but @index points to
>>> + * later parts), the swap cache lookup will still see this folio,
>>> + * And we need to split the large entry here. Later checks will fail,
>>> + * as it can't satisfy the swap requirement, and we will retry
>>> + * the swapin from beginning.
>>> + */
>>> + swap_order = folio_order(folio);
>>
>> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
>> btter name?
>
> Good idea.
>
>>
>>> + if (order > swap_order) {
>>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
>>> + if (error)
>>> + goto failed_nolock;
>>> + }
>>> +
>>> + index = round_down(index, 1 << swap_order);
>>> + swap.val = round_down(swap.val, 1 << swap_order);
>>
>> The round_down() of index and swap value here may cause
>> shmem_add_to_page_cache() to fail to insert a new folio, because the
>> swap value stored at that index in the shmem mapping does not match,
>> leading to another swapin page fault for correction.
>>
>> For example, shmem stores a large swap entry of order 4 in the range of
>> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
>> 0x4000, if a split happens and this round_down() logic is applied, then
>> index = 3, swap.val = 0x4000. However, the actual swap.val should be
>> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
>
> Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
> above, it should re-calculate the swap value there. It's fixed in the
> final patch but left unhandled here. I'll update this part.
>
>>
>> I still prefer my original alignment method, and do you find this will
>> cause any issues?
>>
>> "
>> if (split_order > 0) {
>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>
>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>> }
>> "
>
> It only fits the cached swapin and uncached swapin, not the cache hit
> case. Cache hits may see a larger folio so split didn't happen, but
> the round_down is still needed.
IMO, this only fits for the large swap entry split case.
> And there is another racy case: before this patch, the split may
> happen first, but shmem_swapin_cluster brought in a large folio due to
> race in the swap cache layer.
shmem_swapin_cluster() can only allocate order 0 folio, right?
> And I'm not sure if split_order is always reliable here, for example
> concurrent split may return an inaccurate value here.
We've held the xas lock to ensure the split is reliable, even though
concurrent splits may occur, only one split can get the large
'split_order', another will return 0 (since it will see the large swao
entry has already been split).
Based on your current patch, would the following modifications be clearer?
diff --git a/mm/shmem.c b/mm/shmem.c
index 5be9c905396e..91c071fb7b67 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
*inode, pgoff_t index,
if (xas_error(&xas))
return xas_error(&xas);
- return 0;
+ return split_order;
}
/*
@@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
error = shmem_split_swap_entry(inode, index, swap, gfp);
if (error)
goto failed_nolock;
- }
- index = round_down(index, 1 << swap_order);
- swap.val = round_down(swap.val, 1 << swap_order);
+ /*
+ * If the large swap entry has already been split, it is
+ * necessary to recalculate the new swap entry based on
+ * the old order alignment.
+ */
+ if (split_order > 0) {
+ pgoff_t offset = index - round_down(index, 1 <<
split_order);
+
+ swap = swp_entry(swp_type(swap),
swp_offset(swap) + offset);
+ }
+ } else if (order < folio_order(folio)) {
+ /*
+ * TODO; explain the posible race...
+ */
+ swap.val = round_down(swap.val, 1 << folio_order(folio));
+ }
/* We have to do this with folio locked to prevent races */
folio_lock(folio);
@@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
pgoff_t index,
goto failed;
}
- error = shmem_add_to_page_cache(folio, mapping, index,
+ error = shmem_add_to_page_cache(folio, mapping,
+ round_down(index, nr_pages),
swp_to_radix_entry(swap), gfp);
if (error)
goto failed;
> So I wanted to simplify it: by round_down(folio_order(folio)) we
> simply get the index and swap that will be covered by this specific
> folio, if the covered range still has the corresponding swap entries
> (check and ensured by shmem_add_to_page_cache which holds the
> xa_lock), then the folio will be inserted back safely and
> successfully.
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-30 9:53 ` Baolin Wang
@ 2025-06-30 10:06 ` Kairui Song
2025-06-30 11:59 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-30 10:06 UTC (permalink / raw)
To: Baolin Wang
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 2025/6/30 17:16, Kairui Song wrote:
> > On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 2025/6/27 14:20, Kairui Song wrote:
> >>> From: Kairui Song <kasong@tencent.com>
> >>>
> >>> Instead of keeping different paths of splitting the entry and
> >>> recalculating the swap entry and index, do it in one place.
> >>>
> >>> Whenever swapin brought in a folio smaller than the entry, split the
> >>> entry. And always recalculate the entry and index, in case it might
> >>> read in a folio that's larger than the entry order. This removes
> >>> duplicated code and function calls, and makes the code more robust.
> >>>
> >>> Signed-off-by: Kairui Song <kasong@tencent.com>
> >>> ---
> >>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
> >>> 1 file changed, 41 insertions(+), 62 deletions(-)
> >>>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index f85a985167c5..5be9c905396e 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >>> swap_free_nr(swap, nr_pages);
> >>> }
> >>>
> >>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> >>> - swp_entry_t swap, gfp_t gfp)
> >>> +/*
> >>> + * Split an existing large swap entry. @index should point to one sub mapping
> >>> + * slot within the entry @swap, this sub slot will be split into order 0.
> >>> + */
> >>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
> >>> + swp_entry_t swap, gfp_t gfp)
> >>> {
> >>> struct address_space *mapping = inode->i_mapping;
> >>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> >>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> >>> if (xas_error(&xas))
> >>> return xas_error(&xas);
> >>>
> >>> - return entry_order;
> >>> + return 0;
> >>> }
> >>>
> >>> /*
> >>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>> struct address_space *mapping = inode->i_mapping;
> >>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >>> struct shmem_inode_info *info = SHMEM_I(inode);
> >>> + int error, nr_pages, order, swap_order;
> >>> struct swap_info_struct *si;
> >>> struct folio *folio = NULL;
> >>> bool skip_swapcache = false;
> >>> swp_entry_t swap;
> >>> - int error, nr_pages, order, split_order;
> >>>
> >>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> >>> swap = radix_to_swp_entry(*foliop);
> >>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>> goto failed;
> >>> }
> >>>
> >>> - /*
> >>> - * Now swap device can only swap in order 0 folio, then we
> >>> - * should split the large swap entry stored in the pagecache
> >>> - * if necessary.
> >>> - */
> >>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >>> - if (split_order < 0) {
> >>> - error = split_order;
> >>> - goto failed;
> >>> - }
> >>> -
> >>> - /*
> >>> - * If the large swap entry has already been split, it is
> >>> - * necessary to recalculate the new swap entry based on
> >>> - * the old order alignment.
> >>> - */
> >>> - if (split_order > 0) {
> >>> - pgoff_t offset = index - round_down(index, 1 << split_order);
> >>> -
> >>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>> - }
> >>> -
> >>> /* Here we actually start the io */
> >>> folio = shmem_swapin_cluster(swap, gfp, info, index);
> >>> if (!folio) {
> >>> error = -ENOMEM;
> >>> goto failed;
> >>> }
> >>> - } else if (order > folio_order(folio)) {
> >>> - /*
> >>> - * Swap readahead may swap in order 0 folios into swapcache
> >>> - * asynchronously, while the shmem mapping can still stores
> >>> - * large swap entries. In such cases, we should split the
> >>> - * large swap entry to prevent possible data corruption.
> >>> - */
> >>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >>> - if (split_order < 0) {
> >>> - folio_put(folio);
> >>> - folio = NULL;
> >>> - error = split_order;
> >>> - goto failed;
> >>> - }
> >>> -
> >>> - /*
> >>> - * If the large swap entry has already been split, it is
> >>> - * necessary to recalculate the new swap entry based on
> >>> - * the old order alignment.
> >>> - */
> >>> - if (split_order > 0) {
> >>> - pgoff_t offset = index - round_down(index, 1 << split_order);
> >>> -
> >>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>> - }
> >>> - } else if (order < folio_order(folio)) {
> >>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
> >>> }
> >>>
> >>> alloced:
> >>> + /*
> >>> + * We need to split an existing large entry if swapin brought in a
> >>> + * smaller folio due to various of reasons.
> >>> + *
> >>> + * And worth noting there is a special case: if there is a smaller
> >>> + * cached folio that covers @swap, but not @index (it only covers
> >>> + * first few sub entries of the large entry, but @index points to
> >>> + * later parts), the swap cache lookup will still see this folio,
> >>> + * And we need to split the large entry here. Later checks will fail,
> >>> + * as it can't satisfy the swap requirement, and we will retry
> >>> + * the swapin from beginning.
> >>> + */
> >>> + swap_order = folio_order(folio);
> >>
> >> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
> >> btter name?
> >
> > Good idea.
> >
> >>
> >>> + if (order > swap_order) {
> >>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
> >>> + if (error)
> >>> + goto failed_nolock;
> >>> + }
> >>> +
> >>> + index = round_down(index, 1 << swap_order);
> >>> + swap.val = round_down(swap.val, 1 << swap_order);
> >>
> >> The round_down() of index and swap value here may cause
> >> shmem_add_to_page_cache() to fail to insert a new folio, because the
> >> swap value stored at that index in the shmem mapping does not match,
> >> leading to another swapin page fault for correction.
> >>
> >> For example, shmem stores a large swap entry of order 4 in the range of
> >> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
> >> 0x4000, if a split happens and this round_down() logic is applied, then
> >> index = 3, swap.val = 0x4000. However, the actual swap.val should be
> >> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
> >
> > Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
> > above, it should re-calculate the swap value there. It's fixed in the
> > final patch but left unhandled here. I'll update this part.
> >
> >>
> >> I still prefer my original alignment method, and do you find this will
> >> cause any issues?
> >>
> >> "
> >> if (split_order > 0) {
> >> pgoff_t offset = index - round_down(index, 1 << split_order);
> >>
> >> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >> }
> >> "
> >
> > It only fits the cached swapin and uncached swapin, not the cache hit
> > case. Cache hits may see a larger folio so split didn't happen, but
> > the round_down is still needed.
>
> IMO, this only fits for the large swap entry split case.
>
> > And there is another racy case: before this patch, the split may
> > happen first, but shmem_swapin_cluster brought in a large folio due to
> > race in the swap cache layer.
>
> shmem_swapin_cluster() can only allocate order 0 folio, right?
It can only allocate order 0 folio, but It can hit a large folio: eg.
a parallel swapin/swapout happened, and the folio stays in swap cache,
while we are handling a swapin here.
>
> > And I'm not sure if split_order is always reliable here, for example
> > concurrent split may return an inaccurate value here.
>
> We've held the xas lock to ensure the split is reliable, even though
> concurrent splits may occur, only one split can get the large
> 'split_order', another will return 0 (since it will see the large swao
> entry has already been split).
Yes, it may return 0, so we can get a large folio here, but get
`split_order = 0`?
And if concurrently swapout/swapin happened, the `split_order` could
be a different value?
>
> Based on your current patch, would the following modifications be clearer?
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5be9c905396e..91c071fb7b67 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
> *inode, pgoff_t index,
> if (xas_error(&xas))
> return xas_error(&xas);
>
> - return 0;
> + return split_order;
> }
>
> /*
> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
> error = shmem_split_swap_entry(inode, index, swap, gfp);
> if (error)
> goto failed_nolock;
> - }
>
> - index = round_down(index, 1 << swap_order);
> - swap.val = round_down(swap.val, 1 << swap_order);
> + /*
> + * If the large swap entry has already been split, it is
> + * necessary to recalculate the new swap entry based on
> + * the old order alignment.
> + */
> + if (split_order > 0) {
> + pgoff_t offset = index - round_down(index, 1 <<
> split_order);
> +
> + swap = swp_entry(swp_type(swap),
> swp_offset(swap) + offset);
> + }
> + } else if (order < folio_order(folio)) {
> + /*
> + * TODO; explain the posible race...
> + */
> + swap.val = round_down(swap.val, 1 << folio_order(folio));
> + }
>
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
> pgoff_t index,
> goto failed;
> }
>
> - error = shmem_add_to_page_cache(folio, mapping, index,
> + error = shmem_add_to_page_cache(folio, mapping,
> + round_down(index, nr_pages),
> swp_to_radix_entry(swap), gfp);
> if (error)
> goto failed;
>
> > So I wanted to simplify it: by round_down(folio_order(folio)) we
> > simply get the index and swap that will be covered by this specific
> > folio, if the covered range still has the corresponding swap entries
> > (check and ensured by shmem_add_to_page_cache which holds the
> > xa_lock), then the folio will be inserted back safely and
> > successfully.
>
I think adding the missing swap value fixup in the !SYNC_IO path
should be good enough?
diff --git a/mm/shmem.c b/mm/shmem.c
index 5be9c905396e..2620e4d1b56a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
struct folio *folio = NULL;
bool skip_swapcache = false;
swp_entry_t swap;
+ pgoff_t offset;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
swap = radix_to_swp_entry(*foliop);
@@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
goto failed;
}
- /* Here we actually start the io */
+ /* Cached swapin currently only supports order 0 swapin */
+ /* It may hit a large folio but that's OK and handled below */
+ offset = index - round_down(index, 1 << order);
+ swap.val = swap.val + offset;
folio = shmem_swapin_cluster(swap, gfp, info, index);
if (!folio) {
error = -ENOMEM;
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-30 10:06 ` Kairui Song
@ 2025-06-30 11:59 ` Baolin Wang
2025-06-30 18:19 ` Kairui Song
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-06-30 11:59 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/6/30 18:06, Kairui Song wrote:
> On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>> On 2025/6/30 17:16, Kairui Song wrote:
>>> On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>> On 2025/6/27 14:20, Kairui Song wrote:
>>>>> From: Kairui Song <kasong@tencent.com>
>>>>>
>>>>> Instead of keeping different paths of splitting the entry and
>>>>> recalculating the swap entry and index, do it in one place.
>>>>>
>>>>> Whenever swapin brought in a folio smaller than the entry, split the
>>>>> entry. And always recalculate the entry and index, in case it might
>>>>> read in a folio that's larger than the entry order. This removes
>>>>> duplicated code and function calls, and makes the code more robust.
>>>>>
>>>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>>>> ---
>>>>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
>>>>> 1 file changed, 41 insertions(+), 62 deletions(-)
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index f85a985167c5..5be9c905396e 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>> swap_free_nr(swap, nr_pages);
>>>>> }
>>>>>
>>>>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>> - swp_entry_t swap, gfp_t gfp)
>>>>> +/*
>>>>> + * Split an existing large swap entry. @index should point to one sub mapping
>>>>> + * slot within the entry @swap, this sub slot will be split into order 0.
>>>>> + */
>>>>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
>>>>> + swp_entry_t swap, gfp_t gfp)
>>>>> {
>>>>> struct address_space *mapping = inode->i_mapping;
>>>>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>>>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>> if (xas_error(&xas))
>>>>> return xas_error(&xas);
>>>>>
>>>>> - return entry_order;
>>>>> + return 0;
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>> struct address_space *mapping = inode->i_mapping;
>>>>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>>>>> struct shmem_inode_info *info = SHMEM_I(inode);
>>>>> + int error, nr_pages, order, swap_order;
>>>>> struct swap_info_struct *si;
>>>>> struct folio *folio = NULL;
>>>>> bool skip_swapcache = false;
>>>>> swp_entry_t swap;
>>>>> - int error, nr_pages, order, split_order;
>>>>>
>>>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>>>> swap = radix_to_swp_entry(*foliop);
>>>>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>> goto failed;
>>>>> }
>>>>>
>>>>> - /*
>>>>> - * Now swap device can only swap in order 0 folio, then we
>>>>> - * should split the large swap entry stored in the pagecache
>>>>> - * if necessary.
>>>>> - */
>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>>>> - if (split_order < 0) {
>>>>> - error = split_order;
>>>>> - goto failed;
>>>>> - }
>>>>> -
>>>>> - /*
>>>>> - * If the large swap entry has already been split, it is
>>>>> - * necessary to recalculate the new swap entry based on
>>>>> - * the old order alignment.
>>>>> - */
>>>>> - if (split_order > 0) {
>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>> -
>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> - }
>>>>> -
>>>>> /* Here we actually start the io */
>>>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>>>> if (!folio) {
>>>>> error = -ENOMEM;
>>>>> goto failed;
>>>>> }
>>>>> - } else if (order > folio_order(folio)) {
>>>>> - /*
>>>>> - * Swap readahead may swap in order 0 folios into swapcache
>>>>> - * asynchronously, while the shmem mapping can still stores
>>>>> - * large swap entries. In such cases, we should split the
>>>>> - * large swap entry to prevent possible data corruption.
>>>>> - */
>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>>>> - if (split_order < 0) {
>>>>> - folio_put(folio);
>>>>> - folio = NULL;
>>>>> - error = split_order;
>>>>> - goto failed;
>>>>> - }
>>>>> -
>>>>> - /*
>>>>> - * If the large swap entry has already been split, it is
>>>>> - * necessary to recalculate the new swap entry based on
>>>>> - * the old order alignment.
>>>>> - */
>>>>> - if (split_order > 0) {
>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>> -
>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> - }
>>>>> - } else if (order < folio_order(folio)) {
>>>>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
>>>>> }
>>>>>
>>>>> alloced:
>>>>> + /*
>>>>> + * We need to split an existing large entry if swapin brought in a
>>>>> + * smaller folio due to various of reasons.
>>>>> + *
>>>>> + * And worth noting there is a special case: if there is a smaller
>>>>> + * cached folio that covers @swap, but not @index (it only covers
>>>>> + * first few sub entries of the large entry, but @index points to
>>>>> + * later parts), the swap cache lookup will still see this folio,
>>>>> + * And we need to split the large entry here. Later checks will fail,
>>>>> + * as it can't satisfy the swap requirement, and we will retry
>>>>> + * the swapin from beginning.
>>>>> + */
>>>>> + swap_order = folio_order(folio);
>>>>
>>>> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
>>>> btter name?
>>>
>>> Good idea.
>>>
>>>>
>>>>> + if (order > swap_order) {
>>>>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
>>>>> + if (error)
>>>>> + goto failed_nolock;
>>>>> + }
>>>>> +
>>>>> + index = round_down(index, 1 << swap_order);
>>>>> + swap.val = round_down(swap.val, 1 << swap_order);
>>>>
>>>> The round_down() of index and swap value here may cause
>>>> shmem_add_to_page_cache() to fail to insert a new folio, because the
>>>> swap value stored at that index in the shmem mapping does not match,
>>>> leading to another swapin page fault for correction.
>>>>
>>>> For example, shmem stores a large swap entry of order 4 in the range of
>>>> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
>>>> 0x4000, if a split happens and this round_down() logic is applied, then
>>>> index = 3, swap.val = 0x4000. However, the actual swap.val should be
>>>> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
>>>
>>> Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
>>> above, it should re-calculate the swap value there. It's fixed in the
>>> final patch but left unhandled here. I'll update this part.
>>>
>>>>
>>>> I still prefer my original alignment method, and do you find this will
>>>> cause any issues?
>>>>
>>>> "
>>>> if (split_order > 0) {
>>>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>
>>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>> }
>>>> "
>>>
>>> It only fits the cached swapin and uncached swapin, not the cache hit
>>> case. Cache hits may see a larger folio so split didn't happen, but
>>> the round_down is still needed.
>>
>> IMO, this only fits for the large swap entry split case.
>>
>>> And there is another racy case: before this patch, the split may
>>> happen first, but shmem_swapin_cluster brought in a large folio due to
>>> race in the swap cache layer.
>>
>> shmem_swapin_cluster() can only allocate order 0 folio, right?
>
> It can only allocate order 0 folio, but It can hit a large folio: eg.
> a parallel swapin/swapout happened, and the folio stays in swap cache,
> while we are handling a swapin here.
Yes, I know this. This is exactly the issue that patch 1 wants to fix.
>>> And I'm not sure if split_order is always reliable here, for example
>>> concurrent split may return an inaccurate value here.
>>
>> We've held the xas lock to ensure the split is reliable, even though
>> concurrent splits may occur, only one split can get the large
>> 'split_order', another will return 0 (since it will see the large swao
>> entry has already been split).
>
> Yes, it may return 0, so we can get a large folio here, but get
> `split_order = 0`?
If split happens, which means the 'order' > folio_order(), right? how
can you get a large folio in this context?
> And if concurrently swapout/swapin happened, the `split_order` could
> be a different value?
What do you mean different value? The large swap entry can only be split
once, so the 'split_order' must be 0 or the original large order.
>> Based on your current patch, would the following modifications be clearer?
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 5be9c905396e..91c071fb7b67 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
>> *inode, pgoff_t index,
>> if (xas_error(&xas))
>> return xas_error(&xas);
>>
>> - return 0;
>> + return split_order;
>> }
>>
>> /*
>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
>> *inode, pgoff_t index,
>> error = shmem_split_swap_entry(inode, index, swap, gfp);
>> if (error)
>> goto failed_nolock;
>> - }
>>
>> - index = round_down(index, 1 << swap_order);
>> - swap.val = round_down(swap.val, 1 << swap_order);
>> + /*
>> + * If the large swap entry has already been split, it is
>> + * necessary to recalculate the new swap entry based on
>> + * the old order alignment.
>> + */
>> + if (split_order > 0) {
>> + pgoff_t offset = index - round_down(index, 1 <<
>> split_order);
>> +
>> + swap = swp_entry(swp_type(swap),
>> swp_offset(swap) + offset);
>> + }
>> + } else if (order < folio_order(folio)) {
>> + /*
>> + * TODO; explain the posible race...
>> + */
>> + swap.val = round_down(swap.val, 1 << folio_order(folio));
>> + }
Sorry, you changes did not convince me. I still prefer my changes,
listing out the possible races that might require changes to the swap
value, as it would be clearer and more readable. Additionally, this is a
cleanup patch, so I hope there are no implicit functional changes.
>> /* We have to do this with folio locked to prevent races */
>> folio_lock(folio);
>> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
>> pgoff_t index,
>> goto failed;
>> }
>>
>> - error = shmem_add_to_page_cache(folio, mapping, index,
>> + error = shmem_add_to_page_cache(folio, mapping,
>> + round_down(index, nr_pages),
>> swp_to_radix_entry(swap), gfp);
>> if (error)
>> goto failed;
>>
>>> So I wanted to simplify it: by round_down(folio_order(folio)) we
>>> simply get the index and swap that will be covered by this specific
>>> folio, if the covered range still has the corresponding swap entries
>>> (check and ensured by shmem_add_to_page_cache which holds the
>>> xa_lock), then the folio will be inserted back safely and
>>> successfully.
>>
>
> I think adding the missing swap value fixup in the !SYNC_IO path
> should be good enough?
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5be9c905396e..2620e4d1b56a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
> struct folio *folio = NULL;
> bool skip_swapcache = false;
> swp_entry_t swap;
> + pgoff_t offset;
>
> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> swap = radix_to_swp_entry(*foliop);
> @@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
> goto failed;
> }
>
> - /* Here we actually start the io */
> + /* Cached swapin currently only supports order 0 swapin */
> + /* It may hit a large folio but that's OK and handled below */
> + offset = index - round_down(index, 1 << order);
> + swap.val = swap.val + offset;
>
> folio = shmem_swapin_cluster(swap, gfp, info, index);
> if (!folio) {
> error = -ENOMEM;
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-30 11:59 ` Baolin Wang
@ 2025-06-30 18:19 ` Kairui Song
2025-07-01 1:57 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-06-30 18:19 UTC (permalink / raw)
To: Baolin Wang
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On Mon, Jun 30, 2025 at 7:59 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/6/30 18:06, Kairui Song wrote:
> > On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 2025/6/30 17:16, Kairui Song wrote:
> >>> On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
> >>> <baolin.wang@linux.alibaba.com> wrote:
> >>>> On 2025/6/27 14:20, Kairui Song wrote:
> >>>>> From: Kairui Song <kasong@tencent.com>
> >>>>>
> >>>>> Instead of keeping different paths of splitting the entry and
> >>>>> recalculating the swap entry and index, do it in one place.
> >>>>>
> >>>>> Whenever swapin brought in a folio smaller than the entry, split the
> >>>>> entry. And always recalculate the entry and index, in case it might
> >>>>> read in a folio that's larger than the entry order. This removes
> >>>>> duplicated code and function calls, and makes the code more robust.
> >>>>>
> >>>>> Signed-off-by: Kairui Song <kasong@tencent.com>
> >>>>> ---
> >>>>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
> >>>>> 1 file changed, 41 insertions(+), 62 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>>>> index f85a985167c5..5be9c905396e 100644
> >>>>> --- a/mm/shmem.c
> >>>>> +++ b/mm/shmem.c
> >>>>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >>>>> swap_free_nr(swap, nr_pages);
> >>>>> }
> >>>>>
> >>>>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> >>>>> - swp_entry_t swap, gfp_t gfp)
> >>>>> +/*
> >>>>> + * Split an existing large swap entry. @index should point to one sub mapping
> >>>>> + * slot within the entry @swap, this sub slot will be split into order 0.
> >>>>> + */
> >>>>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
> >>>>> + swp_entry_t swap, gfp_t gfp)
> >>>>> {
> >>>>> struct address_space *mapping = inode->i_mapping;
> >>>>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> >>>>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> >>>>> if (xas_error(&xas))
> >>>>> return xas_error(&xas);
> >>>>>
> >>>>> - return entry_order;
> >>>>> + return 0;
> >>>>> }
> >>>>>
> >>>>> /*
> >>>>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>>> struct address_space *mapping = inode->i_mapping;
> >>>>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >>>>> struct shmem_inode_info *info = SHMEM_I(inode);
> >>>>> + int error, nr_pages, order, swap_order;
> >>>>> struct swap_info_struct *si;
> >>>>> struct folio *folio = NULL;
> >>>>> bool skip_swapcache = false;
> >>>>> swp_entry_t swap;
> >>>>> - int error, nr_pages, order, split_order;
> >>>>>
> >>>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> >>>>> swap = radix_to_swp_entry(*foliop);
> >>>>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>>> goto failed;
> >>>>> }
> >>>>>
> >>>>> - /*
> >>>>> - * Now swap device can only swap in order 0 folio, then we
> >>>>> - * should split the large swap entry stored in the pagecache
> >>>>> - * if necessary.
> >>>>> - */
> >>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >>>>> - if (split_order < 0) {
> >>>>> - error = split_order;
> >>>>> - goto failed;
> >>>>> - }
> >>>>> -
> >>>>> - /*
> >>>>> - * If the large swap entry has already been split, it is
> >>>>> - * necessary to recalculate the new swap entry based on
> >>>>> - * the old order alignment.
> >>>>> - */
> >>>>> - if (split_order > 0) {
> >>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
> >>>>> -
> >>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>>>> - }
> >>>>> -
> >>>>> /* Here we actually start the io */
> >>>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
> >>>>> if (!folio) {
> >>>>> error = -ENOMEM;
> >>>>> goto failed;
> >>>>> }
> >>>>> - } else if (order > folio_order(folio)) {
> >>>>> - /*
> >>>>> - * Swap readahead may swap in order 0 folios into swapcache
> >>>>> - * asynchronously, while the shmem mapping can still stores
> >>>>> - * large swap entries. In such cases, we should split the
> >>>>> - * large swap entry to prevent possible data corruption.
> >>>>> - */
> >>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >>>>> - if (split_order < 0) {
> >>>>> - folio_put(folio);
> >>>>> - folio = NULL;
> >>>>> - error = split_order;
> >>>>> - goto failed;
> >>>>> - }
> >>>>> -
> >>>>> - /*
> >>>>> - * If the large swap entry has already been split, it is
> >>>>> - * necessary to recalculate the new swap entry based on
> >>>>> - * the old order alignment.
> >>>>> - */
> >>>>> - if (split_order > 0) {
> >>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
> >>>>> -
> >>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>>>> - }
> >>>>> - } else if (order < folio_order(folio)) {
> >>>>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
> >>>>> }
> >>>>>
> >>>>> alloced:
> >>>>> + /*
> >>>>> + * We need to split an existing large entry if swapin brought in a
> >>>>> + * smaller folio due to various of reasons.
> >>>>> + *
> >>>>> + * And worth noting there is a special case: if there is a smaller
> >>>>> + * cached folio that covers @swap, but not @index (it only covers
> >>>>> + * first few sub entries of the large entry, but @index points to
> >>>>> + * later parts), the swap cache lookup will still see this folio,
> >>>>> + * And we need to split the large entry here. Later checks will fail,
> >>>>> + * as it can't satisfy the swap requirement, and we will retry
> >>>>> + * the swapin from beginning.
> >>>>> + */
> >>>>> + swap_order = folio_order(folio);
> >>>>
> >>>> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
> >>>> btter name?
> >>>
> >>> Good idea.
> >>>
> >>>>
> >>>>> + if (order > swap_order) {
> >>>>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
> >>>>> + if (error)
> >>>>> + goto failed_nolock;
> >>>>> + }
> >>>>> +
> >>>>> + index = round_down(index, 1 << swap_order);
> >>>>> + swap.val = round_down(swap.val, 1 << swap_order);
> >>>>
> >>>> The round_down() of index and swap value here may cause
> >>>> shmem_add_to_page_cache() to fail to insert a new folio, because the
> >>>> swap value stored at that index in the shmem mapping does not match,
> >>>> leading to another swapin page fault for correction.
> >>>>
> >>>> For example, shmem stores a large swap entry of order 4 in the range of
> >>>> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
> >>>> 0x4000, if a split happens and this round_down() logic is applied, then
> >>>> index = 3, swap.val = 0x4000. However, the actual swap.val should be
> >>>> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
> >>>
> >>> Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
> >>> above, it should re-calculate the swap value there. It's fixed in the
> >>> final patch but left unhandled here. I'll update this part.
> >>>
> >>>>
> >>>> I still prefer my original alignment method, and do you find this will
> >>>> cause any issues?
> >>>>
> >>>> "
> >>>> if (split_order > 0) {
> >>>> pgoff_t offset = index - round_down(index, 1 << split_order);
> >>>>
> >>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>>> }
> >>>> "
> >>>
> >>> It only fits the cached swapin and uncached swapin, not the cache hit
> >>> case. Cache hits may see a larger folio so split didn't happen, but
> >>> the round_down is still needed.
> >>
> >> IMO, this only fits for the large swap entry split case.
> >>
> >>> And there is another racy case: before this patch, the split may
> >>> happen first, but shmem_swapin_cluster brought in a large folio due to
> >>> race in the swap cache layer.
> >>
> >> shmem_swapin_cluster() can only allocate order 0 folio, right?
> >
> > It can only allocate order 0 folio, but It can hit a large folio: eg.
> > a parallel swapin/swapout happened, and the folio stays in swap cache,
> > while we are handling a swapin here.
>
> Yes, I know this. This is exactly the issue that patch 1 wants to fix.
Hmm, patch 1 prevents the hang but doesn't prevent things like a
duplicated fault:
Starts with:
swap entry val = 0x400, order = 4, covering index 0 - 15, faulting index 3:
Before this patch:
CPU0 CPU1
shmem_swapin_folio (see order = 4)
shmem_swapin_folio (see order = 4)
/* fallback to order 0 due to */
/* mem pressure / temporary pin / etc */
shmem_split_large_entry
/* split to order 0 */
/* interrupted */
/* swapin done with order = 4 folio */
/* swapout again, leave the large folio
in cache temporarily */
folio = swap_cluster_readahead(0x403)
/* Gets folio order = 4, folio->swap = 0x400
since swap_cluster_readahead uses swap cache */
folio->swap.val != swap.val
/* ! Above test failed ! */
/* It shouldn't fail the round down is needed */
This patch moved the split after the swapin so it should be OK now,
but still the split_order could be unstable, see below:
>
> >>> And I'm not sure if split_order is always reliable here, for example
> >>> concurrent split may return an inaccurate value here.
> >>
> >> We've held the xas lock to ensure the split is reliable, even though
> >> concurrent splits may occur, only one split can get the large
> >> 'split_order', another will return 0 (since it will see the large swao
> >> entry has already been split).
> >
> > Yes, it may return 0, so we can get a large folio here, but get
> > `split_order = 0`?
>
> If split happens, which means the 'order' > folio_order(), right? how
> can you get a large folio in this context?
>
> > And if concurrently swapout/swapin happened, the `split_order` could
> > be a different value?
>
> What do you mean different value? The large swap entry can only be split
> once, so the 'split_order' must be 0 or the original large order.
Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot
into order 0 IIUC, so things get complex if two CPUs are faulting on
different indexes landing into two different splitting zones:
Before this patch:
swap entry val = 0x400, order = 9, covering index 0 - 511, faulting index 3:
CPU0 CPU1
shmem_swapin_folio (index = 3)
shmem_swapin_folio (index = 510)
/* Gets swap = 0x400 */ /* Gets swap = 0x400 */
/* fallback to order 0 */ /* fallback to order 0 */
split_order = shmem_split_large_entry
/* get split_order = 512 */
/* offset = 3 */
split_order = shmem_split_large_entry
/* get split_order = 0, but no split */
/* map order is 8, offset = 0 */
/* wrong offset */
shmem_swapin_cluster(0x400)
/* It should swapin 0x5fe */
After this patch (with the append fix which was left in latest patch
by mistake) it won't swapin wrong entry now, but
shmem_split_large_entry may still return a outdated order.
That's two previous races I can come up with. These no longer exist
after this patch, it's not a bug though, just redundant IO as far as I
can see because other checks will fallback, looks a bit fragile
though. But shmem_split_large_entry may still return invalid order,
just much less likely.
I think the ideology here is, because the `order =
shmem_confirm_swap(mapping, index, swap)` ensures order is stable and
corresponds to the entry value at one point, so keep using that value
is better (and so this patch does the offset and calculation using the
`order` retrieved there before issues the swapin).
And after the swapin have brought a folio in, simply round down using
the folio's order, which should ensure the folio can be added
successfully in any case as long as the folio->swap and index fits the
shmem mapping fine.
> >> Based on your current patch, would the following modifications be clearer?
> >>
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 5be9c905396e..91c071fb7b67 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
> >> *inode, pgoff_t index,
> >> if (xas_error(&xas))
> >> return xas_error(&xas);
> >>
> >> - return 0;
> >> + return split_order;
> >> }
> >>
> >> /*
> >> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
> >> *inode, pgoff_t index,
> >> error = shmem_split_swap_entry(inode, index, swap, gfp);
> >> if (error)
> >> goto failed_nolock;
> >> - }
> >>
> >> - index = round_down(index, 1 << swap_order);
> >> - swap.val = round_down(swap.val, 1 << swap_order);
> >> + /*
> >> + * If the large swap entry has already been split, it is
> >> + * necessary to recalculate the new swap entry based on
> >> + * the old order alignment.
> >> + */
> >> + if (split_order > 0) {
The split_order could still be an outdated value, eg. we may even get
split_order = 0 with a large folio loaded here.
> >> + pgoff_t offset = index - round_down(index, 1 <<
> >> split_order);
> >> +
> >> + swap = swp_entry(swp_type(swap),
> >> swp_offset(swap) + offset);
> >> + }
> >> + } else if (order < folio_order(folio)) {
> >> + /*
> >> + * TODO; explain the posible race...
> >> + */
> >> + swap.val = round_down(swap.val, 1 << folio_order(folio));
> >> + }
>
> Sorry, you changes did not convince me. I still prefer my changes,
> listing out the possible races that might require changes to the swap
> value, as it would be clearer and more readable. Additionally, this is a
> cleanup patch, so I hope there are no implicit functional changes.
I was thinking that the less branch we have the better, I won't insist
on this though.
>
> >> /* We have to do this with folio locked to prevent races */
> >> folio_lock(folio);
> >> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
> >> pgoff_t index,
> >> goto failed;
> >> }
> >>
> >> - error = shmem_add_to_page_cache(folio, mapping, index,
> >> + error = shmem_add_to_page_cache(folio, mapping,
> >> + round_down(index, nr_pages),
> >> swp_to_radix_entry(swap), gfp);
> >> if (error)
> >> goto failed;
> >>
> >>> So I wanted to simplify it: by round_down(folio_order(folio)) we
> >>> simply get the index and swap that will be covered by this specific
> >>> folio, if the covered range still has the corresponding swap entries
> >>> (check and ensured by shmem_add_to_page_cache which holds the
> >>> xa_lock), then the folio will be inserted back safely and
> >>> successfully.
> >>
> >
> > I think adding the missing swap value fixup in the !SYNC_IO path
> > should be good enough?
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 5be9c905396e..2620e4d1b56a 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode
> > *inode, pgoff_t index,
> > struct folio *folio = NULL;
> > bool skip_swapcache = false;
> > swp_entry_t swap;
> > + pgoff_t offset;
> >
> > VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > swap = radix_to_swp_entry(*foliop);
> > @@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode
> > *inode, pgoff_t index,
> > goto failed;
> > }
> >
> > - /* Here we actually start the io */
> > + /* Cached swapin currently only supports order 0 swapin */
> > + /* It may hit a large folio but that's OK and handled below */
> > + offset = index - round_down(index, 1 << order);
> > + swap.val = swap.val + offset;
BTW for this append patch, it should be using a different variable so
swap won't be changed or it will cause redundant fault again. My bad.
> > folio = shmem_swapin_cluster(swap, gfp, info, index);
> > if (!folio) {
> > error = -ENOMEM;
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-06-30 18:19 ` Kairui Song
@ 2025-07-01 1:57 ` Baolin Wang
2025-07-01 18:49 ` Kairui Song
0 siblings, 1 reply; 23+ messages in thread
From: Baolin Wang @ 2025-07-01 1:57 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/7/1 02:19, Kairui Song wrote:
> On Mon, Jun 30, 2025 at 7:59 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/6/30 18:06, Kairui Song wrote:
>>> On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>> On 2025/6/30 17:16, Kairui Song wrote:
>>>>> On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>> On 2025/6/27 14:20, Kairui Song wrote:
>>>>>>> From: Kairui Song <kasong@tencent.com>
>>>>>>>
>>>>>>> Instead of keeping different paths of splitting the entry and
>>>>>>> recalculating the swap entry and index, do it in one place.
>>>>>>>
>>>>>>> Whenever swapin brought in a folio smaller than the entry, split the
>>>>>>> entry. And always recalculate the entry and index, in case it might
>>>>>>> read in a folio that's larger than the entry order. This removes
>>>>>>> duplicated code and function calls, and makes the code more robust.
>>>>>>>
>>>>>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>>>>>> ---
>>>>>>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
>>>>>>> 1 file changed, 41 insertions(+), 62 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>>>> index f85a985167c5..5be9c905396e 100644
>>>>>>> --- a/mm/shmem.c
>>>>>>> +++ b/mm/shmem.c
>>>>>>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>>>> swap_free_nr(swap, nr_pages);
>>>>>>> }
>>>>>>>
>>>>>>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>> - swp_entry_t swap, gfp_t gfp)
>>>>>>> +/*
>>>>>>> + * Split an existing large swap entry. @index should point to one sub mapping
>>>>>>> + * slot within the entry @swap, this sub slot will be split into order 0.
>>>>>>> + */
>>>>>>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
>>>>>>> + swp_entry_t swap, gfp_t gfp)
>>>>>>> {
>>>>>>> struct address_space *mapping = inode->i_mapping;
>>>>>>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>>>>>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>> if (xas_error(&xas))
>>>>>>> return xas_error(&xas);
>>>>>>>
>>>>>>> - return entry_order;
>>>>>>> + return 0;
>>>>>>> }
>>>>>>>
>>>>>>> /*
>>>>>>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>>>> struct address_space *mapping = inode->i_mapping;
>>>>>>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>>>>>>> struct shmem_inode_info *info = SHMEM_I(inode);
>>>>>>> + int error, nr_pages, order, swap_order;
>>>>>>> struct swap_info_struct *si;
>>>>>>> struct folio *folio = NULL;
>>>>>>> bool skip_swapcache = false;
>>>>>>> swp_entry_t swap;
>>>>>>> - int error, nr_pages, order, split_order;
>>>>>>>
>>>>>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>>>>>> swap = radix_to_swp_entry(*foliop);
>>>>>>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>>>> goto failed;
>>>>>>> }
>>>>>>>
>>>>>>> - /*
>>>>>>> - * Now swap device can only swap in order 0 folio, then we
>>>>>>> - * should split the large swap entry stored in the pagecache
>>>>>>> - * if necessary.
>>>>>>> - */
>>>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>>>>>> - if (split_order < 0) {
>>>>>>> - error = split_order;
>>>>>>> - goto failed;
>>>>>>> - }
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * If the large swap entry has already been split, it is
>>>>>>> - * necessary to recalculate the new swap entry based on
>>>>>>> - * the old order alignment.
>>>>>>> - */
>>>>>>> - if (split_order > 0) {
>>>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>> -
>>>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>> - }
>>>>>>> -
>>>>>>> /* Here we actually start the io */
>>>>>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>>>>>> if (!folio) {
>>>>>>> error = -ENOMEM;
>>>>>>> goto failed;
>>>>>>> }
>>>>>>> - } else if (order > folio_order(folio)) {
>>>>>>> - /*
>>>>>>> - * Swap readahead may swap in order 0 folios into swapcache
>>>>>>> - * asynchronously, while the shmem mapping can still stores
>>>>>>> - * large swap entries. In such cases, we should split the
>>>>>>> - * large swap entry to prevent possible data corruption.
>>>>>>> - */
>>>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>>>>>> - if (split_order < 0) {
>>>>>>> - folio_put(folio);
>>>>>>> - folio = NULL;
>>>>>>> - error = split_order;
>>>>>>> - goto failed;
>>>>>>> - }
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * If the large swap entry has already been split, it is
>>>>>>> - * necessary to recalculate the new swap entry based on
>>>>>>> - * the old order alignment.
>>>>>>> - */
>>>>>>> - if (split_order > 0) {
>>>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>> -
>>>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>> - }
>>>>>>> - } else if (order < folio_order(folio)) {
>>>>>>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
>>>>>>> }
>>>>>>>
>>>>>>> alloced:
>>>>>>> + /*
>>>>>>> + * We need to split an existing large entry if swapin brought in a
>>>>>>> + * smaller folio due to various of reasons.
>>>>>>> + *
>>>>>>> + * And worth noting there is a special case: if there is a smaller
>>>>>>> + * cached folio that covers @swap, but not @index (it only covers
>>>>>>> + * first few sub entries of the large entry, but @index points to
>>>>>>> + * later parts), the swap cache lookup will still see this folio,
>>>>>>> + * And we need to split the large entry here. Later checks will fail,
>>>>>>> + * as it can't satisfy the swap requirement, and we will retry
>>>>>>> + * the swapin from beginning.
>>>>>>> + */
>>>>>>> + swap_order = folio_order(folio);
>>>>>>
>>>>>> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
>>>>>> btter name?
>>>>>
>>>>> Good idea.
>>>>>
>>>>>>
>>>>>>> + if (order > swap_order) {
>>>>>>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
>>>>>>> + if (error)
>>>>>>> + goto failed_nolock;
>>>>>>> + }
>>>>>>> +
>>>>>>> + index = round_down(index, 1 << swap_order);
>>>>>>> + swap.val = round_down(swap.val, 1 << swap_order);
>>>>>>
>>>>>> The round_down() of index and swap value here may cause
>>>>>> shmem_add_to_page_cache() to fail to insert a new folio, because the
>>>>>> swap value stored at that index in the shmem mapping does not match,
>>>>>> leading to another swapin page fault for correction.
>>>>>>
>>>>>> For example, shmem stores a large swap entry of order 4 in the range of
>>>>>> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
>>>>>> 0x4000, if a split happens and this round_down() logic is applied, then
>>>>>> index = 3, swap.val = 0x4000. However, the actual swap.val should be
>>>>>> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
>>>>>
>>>>> Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
>>>>> above, it should re-calculate the swap value there. It's fixed in the
>>>>> final patch but left unhandled here. I'll update this part.
>>>>>
>>>>>>
>>>>>> I still prefer my original alignment method, and do you find this will
>>>>>> cause any issues?
>>>>>>
>>>>>> "
>>>>>> if (split_order > 0) {
>>>>>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>
>>>>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>> }
>>>>>> "
>>>>>
>>>>> It only fits the cached swapin and uncached swapin, not the cache hit
>>>>> case. Cache hits may see a larger folio so split didn't happen, but
>>>>> the round_down is still needed.
>>>>
>>>> IMO, this only fits for the large swap entry split case.
>>>>
>>>>> And there is another racy case: before this patch, the split may
>>>>> happen first, but shmem_swapin_cluster brought in a large folio due to
>>>>> race in the swap cache layer.
>>>>
>>>> shmem_swapin_cluster() can only allocate order 0 folio, right?
>>>
>>> It can only allocate order 0 folio, but It can hit a large folio: eg.
>>> a parallel swapin/swapout happened, and the folio stays in swap cache,
>>> while we are handling a swapin here.
>>
>> Yes, I know this. This is exactly the issue that patch 1 wants to fix.
>
> Hmm, patch 1 prevents the hang but doesn't prevent things like a
> duplicated fault:
>
> Starts with:
> swap entry val = 0x400, order = 4, covering index 0 - 15, faulting index 3:
>
> Before this patch:
> CPU0 CPU1
> shmem_swapin_folio (see order = 4)
> shmem_swapin_folio (see order = 4)
> /* fallback to order 0 due to */
> /* mem pressure / temporary pin / etc */
> shmem_split_large_entry
> /* split to order 0 */
> /* interrupted */
> /* swapin done with order = 4 folio */
> /* swapout again, leave the large folio
> in cache temporarily */
> folio = swap_cluster_readahead(0x403)
> /* Gets folio order = 4, folio->swap = 0x400
> since swap_cluster_readahead uses swap cache */
> folio->swap.val != swap.val
> /* ! Above test failed ! */
> /* It shouldn't fail the round down is needed */
OK. Thanks for the explanation. Yes, this might cause a refault.
> This patch moved the split after the swapin so it should be OK now,
Yes, I agree 'moving the split after the swapin'.
> but still the split_order could be unstable, see below:
>>>>> And I'm not sure if split_order is always reliable here, for example
>>>>> concurrent split may return an inaccurate value here.
>>>>
>>>> We've held the xas lock to ensure the split is reliable, even though
>>>> concurrent splits may occur, only one split can get the large
>>>> 'split_order', another will return 0 (since it will see the large swao
>>>> entry has already been split).
>>>
>>> Yes, it may return 0, so we can get a large folio here, but get
>>> `split_order = 0`?
>>
>> If split happens, which means the 'order' > folio_order(), right? how
>> can you get a large folio in this context?
>>
>>> And if concurrently swapout/swapin happened, the `split_order` could
>>> be a different value?
>>
>> What do you mean different value? The large swap entry can only be split
>> once, so the 'split_order' must be 0 or the original large order.
>
> Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot
> into order 0 IIUC, so things get complex if two CPUs are faulting on
> different indexes landing into two different splitting zones:
>
> Before this patch:
> swap entry val = 0x400, order = 9, covering index 0 - 511, faulting index 3:
>
> CPU0 CPU1
> shmem_swapin_folio (index = 3)
> shmem_swapin_folio (index = 510)
> /* Gets swap = 0x400 */ /* Gets swap = 0x400 */
> /* fallback to order 0 */ /* fallback to order 0 */
> split_order = shmem_split_large_entry
> /* get split_order = 512 */
> /* offset = 3 */
> split_order = shmem_split_large_entry
> /* get split_order = 0, but no split */
> /* map order is 8, offset = 0 */
> /* wrong offset */
> shmem_swapin_cluster(0x400)
> /* It should swapin 0x5fe */
Not ture. IIUC, the CPU1 will failed to split due to
'swp_to_radix_entry(swap) != old' in shmem_split_large_entry(), and will
retry again to fix this race.
> After this patch (with the append fix which was left in latest patch
> by mistake) it won't swapin wrong entry now, but
> shmem_split_large_entry may still return a outdated order.
Like I said above, I don't think we can get a outdated order,right?
> That's two previous races I can come up with. These no longer exist
> after this patch, it's not a bug though, just redundant IO as far as I
> can see because other checks will fallback, looks a bit fragile
> though. But shmem_split_large_entry may still return invalid order,
> just much less likely.
>
> I think the ideology here is, because the `order =
> shmem_confirm_swap(mapping, index, swap)` ensures order is stable and
> corresponds to the entry value at one point, so keep using that value
> is better (and so this patch does the offset and calculation using the
> `order` retrieved there before issues the swapin).
I don't think that the 'order' obtained through the shmem_confirm_swap()
is stable, because shmem_confirm_swap() is only protected by RCU.
However, I think the 'split_order' obtained from
shmem_split_large_entry() under the xas lock is stable.
> And after the swapin have brought a folio in, simply round down using
> the folio's order, which should ensure the folio can be added
> successfully in any case as long as the folio->swap and index fits the
> shmem mapping fine.
>
>>>> Based on your current patch, would the following modifications be clearer?
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 5be9c905396e..91c071fb7b67 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
>>>> *inode, pgoff_t index,
>>>> if (xas_error(&xas))
>>>> return xas_error(&xas);
>>>>
>>>> - return 0;
>>>> + return split_order;
>>>> }
>>>>
>>>> /*
>>>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
>>>> *inode, pgoff_t index,
>>>> error = shmem_split_swap_entry(inode, index, swap, gfp);
>>>> if (error)
>>>> goto failed_nolock;
>>>> - }
>>>>
>>>> - index = round_down(index, 1 << swap_order);
>>>> - swap.val = round_down(swap.val, 1 << swap_order);
>>>> + /*
>>>> + * If the large swap entry has already been split, it is
>>>> + * necessary to recalculate the new swap entry based on
>>>> + * the old order alignment.
>>>> + */
>>>> + if (split_order > 0) {
>
> The split_order could still be an outdated value, eg. we may even get
> split_order = 0 with a large folio loaded here.
Ditto. I didn't see split_order can be an outdated value.
>>>> + pgoff_t offset = index - round_down(index, 1 <<
>>>> split_order);
>>>> +
>>>> + swap = swp_entry(swp_type(swap),
>>>> swp_offset(swap) + offset);
>>>> + }
>>>> + } else if (order < folio_order(folio)) {
>>>> + /*
>>>> + * TODO; explain the posible race...
>>>> + */
>>>> + swap.val = round_down(swap.val, 1 << folio_order(folio));
>>>> + }
>>
>> Sorry, you changes did not convince me. I still prefer my changes,
>> listing out the possible races that might require changes to the swap
>> value, as it would be clearer and more readable. Additionally, this is a
>> cleanup patch, so I hope there are no implicit functional changes.
>
> I was thinking that the less branch we have the better, I won't insist
> on this though.
But this is under the premise of keeping the code logic clear and simple.
>>>> /* We have to do this with folio locked to prevent races */
>>>> folio_lock(folio);
>>>> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
>>>> pgoff_t index,
>>>> goto failed;
>>>> }
>>>>
>>>> - error = shmem_add_to_page_cache(folio, mapping, index,
>>>> + error = shmem_add_to_page_cache(folio, mapping,
>>>> + round_down(index, nr_pages),
>>>> swp_to_radix_entry(swap), gfp);
>>>> if (error)
>>>> goto failed;
>>>>
>>>>> So I wanted to simplify it: by round_down(folio_order(folio)) we
>>>>> simply get the index and swap that will be covered by this specific
>>>>> folio, if the covered range still has the corresponding swap entries
>>>>> (check and ensured by shmem_add_to_page_cache which holds the
>>>>> xa_lock), then the folio will be inserted back safely and
>>>>> successfully.
>>>>
>>>
>>> I think adding the missing swap value fixup in the !SYNC_IO path
>>> should be good enough?
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 5be9c905396e..2620e4d1b56a 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode
>>> *inode, pgoff_t index,
>>> struct folio *folio = NULL;
>>> bool skip_swapcache = false;
>>> swp_entry_t swap;
>>> + pgoff_t offset;
>>>
>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>> swap = radix_to_swp_entry(*foliop);
>>> @@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode
>>> *inode, pgoff_t index,
>>> goto failed;
>>> }
>>>
>>> - /* Here we actually start the io */
>>> + /* Cached swapin currently only supports order 0 swapin */
>>> + /* It may hit a large folio but that's OK and handled below */
>>> + offset = index - round_down(index, 1 << order);
>>> + swap.val = swap.val + offset;
>
> BTW for this append patch, it should be using a different variable so
> swap won't be changed or it will cause redundant fault again. My bad.
>
>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>> if (!folio) {
>>> error = -ENOMEM;
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-07-01 1:57 ` Baolin Wang
@ 2025-07-01 18:49 ` Kairui Song
2025-07-02 2:33 ` Baolin Wang
0 siblings, 1 reply; 23+ messages in thread
From: Kairui Song @ 2025-07-01 18:49 UTC (permalink / raw)
To: Baolin Wang
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On Tue, Jul 1, 2025 at 9:57 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 2025/7/1 02:19, Kairui Song wrote:
> > On Mon, Jun 30, 2025 at 7:59 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 2025/6/30 18:06, Kairui Song wrote:
... snip ...
> >>> It can only allocate order 0 folio, but It can hit a large folio: eg.
> >>> a parallel swapin/swapout happened, and the folio stays in swap cache,
> >>> while we are handling a swapin here.
> >>
> >> Yes, I know this. This is exactly the issue that patch 1 wants to fix.
> >
> > Hmm, patch 1 prevents the hang but doesn't prevent things like a
> > duplicated fault:
> >
> > Starts with:
> > swap entry val = 0x400, order = 4, covering index 0 - 15, faulting index 3:
> >
> > Before this patch:
> > CPU0 CPU1
> > shmem_swapin_folio (see order = 4)
> > shmem_swapin_folio (see order = 4)
> > /* fallback to order 0 due to */
> > /* mem pressure / temporary pin / etc */
> > shmem_split_large_entry
> > /* split to order 0 */
> > /* interrupted */
> > /* swapin done with order = 4 folio */
> > /* swapout again, leave the large folio
> > in cache temporarily */
> > folio = swap_cluster_readahead(0x403)
> > /* Gets folio order = 4, folio->swap = 0x400
> > since swap_cluster_readahead uses swap cache */
> > folio->swap.val != swap.val
> > /* ! Above test failed ! */
> > /* It shouldn't fail the round down is needed */
>
> OK. Thanks for the explanation. Yes, this might cause a refault.
>
> > This patch moved the split after the swapin so it should be OK now,
>
> Yes, I agree 'moving the split after the swapin'.
>
> > but still the split_order could be unstable, see below:
>
> >>>>> And I'm not sure if split_order is always reliable here, for example
> >>>>> concurrent split may return an inaccurate value here.
> >>>>
> >>>> We've held the xas lock to ensure the split is reliable, even though
> >>>> concurrent splits may occur, only one split can get the large
> >>>> 'split_order', another will return 0 (since it will see the large swao
> >>>> entry has already been split).
> >>>
> >>> Yes, it may return 0, so we can get a large folio here, but get
> >>> `split_order = 0`?
> >>
> >> If split happens, which means the 'order' > folio_order(), right? how
> >> can you get a large folio in this context?
> >>
> >>> And if concurrently swapout/swapin happened, the `split_order` could
> >>> be a different value?
> >>
> >> What do you mean different value? The large swap entry can only be split
> >> once, so the 'split_order' must be 0 or the original large order.
> >
> > Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot
> > into order 0 IIUC, so things get complex if two CPUs are faulting on
> > different indexes landing into two different splitting zones:
> >
> > Before this patch:
> > swap entry val = 0x400, order = 9, covering index 0 - 511, faulting index 3:
> >
> > CPU0 CPU1
> > shmem_swapin_folio (index = 3)
> > shmem_swapin_folio (index = 510)
> > /* Gets swap = 0x400 */ /* Gets swap = 0x400 */
> > /* fallback to order 0 */ /* fallback to order 0 */
> > split_order = shmem_split_large_entry
> > /* get split_order = 512 */
> > /* offset = 3 */
> > split_order = shmem_split_large_entry
> > /* get split_order = 0, but no split */
> > /* map order is 8, offset = 0 */
> > /* wrong offset */
> > shmem_swapin_cluster(0x400)
> > /* It should swapin 0x5fe */
>
> Not ture. IIUC, the CPU1 will failed to split due to
> 'swp_to_radix_entry(swap) != old' in shmem_split_large_entry(), and will
> retry again to fix this race.
>
> > After this patch (with the append fix which was left in latest patch
> > by mistake) it won't swapin wrong entry now, but
> > shmem_split_large_entry may still return a outdated order.
>
> Like I said above, I don't think we can get a outdated order,right?
>
> > That's two previous races I can come up with. These no longer exist
> > after this patch, it's not a bug though, just redundant IO as far as I
> > can see because other checks will fallback, looks a bit fragile
> > though. But shmem_split_large_entry may still return invalid order,
> > just much less likely.
> >
> > I think the ideology here is, because the `order =
> > shmem_confirm_swap(mapping, index, swap)` ensures order is stable and
> > corresponds to the entry value at one point, so keep using that value
> > is better (and so this patch does the offset and calculation using the
> > `order` retrieved there before issues the swapin).
>
> I don't think that the 'order' obtained through the shmem_confirm_swap()
> is stable, because shmem_confirm_swap() is only protected by RCU.
> However, I think the 'split_order' obtained from
> shmem_split_large_entry() under the xas lock is stable.
>
> > And after the swapin have brought a folio in, simply round down using
> > the folio's order, which should ensure the folio can be added
> > successfully in any case as long as the folio->swap and index fits the
> > shmem mapping fine.
> >
> >>>> Based on your current patch, would the following modifications be clearer?
> >>>>
> >>>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>>> index 5be9c905396e..91c071fb7b67 100644
> >>>> --- a/mm/shmem.c
> >>>> +++ b/mm/shmem.c
> >>>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
> >>>> *inode, pgoff_t index,
> >>>> if (xas_error(&xas))
> >>>> return xas_error(&xas);
> >>>>
> >>>> - return 0;
> >>>> + return split_order;
> >>>> }
Sorry about the bad example, I got confused looking at the wrong code
base and deduced a wrong stack. Before this series it was returning
entry_order here so yeah it's OK.
> >>>>
> >>>> /*
> >>>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
> >>>> *inode, pgoff_t index,
> >>>> error = shmem_split_swap_entry(inode, index, swap, gfp);
> >>>> if (error)
> >>>> goto failed_nolock;
> >>>> - }
> >>>>
> >>>> - index = round_down(index, 1 << swap_order);
> >>>> - swap.val = round_down(swap.val, 1 << swap_order);
> >>>> + /*
> >>>> + * If the large swap entry has already been split, it is
> >>>> + * necessary to recalculate the new swap entry based on
> >>>> + * the old order alignment.
> >>>> + */
> >>>> + if (split_order > 0) {
> >
> > The split_order could still be an outdated value, eg. we may even get
> > split_order = 0 with a large folio loaded here.
>
> Ditto. I didn't see split_order can be an outdated value.
The problem is, could split_order be 0 while getting a large folio,
which is like the problem in patch 1.
This patch changed the race window by a lot, even if that happens,
shmem_split_swap_entry will just fail and retry though.
Or, can it get a split_order != 0 while folio order == 0:
So assume we do (check split_order instead, trimming all comments):
if (order > folio_order(folio)) {
split_order = shmem_split_swap_entry(inode, index, swap, gfp);
if (split_order < 0)
goto failed_nolock;
if (split_order > 0) {
offset = index - round_down(index, 1 << split_order);
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
/* Could we get unnecessary `folio->swap != swap`? */
} else {
/* Do we need any rounding here? */
/* eg. swap.val = round_down(swap.val, 1 << folio_order(folio)) */
/* Could be helpful */
}
} else if (order < folio_order(folio)) {
swap.val = round_down(swap.val, 1 << folio_order(folio));
}
I was scratching my head in the midnight to figure out if we
need a round down in the /* Do we need any rounding here */ part.
And it seems, we don't? There could be a lot of races, but because
shmem_split_swap_entry is now actually
guarded by swap cache (or swap_map's HAS_CACHE) now: to split an
index, we must load the folio in the swap cache (or the HAS_CACHE bit)
first now. So if a large folio is loaded here, the underlying entries
must remain un-splitted, as it prevents any parallel swapin for
entries is covered.
But this also requires us to call shmem_split_swap_entry with the
right swap and index value covered by the folio (if not, it's still
OK, just redundant split or redundant fault in the worst case it
seems, this clean up can be done later instead).
But there is still a missing issue that I posted previously, it really
should do a ealy fixup of the swap entry for the cached swapin in this
patch (it's delayed to the final patch in this series):
So the shmem_swapin_cluster call should be:
/* Cached swapin currently only supports order 0 swapin */
offset = index - round_down(index, 1 << order);
folio = shmem_swapin_cluster(swp_entry(swp_type(swap),
swp_offset(swap) + offset), gfp, info, index);
So far I think it's enough for this patch now. I can send a V4 to
keep this part's change as minimized as possible, but in later commits
things may still have to change a bit... especially for avoiding
redundant calculation of offset and swap, or minimizing the overhead,
ensuring shmem_split_swap_entry is called with the right value to
avoid re-faults.
I'll try my best to keep this part as unchanged as possible, but I do find it
somehow harder to track as the swapin path gets more refactors.
My bad for making it confusing in the first place, the patches are
still not very well-splitted enough for easier reviewing I think,
please have my apologies if this is causing any inconvenience for you.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
2025-07-01 18:49 ` Kairui Song
@ 2025-07-02 2:33 ` Baolin Wang
0 siblings, 0 replies; 23+ messages in thread
From: Baolin Wang @ 2025-07-02 2:33 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Hugh Dickins, Matthew Wilcox, Kemeng Shi,
Chris Li, Nhat Pham, Baoquan He, Barry Song, linux-kernel
On 2025/7/2 02:49, Kairui Song wrote:
> On Tue, Jul 1, 2025 at 9:57 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>> On 2025/7/1 02:19, Kairui Song wrote:
>>> On Mon, Jun 30, 2025 at 7:59 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>> On 2025/6/30 18:06, Kairui Song wrote:
> ... snip ...
>>>>> It can only allocate order 0 folio, but It can hit a large folio: eg.
>>>>> a parallel swapin/swapout happened, and the folio stays in swap cache,
>>>>> while we are handling a swapin here.
>>>>
>>>> Yes, I know this. This is exactly the issue that patch 1 wants to fix.
>>>
>>> Hmm, patch 1 prevents the hang but doesn't prevent things like a
>>> duplicated fault:
>>>
>>> Starts with:
>>> swap entry val = 0x400, order = 4, covering index 0 - 15, faulting index 3:
>>>
>>> Before this patch:
>>> CPU0 CPU1
>>> shmem_swapin_folio (see order = 4)
>>> shmem_swapin_folio (see order = 4)
>>> /* fallback to order 0 due to */
>>> /* mem pressure / temporary pin / etc */
>>> shmem_split_large_entry
>>> /* split to order 0 */
>>> /* interrupted */
>>> /* swapin done with order = 4 folio */
>>> /* swapout again, leave the large folio
>>> in cache temporarily */
>>> folio = swap_cluster_readahead(0x403)
>>> /* Gets folio order = 4, folio->swap = 0x400
>>> since swap_cluster_readahead uses swap cache */
>>> folio->swap.val != swap.val
>>> /* ! Above test failed ! */
>>> /* It shouldn't fail the round down is needed */
>>
>> OK. Thanks for the explanation. Yes, this might cause a refault.
>>
>>> This patch moved the split after the swapin so it should be OK now,
>>
>> Yes, I agree 'moving the split after the swapin'.
>>
>>> but still the split_order could be unstable, see below:
>>
>>>>>>> And I'm not sure if split_order is always reliable here, for example
>>>>>>> concurrent split may return an inaccurate value here.
>>>>>>
>>>>>> We've held the xas lock to ensure the split is reliable, even though
>>>>>> concurrent splits may occur, only one split can get the large
>>>>>> 'split_order', another will return 0 (since it will see the large swao
>>>>>> entry has already been split).
>>>>>
>>>>> Yes, it may return 0, so we can get a large folio here, but get
>>>>> `split_order = 0`?
>>>>
>>>> If split happens, which means the 'order' > folio_order(), right? how
>>>> can you get a large folio in this context?
>>>>
>>>>> And if concurrently swapout/swapin happened, the `split_order` could
>>>>> be a different value?
>>>>
>>>> What do you mean different value? The large swap entry can only be split
>>>> once, so the 'split_order' must be 0 or the original large order.
>>>
>>> Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot
>>> into order 0 IIUC, so things get complex if two CPUs are faulting on
>>> different indexes landing into two different splitting zones:
>>>
>>> Before this patch:
>>> swap entry val = 0x400, order = 9, covering index 0 - 511, faulting index 3:
>>>
>>> CPU0 CPU1
>>> shmem_swapin_folio (index = 3)
>>> shmem_swapin_folio (index = 510)
>>> /* Gets swap = 0x400 */ /* Gets swap = 0x400 */
>>> /* fallback to order 0 */ /* fallback to order 0 */
>>> split_order = shmem_split_large_entry
>>> /* get split_order = 512 */
>>> /* offset = 3 */
>>> split_order = shmem_split_large_entry
>>> /* get split_order = 0, but no split */
>>> /* map order is 8, offset = 0 */
>>> /* wrong offset */
>>> shmem_swapin_cluster(0x400)
>>> /* It should swapin 0x5fe */
>>
>> Not ture. IIUC, the CPU1 will failed to split due to
>> 'swp_to_radix_entry(swap) != old' in shmem_split_large_entry(), and will
>> retry again to fix this race.
>>
>>> After this patch (with the append fix which was left in latest patch
>>> by mistake) it won't swapin wrong entry now, but
>>> shmem_split_large_entry may still return a outdated order.
>>
>> Like I said above, I don't think we can get a outdated order,right?
>>
>>> That's two previous races I can come up with. These no longer exist
>>> after this patch, it's not a bug though, just redundant IO as far as I
>>> can see because other checks will fallback, looks a bit fragile
>>> though. But shmem_split_large_entry may still return invalid order,
>>> just much less likely.
>>>
>>> I think the ideology here is, because the `order =
>>> shmem_confirm_swap(mapping, index, swap)` ensures order is stable and
>>> corresponds to the entry value at one point, so keep using that value
>>> is better (and so this patch does the offset and calculation using the
>>> `order` retrieved there before issues the swapin).
>>
>> I don't think that the 'order' obtained through the shmem_confirm_swap()
>> is stable, because shmem_confirm_swap() is only protected by RCU.
>> However, I think the 'split_order' obtained from
>> shmem_split_large_entry() under the xas lock is stable.
>>
>>> And after the swapin have brought a folio in, simply round down using
>>> the folio's order, which should ensure the folio can be added
>>> successfully in any case as long as the folio->swap and index fits the
>>> shmem mapping fine.
>>>
>>>>>> Based on your current patch, would the following modifications be clearer?
>>>>>>
>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>>> index 5be9c905396e..91c071fb7b67 100644
>>>>>> --- a/mm/shmem.c
>>>>>> +++ b/mm/shmem.c
>>>>>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
>>>>>> *inode, pgoff_t index,
>>>>>> if (xas_error(&xas))
>>>>>> return xas_error(&xas);
>>>>>>
>>>>>> - return 0;
>>>>>> + return split_order;
>>>>>> }
>
> Sorry about the bad example, I got confused looking at the wrong code
> base and deduced a wrong stack. Before this series it was returning
> entry_order here so yeah it's OK.
>
>>>>>>
>>>>>> /*
>>>>>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
>>>>>> *inode, pgoff_t index,
>>>>>> error = shmem_split_swap_entry(inode, index, swap, gfp);
>>>>>> if (error)
>>>>>> goto failed_nolock;
>>>>>> - }
>>>>>>
>>>>>> - index = round_down(index, 1 << swap_order);
>>>>>> - swap.val = round_down(swap.val, 1 << swap_order);
>>>>>> + /*
>>>>>> + * If the large swap entry has already been split, it is
>>>>>> + * necessary to recalculate the new swap entry based on
>>>>>> + * the old order alignment.
>>>>>> + */
>>>>>> + if (split_order > 0) {
>>>
>>> The split_order could still be an outdated value, eg. we may even get
>>> split_order = 0 with a large folio loaded here.
>>
>> Ditto. I didn't see split_order can be an outdated value.
>
> The problem is, could split_order be 0 while getting a large folio,
> which is like the problem in patch 1.
> This patch changed the race window by a lot, even if that happens,
> shmem_split_swap_entry will just fail and retry though.
>
> Or, can it get a split_order != 0 while folio order == 0:
>
> So assume we do (check split_order instead, trimming all comments):
> if (order > folio_order(folio)) {
> split_order = shmem_split_swap_entry(inode, index, swap, gfp);
> if (split_order < 0)
> goto failed_nolock;
> if (split_order > 0) {
> offset = index - round_down(index, 1 << split_order);
> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> /* Could we get unnecessary `folio->swap != swap`? */
> } else {
> /* Do we need any rounding here? */
> /* eg. swap.val = round_down(swap.val, 1 << folio_order(folio)) */
> /* Could be helpful */
> }
> } else if (order < folio_order(folio)) {
> swap.val = round_down(swap.val, 1 << folio_order(folio));
> }
>
> I was scratching my head in the midnight to figure out if we
> need a round down in the /* Do we need any rounding here */ part.
>
> And it seems, we don't? There could be a lot of races, but because
> shmem_split_swap_entry is now actually
> guarded by swap cache (or swap_map's HAS_CACHE) now: to split an
> index, we must load the folio in the swap cache (or the HAS_CACHE bit)
> first now. So if a large folio is loaded here, the underlying entries
> must remain un-splitted, as it prevents any parallel swapin for
> entries is covered.
At least in this patch, I think we do not need a round down in the /* Do
we need any rounding here */ part.
Because concurrent swapin races always exist, we need to ensure that the
folio can be correctly inserted into the shmem mapping. Moreover, the
checks on ‘swap.val’, shmem_split_swap_entry(), and
shmem_add_to_page_cache() all have some race condition checks to prevent
errors, and will refault to fix the race issue.
Therefore, for this uncommon race issue, if there's no need to add
additional complex logic, I prefer to leverage the current race checking
mechanism to refault to address the race issues. Unless an issue arises
that cannot be resolved by refault, as described in your patch 1.
> But this also requires us to call shmem_split_swap_entry with the
> right swap and index value covered by the folio (if not, it's still
> OK, just redundant split or redundant fault in the worst case it
> seems, this clean up can be done later instead).
>
> But there is still a missing issue that I posted previously, it really
> should do a ealy fixup of the swap entry for the cached swapin in this
> patch (it's delayed to the final patch in this series):
>
> So the shmem_swapin_cluster call should be:
>
> /* Cached swapin currently only supports order 0 swapin */
> offset = index - round_down(index, 1 << order);
> folio = shmem_swapin_cluster(swp_entry(swp_type(swap),
> swp_offset(swap) + offset), gfp, info, index);
Actually, this race will be checked, and a refault will fix it. Yeah,
this needs an extra refault, but can this really cause a real issue in
the real products?
> So far I think it's enough for this patch now. I can send a V4 to
> keep this part's change as minimized as possible, but in later commits
> things may still have to change a bit... especially for avoiding
> redundant calculation of offset and swap, or minimizing the overhead,
> ensuring shmem_split_swap_entry is called with the right value to
> avoid re-faults.
Yes. This patch is for cleanup, so we should not add more changes to
avoid refaults. :)
And We can discuss in your following patches whether it is worth
avoiding some uncommon refaults, especially when complex logic needs to
be added.
> I'll try my best to keep this part as unchanged as possible, but I do find it
> somehow harder to track as the swapin path gets more refactors.
>
> My bad for making it confusing in the first place, the patches are
> still not very well-splitted enough for easier reviewing I think,
> please have my apologies if this is causing any inconvenience for you.
No worries:) I think your patches are on the right direction (I
originally wanted to do some cleanup of the swapin logic as well). I am
happy to help review and test your patches.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-07-02 2:33 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 6:20 [PATCH v3 0/7] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
2025-06-27 6:20 ` [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
2025-06-30 3:44 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 2/7] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-06-27 6:20 ` [PATCH v3 3/7] mm/shmem, swap: tidy up THP swapin checks Kairui Song
2025-06-30 4:47 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting Kairui Song
2025-06-30 6:34 ` Baolin Wang
2025-06-30 9:16 ` Kairui Song
2025-06-30 9:53 ` Baolin Wang
2025-06-30 10:06 ` Kairui Song
2025-06-30 11:59 ` Baolin Wang
2025-06-30 18:19 ` Kairui Song
2025-07-01 1:57 ` Baolin Wang
2025-07-01 18:49 ` Kairui Song
2025-07-02 2:33 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 5/7] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
2025-06-30 7:24 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 6/7] mm/shmem, swap: fix major fault counting Kairui Song
2025-06-30 7:05 ` Baolin Wang
2025-06-27 6:20 ` [PATCH v3 7/7] mm/shmem, swap: avoid false positive swap cache lookup Kairui Song
2025-06-30 7:21 ` Baolin Wang
2025-06-30 9:17 ` Kairui Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).