[PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Chris Li <chrisl@kernel.org>, Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>, Barry Song <baohua@kernel.org>,
	linux-kernel@vger.kernel.org, Kairui Song <kasong@tencent.com>
Subject: [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin
Date: Thu, 10 Jul 2025 11:37:05 +0800	[thread overview]
Message-ID: <20250710033706.71042-8-ryncsn@gmail.com> (raw)
In-Reply-To: <20250710033706.71042-1-ryncsn@gmail.com>

From: Kairui Song <kasong@tencent.com>

Instead of calculating the swap entry differently in different swapin
paths, calculate it early before the swap cache lookup and use that
for the lookup and later swapin. And after swapin have brought a folio,
simply round it down against the size of the folio.

This is simple and effective enough to verify the swap value. A folio's
swap entry is always aligned by its size. Any kind of parallel split or
race is acceptable because the final shmem_add_to_page_cache ensures
that all entries covered by the folio are correct, and thus there
will be no data corruption.

This also prevents false positive cache lookup. If a shmem read
request's index points to the middle of a large swap entry,
previously, shmem will try the swap cache lookup using the large swap
entry's starting value (which is the first sub swap entry of this
large entry). This will lead to false positive lookup results if only
the first few swap entries are cached but the actual requested swap
entry pointed by the index is uncached. This is not a rare event,
as swap readahead always tries to cache order 0 folios when possible.

And this shouldn't cause any increased repeated faults. Instead, no
matter how the shmem mapping is split in parallel, as long as the
mapping still contains the right entries, the swapin will succeed.

The final object size and stack usage are also reduced due to
simplified code:

./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-233 (-233)
Function                                     old     new   delta
shmem_swapin_folio                          4040    3807    -233
Total: Before=33152, After=32919, chg -0.70%

Stack usage (Before vs After):
mm/shmem.c:2277:12:shmem_swapin_folio   264     static
mm/shmem.c:2277:12:shmem_swapin_folio   256     static

And while at it, round down the index too if swap entry is round down.
The index is used either for folio reallocation or confirming the
mapping content. In either case, it should be aligned with the swap
folio.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 66 ++++++++++++++++++++++++++----------------------------
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 80f5b8c73eb8..9c50607ac455 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2265,7 +2265,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	if (xas_error(&xas))
 		return xas_error(&xas);
 
-	return entry_order;
+	return 0;
 }
 
 /*
@@ -2286,7 +2286,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
-	int error, nr_pages, order, split_order;
+	int error, nr_pages, order;
 	pgoff_t offset;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -2294,11 +2294,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	swap = index_entry;
 	*foliop = NULL;
 
-	if (is_poisoned_swp_entry(swap))
+	if (is_poisoned_swp_entry(index_entry))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	order = shmem_confirm_swap(mapping, index, swap);
+	si = get_swap_device(index_entry);
+	order = shmem_confirm_swap(mapping, index, index_entry);
 	if (unlikely(!si)) {
 		if (order < 0)
 			return -EEXIST;
@@ -2310,6 +2310,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EEXIST;
 	}
 
+	/* index may point to the middle of a large entry, get the sub entry */
+	if (order) {
+		offset = index - round_down(index, 1 << order);
+		swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+	}
+
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
@@ -2322,7 +2328,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct mTHP swapin skipping swap cache & readhaed */
-			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+			folio = shmem_swap_alloc_folio(inode, vma, index,
+						       index_entry, order, gfp);
 			if (IS_ERR(folio)) {
 				error = PTR_ERR(folio);
 				folio = NULL;
@@ -2330,16 +2337,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			}
 			skip_swapcache = true;
 		} else {
-			/*
-			 * Cached swapin only supports order 0 folio, it is
-			 * necessary to recalculate the new swap entry based on
-			 * the offset, as the swapin index might be unalgined.
-			 */
-			if (order) {
-				offset = index - round_down(index, 1 << order);
-				swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-			}
-
+			/* Cached swapin only supports order 0 folio */
 			folio = shmem_swapin_cluster(swap, gfp, info, index);
 			if (!folio) {
 				error = -ENOMEM;
@@ -2356,23 +2354,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		 * large swap entries. In such cases, we should split the
 		 * large swap entry to prevent possible data corruption.
 		 */
-		split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
-		if (split_order < 0) {
-			error = split_order;
+		error = shmem_split_large_entry(inode, index, index_entry, gfp);
+		if (error)
 			goto failed_nolock;
-		}
+	}
 
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			offset = index - round_down(index, 1 << split_order);
-			swap = swp_entry(swp_type(swap), swp_offset(index_entry) + offset);
-		}
-	} else if (order < folio_order(folio)) {
-		swap.val = round_down(swap.val, 1 << folio_order(folio));
+	/*
+	 * If the folio is large, round down swap and index by folio size.
+	 * No matter what race occurs, the swap layer ensures we either get
+	 * a valid folio that has its swap entry aligned by size, or a
+	 * temporarily invalid one which we'll abort very soon and retry.
+	 *
+	 * shmem_add_to_page_cache ensures the whole range contains expected
+	 * entries and prevents any corruption, so any race split is fine
+	 * too, it will succeed as long as the entries are still there.
+	 */
+	nr_pages = folio_nr_pages(folio);
+	if (nr_pages > 1) {
+		swap.val = round_down(swap.val, nr_pages);
+		index = round_down(index, nr_pages);
 	}
 
 	/* We have to do this with folio locked to prevent races */
@@ -2387,7 +2387,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
-	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -2401,8 +2400,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping,
-					round_down(index, nr_pages),
+	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
-- 
2.50.0

next prev parent reply	other threads:[~2025-07-10  3:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-10  3:36 [PATCH v5 0/8] mm/shmem, swap: bugfix and improvement of mTHP swap in Kairui Song
2025-07-10  3:36 ` [PATCH v5 1/8] mm/shmem, swap: improve cached mTHP handling and fix potential hung Kairui Song
2025-07-24 17:02   ` Kairui Song
2025-07-24 18:16     ` Kairui Song
2025-07-25  3:52       ` Baolin Wang
2025-07-25  4:54         ` Kairui Song
2025-07-10  3:37 ` [PATCH v5 2/8] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-07-10  3:37 ` [PATCH v5 3/8] mm/shmem, swap: tidy up THP swapin checks Kairui Song
2025-07-10  3:37 ` [PATCH v5 4/8] mm/shmem, swap: tidy up swap entry splitting Kairui Song
2025-07-16  7:09   ` Baoquan He
2025-07-10  3:37 ` [PATCH v5 5/8] mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO Kairui Song
2025-07-11  6:10   ` Baolin Wang
2025-07-13 10:53   ` Barry Song
2025-07-10  3:37 ` [PATCH v5 6/8] mm/shmem, swap: simplify swapin path and result handling Kairui Song
2025-07-11  6:23   ` Baolin Wang
2025-07-11  6:28     ` Kairui Song
2025-07-15 22:09       ` Hugh Dickins
2025-07-16  7:14         ` Kairui Song
2025-07-10  3:37 ` Kairui Song [this message]
2025-07-11  6:36   ` [PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin Baolin Wang
2025-07-14  2:39     ` Baolin Wang
2025-07-10  3:37 ` [PATCH v5 8/8] mm/shmem, swap: fix major fault counting Kairui Song

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:80f5b8c73eb dfblob:9c50607ac45 )
 OR (
bs:"[PATCH v5 7/8] mm/shmem, swap: rework swap entry and index calculation for large swapin" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250710033706.71042-8-ryncsn@gmail.com \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).