From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
Hugh Dickins <hughd@google.com>, Chris Li <chrisl@kernel.org>,
David Hildenbrand <david@redhat.com>,
Yosry Ahmed <yosryahmed@google.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
Nhat Pham <nphamcs@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Baoquan He <bhe@redhat.com>, Barry Song <baohua@kernel.org>,
Kalesh Singh <kaleshsingh@google.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Ryan Roberts <ryan.roberts@arm.com>,
linux-kernel@vger.kernel.org, Kairui Song <kasong@tencent.com>
Subject: [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO
Date: Thu, 15 May 2025 04:17:12 +0800 [thread overview]
Message-ID: <20250514201729.48420-13-ryncsn@gmail.com> (raw)
In-Reply-To: <20250514201729.48420-1-ryncsn@gmail.com>
From: Kairui Song <kasong@tencent.com>
Now the overhead of the swap cache is trivial to none, bypassing the
swap cache is no longer a valid optimization.
This commit is more than code simplification, it changes the swap in
behaviour in multiple ways:
We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
The indicator to bypass the swap cache and read ahead, in many workload
bypassing read ahead is the more helpful part for SWP_SYNCHRONOUS_IO
devices as they have extreme low latency the read ahead isn't helpful.
The `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` is not a good
indicator in the first place: obviously, read ahead has nothing to do with
swap count, that's more of a workaround due to the limitation of current
implementation that read ahead bypassing is strictly coupled with swap
cache bypassing. Swap count > 1 can't bypass the swap cache because that
will result in redundant IO or wasted CPU time.
So the first change with this commit is that read ahead is now always
disabled for SWP_SYNCHRONOUS_IO devices, this is a good thing as these
devices have extreme low latency, and queued IO do not affect them
(ZRAM, RAMDISK), so read ahead isn't helpful.
The second thing here is that this enabled mTHP swap in for all faults on
SWP_SYNCHRONOUS_IO devices. Previously, the mTHP swap is also coupled with
swap cache bypassing. But again clearly, it doesn't make much sense that
mTHP's ref count affects its swap in behavior.
And to catch potential issues with mTHP swap in, especially with page
exclusiveness, more debug sanity checks and comments are added. But the
code is still simpler with reduced LOC.
For a real mTHP workload, this may cause more serious thrashing, this isn't
a problem with this commit but a generic mTHP issue. For a 4K workload,
this commit boosts the performance:
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/memory.c | 267 +++++++++++++++++++++++-----------------------------
1 file changed, 116 insertions(+), 151 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 1b6e192de6ec..0b41d15c6d7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -87,6 +87,7 @@
#include <asm/tlbflush.h>
#include "pgalloc-track.h"
+#include "swap_table.h"
#include "internal.h"
#include "swap.h"
@@ -4477,7 +4478,33 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+/* Check if a folio should be exclusive, with sanity tests */
+static bool check_swap_exclusive(struct folio *folio, swp_entry_t entry,
+ pte_t *ptep, unsigned int fault_nr)
+{
+ pgoff_t offset = swp_offset(entry);
+ struct page *page = folio_file_page(folio, offset);
+
+ if (!pte_swp_exclusive(ptep_get(ptep)))
+ return false;
+
+ /* For exclusive swapin, it must not be mapped */
+ if (fault_nr == 1)
+ VM_WARN_ON_ONCE_PAGE(atomic_read(&page->_mapcount) != -1, page);
+ else
+ VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
+ /*
+ * Check if swap count is consistent with exclusiveness. The folio
+ * and PTL lock keeps the swap count stable.
+ */
+ if (IS_ENABLED(CONFIG_VM_DEBUG)) {
+ for (int i = 0; i < fault_nr; i++) {
+ VM_WARN_ON_FOLIO(__swap_count(entry) != 1, folio);
+ entry.val++;
+ }
+ }
+ return true;
+}
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
@@ -4490,17 +4517,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct folio *swapcache, *folio = NULL;
- DECLARE_WAITQUEUE(wait, current);
+ struct folio *swapcache = NULL, *folio;
struct page *page;
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
- bool need_clear_cache = false;
bool exclusive = false;
swp_entry_t entry;
pte_t pte;
vm_fault_t ret = 0;
- void *shadow = NULL;
int nr_pages;
unsigned long page_idx;
unsigned long address;
@@ -4571,56 +4595,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio = swap_cache_get_folio(entry);
swapcache = folio;
if (!folio) {
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
- __swap_count(entry) == 1) {
- /* skip swapcache */
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
folio = alloc_swap_folio(vmf);
if (folio) {
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
-
- nr_pages = folio_nr_pages(folio);
- if (folio_test_large(folio))
- entry.val = ALIGN_DOWN(entry.val, nr_pages);
- /*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread
- * may finish swapin first, free the entry, and
- * swapout reusing the same entry. It's
- * undetectable as pte_same() returns true due
- * to entry reuse.
- */
- if (swapcache_prepare(entry, nr_pages)) {
- /*
- * Relax a bit to prevent rapid
- * repeated page faults.
- */
- add_wait_queue(&swapcache_wq, &wait);
- schedule_timeout_uninterruptible(1);
- remove_wait_queue(&swapcache_wq, &wait);
- goto out_page;
- }
- need_clear_cache = true;
-
- memcg1_swapin(entry, nr_pages);
-
- shadow = swap_cache_get_shadow(entry);
- if (shadow)
- workingset_refault(folio, shadow);
-
- folio_add_lru(folio);
-
- /* To provide entry to swap_read_folio() */
- folio->swap = entry;
- swap_read_folio(folio, NULL);
- folio->private = NULL;
+ swapcache = swapin_entry(entry, folio);
+ if (swapcache != folio)
+ folio_put(folio);
}
} else {
- folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- vmf);
- swapcache = folio;
+ swapcache = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
}
+ folio = swapcache;
if (!folio) {
/*
* Back out if somebody else faulted in this pte
@@ -4644,57 +4630,56 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (ret & VM_FAULT_RETRY)
goto out_release;
+ /*
+ * Make sure folio_free_swap() or swapoff did not release the
+ * swapcache from under us. The page pin, and pte_same test
+ * below, are not enough to exclude that. Even if it is still
+ * swapcache, we need to check that the page's swap has not
+ * changed.
+ */
+ if (!folio_swap_contains(folio, entry))
+ goto out_page;
page = folio_file_page(folio, swp_offset(entry));
- if (swapcache) {
- /*
- * Make sure folio_free_swap() or swapoff did not release the
- * swapcache from under us. The page pin, and pte_same test
- * below, are not enough to exclude that. Even if it is still
- * swapcache, we need to check that the page's swap has not
- * changed.
- */
- if (!folio_swap_contains(folio, entry))
- goto out_page;
- if (PageHWPoison(page)) {
- /*
- * hwpoisoned dirty swapcache pages are kept for killing
- * owner processes (which may be unknown at hwpoison time)
- */
- ret = VM_FAULT_HWPOISON;
- goto out_page;
- }
-
- swap_update_readahead(folio, vma, vmf->address);
+ /*
+ * hwpoisoned dirty swapcache pages are kept for killing
+ * owner processes (which may be unknown at hwpoison time)
+ */
+ if (PageHWPoison(page)) {
+ ret = VM_FAULT_HWPOISON;
+ goto out_page;
+ }
- /*
- * KSM sometimes has to copy on read faults, for example, if
- * page->index of !PageKSM() pages would be nonlinear inside the
- * anon VMA -- PageKSM() is lost on actual swapout.
- */
- folio = ksm_might_need_to_copy(folio, vma, vmf->address);
- if (unlikely(!folio)) {
- ret = VM_FAULT_OOM;
- folio = swapcache;
- goto out_page;
- } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
- ret = VM_FAULT_HWPOISON;
- folio = swapcache;
- goto out_page;
- } else if (folio != swapcache)
- page = folio_page(folio, 0);
+ swap_update_readahead(folio, vma, vmf->address);
- /*
- * If we want to map a page that's in the swapcache writable, we
- * have to detect via the refcount if we're really the exclusive
- * owner. Try removing the extra reference from the local LRU
- * caches if required.
- */
- if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
- !folio_test_ksm(folio) && !folio_test_lru(folio))
- lru_add_drain();
+ /*
+ * KSM sometimes has to copy on read faults, for example, if
+ * page->index of !PageKSM() pages would be nonlinear inside the
+ * anon VMA -- PageKSM() is lost on actual swapout.
+ */
+ folio = ksm_might_need_to_copy(folio, vma, vmf->address);
+ if (unlikely(!folio)) {
+ ret = VM_FAULT_OOM;
+ folio = swapcache;
+ goto out_page;
+ } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
+ ret = VM_FAULT_HWPOISON;
+ folio = swapcache;
+ goto out_page;
+ } else if (folio != swapcache) {
+ page = folio_file_page(folio, swp_offset(entry));
}
+ /*
+ * If we want to map a page that's in the swapcache writable, we
+ * have to detect via the refcount if we're really the exclusive
+ * owner. Try removing the extra reference from the local LRU
+ * caches if required.
+ */
+ if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
+ !folio_test_ksm(folio) && !folio_test_lru(folio))
+ lru_add_drain();
+
folio_throttle_swaprate(folio, GFP_KERNEL);
/*
@@ -4710,44 +4695,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_nomap;
}
- /* allocated large folios for SWP_SYNCHRONOUS_IO */
- if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
- unsigned long nr = folio_nr_pages(folio);
- unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
- unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
- pte_t *folio_ptep = vmf->pte - idx;
- pte_t folio_pte = ptep_get(folio_ptep);
-
- if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
- swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
- goto out_nomap;
-
- page_idx = idx;
- address = folio_start;
- ptep = folio_ptep;
- goto check_folio;
- }
-
nr_pages = 1;
page_idx = 0;
address = vmf->address;
ptep = vmf->pte;
- if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+ if (folio_test_large(folio)) {
unsigned long nr = folio_nr_pages(folio);
unsigned long idx = folio_page_idx(folio, page);
- unsigned long folio_address = address - idx * PAGE_SIZE;
+ unsigned long folio_address = vmf->address - idx * PAGE_SIZE;
pte_t *folio_ptep = vmf->pte - idx;
- if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))
+ if (can_swapin_thp(vmf, folio_ptep, folio_address, nr)) {
+ page_idx = idx;
+ address = folio_address;
+ ptep = folio_ptep;
+ nr_pages = nr;
+ entry = folio->swap;
+ page = &folio->page;
goto check_folio;
-
- page_idx = idx;
- address = folio_address;
- ptep = folio_ptep;
- nr_pages = nr;
- entry = folio->swap;
- page = &folio->page;
+ }
+ /*
+ * If it's a fresh large folio in the swap cache but the
+ * page table supporting it is gone, drop it and fallback
+ * to order 0 swap in again.
+ *
+ * The folio must be clean, nothing should have touched
+ * it, shmem removes the folio from swap cache upon
+ * swapin, and anon flag won't be gone once set.
+ * TODO: We might want to split or partially map it.
+ */
+ if (!folio_test_anon(folio)) {
+ WARN_ON_ONCE(folio_test_dirty(folio));
+ delete_from_swap_cache(folio);
+ goto out_nomap;
+ }
}
check_folio:
@@ -4767,7 +4749,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* the swap entry concurrently) for certainly exclusive pages.
*/
if (!folio_test_ksm(folio)) {
- exclusive = pte_swp_exclusive(vmf->orig_pte);
+ exclusive = check_swap_exclusive(folio, entry, ptep, nr_pages);
if (folio != swapcache) {
/*
* We have a fresh page that is not exposed to the
@@ -4805,15 +4787,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
arch_swap_restore(folio_swap(entry, folio), folio);
- /*
- * Remove the swap entry and conditionally try to free up the swapcache.
- * We're already holding a reference on the page but haven't mapped it
- * yet.
- */
- swap_free_nr(entry, nr_pages);
- if (should_try_to_free_swap(folio, vma, vmf->flags))
- folio_free_swap(folio);
-
add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);
@@ -4849,14 +4822,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
} else if (!folio_test_anon(folio)) {
- /*
- * We currently only expect small !anon folios which are either
- * fully exclusive or fully shared, or new allocated large
- * folios which are fully exclusive. If we ever get large
- * folios within swapcache here, we have to be careful.
- */
- VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
- VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
} else {
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
@@ -4869,7 +4836,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
arch_do_swap_page_nr(vma->vm_mm, vma, address,
pte, pte, nr_pages);
+ /*
+ * Remove the swap entry and conditionally try to free up the
+ * swapcache then unlock the folio. Do this after the PTEs are
+ * set, so raced faults will see updated PTEs.
+ */
+ swap_free_nr(entry, nr_pages);
+ if (should_try_to_free_swap(folio, vma, vmf->flags))
+ folio_free_swap(folio);
folio_unlock(folio);
+
if (folio != swapcache && swapcache) {
/*
* Hold the lock to avoid the swap entry to be reused
@@ -4896,12 +4872,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
- /* Clear the swap cache pin for direct swapin after PTL unlock */
- if (need_clear_cache) {
- swapcache_clear(si, entry, nr_pages);
- if (waitqueue_active(&swapcache_wq))
- wake_up(&swapcache_wq);
- }
if (si)
put_swap_device(si);
return ret;
@@ -4916,11 +4886,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_unlock(swapcache);
folio_put(swapcache);
}
- if (need_clear_cache) {
- swapcache_clear(si, entry, nr_pages);
- if (waitqueue_active(&swapcache_wq))
- wake_up(&swapcache_wq);
- }
if (si)
put_swap_device(si);
return ret;
--
2.49.0
next prev parent reply other threads:[~2025-05-14 20:18 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
2025-05-19 4:38 ` Barry Song
2025-05-20 3:31 ` Kairui Song
2025-05-20 4:41 ` Barry Song
2025-05-20 19:09 ` Kairui Song
2025-05-20 22:33 ` Barry Song
2025-05-21 2:45 ` Kairui Song
2025-05-21 3:24 ` Barry Song
2025-05-23 2:29 ` Barry Song
2025-05-23 20:01 ` Kairui Song
2025-05-27 7:58 ` Barry Song
2025-05-27 15:11 ` Kairui Song
2025-05-30 8:49 ` Kairui Song
2025-05-30 19:24 ` Kairui Song
2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
2025-05-19 6:26 ` Barry Song
2025-05-20 3:50 ` Kairui Song
2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
2025-05-15 9:31 ` Klara Modin
2025-05-15 9:39 ` Kairui Song
2025-05-19 7:08 ` Barry Song
2025-05-19 11:09 ` Kairui Song
2025-05-19 11:57 ` Barry Song
2025-05-14 20:17 ` Kairui Song [this message]
2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
2025-06-19 10:38 ` Baoquan He
2025-06-19 10:50 ` Kairui Song
2025-06-20 8:04 ` Baoquan He
2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
2025-05-15 9:40 ` Klara Modin
2025-05-16 2:35 ` Kairui Song
2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-05-21 18:36 ` Nhat Pham
2025-05-22 4:13 ` Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250514201729.48420-13-ryncsn@gmail.com \
--to=ryncsn@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=shikemeng@huaweicloud.com \
--cc=tim.c.chen@linux.intel.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).