[RFC PATCH] mm: bypass swap readahead for zswap

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] mm: bypass swap readahead for zswap
@ 2026-06-24  7:55 Alexandre Ghiti
  2026-06-24 10:30 ` Kairui Song
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Alexandre Ghiti @ 2026-06-24  7:55 UTC (permalink / raw)
  To: akpm, hannes, yosry, nphamcs
  Cc: chengming.zhou, david, ljs, liam, vbabka, rppt, surenb, mhocko,
	kasong, chrisl, baohua, usama.arif, linux-mm, linux-kernel,
	Alexandre Ghiti

Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.

zswap is the same kind of in-memory, synchronous backend as zram, not a
swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
swapin_readahead().

Here are the results from bypassing readahead for zswap too: it was
measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
off, on Sapphire Rapids and 3 iterations.

  768M memcg (sustained swap thrash):
    metric                 mm-new    + bypass    delta
    build time (s)          405.0       341.7    -15.6%
    zswap-in (GB)            79.5        53.0     -33%
    zswap-out (GB)          144.8       115.6     -20%
    swap readahead (pages)  6.79M       0.45M     -93%
    swap_ra hit (%)          72.1        89.9     +18pp

  1G memcg (light pressure, build not memory-bound):
    metric                 mm-new    + bypass    delta
    build time (s)          177.7       176.0    ~same (no regression)
    zswap-in (GB)            10.2         7.5     -26%
    zswap-out (GB)           27.7        25.1      -9%
    swap readahead (pages)  1.07M       0.08M     -93%
    swap_ra hit (%)          68.6        87.2     +19pp

The gain is from no longer prefetching pages that are pointless for an
in-memory backend: readahead inflates anon residency and thrashes the
page cache (file pages get evicted and re-read), lengthens each fault by
synchronously (de)compressing a cluster of neighbours, and adds
compression traffic when those extra pages are reclaimed.

Bypassing swap readahead for zswap therefore makes sense.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---

- This bypass originally comes from Usama's series that implements
  large folio zswapin: while working on improving this series, I noticed
  the gains I got only came from the bypass of readahead.

 include/linux/zswap.h |  6 ++++++
 mm/memory.c           |  5 +++--
 mm/zswap.c            | 11 +++++++++++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..b6f0e6198b6f 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
 void zswap_folio_swapin(struct folio *folio);
 bool zswap_is_enabled(void);
 bool zswap_never_enabled(void);
+bool zswap_present_test(swp_entry_t swp);
 #else
 
 struct zswap_lruvec_state {};
@@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
 	return true;
 }
 
+static inline bool zswap_present_test(swp_entry_t swp)
+{
+	return false;
+}
+
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..5aa1ea9eb48a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
 	if (!folio) {
-		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
+		    zswap_present_test(entry))
 			folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
 					    thp_swapin_suitable_orders(vmf) | BIT(0),
 					    vmf, NULL, 0);
diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e0a3..5b85b4d17647 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 		>> ZSWAP_ADDRESS_SPACE_SHIFT];
 }
 
+/**
+ * zswap_present_test - check if a swap entry is currently backed by zswap
+ * @swp: the swap entry to test
+ *
+ * Return: true if @swp has a zswap entry, false otherwise.
+ */
+bool zswap_present_test(swp_entry_t swp)
+{
+	return xa_load(swap_zswap_tree(swp), swp_offset(swp));
+}
+
 #define zswap_pool_debug(msg, p)			\
 	pr_debug("%s pool %s\n", msg, (p)->tfm_name)
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
  2026-06-24  7:55 [RFC PATCH] mm: bypass swap readahead for zswap Alexandre Ghiti
@ 2026-06-24 10:30 ` Kairui Song
  2026-06-24 17:43   ` Nhat Pham
  2026-06-24 14:58 ` David Hildenbrand (Arm)
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Kairui Song @ 2026-06-24 10:30 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: akpm, hannes, yosry, nphamcs, chengming.zhou, david, ljs, liam,
	vbabka, rppt, surenb, mhocko, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel

On Wed, Jun 24, 2026 at 3:59 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
>
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
>
>   768M memcg (sustained swap thrash):
>     metric                 mm-new    + bypass    delta
>     build time (s)          405.0       341.7    -15.6%
>     zswap-in (GB)            79.5        53.0     -33%
>     zswap-out (GB)          144.8       115.6     -20%
>     swap readahead (pages)  6.79M       0.45M     -93%
>     swap_ra hit (%)          72.1        89.9     +18pp
>
>   1G memcg (light pressure, build not memory-bound):
>     metric                 mm-new    + bypass    delta
>     build time (s)          177.7       176.0    ~same (no regression)
>     zswap-in (GB)            10.2         7.5     -26%
>     zswap-out (GB)           27.7        25.1      -9%
>     swap readahead (pages)  1.07M       0.08M     -93%
>     swap_ra hit (%)          68.6        87.2     +19pp
>
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
>
> Bypassing swap readahead for zswap therefore makes sense.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>
> - This bypass originally comes from Usama's series that implements
>   large folio zswapin: while working on improving this series, I noticed
>   the gains I got only came from the bypass of readahead.
>
>  include/linux/zswap.h |  6 ++++++
>  mm/memory.c           |  5 +++--
>  mm/zswap.c            | 11 +++++++++++
>  3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..b6f0e6198b6f 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +bool zswap_present_test(swp_entry_t swp);
>  #else
>
>  struct zswap_lruvec_state {};
> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
>         return true;
>  }
>
> +static inline bool zswap_present_test(swp_entry_t swp)
> +{
> +       return false;
> +}
> +
>  #endif
>
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
>         if (!folio) {
> -               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> +                   zswap_present_test(entry))

Hi Alexandre

Thanks for the test and patch, very interesting idea.

> diff --git a/mm/zswap.c b/mm/zswap.c
> index 761cd699e0a3..5b85b4d17647 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>                 >> ZSWAP_ADDRESS_SPACE_SHIFT];
>  }
>
> +/**
> + * zswap_present_test - check if a swap entry is currently backed by zswap
> + * @swp: the swap entry to test
> + *
> + * Return: true if @swp has a zswap entry, false otherwise.
> + */
> +bool zswap_present_test(swp_entry_t swp)
> +{
> +       return xa_load(swap_zswap_tree(swp), swp_offset(swp));

Better check zswap_never_enabled first to avoid a xa_load if not needed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
  2026-06-24  7:55 [RFC PATCH] mm: bypass swap readahead for zswap Alexandre Ghiti
  2026-06-24 10:30 ` Kairui Song
@ 2026-06-24 14:58 ` David Hildenbrand (Arm)
  2026-06-24 18:01 ` Yosry Ahmed
  2026-06-24 19:24 ` Barry Song
  3 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-24 14:58 UTC (permalink / raw)
  To: Alexandre Ghiti, akpm, hannes, yosry, nphamcs
  Cc: chengming.zhou, ljs, liam, vbabka, rppt, surenb, mhocko, kasong,
	chrisl, baohua, usama.arif, linux-mm, linux-kernel

On 6/24/26 09:55, Alexandre Ghiti wrote:
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
> 
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
> 
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
> 
>   768M memcg (sustained swap thrash):
>     metric                 mm-new    + bypass    delta
>     build time (s)          405.0       341.7    -15.6%
>     zswap-in (GB)            79.5        53.0     -33%
>     zswap-out (GB)          144.8       115.6     -20%
>     swap readahead (pages)  6.79M       0.45M     -93%
>     swap_ra hit (%)          72.1        89.9     +18pp
> 
>   1G memcg (light pressure, build not memory-bound):
>     metric                 mm-new    + bypass    delta
>     build time (s)          177.7       176.0    ~same (no regression)
>     zswap-in (GB)            10.2         7.5     -26%
>     zswap-out (GB)           27.7        25.1      -9%
>     swap readahead (pages)  1.07M       0.08M     -93%
>     swap_ra hit (%)          68.6        87.2     +19pp
> 
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
> 
> Bypassing swap readahead for zswap therefore makes sense.
> 
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---

[...]

>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (folio)
>  		swap_update_readahead(folio, vma, vmf->address);
>  	if (!folio) {
> -		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +		/* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> +		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> +		    zswap_present_test(entry))

This should really be abstracted into a reasonably-named helper that can live in
swap code.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
  2026-06-24 10:30 ` Kairui Song
@ 2026-06-24 17:43   ` Nhat Pham
  0 siblings, 0 replies; 6+ messages in thread
From: Nhat Pham @ 2026-06-24 17:43 UTC (permalink / raw)
  To: Kairui Song
  Cc: Alexandre Ghiti, akpm, hannes, yosry, chengming.zhou, david, ljs,
	liam, vbabka, rppt, surenb, mhocko, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel

On Wed, Jun 24, 2026 at 3:31 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Better check zswap_never_enabled first to avoid a xa_load if not needed.

+1.

Maybe also xa_empty() when we're at it? :)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
  2026-06-24  7:55 [RFC PATCH] mm: bypass swap readahead for zswap Alexandre Ghiti
  2026-06-24 10:30 ` Kairui Song
  2026-06-24 14:58 ` David Hildenbrand (Arm)
@ 2026-06-24 18:01 ` Yosry Ahmed
  2026-06-24 19:24 ` Barry Song
  3 siblings, 0 replies; 6+ messages in thread
From: Yosry Ahmed @ 2026-06-24 18:01 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: akpm, hannes, nphamcs, chengming.zhou, david, ljs, liam, vbabka,
	rppt, surenb, mhocko, kasong, chrisl, baohua, usama.arif,
	linux-mm, linux-kernel

On Wed, Jun 24, 2026 at 12:57 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
>
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
>
>   768M memcg (sustained swap thrash):
>     metric                 mm-new    + bypass    delta
>     build time (s)          405.0       341.7    -15.6%
>     zswap-in (GB)            79.5        53.0     -33%
>     zswap-out (GB)          144.8       115.6     -20%
>     swap readahead (pages)  6.79M       0.45M     -93%
>     swap_ra hit (%)          72.1        89.9     +18pp
>
>   1G memcg (light pressure, build not memory-bound):
>     metric                 mm-new    + bypass    delta
>     build time (s)          177.7       176.0    ~same (no regression)
>     zswap-in (GB)            10.2         7.5     -26%
>     zswap-out (GB)           27.7        25.1      -9%
>     swap readahead (pages)  1.07M       0.08M     -93%
>     swap_ra hit (%)          68.6        87.2     +19pp
>
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
>
> Bypassing swap readahead for zswap therefore makes sense.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>
> - This bypass originally comes from Usama's series that implements
>   large folio zswapin: while working on improving this series, I noticed
>   the gains I got only came from the bypass of readahead.
>
>  include/linux/zswap.h |  6 ++++++
>  mm/memory.c           |  5 +++--
>  mm/zswap.c            | 11 +++++++++++
>  3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..b6f0e6198b6f 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +bool zswap_present_test(swp_entry_t swp);
>  #else
>
>  struct zswap_lruvec_state {};
> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
>         return true;
>  }
>
> +static inline bool zswap_present_test(swp_entry_t swp)
> +{
> +       return false;
> +}
> +
>  #endif
>
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
>         if (!folio) {
> -               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> +                   zswap_present_test(entry))

This assumes that if the swap entry is in zswap, then the remaining
entries (covered by the readahead window) will also be in zswap,
right? While not very likely, it's possible that the remaining entries
not in zswap but on disk, right?

>                         folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
>                                             thp_swapin_suitable_orders(vmf) | BIT(0),
>                                             vmf, NULL, 0);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 761cd699e0a3..5b85b4d17647 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>                 >> ZSWAP_ADDRESS_SPACE_SHIFT];
>  }
>
> +/**
> + * zswap_present_test - check if a swap entry is currently backed by zswap
> + * @swp: the swap entry to test
> + *
> + * Return: true if @swp has a zswap entry, false otherwise.
> + */
> +bool zswap_present_test(swp_entry_t swp)

zswap_is_present()?

> +{
> +       return xa_load(swap_zswap_tree(swp), swp_offset(swp));
> +}
> +
>  #define zswap_pool_debug(msg, p)                       \
>         pr_debug("%s pool %s\n", msg, (p)->tfm_name)
>
> --
> 2.54.0
>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: bypass swap readahead for zswap
  2026-06-24  7:55 [RFC PATCH] mm: bypass swap readahead for zswap Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2026-06-24 18:01 ` Yosry Ahmed
@ 2026-06-24 19:24 ` Barry Song
  3 siblings, 0 replies; 6+ messages in thread
From: Barry Song @ 2026-06-24 19:24 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: akpm, hannes, yosry, nphamcs, chengming.zhou, david, ljs, liam,
	vbabka, rppt, surenb, mhocko, kasong, chrisl, usama.arif,
	linux-mm, linux-kernel

On Wed, Jun 24, 2026 at 3:57 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
>
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
>
>   768M memcg (sustained swap thrash):
>     metric                 mm-new    + bypass    delta
>     build time (s)          405.0       341.7    -15.6%
>     zswap-in (GB)            79.5        53.0     -33%
>     zswap-out (GB)          144.8       115.6     -20%
>     swap readahead (pages)  6.79M       0.45M     -93%
>     swap_ra hit (%)          72.1        89.9     +18pp
>
>   1G memcg (light pressure, build not memory-bound):
>     metric                 mm-new    + bypass    delta
>     build time (s)          177.7       176.0    ~same (no regression)
>     zswap-in (GB)            10.2         7.5     -26%
>     zswap-out (GB)           27.7        25.1      -9%
>     swap readahead (pages)  1.07M       0.08M     -93%
>     swap_ra hit (%)          68.6        87.2     +19pp
>
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
>
> Bypassing swap readahead for zswap therefore makes sense.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>
> - This bypass originally comes from Usama's series that implements
>   large folio zswapin: while working on improving this series, I noticed
>   the gains I got only came from the bypass of readahead.
>
>  include/linux/zswap.h |  6 ++++++
>  mm/memory.c           |  5 +++--
>  mm/zswap.c            | 11 +++++++++++
>  3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..b6f0e6198b6f 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_folio_swapin(struct folio *folio);
>  bool zswap_is_enabled(void);
>  bool zswap_never_enabled(void);
> +bool zswap_present_test(swp_entry_t swp);
>  #else
>
>  struct zswap_lruvec_state {};
> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
>         return true;
>  }
>
> +static inline bool zswap_present_test(swp_entry_t swp)
> +{
> +       return false;
> +}
> +
>  #endif
>
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
>         if (!folio) {
> -               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> +                   zswap_present_test(entry))
>                         folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
>                                             thp_swapin_suitable_orders(vmf) | BIT(0),
>                                             vmf, NULL, 0);

Basically, I have been seeing the same issue recently. If the
readahead swap entries are also in zswap, we end up doing the
decompression during one page fault, but then need another page fault
to fetch the page from the swap cache and install the mapping. In that
case, readahead may not be beneficial.

On the other hand, if the readahead swap entries are not in zswap, the
situation is different.

For example, suppose we fault on the swap entry for address 1 MB and
readahead brings in the entry for 1 MB + 4 KB. If both entries are in
zswap, readahead does not seem like a good trade-off. However, if the
1 MB + 4 KB entry is not in zswap and would otherwise require storage
I/O, then readahead can be beneficial.

So I implemented a rather ugly fault_around-like mechanism in
do_swap_page(). At least with page-cluster == 1, I am seeing a
performance improvement, as the readahead folios can be mapped
directly and do not require a second page fault.

It is admittedly quite ugly and is only meant as a proof of concept :-)

Subject: [PATCH PoC] mm: enable do_swap_page fault_around

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index c00a31a6d1d0..1db79f45a575 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4736,6 +4736,100 @@ static void check_swap_exclusive(struct folio
*folio, swp_entry_t entry,
         } while (--nr_pages);
 }

+static void do_swap_map_around(struct vm_fault *vmf, struct
swap_info_struct *si)
+{
+        struct vm_area_struct *vma = vmf->vma;
+        int nr_around = 1 << page_cluster;
+        unsigned long start = max3(vma->vm_start, vmf->address -
(nr_around - 1) * PAGE_SIZE,
+                        vmf->address & PMD_MASK);
+        unsigned long end = min3(vma->vm_end, vmf->address +
nr_around * PAGE_SIZE,
+                        (vmf->address & PMD_MASK) + PMD_SIZE);
+        unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
+        unsigned long delta_pages = (vmf->address - start) >> PAGE_SHIFT;
+        pte_t *ptep = vmf->pte - delta_pages;
+
+        for (int i = 0; i < nr_pages; i++, ptep++) {
+                unsigned long address = start + (i << PAGE_SHIFT);
+                rmap_t rmap_flags = RMAP_NONE;
+                pte_t orig_pte, pte;
+                struct folio *folio;
+                struct page *page;
+                softleaf_t entry;
+                bool exclusive;
+
+                if (ptep == vmf->pte)
+                        continue;
+                orig_pte = ptep_get(ptep);
+                exclusive = pte_swp_exclusive(orig_pte);
+                if (!exclusive)
+                        continue;
+                entry = softleaf_from_pte(orig_pte);
+                if (!softleaf_is_swap(entry))
+                        continue;
+                folio = swap_cache_get_folio(entry);
+                if (!folio)
+                        continue;
+                if (unlikely(!folio_matches_swap_entry(folio, entry)))
+                        goto skip;
+                if (folio_test_locked(folio))
+                        goto skip;
+                if (!folio_test_uptodate(folio))
+                        goto skip;
+                if (!folio_trylock(folio))
+                        goto skip;
+                if (folio_test_ksm(folio) || folio_test_large(folio) ||
+                        !folio_test_uptodate(folio))
+                        goto unlock;
+                if (exclusive && folio_test_writeback(folio) &&
+                                data_race(si->flags & SWP_STABLE_WRITES))
+                        exclusive = false;
+
+                arch_swap_restore(folio_swap(entry, folio), folio);
+
+                page = folio_page(folio, 0);
+                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+                add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1);
+                pte = mk_pte(page, vma->vm_page_prot);
+                if (pte_swp_soft_dirty(orig_pte))
+                        pte = pte_mksoft_dirty(pte);
+                if (pte_swp_uffd_wp(orig_pte))
+                        pte = pte_mkuffd_wp(pte);
+
+                if (exclusive) {
+                        if ((vma->vm_flags & VM_WRITE) &&
!userfaultfd_pte_wp(vma, pte) &&
+                                        !pte_needs_soft_dirty_wp(vma, pte)) {
+                                pte = pte_mkwrite(pte, vma);
+                        }
+                        rmap_flags |= RMAP_EXCLUSIVE;
+                }
+                flush_icache_pages(vma, page, 1);
+
+                if (!folio_test_anon(folio)) {
+                        folio_add_new_anon_rmap(folio, vma, address,
rmap_flags);
+                        folio_put_swap(folio, NULL);
+                } else {
+                        folio_add_anon_rmap_ptes(folio, page, 1, vma, address,
+                                        rmap_flags);
+                        folio_put_swap(folio, page);
+                }
+
+                set_ptes(vma->vm_mm, address, ptep, pte, 1);
+                arch_do_swap_page_nr(vma->vm_mm, vma, address,
+                                pte, pte, 1);
+
+                if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
+                        folio_free_swap(folio);
+                folio_unlock(folio);
+                swap_update_readahead(folio, vma, address);
+                update_mmu_cache_range(vmf, vma, address, ptep, 1);
+                continue;
+unlock:
+                folio_unlock(folio);
+skip:
+                folio_put(folio);
+        };
+}
+
 /*
  * We enter with either the VMA lock or the mmap_lock held (see
  * FAULT_FLAG_VMA_LOCK), and pte mapped but not yet locked.
@@ -5121,6 +5215,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

         /* No need to invalidate - it was non-present before */
         update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
+        do_swap_map_around(vmf, si);
 unlock:
         if (vmf->pte)
                 pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-24 19:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24  7:55 [RFC PATCH] mm: bypass swap readahead for zswap Alexandre Ghiti
2026-06-24 10:30 ` Kairui Song
2026-06-24 17:43   ` Nhat Pham
2026-06-24 14:58 ` David Hildenbrand (Arm)
2026-06-24 18:01 ` Yosry Ahmed
2026-06-24 19:24 ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox