* [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
@ 2025-08-07 18:16 SeongJae Park
2025-08-07 18:32 ` SeongJae Park
2025-08-07 23:03 ` Nhat Pham
0 siblings, 2 replies; 7+ messages in thread
From: SeongJae Park @ 2025-08-07 18:16 UTC (permalink / raw)
Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
linux-kernel, linux-mm, Takero Funaki
When zswap writeback is enabled and it fails compressing a given page,
the page is swapped out to the backing swap device. This behavior
breaks the zswap's writeback LRU order, and hence users can experience
unexpected latency spikes. If the page is compressed without failure,
but results in a size of PAGE_SIZE, the LRU order is kept, but the
decompression overhead for loading the page back on the later access is
unnecessary.
Keep the LRU order and optimize unnecessary decompression overheads in
those cases, by storing the original content as-is in zpool. The length
field of zswap_entry will be set appropriately, as PAGE_SIZE, Hence
whether it is saved as-is or not (whether decompression is unnecessary)
is identified by 'zswap_entry->length == PAGE_SIZE'.
Because the uncompressed data is saved in zpool, same to the compressed
ones, this introduces no change in terms of memory management including
movability and migratability of involved pages.
This change is also not increasing per zswap entry metadata overhead.
But as the number of incompressible pages increases, total zswap
metadata overhead is proportionally increased. The overhead should not
be problematic in usual cases, since the zswap metadata for single zswap
entry is much smaller than PAGE_SIZE, and in common zswap use cases
there should be a sufficient amount of compressible pages. Also it can
be mitigated by the zswap writeback.
When the writeback is disabled, the additional overhead could be
problematic. For the case, keep the current behavior that just returns
the failure and let swap_writeout() put the page back to the active LRU
list in the case. It is known to be suboptimal when the incompressible
pages are cold, since the incompressible pages will continuously be
tried to be zswapped out, and burn CPU cycles for compression attempts
that will anyway fails. One imaginable solution for the problem is
reusing the swapped-out page and its struct page to store in the zswap
pool. But that's out of the scope of this patch.
Tests
-----
I tested this patch using a simple self-written microbenchmark that is
available at GitHub[1]. You can reproduce the test I did by executing
run_tests.sh of the repo on your system. Note that the repo's
documentation is not good as of this writing, so you may need to read
and use the code.
The basic test scenario is simple. Run a test program making artificial
accesses to memory having artificial content under memory.high-set
memory limit and measure how many accesses were made in given time.
The test program repeatedly and randomly access three anonymous memory
regions. The regions are all 500 MiB size, and accessed in the same
probability. Two of those are filled up with a simple content that can
easily be compressed, while the remaining one is filled up with a
content that read from /dev/urandom, which is easy to fail at
compressing to <PAGE_SIZE size. The program runs for two minutes and
prints out the number of accesses made every five seconds.
The test script runs the program under below seven configurations.
- 0: memory.high is set to 2 GiB, zswap is disabled.
- 1-1: memory.high is set to 1350 MiB, zswap is disabled.
- 1-2: On 1-1, zswap is enabled without this patch.
- 1-3: On 1-2, this patch is applied.
For all zswap enabled case, zswap shrinker is enabled.
Configuration '0' is for showing the original memory performance.
Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
zswap, and this patch under a level of memory pressure (~10% of working
set).
Because the per-5 seconds performance is not very reliable, I measured
the average of that for the last one minute period of the test program
run. I also measured a few vmstat counters including zswpin, zswpout,
zswpwb, pswpin and pswpout during the test runs.
The measurement results are as below. To save space, I show performance
numbers that are normalized to that of the configuration '0' (no memory
pressure), only. The averaged accesses per 5 seconds of configuration
'0' was 36493417.75.
config 0 1-1 1-2 1-3
perf_normalized 1.0000 0.0057 0.0235 0.0367
perf_stdev_ratio 0.0582 0.0652 0.0167 0.0346
zswpin 0 0 3548424 1999335
zswpout 0 0 3588817 2361689
zswpwb 0 0 10214 340270
pswpin 0 485806 772038 340967
pswpout 0 649543 144773 340270
'perf_normalized' is the performance metric, normalized to that of
configuration '0' (no pressure). 'perf_stdev_ratio' is the standard
deviation of the averaged data points, as a ratio to the averaged metric
value. For example, configuration '0' performance was showing 5.8%
stdev. Configurations 1-1 and 1-3 were having about 6.5% and 6.1%
stdev. Also the results were highly variable between multiple runs. So
this result is not very stable but just showing ball park figures.
Please keep this in your mind when reading these results.
Under about 10% of working set memory pressure, the performance was
dropped to about 0.57% of no-pressure one, when the normal swap is used
(1-1). Note that ~10% working set pressure is already extreme, at least
on this test setup. No one would desire system setups that can degrade
performance to 0.57% of the best case.
By turning zswap on (1-2), the performance was improved about 4x,
resulting in about 2.35% of no-pressure one. Because of the
incompressible pages in the third memory region, a significant amount of
(non-zswap) swap I/O operations were made, though.
By applying this patch (1-3), about 56% performance improvement was
made, resulting in about 3.67% of no-pressure one. Reduced pswpin of
1-3 compared to 1-2 let us see where this improvement came from.
Related Works
-------------
This is not an entirely new attempt. Nhat Pham and Takero Funaki tried
very similar approaches in October 2023[2] and April 2024[3],
respectively. The two approaches didn't get merged mainly due to the
metadata overhead concern. I described why I think that shouldn't be a
problem for this change, which is automatically disabled when writeback
is disabled, at the beginning of this changelog.
This patch is not particularly different from those, and actually built
upon those. I wrote this from scratch again, though. Hence adding
Suggested-by tags for them. Actually Nhat first suggested this to me
offlist.
[1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
[2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
[3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Takero Funaki <flintglass@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Changes from RFC v2
(https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
- Fix race conditions at decompressed pages identification.
- Remove the parameter and make saving as-is the default behavior.
- Open-code main changes.
- Clarify there is no memory management changes on the cover letter.
- Remove 20% pressure case from test results, since it is arguably too
extreme and only adds confusion.
- Drop RFC tag.
Changes from RFC v1
(https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
- Consider PAGE_SIZE compression successes as failures.
- Use zpool for storing incompressible pages.
- Test with zswap shrinker enabled.
- Wordsmith changelog and comments.
- Add documentation of save_incompressible_pages parameter.
mm/zswap.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 3c0fd8a13718..2db2da130ec4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -976,8 +976,25 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
*/
comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
dlen = acomp_ctx->req->dlen;
- if (comp_ret)
- goto unlock;
+
+ /*
+ * If a page cannot be compressed into a size smaller than PAGE_SIZE,
+ * save the content as is without a compression, to keep the LRU order
+ * of writebacks. If writeback is disabled, reject the page since it
+ * only adds metadata overhead. swap_writeout() will put the page back
+ * to the active LRU list in the case.
+ */
+ if (comp_ret || dlen >= PAGE_SIZE) {
+ if (mem_cgroup_zswap_writeback_enabled(
+ folio_memcg(page_folio(page)))) {
+ comp_ret = 0;
+ dlen = PAGE_SIZE;
+ memcpy_from_page(dst, page, 0, dlen);
+ } else {
+ comp_ret = comp_ret ? : -EINVAL;
+ goto unlock;
+ }
+ }
zpool = pool->zpool;
gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
@@ -1012,6 +1029,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
+ /* zswap entries of PAGE_SIZE are not compressed. */
+ if (entry->length == PAGE_SIZE) {
+ memcpy_to_folio(folio, 0, obj, entry->length);
+ zpool_obj_read_end(zpool, entry->handle, obj);
+ acomp_ctx_put_unlock(acomp_ctx);
+ return true;
+ }
+
/*
* zpool_obj_read_begin() might return a kmap address of highmem when
* acomp_ctx->buffer is not used. However, sg_init_one() does not
base-commit: 2ec534125ae474292175ae198483c545eac2161d
--
2.39.5
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
2025-08-07 18:16 [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
@ 2025-08-07 18:32 ` SeongJae Park
2025-08-07 23:03 ` Nhat Pham
1 sibling, 0 replies; 7+ messages in thread
From: SeongJae Park @ 2025-08-07 18:32 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
Nhat Pham, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
Takero Funaki
On Thu, 7 Aug 2025 11:16:16 -0700 SeongJae Park <sj@kernel.org> wrote:
> Tests
> -----
>
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1]. You can reproduce the test I did by executing
> run_tests.sh of the repo on your system. Note that the repo's
> documentation is not good as of this writing, so you may need to read
> and use the code.
The test script assumes the kernel has save_incompressible_pages zswap
parameter, which dropped on this version of this patch. To run the test script
as is, you may need to build your test kernel with a patch that introduces the
parameter back, to use the test script as is. An example of such patch is
available[1] at my GitHub tree.
[1] https://github.com/sjp38/linux/commit/3a8122e9cb947c13878f5fab1cd2d106be3f3d2f
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
2025-08-07 18:16 [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
2025-08-07 18:32 ` SeongJae Park
@ 2025-08-07 23:03 ` Nhat Pham
2025-08-07 23:07 ` Nhat Pham
2025-08-07 23:54 ` SeongJae Park
1 sibling, 2 replies; 7+ messages in thread
From: Nhat Pham @ 2025-08-07 23:03 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
Yosry Ahmed, kernel-team, linux-kernel, linux-mm, Takero Funaki
On Thu, Aug 7, 2025 at 11:16 AM SeongJae Park <sj@kernel.org> wrote:
>
> When zswap writeback is enabled and it fails compressing a given page,
> the page is swapped out to the backing swap device. This behavior
> breaks the zswap's writeback LRU order, and hence users can experience
> unexpected latency spikes. If the page is compressed without failure,
> but results in a size of PAGE_SIZE, the LRU order is kept, but the
> decompression overhead for loading the page back on the later access is
> unnecessary.
Maybe add the writeback disabled story - we added to get around this
latency issue, but that solution was not satisfactory either: wasting
cpu cycles retrying incompressible pages, especially when we're under
reclaim/memory pressure, and we've reclaimed most if not all other
sources.
This problem was pointed out by Yosry:
https://lore.kernel.org/all/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com/
>
> Keep the LRU order and optimize unnecessary decompression overheads in
> those cases, by storing the original content as-is in zpool. The length
> field of zswap_entry will be set appropriately, as PAGE_SIZE, Hence
> whether it is saved as-is or not (whether decompression is unnecessary)
> is identified by 'zswap_entry->length == PAGE_SIZE'.
>
> Because the uncompressed data is saved in zpool, same to the compressed
> ones, this introduces no change in terms of memory management including
> movability and migratability of involved pages.
>
> This change is also not increasing per zswap entry metadata overhead.
> But as the number of incompressible pages increases, total zswap
> metadata overhead is proportionally increased. The overhead should not
> be problematic in usual cases, since the zswap metadata for single zswap
> entry is much smaller than PAGE_SIZE, and in common zswap use cases
> there should be a sufficient amount of compressible pages. Also it can
> be mitigated by the zswap writeback.
>
> When the writeback is disabled, the additional overhead could be
> problematic. For the case, keep the current behavior that just returns
> the failure and let swap_writeout() put the page back to the active LRU
> list in the case. It is known to be suboptimal when the incompressible
> pages are cold, since the incompressible pages will continuously be
> tried to be zswapped out, and burn CPU cycles for compression attempts
> that will anyway fails. One imaginable solution for the problem is
> reusing the swapped-out page and its struct page to store in the zswap
> pool. But that's out of the scope of this patch.
>
> Tests
> -----
>
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1]. You can reproduce the test I did by executing
> run_tests.sh of the repo on your system. Note that the repo's
> documentation is not good as of this writing, so you may need to read
> and use the code.
>
> The basic test scenario is simple. Run a test program making artificial
> accesses to memory having artificial content under memory.high-set
> memory limit and measure how many accesses were made in given time.
>
> The test program repeatedly and randomly access three anonymous memory
> regions. The regions are all 500 MiB size, and accessed in the same
> probability. Two of those are filled up with a simple content that can
> easily be compressed, while the remaining one is filled up with a
> content that read from /dev/urandom, which is easy to fail at
> compressing to <PAGE_SIZE size. The program runs for two minutes and
> prints out the number of accesses made every five seconds.
>
> The test script runs the program under below seven configurations.
>
> - 0: memory.high is set to 2 GiB, zswap is disabled.
> - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> - 1-2: On 1-1, zswap is enabled without this patch.
> - 1-3: On 1-2, this patch is applied.
>
> For all zswap enabled case, zswap shrinker is enabled.
>
> Configuration '0' is for showing the original memory performance.
> Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> zswap, and this patch under a level of memory pressure (~10% of working
> set).
>
> Because the per-5 seconds performance is not very reliable, I measured
> the average of that for the last one minute period of the test program
> run. I also measured a few vmstat counters including zswpin, zswpout,
> zswpwb, pswpin and pswpout during the test runs.
>
> The measurement results are as below. To save space, I show performance
> numbers that are normalized to that of the configuration '0' (no memory
> pressure), only. The averaged accesses per 5 seconds of configuration
> '0' was 36493417.75.
>
> config 0 1-1 1-2 1-3
> perf_normalized 1.0000 0.0057 0.0235 0.0367
> perf_stdev_ratio 0.0582 0.0652 0.0167 0.0346
> zswpin 0 0 3548424 1999335
> zswpout 0 0 3588817 2361689
> zswpwb 0 0 10214 340270
> pswpin 0 485806 772038 340967
> pswpout 0 649543 144773 340270
>
> 'perf_normalized' is the performance metric, normalized to that of
> configuration '0' (no pressure). 'perf_stdev_ratio' is the standard
> deviation of the averaged data points, as a ratio to the averaged metric
> value. For example, configuration '0' performance was showing 5.8%
> stdev. Configurations 1-1 and 1-3 were having about 6.5% and 6.1%
> stdev. Also the results were highly variable between multiple runs. So
> this result is not very stable but just showing ball park figures.
> Please keep this in your mind when reading these results.
>
> Under about 10% of working set memory pressure, the performance was
> dropped to about 0.57% of no-pressure one, when the normal swap is used
> (1-1). Note that ~10% working set pressure is already extreme, at least
> on this test setup. No one would desire system setups that can degrade
> performance to 0.57% of the best case.
>
> By turning zswap on (1-2), the performance was improved about 4x,
> resulting in about 2.35% of no-pressure one. Because of the
> incompressible pages in the third memory region, a significant amount of
> (non-zswap) swap I/O operations were made, though.
>
> By applying this patch (1-3), about 56% performance improvement was
> made, resulting in about 3.67% of no-pressure one. Reduced pswpin of
> 1-3 compared to 1-2 let us see where this improvement came from.
Nice.
>
> Related Works
> -------------
>
> This is not an entirely new attempt. Nhat Pham and Takero Funaki tried
> very similar approaches in October 2023[2] and April 2024[3],
> respectively. The two approaches didn't get merged mainly due to the
> metadata overhead concern. I described why I think that shouldn't be a
> problem for this change, which is automatically disabled when writeback
> is disabled, at the beginning of this changelog.
>
> This patch is not particularly different from those, and actually built
> upon those. I wrote this from scratch again, though. Hence adding
> Suggested-by tags for them. Actually Nhat first suggested this to me
> offlist.
>
> [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Takero Funaki <flintglass@gmail.com>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
> Changes from RFC v2
> (https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
> - Fix race conditions at decompressed pages identification.
> - Remove the parameter and make saving as-is the default behavior.
> - Open-code main changes.
> - Clarify there is no memory management changes on the cover letter.
> - Remove 20% pressure case from test results, since it is arguably too
> extreme and only adds confusion.
> - Drop RFC tag.
>
> Changes from RFC v1
> (https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
> - Consider PAGE_SIZE compression successes as failures.
> - Use zpool for storing incompressible pages.
> - Test with zswap shrinker enabled.
> - Wordsmith changelog and comments.
> - Add documentation of save_incompressible_pages parameter.
>
> mm/zswap.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 3c0fd8a13718..2db2da130ec4 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -976,8 +976,25 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> */
> comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> dlen = acomp_ctx->req->dlen;
> - if (comp_ret)
> - goto unlock;
> +
> + /*
> + * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> + * save the content as is without a compression, to keep the LRU order
> + * of writebacks. If writeback is disabled, reject the page since it
> + * only adds metadata overhead. swap_writeout() will put the page back
> + * to the active LRU list in the case.
> + */
> + if (comp_ret || dlen >= PAGE_SIZE) {
> + if (mem_cgroup_zswap_writeback_enabled(
> + folio_memcg(page_folio(page)))) {
> + comp_ret = 0;
> + dlen = PAGE_SIZE;
> + memcpy_from_page(dst, page, 0, dlen);
I wonder if we can save one memcpy here. Would it be safe to map the page:
dst = kmap_local_page(page);
then, after we're done with storing (i.e after zpool_obj_write()), do:
if (dlen == PAGE_SIZE)
kunmap(dst);
(don't forget to unmap the page in the failure paths too!)
> + } else {
> + comp_ret = comp_ret ? : -EINVAL;
Does this keep the value of comp_ret if comp_ret != 0 lol. Syntax
looks weird to me.
> + goto unlock;
> + }
> + }
Also, can we fix the counter value?
I assume we want:
else if (comp_ret || dlen == PAGE_SIZE)
zswap_reject_compress_fail++;
or something like that.
And what happened to the incompressible page stored counter? :)
>
> zpool = pool->zpool;
> gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> @@ -1012,6 +1029,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
>
> + /* zswap entries of PAGE_SIZE are not compressed. */
of length?
> + if (entry->length == PAGE_SIZE) {
> + memcpy_to_folio(folio, 0, obj, entry->length);
> + zpool_obj_read_end(zpool, entry->handle, obj);
> + acomp_ctx_put_unlock(acomp_ctx);
> + return true;
> + }
> +
> /*
> * zpool_obj_read_begin() might return a kmap address of highmem when
> * acomp_ctx->buffer is not used. However, sg_init_one() does not
>
> base-commit: 2ec534125ae474292175ae198483c545eac2161d
> --
> 2.39.5
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
2025-08-07 23:03 ` Nhat Pham
@ 2025-08-07 23:07 ` Nhat Pham
2025-08-07 23:54 ` SeongJae Park
1 sibling, 0 replies; 7+ messages in thread
From: Nhat Pham @ 2025-08-07 23:07 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
Yosry Ahmed, kernel-team, linux-kernel, linux-mm, Takero Funaki
On Thu, Aug 7, 2025 at 4:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> if (dlen == PAGE_SIZE)
> kunmap(dst);
should be kunmap_local - apparently it's a thing :)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
2025-08-07 23:03 ` Nhat Pham
2025-08-07 23:07 ` Nhat Pham
@ 2025-08-07 23:54 ` SeongJae Park
2025-08-08 23:37 ` Nhat Pham
1 sibling, 1 reply; 7+ messages in thread
From: SeongJae Park @ 2025-08-07 23:54 UTC (permalink / raw)
To: Nhat Pham
Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
Takero Funaki
On Thu, 7 Aug 2025 16:03:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
> On Thu, Aug 7, 2025 at 11:16 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > the page is swapped out to the backing swap device. This behavior
> > breaks the zswap's writeback LRU order, and hence users can experience
> > unexpected latency spikes. If the page is compressed without failure,
> > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > decompression overhead for loading the page back on the later access is
> > unnecessary.
>
> Maybe add the writeback disabled story - we added to get around this
> latency issue, but that solution was not satisfactory either: wasting
> cpu cycles retrying incompressible pages, especially when we're under
> reclaim/memory pressure, and we've reclaimed most if not all other
> sources.
>
> This problem was pointed out by Yosry:
>
> https://lore.kernel.org/all/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com/
Thank you for sharing the detailed context, I will add this history with the
link to the last paragraph of this section, or the 'Related Works' section.
>
> >
> > Keep the LRU order and optimize unnecessary decompression overheads in
> > those cases, by storing the original content as-is in zpool. The length
> > field of zswap_entry will be set appropriately, as PAGE_SIZE, Hence
> > whether it is saved as-is or not (whether decompression is unnecessary)
> > is identified by 'zswap_entry->length = PAGE_SIZE'.
> >
> > Because the uncompressed data is saved in zpool, same to the compressed
> > ones, this introduces no change in terms of memory management including
> > movability and migratability of involved pages.
> >
> > This change is also not increasing per zswap entry metadata overhead.
> > But as the number of incompressible pages increases, total zswap
> > metadata overhead is proportionally increased. The overhead should not
> > be problematic in usual cases, since the zswap metadata for single zswap
> > entry is much smaller than PAGE_SIZE, and in common zswap use cases
> > there should be a sufficient amount of compressible pages. Also it can
> > be mitigated by the zswap writeback.
> >
> > When the writeback is disabled, the additional overhead could be
> > problematic. For the case, keep the current behavior that just returns
> > the failure and let swap_writeout() put the page back to the active LRU
> > list in the case. It is known to be suboptimal when the incompressible
> > pages are cold, since the incompressible pages will continuously be
> > tried to be zswapped out, and burn CPU cycles for compression attempts
> > that will anyway fails. One imaginable solution for the problem is
> > reusing the swapped-out page and its struct page to store in the zswap
> > pool. But that's out of the scope of this patch.
> >
> > Tests
> > -----
[...]
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -976,8 +976,25 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > */
> > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > dlen = acomp_ctx->req->dlen;
> > - if (comp_ret)
> > - goto unlock;
> > +
> > + /*
> > + * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > + * save the content as is without a compression, to keep the LRU order
> > + * of writebacks. If writeback is disabled, reject the page since it
> > + * only adds metadata overhead. swap_writeout() will put the page back
> > + * to the active LRU list in the case.
> > + */
> > + if (comp_ret || dlen >= PAGE_SIZE) {
> > + if (mem_cgroup_zswap_writeback_enabled(
> > + folio_memcg(page_folio(page)))) {
> > + comp_ret = 0;
> > + dlen = PAGE_SIZE;
> > + memcpy_from_page(dst, page, 0, dlen);
>
> I wonder if we can save one memcpy here. Would it be safe to map the page:
>
> dst = kmap_local_page(page);
Sounds good, I will do so in the next version.
>
> then, after we're done with storing (i.e after zpool_obj_write()), do:
>
> if (dlen = PAGE_SIZE)
> kunmap(dst);
>
> (don't forget to unmap the page in the failure paths too!)
Sure, I think the 'unlock' label would be appropriate part to add the unmap
code. I also got your correction of s/kunmap/kunmap_local() in the reply. I
will not miss that on the next version.
>
> > + } else {
> > + comp_ret = comp_ret ? : -EINVAL;
>
> Does this keep the value of comp_ret if comp_ret != 0 lol.
Yes.
> Syntax
> looks weird to me.
I agree it could look weird. I prefer keeping the code in a way more familiar
to ones who will keep maintain the file. So I will modify this as below, in
the next version.
comp_ret ? comp_ret : -EINVAL
>
> > + goto unlock;
> > + }
> > + }
>
> Also, can we fix the counter value?
>
> I assume we want:
>
> else if (comp_ret || dlen = PAGE_SIZE)
> zswap_reject_compress_fail++;
>
> or something like that.
I'm not very clearly getting your point.
I was thinking we should increase the counter if we "reject" the page (does not
save the content in the zpool) due to failing at compressing the page's content
into a size smaller than PAGE_SIZE. This patch implements the behavior.
Am I missing a mis-implementation of the behavior in this patch, or the
behavior is not what you think it should be? More elaboration of your point
would be helpful for me.
>
> And what happened to the incompressible page stored counter? :)
I don't get what counter you are asking about. Could you please elaborate?
>
>
> >
> > zpool = pool->zpool;
> > gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1012,6 +1029,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> > obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> >
> > + /* zswap entries of PAGE_SIZE are not compressed. */
> of length?
Thank you for this nice suggestion, I will change the comment so, in the next
version.
>
>
> > + if (entry->length = PAGE_SIZE) {
> > + memcpy_to_folio(folio, 0, obj, entry->length);
> > + zpool_obj_read_end(zpool, entry->handle, obj);
> > + acomp_ctx_put_unlock(acomp_ctx);
> > + return true;
> > + }
> > +
> > /*
> > * zpool_obj_read_begin() might return a kmap address of highmem when
> > * acomp_ctx->buffer is not used. However, sg_init_one() does not
> >
> > base-commit: 2ec534125ae474292175ae198483c545eac2161d
> > --
> > 2.39.5
>
Thank you for your nice review and comments :)
Thanks,
SJ
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
2025-08-07 23:54 ` SeongJae Park
@ 2025-08-08 23:37 ` Nhat Pham
2025-08-09 0:18 ` SeongJae Park
0 siblings, 1 reply; 7+ messages in thread
From: Nhat Pham @ 2025-08-08 23:37 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
Yosry Ahmed, kernel-team, linux-kernel, linux-mm, Takero Funaki
On Thu, Aug 7, 2025 at 4:54 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Thu, 7 Aug 2025 16:03:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
>
> > On Thu, Aug 7, 2025 at 11:16 AM SeongJae Park <sj@kernel.org> wrote:
> > >
> > > When zswap writeback is enabled and it fails compressing a given page,
> > > the page is swapped out to the backing swap device. This behavior
> > > breaks the zswap's writeback LRU order, and hence users can experience
> > > unexpected latency spikes. If the page is compressed without failure,
> > > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > > decompression overhead for loading the page back on the later access is
> > > unnecessary.
> >
> > Maybe add the writeback disabled story - we added to get around this
> > latency issue, but that solution was not satisfactory either: wasting
> > cpu cycles retrying incompressible pages, especially when we're under
> > reclaim/memory pressure, and we've reclaimed most if not all other
> > sources.
> >
> > This problem was pointed out by Yosry:
> >
> > https://lore.kernel.org/all/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com/
>
> Thank you for sharing the detailed context, I will add this history with the
> link to the last paragraph of this section, or the 'Related Works' section.
>
> >
> > >
> > > Keep the LRU order and optimize unnecessary decompression overheads in
> > > those cases, by storing the original content as-is in zpool. The length
> > > field of zswap_entry will be set appropriately, as PAGE_SIZE, Hence
> > > whether it is saved as-is or not (whether decompression is unnecessary)
> > > is identified by 'zswap_entry->length = PAGE_SIZE'.
> > >
> > > Because the uncompressed data is saved in zpool, same to the compressed
> > > ones, this introduces no change in terms of memory management including
> > > movability and migratability of involved pages.
> > >
> > > This change is also not increasing per zswap entry metadata overhead.
> > > But as the number of incompressible pages increases, total zswap
> > > metadata overhead is proportionally increased. The overhead should not
> > > be problematic in usual cases, since the zswap metadata for single zswap
> > > entry is much smaller than PAGE_SIZE, and in common zswap use cases
> > > there should be a sufficient amount of compressible pages. Also it can
> > > be mitigated by the zswap writeback.
> > >
> > > When the writeback is disabled, the additional overhead could be
> > > problematic. For the case, keep the current behavior that just returns
> > > the failure and let swap_writeout() put the page back to the active LRU
> > > list in the case. It is known to be suboptimal when the incompressible
> > > pages are cold, since the incompressible pages will continuously be
> > > tried to be zswapped out, and burn CPU cycles for compression attempts
> > > that will anyway fails. One imaginable solution for the problem is
> > > reusing the swapped-out page and its struct page to store in the zswap
> > > pool. But that's out of the scope of this patch.
> > >
> > > Tests
> > > -----
> [...]
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -976,8 +976,25 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > > */
> > > comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > > dlen = acomp_ctx->req->dlen;
> > > - if (comp_ret)
> > > - goto unlock;
> > > +
> > > + /*
> > > + * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > > + * save the content as is without a compression, to keep the LRU order
> > > + * of writebacks. If writeback is disabled, reject the page since it
> > > + * only adds metadata overhead. swap_writeout() will put the page back
> > > + * to the active LRU list in the case.
> > > + */
> > > + if (comp_ret || dlen >= PAGE_SIZE) {
> > > + if (mem_cgroup_zswap_writeback_enabled(
> > > + folio_memcg(page_folio(page)))) {
> > > + comp_ret = 0;
> > > + dlen = PAGE_SIZE;
> > > + memcpy_from_page(dst, page, 0, dlen);
> >
> > I wonder if we can save one memcpy here. Would it be safe to map the page:
> >
> > dst = kmap_local_page(page);
>
> Sounds good, I will do so in the next version.
>
> >
> > then, after we're done with storing (i.e after zpool_obj_write()), do:
> >
> > if (dlen = PAGE_SIZE)
> > kunmap(dst);
> >
> > (don't forget to unmap the page in the failure paths too!)
>
> Sure, I think the 'unlock' label would be appropriate part to add the unmap
> code. I also got your correction of s/kunmap/kunmap_local() in the reply. I
> will not miss that on the next version.
>
> >
> > > + } else {
> > > + comp_ret = comp_ret ? : -EINVAL;
> >
> > Does this keep the value of comp_ret if comp_ret != 0 lol.
>
> Yes.
>
> > Syntax
> > looks weird to me.
>
> I agree it could look weird. I prefer keeping the code in a way more familiar
> to ones who will keep maintain the file. So I will modify this as below, in
> the next version.
>
> comp_ret ? comp_ret : -EINVAL
>
> >
> > > + goto unlock;
> > > + }
> > > + }
> >
> > Also, can we fix the counter value?
> >
> > I assume we want:
> >
> > else if (comp_ret || dlen = PAGE_SIZE)
> > zswap_reject_compress_fail++;
> >
> > or something like that.
>
> I'm not very clearly getting your point.
>
> I was thinking we should increase the counter if we "reject" the page (does not
> save the content in the zpool) due to failing at compressing the page's content
> into a size smaller than PAGE_SIZE. This patch implements the behavior.
>
> Am I missing a mis-implementation of the behavior in this patch, or the
> behavior is not what you think it should be? More elaboration of your point
> would be helpful for me.
Ah yeah, maybe "reject compress fail" is not a good name here. But
sometimes I like to know how many times we fail to compress, even if
we do save them.
We can rename it to just "zswap_compress_fail", but that's breaking
API, so it's kind of annoying. Maybe "zswap_stored_uncompressed_pages"
suffices (see comment below).
Johannes, any suggestions on what to do here?
>
> >
> > And what happened to the incompressible page stored counter? :)
>
> I don't get what counter you are asking about. Could you please elaborate?
I meant "zswap_stored_uncompressed_pages" in your RFC v1.
That could give us a nice breakdown of how much memory in zswap is
actually compressed memory, and how much is uncompressed.
>
> >
> >
> > >
> > > zpool = pool->zpool;
> > > gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> > > @@ -1012,6 +1029,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > > acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
> > > obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> > >
> > > + /* zswap entries of PAGE_SIZE are not compressed. */
> > of length?
>
> Thank you for this nice suggestion, I will change the comment so, in the next
> version.
>
> >
> >
> > > + if (entry->length = PAGE_SIZE) {
> > > + memcpy_to_folio(folio, 0, obj, entry->length);
> > > + zpool_obj_read_end(zpool, entry->handle, obj);
> > > + acomp_ctx_put_unlock(acomp_ctx);
> > > + return true;
> > > + }
> > > +
> > > /*
> > > * zpool_obj_read_begin() might return a kmap address of highmem when
> > > * acomp_ctx->buffer is not used. However, sg_init_one() does not
> > >
> > > base-commit: 2ec534125ae474292175ae198483c545eac2161d
> > > --
> > > 2.39.5
> >
>
> Thank you for your nice review and comments :)
>
>
> Thanks,
> SJ
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is
2025-08-08 23:37 ` Nhat Pham
@ 2025-08-09 0:18 ` SeongJae Park
0 siblings, 0 replies; 7+ messages in thread
From: SeongJae Park @ 2025-08-09 0:18 UTC (permalink / raw)
To: Nhat Pham
Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
Takero Funaki
On Fri, 8 Aug 2025 16:37:15 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
> On Thu, Aug 7, 2025 at 4:54 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Thu, 7 Aug 2025 16:03:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > > On Thu, Aug 7, 2025 at 11:16 AM SeongJae Park <sj@kernel.org> wrote:
[...]
> > > Also, can we fix the counter value?
> > >
> > > I assume we want:
> > >
> > > else if (comp_ret || dlen = PAGE_SIZE)
> > > zswap_reject_compress_fail++;
> > >
> > > or something like that.
> >
> > I'm not very clearly getting your point.
> >
> > I was thinking we should increase the counter if we "reject" the page (does not
> > save the content in the zpool) due to failing at compressing the page's content
> > into a size smaller than PAGE_SIZE. This patch implements the behavior.
> >
> > Am I missing a mis-implementation of the behavior in this patch, or the
> > behavior is not what you think it should be? More elaboration of your point
> > would be helpful for me.
>
> Ah yeah, maybe "reject compress fail" is not a good name here. But
> sometimes I like to know how many times we fail to compress, even if
> we do save them.
Thank you for clarifying, that makes sense to me.
>
> We can rename it to just "zswap_compress_fail", but that's breaking
> API, so it's kind of annoying. Maybe "zswap_stored_uncompressed_pages"
> suffices (see comment below).
The suggested name sounds good to me.
>
> Johannes, any suggestions on what to do here?
+1
>
> >
> > >
> > > And what happened to the incompressible page stored counter? :)
> >
> > I don't get what counter you are asking about. Could you please elaborate?
>
> I meant "zswap_stored_uncompressed_pages" in your RFC v1.
Thank you for kindly elaborating this. I implemented that not to provide an
additional observability, but only for keeping zswap_total_pages() account the
pages including the uncompressed pages, though, since the version was not using
zpool. The internal counter has dropped from RFC v2, since we started using
zpool, thanks to feedbacks from reviewers including you.
>
> That could give us a nice breakdown of how much memory in zswap is
> actually compressed memory, and how much is uncompressed.
I agree it could be useful information. Unless others raise different
opinions, I will implement this in the next version, with your suggested name.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-09 0:18 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 18:16 [PATCH] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
2025-08-07 18:32 ` SeongJae Park
2025-08-07 23:03 ` Nhat Pham
2025-08-07 23:07 ` Nhat Pham
2025-08-07 23:54 ` SeongJae Park
2025-08-08 23:37 ` Nhat Pham
2025-08-09 0:18 ` SeongJae Park
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).