[PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
@ 2025-08-12 17:00 SeongJae Park
  2025-08-12 18:52 ` Nhat Pham
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: SeongJae Park @ 2025-08-12 17:00 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki

When zswap writeback is enabled and it fails compressing a given page,
the page is swapped out to the backing swap device.  This behavior
breaks the zswap's writeback LRU order, and hence users can experience
unexpected latency spikes.  If the page is compressed without failure,
but results in a size of PAGE_SIZE, the LRU order is kept, but the
decompression overhead for loading the page back on the later access is
unnecessary.

Keep the LRU order and optimize unnecessary decompression overheads in
those cases, by storing the original content as-is in zpool.  The length
field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
whether it is saved as-is or not (whether decompression is unnecessary)
is identified by 'zswap_entry->length == PAGE_SIZE'.

Because the uncompressed data is saved in zpool, same to the compressed
ones, this introduces no change in terms of memory management including
movability and migratability of involved pages.

This change is also not increasing per zswap entry metadata overhead.
But as the number of incompressible pages increases, total zswap
metadata overhead is proportionally increased.  The overhead should not
be problematic in usual cases, since the zswap metadata for single zswap
entry is much smaller than PAGE_SIZE, and in common zswap use cases
there should be a sufficient amount of compressible pages.  Also it can
be mitigated by the zswap writeback.

When the writeback is disabled, the additional overhead could be
problematic.  For the case, keep the current behavior that just returns
the failure and let swap_writeout() put the page back to the active LRU
list in the case.

Knowing how many compression failures happened will be useful for future
investigations.  investigations.  Add a new debugfs file, compress_fail,
for the purpose.

Tests
-----

I tested this patch using a simple self-written microbenchmark that is
available at GitHub[1].  You can reproduce the test I did by executing
run_tests.sh of the repo on your system.  Note that the repo's
documentation is not good as of this writing, so you may need to read
and use the code.

The basic test scenario is simple.  Run a test program making artificial
accesses to memory having artificial content under memory.high-set
memory limit and measure how many accesses were made in given time.

The test program repeatedly and randomly access three anonymous memory
regions.  The regions are all 500 MiB size, and accessed in the same
probability.  Two of those are filled up with a simple content that can
easily be compressed, while the remaining one is filled up with a
content that read from /dev/urandom, which is easy to fail at
compressing to <PAGE_SIZE size.  The program runs for two minutes and
prints out the number of accesses made every five seconds.

The test script runs the program under below seven configurations.

- 0: memory.high is set to 2 GiB, zswap is disabled.
- 1-1: memory.high is set to 1350 MiB, zswap is disabled.
- 1-2: On 1-1, zswap is enabled without this patch.
- 1-3: On 1-2, this patch is applied.

For all zswap enabled cases, zswap shrinker is enabled.

Configuration '0' is for showing the original memory performance.
Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
zswap, and this patch under a level of memory pressure (~10% of working
set).

Because the per-5 seconds performance is not very reliable, I measured
the average of that for the last one minute period of the test program
run.  I also measured a few vmstat counters including zswpin, zswpout,
zswpwb, pswpin and pswpout during the test runs.

The measurement results are as below.  To save space, I show performance
numbers that are normalized to that of the configuration '0' (no memory
pressure), only.  The averaged accesses per 5 seconds of configuration
'0' was 36493417.75.

    config            0       1-1     1-2      1-3
    perf_normalized   1.0000  0.0057  0.0235   0.0367
    perf_stdev_ratio  0.0582  0.0652  0.0167   0.0346
    zswpin            0       0       3548424  1999335
    zswpout           0       0       3588817  2361689
    zswpwb            0       0       10214    340270
    pswpin            0       485806  772038   340967
    pswpout           0       649543  144773   340270

'perf_normalized' is the performance metric, normalized to that of
configuration '0' (no pressure).  'perf_stdev_ratio' is the standard
deviation of the averaged data points, as a ratio to the averaged metric
value.  For example, configuration '0' performance was showing 5.8%
stdev.  Configurations 1-1 and 1-3 were having about 6.5% and 6.1%
stdev.  Also the results were highly variable between multiple runs.  So
this result is not very stable but just showing ball park figures.
Please keep this in your mind when reading these results.

Under about 10% of working set memory pressure, the performance was
dropped to about 0.57% of no-pressure one, when the normal swap is used
(1-1).  Note that ~10% working set pressure is already extreme, at least
on this test setup.  No one would desire system setups that can degrade
performance to 0.57% of the best case.

By turning zswap on (1-2), the performance was improved about 4x,
resulting in about 2.35% of no-pressure one.  Because of the
incompressible pages in the third memory region, a significant amount of
(non-zswap) swap I/O operations were made, though.

By applying this patch (1-3), about 56% performance improvement was
made, resulting in about 3.67% of no-pressure one.  Reduced pswpin of
1-3 compared to 1-2 let us see where this improvement came from.

Related Works
-------------

This is not an entirely new attempt.  Nhat Pham and Takero Funaki tried
very similar approaches in October 2023[2] and April 2024[3],
respectively.  The two approaches didn't get merged mainly due to the
metadata overhead concern.  I described why I think that shouldn't be a
problem for this change, which is automatically disabled when writeback
is disabled, at the beginning of this changelog.

This patch is not particularly different from those, and actually built
upon those.  I wrote this from scratch again, though.  Hence adding
Suggested-by tags for them.  Actually Nhat first suggested this to me
offlist.

Historically, writeback disabling was introduced partially as a way to
solve the LRU order issue.  Yosry pointed out[4] this is still
suboptimal when the incompressible pages are cold, since the
incompressible pages will continuously be tried to be zswapped out, and
burn CPU cycles for compression attempts that will anyway fail.  One
imaginable solution for the problem is reusing the swapped-out page and
its struct page to store in the zswap pool.  But that's out of the scope
of this patch.

[1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
[2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
[3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
[4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com

Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Takero Funaki <flintglass@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
---
Changes from v1
(https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org)
- Optimize out memcpy() per incompressible page saving, using
  k[un]map_local().
- Add a debugfs file for counting compression failures.
- Use a clear form of a ternary operation.
- Add the history of writeback disabling with a link.
- Wordsmith comments.

Changes from RFC v2
(https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
- Fix race conditions at decompressed pages identification.
- Remove the parameter and make saving as-is the default behavior.
- Open-code main changes.
- Clarify there is no memory management changes on the cover letter.
- Remove 20% pressure case from test results, since it is arguably too
  extreme and only adds confusion.
- Drop RFC tag.

Changes from RFC v1
(https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
- Consider PAGE_SIZE compression successes as failures.
- Use zpool for storing incompressible pages.
- Test with zswap shrinker enabled.
- Wordsmith changelog and comments.
- Add documentation of save_incompressible_pages parameter.

 mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 3c0fd8a13718..0fb940d03268 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
 static u64 zswap_reject_reclaim_fail;
 /* Store failed due to compression algorithm failure */
 static u64 zswap_reject_compress_fail;
+/* Compression into a size of <PAGE_SIZE failed */
+static u64 zswap_compress_fail;
 /* Compressed page was too big for the allocator to (optimally) store */
 static u64 zswap_reject_compress_poor;
 /* Load or writeback failed due to decompression failure */
@@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 */
 	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
 	dlen = acomp_ctx->req->dlen;
-	if (comp_ret)
-		goto unlock;
+
+	/*
+	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
+	 * save the content as is without a compression, to keep the LRU order
+	 * of writebacks.  If writeback is disabled, reject the page since it
+	 * only adds metadata overhead.  swap_writeout() will put the page back
+	 * to the active LRU list in the case.
+	 */
+	if (comp_ret || dlen >= PAGE_SIZE) {
+		zswap_compress_fail++;
+		if (mem_cgroup_zswap_writeback_enabled(
+					folio_memcg(page_folio(page)))) {
+			comp_ret = 0;
+			dlen = PAGE_SIZE;
+			dst = kmap_local_page(page);
+		} else {
+			comp_ret = comp_ret ? comp_ret : -EINVAL;
+			goto unlock;
+		}
+	}

 	zpool = pool->zpool;
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
@@ -990,6 +1010,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	entry->length = dlen;

 unlock:
+	if (dst != acomp_ctx->buffer)
+		kunmap_local(dst);
 	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
 		zswap_reject_compress_poor++;
 	else if (comp_ret)
@@ -1012,6 +1034,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
 	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);

+	/* zswap entries of length PAGE_SIZE are not compressed. */
+	if (entry->length == PAGE_SIZE) {
+		memcpy_to_folio(folio, 0, obj, entry->length);
+		zpool_obj_read_end(zpool, entry->handle, obj);
+		acomp_ctx_put_unlock(acomp_ctx);
+		return true;
+	}
+
 	/*
 	 * zpool_obj_read_begin() might return a kmap address of highmem when
 	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
@@ -1809,6 +1839,8 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_kmemcache_fail);
 	debugfs_create_u64("reject_compress_fail", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_fail);
+	debugfs_create_u64("compress_fail", 0444,
+			   zswap_debugfs_root, &zswap_compress_fail);
 	debugfs_create_u64("reject_compress_poor", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("decompress_fail", 0444,

base-commit: 44fa6646d975349f6499d1aeb0ed826680d0bb5c
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-12 17:00 [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
@ 2025-08-12 18:52 ` Nhat Pham
  2025-08-13 17:07 ` Chris Li
  2025-08-14 16:23 ` Johannes Weiner
  2 siblings, 0 replies; 24+ messages in thread
From: Nhat Pham @ 2025-08-12 18:52 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
	Yosry Ahmed, linux-kernel, linux-mm, Takero Funaki

On Tue, Aug 12, 2025 at 10:00 AM SeongJae Park <sj@kernel.org> wrote:
>
> When zswap writeback is enabled and it fails compressing a given page,
> the page is swapped out to the backing swap device.  This behavior
> breaks the zswap's writeback LRU order, and hence users can experience
> unexpected latency spikes.  If the page is compressed without failure,
> but results in a size of PAGE_SIZE, the LRU order is kept, but the
> decompression overhead for loading the page back on the later access is
> unnecessary.
>
> Keep the LRU order and optimize unnecessary decompression overheads in
> those cases, by storing the original content as-is in zpool.  The length
> field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> whether it is saved as-is or not (whether decompression is unnecessary)
> is identified by 'zswap_entry->length == PAGE_SIZE'.
>
> Because the uncompressed data is saved in zpool, same to the compressed
> ones, this introduces no change in terms of memory management including
> movability and migratability of involved pages.
>
> This change is also not increasing per zswap entry metadata overhead.
> But as the number of incompressible pages increases, total zswap
> metadata overhead is proportionally increased.  The overhead should not
> be problematic in usual cases, since the zswap metadata for single zswap
> entry is much smaller than PAGE_SIZE, and in common zswap use cases
> there should be a sufficient amount of compressible pages.  Also it can
> be mitigated by the zswap writeback.
>
> When the writeback is disabled, the additional overhead could be
> problematic.  For the case, keep the current behavior that just returns
> the failure and let swap_writeout() put the page back to the active LRU
> list in the case.
>
> Knowing how many compression failures happened will be useful for future
> investigations.  investigations.  Add a new debugfs file, compress_fail,
> for the purpose.
>
> Tests
> -----
>
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1].  You can reproduce the test I did by executing
> run_tests.sh of the repo on your system.  Note that the repo's
> documentation is not good as of this writing, so you may need to read
> and use the code.
>
> The basic test scenario is simple.  Run a test program making artificial
> accesses to memory having artificial content under memory.high-set
> memory limit and measure how many accesses were made in given time.
>
> The test program repeatedly and randomly access three anonymous memory
> regions.  The regions are all 500 MiB size, and accessed in the same
> probability.  Two of those are filled up with a simple content that can
> easily be compressed, while the remaining one is filled up with a
> content that read from /dev/urandom, which is easy to fail at
> compressing to <PAGE_SIZE size.  The program runs for two minutes and
> prints out the number of accesses made every five seconds.
>
> The test script runs the program under below seven configurations.
>
> - 0: memory.high is set to 2 GiB, zswap is disabled.
> - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> - 1-2: On 1-1, zswap is enabled without this patch.
> - 1-3: On 1-2, this patch is applied.
>
> For all zswap enabled cases, zswap shrinker is enabled.
>
> Configuration '0' is for showing the original memory performance.
> Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> zswap, and this patch under a level of memory pressure (~10% of working
> set).
>
> Because the per-5 seconds performance is not very reliable, I measured
> the average of that for the last one minute period of the test program
> run.  I also measured a few vmstat counters including zswpin, zswpout,
> zswpwb, pswpin and pswpout during the test runs.
>
> The measurement results are as below.  To save space, I show performance
> numbers that are normalized to that of the configuration '0' (no memory
> pressure), only.  The averaged accesses per 5 seconds of configuration
> '0' was 36493417.75.
>
>     config            0       1-1     1-2      1-3
>     perf_normalized   1.0000  0.0057  0.0235   0.0367
>     perf_stdev_ratio  0.0582  0.0652  0.0167   0.0346
>     zswpin            0       0       3548424  1999335
>     zswpout           0       0       3588817  2361689
>     zswpwb            0       0       10214    340270
>     pswpin            0       485806  772038   340967
>     pswpout           0       649543  144773   340270
>
> 'perf_normalized' is the performance metric, normalized to that of
> configuration '0' (no pressure).  'perf_stdev_ratio' is the standard
> deviation of the averaged data points, as a ratio to the averaged metric
> value.  For example, configuration '0' performance was showing 5.8%
> stdev.  Configurations 1-1 and 1-3 were having about 6.5% and 6.1%
> stdev.  Also the results were highly variable between multiple runs.  So
> this result is not very stable but just showing ball park figures.
> Please keep this in your mind when reading these results.
>
> Under about 10% of working set memory pressure, the performance was
> dropped to about 0.57% of no-pressure one, when the normal swap is used
> (1-1).  Note that ~10% working set pressure is already extreme, at least
> on this test setup.  No one would desire system setups that can degrade
> performance to 0.57% of the best case.
>
> By turning zswap on (1-2), the performance was improved about 4x,
> resulting in about 2.35% of no-pressure one.  Because of the
> incompressible pages in the third memory region, a significant amount of
> (non-zswap) swap I/O operations were made, though.
>
> By applying this patch (1-3), about 56% performance improvement was
> made, resulting in about 3.67% of no-pressure one.  Reduced pswpin of
> 1-3 compared to 1-2 let us see where this improvement came from.
>
> Related Works
> -------------
>
> This is not an entirely new attempt.  Nhat Pham and Takero Funaki tried
> very similar approaches in October 2023[2] and April 2024[3],
> respectively.  The two approaches didn't get merged mainly due to the
> metadata overhead concern.  I described why I think that shouldn't be a
> problem for this change, which is automatically disabled when writeback
> is disabled, at the beginning of this changelog.
>
> This patch is not particularly different from those, and actually built
> upon those.  I wrote this from scratch again, though.  Hence adding
> Suggested-by tags for them.  Actually Nhat first suggested this to me
> offlist.
>
> Historically, writeback disabling was introduced partially as a way to
> solve the LRU order issue.  Yosry pointed out[4] this is still
> suboptimal when the incompressible pages are cold, since the
> incompressible pages will continuously be tried to be zswapped out, and
> burn CPU cycles for compression attempts that will anyway fail.  One
> imaginable solution for the problem is reusing the swapped-out page and
> its struct page to store in the zswap pool.  But that's out of the scope
> of this patch.
>
> [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
> [4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Takero Funaki <flintglass@gmail.com>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
> Changes from v1
> (https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org)
> - Optimize out memcpy() per incompressible page saving, using
>   k[un]map_local().
> - Add a debugfs file for counting compression failures.
> - Use a clear form of a ternary operation.
> - Add the history of writeback disabling with a link.
> - Wordsmith comments.
>
> Changes from RFC v2
> (https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
> - Fix race conditions at decompressed pages identification.
> - Remove the parameter and make saving as-is the default behavior.
> - Open-code main changes.
> - Clarify there is no memory management changes on the cover letter.
> - Remove 20% pressure case from test results, since it is arguably too
>   extreme and only adds confusion.
> - Drop RFC tag.
>
> Changes from RFC v1
> (https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
> - Consider PAGE_SIZE compression successes as failures.
> - Use zpool for storing incompressible pages.
> - Test with zswap shrinker enabled.
> - Wordsmith changelog and comments.
> - Add documentation of save_incompressible_pages parameter.
>
>  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)

I'll let Johannes chime in as well, but LGTM.

Acked-by: Nhat Pham <nphamcs@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-12 17:00 [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
  2025-08-12 18:52 ` Nhat Pham
@ 2025-08-13 17:07 ` Chris Li
  2025-08-13 18:20   ` SeongJae Park
                     ` (3 more replies)
  2025-08-14 16:23 ` Johannes Weiner
  2 siblings, 4 replies; 24+ messages in thread
From: Chris Li @ 2025-08-13 17:07 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm, Takero Funaki,
	Hugh Dickins

Hi SeongJae,

Thanks for doing that. I am working on something related to this as
well. Please consider CC me on your next version.

On Tue, Aug 12, 2025 at 10:00 AM SeongJae Park <sj@kernel.org> wrote:
>
> When zswap writeback is enabled and it fails compressing a given page,
> the page is swapped out to the backing swap device.  This behavior
> breaks the zswap's writeback LRU order, and hence users can experience
> unexpected latency spikes.  If the page is compressed without failure,
> but results in a size of PAGE_SIZE, the LRU order is kept, but the
> decompression overhead for loading the page back on the later access is
> unnecessary.
>
> Keep the LRU order and optimize unnecessary decompression overheads in
> those cases, by storing the original content as-is in zpool.  The length
> field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> whether it is saved as-is or not (whether decompression is unnecessary)
> is identified by 'zswap_entry->length == PAGE_SIZE'.
>
> Because the uncompressed data is saved in zpool, same to the compressed
> ones, this introduces no change in terms of memory management including
> movability and migratability of involved pages.

If you store uncompressed data in the zpool, zpool has metadata
overhead, e.g. allocating the entry->handle for uncompressed pages.
If the page is not compressed, another idea is just skip the zpool,
store it as a page in the zswap entry as page. We can make a union of
entry->handle and entry->incompressble_page. If entry->length ==
PAGE_SIZE, use entry->incompressable_page as a page.

The pros is that, on the page fault, there is no need to allocate a
new page. You can just move the uncompressed_page into the swap_cache.
You save one memcpy on the page fault as well. That will make the
incompressible page fault behave very similar to the minor page fault.

The cons is that, now zpool does not account for uncompressed pages,
on the second thought, that can be a con as well, the page is not
compressed, why should it account as compressed pages?

I know Hugh has some idea to store incompressible pages in the swap
cache as well. Hugh?

> This change is also not increasing per zswap entry metadata overhead.
> But as the number of incompressible pages increases, total zswap
> metadata overhead is proportionally increased.  The overhead should not
> be problematic in usual cases, since the zswap metadata for single zswap
> entry is much smaller than PAGE_SIZE, and in common zswap use cases
> there should be a sufficient amount of compressible pages.  Also it can
> be mitigated by the zswap writeback.
>
> When the writeback is disabled, the additional overhead could be
> problematic.  For the case, keep the current behavior that just returns
> the failure and let swap_writeout() put the page back to the active LRU
> list in the case.
>
> Knowing how many compression failures happened will be useful for future
> investigations.  investigations.  Add a new debugfs file, compress_fail,
> for the purpose.

Again, I think we should distinguish the crypto engine fail vs ration failure.

>
> Tests
> -----
>
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1].  You can reproduce the test I did by executing
> run_tests.sh of the repo on your system.  Note that the repo's
> documentation is not good as of this writing, so you may need to read
> and use the code.
>
> The basic test scenario is simple.  Run a test program making artificial
> accesses to memory having artificial content under memory.high-set
> memory limit and measure how many accesses were made in given time.
>
> The test program repeatedly and randomly access three anonymous memory
> regions.  The regions are all 500 MiB size, and accessed in the same
> probability.  Two of those are filled up with a simple content that can
> easily be compressed, while the remaining one is filled up with a
> content that read from /dev/urandom, which is easy to fail at
> compressing to <PAGE_SIZE size.  The program runs for two minutes and
> prints out the number of accesses made every five seconds.
>
> The test script runs the program under below seven configurations.
>
> - 0: memory.high is set to 2 GiB, zswap is disabled.
> - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> - 1-2: On 1-1, zswap is enabled without this patch.
> - 1-3: On 1-2, this patch is applied.
>
> For all zswap enabled cases, zswap shrinker is enabled.
>
> Configuration '0' is for showing the original memory performance.
> Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> zswap, and this patch under a level of memory pressure (~10% of working
> set).
>
> Because the per-5 seconds performance is not very reliable, I measured
> the average of that for the last one minute period of the test program
> run.  I also measured a few vmstat counters including zswpin, zswpout,
> zswpwb, pswpin and pswpout during the test runs.
>
> The measurement results are as below.  To save space, I show performance
> numbers that are normalized to that of the configuration '0' (no memory
> pressure), only.  The averaged accesses per 5 seconds of configuration
> '0' was 36493417.75.
>
>     config            0       1-1     1-2      1-3
>     perf_normalized   1.0000  0.0057  0.0235   0.0367
>     perf_stdev_ratio  0.0582  0.0652  0.0167   0.0346
>     zswpin            0       0       3548424  1999335
>     zswpout           0       0       3588817  2361689
>     zswpwb            0       0       10214    340270
>     pswpin            0       485806  772038   340967
>     pswpout           0       649543  144773   340270
>
> 'perf_normalized' is the performance metric, normalized to that of
> configuration '0' (no pressure).  'perf_stdev_ratio' is the standard
> deviation of the averaged data points, as a ratio to the averaged metric
> value.  For example, configuration '0' performance was showing 5.8%
> stdev.  Configurations 1-1 and 1-3 were having about 6.5% and 6.1%
> stdev.  Also the results were highly variable between multiple runs.  So
> this result is not very stable but just showing ball park figures.
> Please keep this in your mind when reading these results.
>
> Under about 10% of working set memory pressure, the performance was
> dropped to about 0.57% of no-pressure one, when the normal swap is used
> (1-1).  Note that ~10% working set pressure is already extreme, at least
> on this test setup.  No one would desire system setups that can degrade
> performance to 0.57% of the best case.
>
> By turning zswap on (1-2), the performance was improved about 4x,
> resulting in about 2.35% of no-pressure one.  Because of the
> incompressible pages in the third memory region, a significant amount of
> (non-zswap) swap I/O operations were made, though.
>
> By applying this patch (1-3), about 56% performance improvement was
> made, resulting in about 3.67% of no-pressure one.  Reduced pswpin of
> 1-3 compared to 1-2 let us see where this improvement came from.
>
> Related Works
> -------------
>
> This is not an entirely new attempt.  Nhat Pham and Takero Funaki tried
> very similar approaches in October 2023[2] and April 2024[3],
> respectively.  The two approaches didn't get merged mainly due to the
> metadata overhead concern.  I described why I think that shouldn't be a
> problem for this change, which is automatically disabled when writeback
> is disabled, at the beginning of this changelog.
>
> This patch is not particularly different from those, and actually built
> upon those.  I wrote this from scratch again, though.  Hence adding
> Suggested-by tags for them.  Actually Nhat first suggested this to me
> offlist.
>
> Historically, writeback disabling was introduced partially as a way to
> solve the LRU order issue.  Yosry pointed out[4] this is still
> suboptimal when the incompressible pages are cold, since the
> incompressible pages will continuously be tried to be zswapped out, and
> burn CPU cycles for compression attempts that will anyway fail.  One
> imaginable solution for the problem is reusing the swapped-out page and
> its struct page to store in the zswap pool.  But that's out of the scope
> of this patch.
>
> [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
> [4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Takero Funaki <flintglass@gmail.com>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
> Changes from v1
> (https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org)
> - Optimize out memcpy() per incompressible page saving, using
>   k[un]map_local().
> - Add a debugfs file for counting compression failures.
> - Use a clear form of a ternary operation.
> - Add the history of writeback disabling with a link.
> - Wordsmith comments.
>
> Changes from RFC v2
> (https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
> - Fix race conditions at decompressed pages identification.
> - Remove the parameter and make saving as-is the default behavior.
> - Open-code main changes.
> - Clarify there is no memory management changes on the cover letter.
> - Remove 20% pressure case from test results, since it is arguably too
>   extreme and only adds confusion.
> - Drop RFC tag.
>
> Changes from RFC v1
> (https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
> - Consider PAGE_SIZE compression successes as failures.
> - Use zpool for storing incompressible pages.
> - Test with zswap shrinker enabled.
> - Wordsmith changelog and comments.
> - Add documentation of save_incompressible_pages parameter.
>
>  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 3c0fd8a13718..0fb940d03268 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
>  static u64 zswap_reject_reclaim_fail;
>  /* Store failed due to compression algorithm failure */
>  static u64 zswap_reject_compress_fail;
> +/* Compression into a size of <PAGE_SIZE failed */
> +static u64 zswap_compress_fail;
>  /* Compressed page was too big for the allocator to (optimally) store */
>  static u64 zswap_reject_compress_poor;
>  /* Load or writeback failed due to decompression failure */
> @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          */
>         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>         dlen = acomp_ctx->req->dlen;
> -       if (comp_ret)
> -               goto unlock;
> +
> +       /*
> +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> +        * save the content as is without a compression, to keep the LRU order
> +        * of writebacks.  If writeback is disabled, reject the page since it
> +        * only adds metadata overhead.  swap_writeout() will put the page back
> +        * to the active LRU list in the case.
> +        */
> +       if (comp_ret || dlen >= PAGE_SIZE) {
> +               zswap_compress_fail++;

I feel that mixing comp_ret and dlen size if tested here complicates
the nested if statement and the behavior as well.
One behavior change is that, if the comp_ret != 0, it means the
compression crypt engine has some error. e.g. have some bad chip
always fail the compression. That is a different error case than the
compression was successful and the compression rate is not good. In
this case the patch makes the page store in zpool for cryto engine
failure, which does not help.

Is there any reason you want to store the page in zpool when the
compression engine (not the ratio) fails?

What do you say about the following alternative, this keeps the
original behavior if compression engines fail.

     if (comp_ret) {
          zswap_compress_egine_fail++;
          goto unlock;
     }

     if (dlen >= PAGE_SIZE) {
        zswap_compress_fail++;
        if (!mem_cgroup_zswap_writeback_enabled(
                                      folio_memcg(page_folio(page)))) {
              comp_ret = -EINVAL;
              goto unlock;
        }
       dlen = PAGE_SIZE;
       dst = kmap_local_page(page);
    }

Overall I feel this alternative is simpler and easier to read. I can
be biased towards my own code :-).
I think we should treat the compression engine failure separately from
the compression rate failure. The engine failure is rare and we should
know about it as a real error. The compression rate failure is normal.

Chris

> +                       comp_ret = 0;
> +                       dlen = PAGE_SIZE;
> +                       dst = kmap_local_page(page);
> +               } else {
> +                       comp_ret = comp_ret ? comp_ret : -EINVAL;
> +                       goto unlock;
> +               }
> +       }
>
>         zpool = pool->zpool;
>         gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> @@ -990,6 +1010,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>         entry->length = dlen;
>
>  unlock:
> +       if (dst != acomp_ctx->buffer)
> +               kunmap_local(dst);
>         if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
>                 zswap_reject_compress_poor++;
>         else if (comp_ret)
> @@ -1012,6 +1034,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>         acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
>         obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
>
> +       /* zswap entries of length PAGE_SIZE are not compressed. */
> +       if (entry->length == PAGE_SIZE) {
> +               memcpy_to_folio(folio, 0, obj, entry->length);
> +               zpool_obj_read_end(zpool, entry->handle, obj);
> +               acomp_ctx_put_unlock(acomp_ctx);
> +               return true;
> +       }
> +
>         /*
>          * zpool_obj_read_begin() might return a kmap address of highmem when
>          * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> @@ -1809,6 +1839,8 @@ static int zswap_debugfs_init(void)
>                            zswap_debugfs_root, &zswap_reject_kmemcache_fail);
>         debugfs_create_u64("reject_compress_fail", 0444,
>                            zswap_debugfs_root, &zswap_reject_compress_fail);
> +       debugfs_create_u64("compress_fail", 0444,
> +                          zswap_debugfs_root, &zswap_compress_fail);
>         debugfs_create_u64("reject_compress_poor", 0444,
>                            zswap_debugfs_root, &zswap_reject_compress_poor);
>         debugfs_create_u64("decompress_fail", 0444,
>
> base-commit: 44fa6646d975349f6499d1aeb0ed826680d0bb5c
> --
> 2.39.5
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 17:07 ` Chris Li
@ 2025-08-13 18:20   ` SeongJae Park
  2025-08-15 22:28     ` Chris Li
  2025-08-13 18:32   ` Nhat Pham
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: SeongJae Park @ 2025-08-13 18:20 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

Hi Chris,

On Wed, 13 Aug 2025 10:07:18 -0700 Chris Li <chrisl@kernel.org> wrote:

> Hi SeongJae,
> 
> Thanks for doing that. I am working on something related to this as
> well. Please consider CC me on your next version.

Thank you for chiming in, I will do so.

> 
> On Tue, Aug 12, 2025 at 10:00 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > the page is swapped out to the backing swap device.  This behavior
> > breaks the zswap's writeback LRU order, and hence users can experience
> > unexpected latency spikes.  If the page is compressed without failure,
> > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > decompression overhead for loading the page back on the later access is
> > unnecessary.
> >
> > Keep the LRU order and optimize unnecessary decompression overheads in
> > those cases, by storing the original content as-is in zpool.  The length
> > field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> > whether it is saved as-is or not (whether decompression is unnecessary)
> > is identified by 'zswap_entry->length = PAGE_SIZE'.
> >
> > Because the uncompressed data is saved in zpool, same to the compressed
> > ones, this introduces no change in terms of memory management including
> > movability and migratability of involved pages.
> 
> If you store uncompressed data in the zpool, zpool has metadata
> overhead, e.g. allocating the entry->handle for uncompressed pages.
> If the page is not compressed, another idea is just skip the zpool,
> store it as a page in the zswap entry as page. We can make a union of
> entry->handle and entry->incompressble_page. If entry->length =
> PAGE_SIZE, use entry->incompressable_page as a page.

My RFC v1[1] was using a kind of such approach.  I changed it to use zpool
starting from RFC v2, following the suggestions from Nhat, Johannes and Joshua.
It was suggested to use zpool to reuse its support of migration, highmem
handling, and stats.  And I agree their suggestions.  David also raised a
question on RFC v2, about the movability behavior changes[2].

I agree we will get more metadata overhead.  Nonetheless, I argue it is not a
big problem and can be mitigated in writeback-enabled environments.  Hence this
feature is disabled on writeback-disabled case.  Next paragraph on the original
cover letter explains my points about this.

> 
> The pros is that, on the page fault, there is no need to allocate a
> new page. You can just move the uncompressed_page into the swap_cache.
> You save one memcpy on the page fault as well. That will make the
> incompressible page fault behave very similar to the minor page fault.

I agree this can save one memcpy, and that's cool idea!  Nonetheless, this
patch is not making the [z]swap-in overhead worse.  Rather, this patch also
slightly improve it for incompressible pages case by skipping the
decompression.  Also, if we take this way, I think we should also need to make
a good answer to the zswapped-page migrations etc.

So, if we agree on my justification about the metadata overhead, I think this
could be done as a followup work of this patch?

> 
> The cons is that, now zpool does not account for uncompressed pages,
> on the second thought, that can be a con as well, the page is not
> compressed, why should it account as compressed pages?

I agree it might look weird.  But, in my humble opinion, in a perspective of
thinking it as zswap backend storage, and by thinking the definition of
compression in a flexible way, this may be understandable and not cause real
user problems?

> 
> I know Hugh has some idea to store incompressible pages in the swap
> cache as well. Hugh?
> 
> > This change is also not increasing per zswap entry metadata overhead.
> > But as the number of incompressible pages increases, total zswap
> > metadata overhead is proportionally increased.  The overhead should not
> > be problematic in usual cases, since the zswap metadata for single zswap
> > entry is much smaller than PAGE_SIZE, and in common zswap use cases
> > there should be a sufficient amount of compressible pages.  Also it can
> > be mitigated by the zswap writeback.
> >
> > When the writeback is disabled, the additional overhead could be
> > problematic.  For the case, keep the current behavior that just returns
> > the failure and let swap_writeout() put the page back to the active LRU
> > list in the case.
> >
> > Knowing how many compression failures happened will be useful for future
> > investigations.  investigations.  Add a new debugfs file, compress_fail,
> > for the purpose.
> 
> Again, I think we should distinguish the crypto engine fail vs ration failure.

I see you left more detailed good opinions about this below.  I will reply
there.

[...]
> >  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
> >  1 file changed, 34 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 3c0fd8a13718..0fb940d03268 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
> >  static u64 zswap_reject_reclaim_fail;
> >  /* Store failed due to compression algorithm failure */
> >  static u64 zswap_reject_compress_fail;
> > +/* Compression into a size of <PAGE_SIZE failed */
> > +static u64 zswap_compress_fail;
> >  /* Compressed page was too big for the allocator to (optimally) store */
> >  static u64 zswap_reject_compress_poor;
> >  /* Load or writeback failed due to decompression failure */
> > @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >          */
> >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >         dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > -               goto unlock;
> > +
> > +       /*
> > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > +        * save the content as is without a compression, to keep the LRU order
> > +        * of writebacks.  If writeback is disabled, reject the page since it
> > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > +        * to the active LRU list in the case.
> > +        */
> > +       if (comp_ret || dlen >= PAGE_SIZE) {
> > +               zswap_compress_fail++;
> 
> I feel that mixing comp_ret and dlen size if tested here complicates
> the nested if statement and the behavior as well.
> One behavior change is that, if the comp_ret != 0, it means the
> compression crypt engine has some error. e.g. have some bad chip
> always fail the compression. That is a different error case than the
> compression was successful and the compression rate is not good. In
> this case the patch makes the page store in zpool for cryto engine
> failure, which does not help.

I agree the code has rooms to improve, in terms of the readability, and keeping
fine grained observailibty.

> 
> Is there any reason you want to store the page in zpool when the
> compression engine (not the ratio) fails?

The main problem this patch tries to solve is the LRU order corruption.  In any
case, I want to keep the order if possible.

> 
> What do you say about the following alternative, this keeps the
> original behavior if compression engines fail.
> 
>      if (comp_ret) {
>           zswap_compress_egine_fail++;

I agree this counter can be useful.

>           goto unlock;

This will make the page go directly to underlying slower swap device, and hence
cause LRU order inversion.  So unfortuately I have to say this is not the
behavior I want.

I agree engine failure is not frequent, so this behavior might not really make
problem to me.  But, still I don't see a reason to let the page go directly to
swap and make LRU order unexpected.

>      }
> 
>      if (dlen >= PAGE_SIZE) {
>         zswap_compress_fail++;

I define compress failure here as a failure of attempt to compress the given
page's content into a size smaller than PAGE_SIZE.  It is a superset including
both "ratio" failure and "engine" failure.  Hence I think zswap_compress_fail
should be increased even in the upper case.

We can discuss about re-defining and re-naming what each stat means, of course,
if you want.  But I think even if we keep the current definitions and names, we
can still get the ratio failures if we add your nicely suggested
zswap_compress_engine_fail, as
'zswap_compress_fail - zswap_compress_engine_fail', so we might not really need
re-definition and re-naming?

>         if (!mem_cgroup_zswap_writeback_enabled(
>                                       folio_memcg(page_folio(page)))) {
>               comp_ret = -EINVAL;
>               goto unlock;
>         }
>        dlen = PAGE_SIZE;
>        dst = kmap_local_page(page);
>     }
> 
> Overall I feel this alternative is simpler and easier to read.

I agree this code is nice to read.  Nonetheless I have to say the behavior is
somewhat different from what I want.

> I can
> be biased towards my own code :-).
> I think we should treat the compression engine failure separately from
> the compression rate failure. The engine failure is rare and we should
> know about it as a real error. The compression rate failure is normal.

Again, I agree having the observability would be nice.  I can add a new counter
for that, like below attached patch, for example.

[1] https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org
[2] https://lore.kernel.org/761a2899-6fd9-4bfe-aeaf-23bce0baa0f1@redhat.com


Thanks,
SJ

[...]

==== >8 ====
From 839459619cf2c1a66d0a82361808bab499feefda Mon Sep 17 00:00:00 2001
From: SeongJae Park <sj@kernel.org>
Date: Wed, 13 Aug 2025 11:13:24 -0700
Subject: [PATCH] mm/zswap: add crypto engine failures debugfs counter

compress_fail counts failures of general attempts to compress a given
apge into a size smaller than PAGE_SIZE.  It could be happened because
the content of the page is not easy to be compressed, or the crypto
engine error-ed for any reason.  Adda new debugfs file, namely
compress_engine_fail, for coutning the latter case.  Keep the
comress_fail count, since the failures due to the compression ratio
failure can be calculated by subtracting the new counter
(compress_engine_fail) value from that of the existing one
(compress_fail).

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/zswap.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 0fb940d03268..6b2144126fba 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
 static u64 zswap_reject_compress_fail;
 /* Compression into a size of <PAGE_SIZE failed */
 static u64 zswap_compress_fail;
+/* Compression failed by the crypto engine */
+static u64 zswap_compress_engine_fail;
 /* Compressed page was too big for the allocator to (optimally) store */
 static u64 zswap_reject_compress_poor;
 /* Load or writeback failed due to decompression failure */
@@ -988,6 +990,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 */
 	if (comp_ret || dlen >= PAGE_SIZE) {
 		zswap_compress_fail++;
+		if (comp_ret)
+			zswap_compress_engine_fail++;
 		if (mem_cgroup_zswap_writeback_enabled(
 					folio_memcg(page_folio(page)))) {
 			comp_ret = 0;
@@ -1841,6 +1845,8 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_compress_fail);
 	debugfs_create_u64("compress_fail", 0444,
 			   zswap_debugfs_root, &zswap_compress_fail);
+	debugfs_create_u64("compress_engine_fail", 0444,
+			   zswap_debugfs_root, &zswap_compress_engine_fail);
 	debugfs_create_u64("reject_compress_poor", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("decompress_fail", 0444,
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 17:07 ` Chris Li
  2025-08-13 18:20   ` SeongJae Park
@ 2025-08-13 18:32   ` Nhat Pham
  2025-08-15 22:34     ` Chris Li
  2025-08-13 18:58   ` Hugh Dickins
  2025-08-13 19:42   ` Shakeel Butt
  3 siblings, 1 reply; 24+ messages in thread
From: Nhat Pham @ 2025-08-13 18:32 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, Aug 13, 2025 at 10:07 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi SeongJae,
>
> Thanks for doing that. I am working on something related to this as
> well. Please consider CC me on your next version.
>
> On Tue, Aug 12, 2025 at 10:00 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > When zswap writeback is enabled and it fails compressing a given page,
> > the page is swapped out to the backing swap device.  This behavior
> > breaks the zswap's writeback LRU order, and hence users can experience
> > unexpected latency spikes.  If the page is compressed without failure,
> > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > decompression overhead for loading the page back on the later access is
> > unnecessary.
> >
> > Keep the LRU order and optimize unnecessary decompression overheads in
> > those cases, by storing the original content as-is in zpool.  The length
> > field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> > whether it is saved as-is or not (whether decompression is unnecessary)
> > is identified by 'zswap_entry->length == PAGE_SIZE'.
> >
> > Because the uncompressed data is saved in zpool, same to the compressed
> > ones, this introduces no change in terms of memory management including
> > movability and migratability of involved pages.
>
> If you store uncompressed data in the zpool, zpool has metadata
> overhead, e.g. allocating the entry->handle for uncompressed pages.
> If the page is not compressed, another idea is just skip the zpool,
> store it as a page in the zswap entry as page. We can make a union of
> entry->handle and entry->incompressble_page. If entry->length ==
> PAGE_SIZE, use entry->incompressable_page as a page.
>
> The pros is that, on the page fault, there is no need to allocate a
> new page. You can just move the uncompressed_page into the swap_cache.
> You save one memcpy on the page fault as well. That will make the
> incompressible page fault behave very similar to the minor page fault.
>
> The cons is that, now zpool does not account for uncompressed pages,
> on the second thought, that can be a con as well, the page is not
> compressed, why should it account as compressed pages?
>
> I know Hugh has some idea to store incompressible pages in the swap
> cache as well. Hugh?

I've also proposed that approach internally - keeping the page in
swapcache, while adding them to the zswap LRU for writeback to disk
(and so that we do not consider them for zswap again in the future).

But after a while, we decided against it, mostly due to the complexity
of the solution. On the zswap side, we need to distinguish between the
ordinary struct zswap_entry and the struct page on zswap's LRU list.
Externally, we need to handle moving a page currently in the zswap LRU
to the main memory anon LRUs too.

Migration is another concern. Zswap needs to be notified that the
"backend" of a zswap entry has changed underneath it. Not impossible,
but again that's just more surgery.

So we decided to start with a simple solution (this one), and iterate
as issues cropped up. At least then, we have production justifications
for any future improvements.

>
> > This change is also not increasing per zswap entry metadata overhead.
> > But as the number of incompressible pages increases, total zswap
> > metadata overhead is proportionally increased.  The overhead should not
> > be problematic in usual cases, since the zswap metadata for single zswap
> > entry is much smaller than PAGE_SIZE, and in common zswap use cases
> > there should be a sufficient amount of compressible pages.  Also it can
> > be mitigated by the zswap writeback.
> >
> > When the writeback is disabled, the additional overhead could be
> > problematic.  For the case, keep the current behavior that just returns
> > the failure and let swap_writeout() put the page back to the active LRU
> > list in the case.
> >
> > Knowing how many compression failures happened will be useful for future
> > investigations.  investigations.  Add a new debugfs file, compress_fail,
> > for the purpose.
>
> Again, I think we should distinguish the crypto engine fail vs ration failure.
>
> >
> > Tests
> > -----
> >
> > I tested this patch using a simple self-written microbenchmark that is
> > available at GitHub[1].  You can reproduce the test I did by executing
> > run_tests.sh of the repo on your system.  Note that the repo's
> > documentation is not good as of this writing, so you may need to read
> > and use the code.
> >
> > The basic test scenario is simple.  Run a test program making artificial
> > accesses to memory having artificial content under memory.high-set
> > memory limit and measure how many accesses were made in given time.
> >
> > The test program repeatedly and randomly access three anonymous memory
> > regions.  The regions are all 500 MiB size, and accessed in the same
> > probability.  Two of those are filled up with a simple content that can
> > easily be compressed, while the remaining one is filled up with a
> > content that read from /dev/urandom, which is easy to fail at
> > compressing to <PAGE_SIZE size.  The program runs for two minutes and
> > prints out the number of accesses made every five seconds.
> >
> > The test script runs the program under below seven configurations.
> >
> > - 0: memory.high is set to 2 GiB, zswap is disabled.
> > - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> > - 1-2: On 1-1, zswap is enabled without this patch.
> > - 1-3: On 1-2, this patch is applied.
> >
> > For all zswap enabled cases, zswap shrinker is enabled.
> >
> > Configuration '0' is for showing the original memory performance.
> > Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> > zswap, and this patch under a level of memory pressure (~10% of working
> > set).
> >
> > Because the per-5 seconds performance is not very reliable, I measured
> > the average of that for the last one minute period of the test program
> > run.  I also measured a few vmstat counters including zswpin, zswpout,
> > zswpwb, pswpin and pswpout during the test runs.
> >
> > The measurement results are as below.  To save space, I show performance
> > numbers that are normalized to that of the configuration '0' (no memory
> > pressure), only.  The averaged accesses per 5 seconds of configuration
> > '0' was 36493417.75.
> >
> >     config            0       1-1     1-2      1-3
> >     perf_normalized   1.0000  0.0057  0.0235   0.0367
> >     perf_stdev_ratio  0.0582  0.0652  0.0167   0.0346
> >     zswpin            0       0       3548424  1999335
> >     zswpout           0       0       3588817  2361689
> >     zswpwb            0       0       10214    340270
> >     pswpin            0       485806  772038   340967
> >     pswpout           0       649543  144773   340270
> >
> > 'perf_normalized' is the performance metric, normalized to that of
> > configuration '0' (no pressure).  'perf_stdev_ratio' is the standard
> > deviation of the averaged data points, as a ratio to the averaged metric
> > value.  For example, configuration '0' performance was showing 5.8%
> > stdev.  Configurations 1-1 and 1-3 were having about 6.5% and 6.1%
> > stdev.  Also the results were highly variable between multiple runs.  So
> > this result is not very stable but just showing ball park figures.
> > Please keep this in your mind when reading these results.
> >
> > Under about 10% of working set memory pressure, the performance was
> > dropped to about 0.57% of no-pressure one, when the normal swap is used
> > (1-1).  Note that ~10% working set pressure is already extreme, at least
> > on this test setup.  No one would desire system setups that can degrade
> > performance to 0.57% of the best case.
> >
> > By turning zswap on (1-2), the performance was improved about 4x,
> > resulting in about 2.35% of no-pressure one.  Because of the
> > incompressible pages in the third memory region, a significant amount of
> > (non-zswap) swap I/O operations were made, though.
> >
> > By applying this patch (1-3), about 56% performance improvement was
> > made, resulting in about 3.67% of no-pressure one.  Reduced pswpin of
> > 1-3 compared to 1-2 let us see where this improvement came from.
> >
> > Related Works
> > -------------
> >
> > This is not an entirely new attempt.  Nhat Pham and Takero Funaki tried
> > very similar approaches in October 2023[2] and April 2024[3],
> > respectively.  The two approaches didn't get merged mainly due to the
> > metadata overhead concern.  I described why I think that shouldn't be a
> > problem for this change, which is automatically disabled when writeback
> > is disabled, at the beginning of this changelog.
> >
> > This patch is not particularly different from those, and actually built
> > upon those.  I wrote this from scratch again, though.  Hence adding
> > Suggested-by tags for them.  Actually Nhat first suggested this to me
> > offlist.
> >
> > Historically, writeback disabling was introduced partially as a way to
> > solve the LRU order issue.  Yosry pointed out[4] this is still
> > suboptimal when the incompressible pages are cold, since the
> > incompressible pages will continuously be tried to be zswapped out, and
> > burn CPU cycles for compression attempts that will anyway fail.  One
> > imaginable solution for the problem is reusing the swapped-out page and
> > its struct page to store in the zswap pool.  But that's out of the scope
> > of this patch.
> >
> > [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> > [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> > [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
> > [4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com
> >
> > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > Suggested-by: Takero Funaki <flintglass@gmail.com>
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> > Changes from v1
> > (https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org)
> > - Optimize out memcpy() per incompressible page saving, using
> >   k[un]map_local().
> > - Add a debugfs file for counting compression failures.
> > - Use a clear form of a ternary operation.
> > - Add the history of writeback disabling with a link.
> > - Wordsmith comments.
> >
> > Changes from RFC v2
> > (https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
> > - Fix race conditions at decompressed pages identification.
> > - Remove the parameter and make saving as-is the default behavior.
> > - Open-code main changes.
> > - Clarify there is no memory management changes on the cover letter.
> > - Remove 20% pressure case from test results, since it is arguably too
> >   extreme and only adds confusion.
> > - Drop RFC tag.
> >
> > Changes from RFC v1
> > (https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
> > - Consider PAGE_SIZE compression successes as failures.
> > - Use zpool for storing incompressible pages.
> > - Test with zswap shrinker enabled.
> > - Wordsmith changelog and comments.
> > - Add documentation of save_incompressible_pages parameter.
> >
> >  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
> >  1 file changed, 34 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 3c0fd8a13718..0fb940d03268 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
> >  static u64 zswap_reject_reclaim_fail;
> >  /* Store failed due to compression algorithm failure */
> >  static u64 zswap_reject_compress_fail;
> > +/* Compression into a size of <PAGE_SIZE failed */
> > +static u64 zswap_compress_fail;
> >  /* Compressed page was too big for the allocator to (optimally) store */
> >  static u64 zswap_reject_compress_poor;
> >  /* Load or writeback failed due to decompression failure */
> > @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >          */
> >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> >         dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > -               goto unlock;
> > +
> > +       /*
> > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > +        * save the content as is without a compression, to keep the LRU order
> > +        * of writebacks.  If writeback is disabled, reject the page since it
> > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > +        * to the active LRU list in the case.
> > +        */
> > +       if (comp_ret || dlen >= PAGE_SIZE) {
> > +               zswap_compress_fail++;
>
> I feel that mixing comp_ret and dlen size if tested here complicates
> the nested if statement and the behavior as well.
> One behavior change is that, if the comp_ret != 0, it means the
> compression crypt engine has some error. e.g. have some bad chip
> always fail the compression. That is a different error case than the
> compression was successful and the compression rate is not good. In
> this case the patch makes the page store in zpool for cryto engine
> failure, which does not help.
>
> Is there any reason you want to store the page in zpool when the
> compression engine (not the ratio) fails?

It helps if you want to write them back to disk later, in the LRU
order. SJ pointed that out in the patch changelog, I believe.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 17:07 ` Chris Li
  2025-08-13 18:20   ` SeongJae Park
  2025-08-13 18:32   ` Nhat Pham
@ 2025-08-13 18:58   ` Hugh Dickins
  2025-08-15 22:38     ` Chris Li
  2025-08-13 19:42   ` Shakeel Butt
  3 siblings, 1 reply; 24+ messages in thread
From: Hugh Dickins @ 2025-08-13 18:58 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, 13 Aug 2025, Chris Li wrote:
> 
> I know Hugh has some idea to store incompressible pages in the swap
> cache as well. Hugh?

No, I don't have any idea or plan to store incompressible pages in the
swap cache myself.  But yes, incompressible pages could well be "stored"
in the swap cache.

Hugh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 17:07 ` Chris Li
                     ` (2 preceding siblings ...)
  2025-08-13 18:58   ` Hugh Dickins
@ 2025-08-13 19:42   ` Shakeel Butt
  2025-08-13 20:48     ` Johannes Weiner
  2025-08-15 22:42     ` Chris Li
  3 siblings, 2 replies; 24+ messages in thread
From: Shakeel Butt @ 2025-08-13 19:42 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, Aug 13, 2025 at 10:07:18AM -0700, Chris Li wrote:
> 
> If you store uncompressed data in the zpool, zpool has metadata
> overhead, e.g. allocating the entry->handle for uncompressed pages.
> If the page is not compressed, another idea is just skip the zpool,
> store it as a page in the zswap entry as page. We can make a union of
> entry->handle and entry->incompressble_page. If entry->length ==
> PAGE_SIZE, use entry->incompressable_page as a page.

The main problem being solved here is to avoid the scenario where the
incompressible pages are being rotated in LRUs and zswapped multiple
times and wasting CPU on compressing incompressible pages. SJ's approach
solves the issue but with some memory overhead (zswap entry). With your
suggestion and to solve the mentioned issue, we will need to change some
core parts of reclaim (__remove_mapping()), LRU handling (swap cache
pages not in LRUs) and refault (putting such pages back in LRU and
should it handle read and write faults differently). So, the cons of
that approach is more complex code.

Personally I would prefer a simple solution with some overhead over a
more complicated and error prone solution without overhead. Or maybe you
have a more simplied approach instead?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 19:42   ` Shakeel Butt
@ 2025-08-13 20:48     ` Johannes Weiner
  2025-08-15 22:44       ` Chris Li
  2025-08-15 22:42     ` Chris Li
  1 sibling, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2025-08-13 20:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Chris Li, SeongJae Park, Andrew Morton, Chengming Zhou,
	David Hildenbrand, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, Aug 13, 2025 at 12:42:32PM -0700, Shakeel Butt wrote:
> On Wed, Aug 13, 2025 at 10:07:18AM -0700, Chris Li wrote:
> > 
> > If you store uncompressed data in the zpool, zpool has metadata
> > overhead, e.g. allocating the entry->handle for uncompressed pages.
> > If the page is not compressed, another idea is just skip the zpool,
> > store it as a page in the zswap entry as page. We can make a union of
> > entry->handle and entry->incompressble_page. If entry->length ==
> > PAGE_SIZE, use entry->incompressable_page as a page.
> 
> The main problem being solved here is to avoid the scenario where the
> incompressible pages are being rotated in LRUs and zswapped multiple
> times and wasting CPU on compressing incompressible pages. SJ's approach
> solves the issue but with some memory overhead (zswap entry). With your
> suggestion and to solve the mentioned issue, we will need to change some
> core parts of reclaim (__remove_mapping()), LRU handling (swap cache
> pages not in LRUs) and refault (putting such pages back in LRU and
> should it handle read and write faults differently). So, the cons of
> that approach is more complex code.

What Chris is proposing would also fix that, even for configurations
without writeback. So I'm not opposed to it.

However, for deployments where writeback *is* enabled, this code is an
improvement over the status quo. And it's not in conflict with a
broader fix for !writeback setups, so it's not an either-or scenario.

Specifically for the writeback case, the metadata overhead is not much
of a concern: we can just write back more zswap tail to make up for
it; the more important thing is that we can now do so in LRU order.

The premise being that writing an additional cold page from zswap to
disk to make room for a slightly inefficiently stored warm page is
better than rejecting and sending the *warm* page to disk instead.

So I agree with you Chris. But also think that's follow-up work for
somebody who cares about the !writeback case.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-12 17:00 [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
  2025-08-12 18:52 ` Nhat Pham
  2025-08-13 17:07 ` Chris Li
@ 2025-08-14 16:23 ` Johannes Weiner
  2025-08-14 17:04   ` SeongJae Park
  2 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2025-08-14 16:23 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Nhat Pham,
	Yosry Ahmed, linux-kernel, linux-mm, Takero Funaki

On Tue, Aug 12, 2025 at 10:00:46AM -0700, SeongJae Park wrote:
> When zswap writeback is enabled and it fails compressing a given page,
> the page is swapped out to the backing swap device.  This behavior
> breaks the zswap's writeback LRU order, and hence users can experience
> unexpected latency spikes.  If the page is compressed without failure,
> but results in a size of PAGE_SIZE, the LRU order is kept, but the
> decompression overhead for loading the page back on the later access is
> unnecessary.
> 
> Keep the LRU order and optimize unnecessary decompression overheads in
> those cases, by storing the original content as-is in zpool.  The length
> field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> whether it is saved as-is or not (whether decompression is unnecessary)
> is identified by 'zswap_entry->length == PAGE_SIZE'.
> 
> Because the uncompressed data is saved in zpool, same to the compressed
> ones, this introduces no change in terms of memory management including
> movability and migratability of involved pages.
> 
> This change is also not increasing per zswap entry metadata overhead.
> But as the number of incompressible pages increases, total zswap
> metadata overhead is proportionally increased.  The overhead should not
> be problematic in usual cases, since the zswap metadata for single zswap
> entry is much smaller than PAGE_SIZE, and in common zswap use cases
> there should be a sufficient amount of compressible pages.  Also it can
> be mitigated by the zswap writeback.
> 
> When the writeback is disabled, the additional overhead could be
> problematic.  For the case, keep the current behavior that just returns
> the failure and let swap_writeout() put the page back to the active LRU
> list in the case.
> 
> Knowing how many compression failures happened will be useful for future
> investigations.  investigations.  Add a new debugfs file, compress_fail,
> for the purpose.
> 
> Tests
> -----
> 
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1].  You can reproduce the test I did by executing
> run_tests.sh of the repo on your system.  Note that the repo's
> documentation is not good as of this writing, so you may need to read
> and use the code.
> 
> The basic test scenario is simple.  Run a test program making artificial
> accesses to memory having artificial content under memory.high-set
> memory limit and measure how many accesses were made in given time.
> 
> The test program repeatedly and randomly access three anonymous memory
> regions.  The regions are all 500 MiB size, and accessed in the same
> probability.  Two of those are filled up with a simple content that can
> easily be compressed, while the remaining one is filled up with a
> content that read from /dev/urandom, which is easy to fail at
> compressing to <PAGE_SIZE size.  The program runs for two minutes and
> prints out the number of accesses made every five seconds.
> 
> The test script runs the program under below seven configurations.

                                               four?

> - 0: memory.high is set to 2 GiB, zswap is disabled.
> - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> - 1-2: On 1-1, zswap is enabled without this patch.
> - 1-3: On 1-2, this patch is applied.

I'm not sure 0 and 1-1 are super interesting, since we care about
zswap performance under pressure before and after the patch.

> For all zswap enabled cases, zswap shrinker is enabled.

It would, however, be good to test without the shrinker as well, since
it's currently optional and not enabled by default.

The performance with the shrinker looks great, but if it regresses for
setups without the shrinker we need additional gating.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-14 16:23 ` Johannes Weiner
@ 2025-08-14 17:04   ` SeongJae Park
  0 siblings, 0 replies; 24+ messages in thread
From: SeongJae Park @ 2025-08-14 17:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm, Takero Funaki

On Thu, 14 Aug 2025 12:23:50 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Aug 12, 2025 at 10:00:46AM -0700, SeongJae Park wrote:
> > When zswap writeback is enabled and it fails compressing a given page,
> > the page is swapped out to the backing swap device.  This behavior
> > breaks the zswap's writeback LRU order, and hence users can experience
> > unexpected latency spikes.  If the page is compressed without failure,
> > but results in a size of PAGE_SIZE, the LRU order is kept, but the
> > decompression overhead for loading the page back on the later access is
> > unnecessary.
> > 
> > Keep the LRU order and optimize unnecessary decompression overheads in
> > those cases, by storing the original content as-is in zpool.  The length
> > field of zswap_entry will be set appropriately, as PAGE_SIZE,  Hence
> > whether it is saved as-is or not (whether decompression is unnecessary)
> > is identified by 'zswap_entry->length == PAGE_SIZE'.
[...]
> > Tests
> > -----
> > 
> > I tested this patch using a simple self-written microbenchmark that is
> > available at GitHub[1].  You can reproduce the test I did by executing
> > run_tests.sh of the repo on your system.  Note that the repo's
> > documentation is not good as of this writing, so you may need to read
> > and use the code.
> > 
> > The basic test scenario is simple.  Run a test program making artificial
> > accesses to memory having artificial content under memory.high-set
> > memory limit and measure how many accesses were made in given time.
> > 
> > The test program repeatedly and randomly access three anonymous memory
> > regions.  The regions are all 500 MiB size, and accessed in the same
> > probability.  Two of those are filled up with a simple content that can
> > easily be compressed, while the remaining one is filled up with a
> > content that read from /dev/urandom, which is easy to fail at
> > compressing to <PAGE_SIZE size.  The program runs for two minutes and
> > prints out the number of accesses made every five seconds.
> > 
> > The test script runs the program under below seven configurations.
> 
>                                                four?

Nice catch, thank you!  I will fix this in the next version.

> 
> > - 0: memory.high is set to 2 GiB, zswap is disabled.
> > - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> > - 1-2: On 1-1, zswap is enabled without this patch.
> > - 1-3: On 1-2, this patch is applied.
> 
> I'm not sure 0 and 1-1 are super interesting, since we care about
> zswap performance under pressure before and after the patch.

You're right, the main focus should be 1-2 vs 1-3.

I added 0 and 1-1, too, though, just for showing how artificial and micro the
given test setup is.

That is, as I also mentioned on the original mail, this test setup is for
pretty extreme pressure case.  To quote,

        config            0       1-1     1-2      1-3
        perf_normalized   1.0000  0.0057  0.0235   0.0367
    [...]
    Note that ~10% working set pressure is already extreme, at least
    on this test setup.  No one would desire system setups that can degrade
    performance to 0.57% of the best case.

Enabling zswap improves the performance to 2.35% of the ideal setup, and this
patch further improves it to 3.67%.  But still no one would want a setup that
achieves only 3.67% of the ideal performance.  The performance itself is also
a pure memory random access performance, which is again far from those for real
workloads, though.

So, this is a microbenchmarking test that pretty far from the real world.

I will add more description about the intention of the configs in the next
version.  Let me know if you think it is only disturbing reading of the patch
and better to be just dropped.
 
> 
> > For all zswap enabled cases, zswap shrinker is enabled.
> 
> It would, however, be good to test without the shrinker as well, since
> it's currently optional and not enabled by default.

Agreed.  Actually we ran without shrinker for RFC v1[1] of this patch.  The
version also showed performance improvement in the configuration.  Since we
made no big change after the version, I expect no big difference.

Nonetheless, I will run the test with zswap shrinker disabled setup again with
this version, and share the results.  Unless the results is far from my
expectation, I will directly post the next version of this patch with abovely
promised changelog updates.

> 
> The performance with the shrinker looks great, but if it regresses for
> setups without the shrinker we need additional gating.

Agreed.  If I find a regression from the abovely promised rerun of the test, I
will share that as a reply to this thread, and start a discussion for the
possible additional gating.

[1] https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org


Thanks,
SJ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 18:20   ` SeongJae Park
@ 2025-08-15 22:28     ` Chris Li
  2025-08-15 23:08       ` Nhat Pham
  2025-08-16  0:07       ` SeongJae Park
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Li @ 2025-08-15 22:28 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm, Takero Funaki,
	Hugh Dickins

On Wed, Aug 13, 2025 at 11:20 AM SeongJae Park <sj@kernel.org> wrote:
>
> My RFC v1[1] was using a kind of such approach.  I changed it to use zpool
> starting from RFC v2, following the suggestions from Nhat, Johannes and Joshua.
> It was suggested to use zpool to reuse its support of migration, highmem
> handling, and stats.  And I agree their suggestions.  David also raised a
> question on RFC v2, about the movability behavior changes[2].

Thanks for the pointer, that is an interesting read.

> I agree we will get more metadata overhead.  Nonetheless, I argue it is not a

Keep in mind that is more than just metadata. There is another hidden
overhead of using zpool to store your full page compared to directly
using the page.  When the page is read, because most likely the zpool
back end is not at a position aligned to the page. That zpool page
will not be able to free immediately, those fragmented zpool will need
to wait for compaction to free. it.

> big problem and can be mitigated in writeback-enabled environments.  Hence this
> feature is disabled on writeback-disabled case.  Next paragraph on the original
> cover letter explains my points about this.
>
> >
> > The pros is that, on the page fault, there is no need to allocate a
> > new page. You can just move the uncompressed_page into the swap_cache.
> > You save one memcpy on the page fault as well. That will make the
> > incompressible page fault behave very similar to the minor page fault.
>
> I agree this can save one memcpy, and that's cool idea!  Nonetheless, this
> patch is not making the [z]swap-in overhead worse.  Rather, this patch also

We might still wait to evaluate the lost/gain vs store the
incompressible in swap cache. Google actually has an internal patch to
store the incompressible pages in swap cache and move them out of the
LRU to another already existing list. I can clean it up a bit and send
it to the list for comparison. This approach will not mess with the
zswap LRU because the incompressible data is not moving into zswap.
There is no page fault to handle the incompressible pages.

> slightly improve it for incompressible pages case by skipping the
> decompression.  Also, if we take this way, I think we should also need to make
> a good answer to the zswapped-page migrations etc.

Yes, if we keep the store page in the zswap approach, we might have to
optimize inside zsmalloc to handle full page storing better.

> So, if we agree on my justification about the metadata overhead, I think this
> could be done as a followup work of this patch?

Sure. I am most interested in getting the best overall solution. No
objects to get it now vs later.

>
> >
> > The cons is that, now zpool does not account for uncompressed pages,
> > on the second thought, that can be a con as well, the page is not
> > compressed, why should it account as compressed pages?
>
> I agree it might look weird.  But, in my humble opinion, in a perspective of
> thinking it as zswap backend storage, and by thinking the definition of
> compression in a flexible way, this may be understandable and not cause real
> user problems?

I think the real problem is hiding the real error with non errors like
data not compressible. If you want to store uncompressed data in the
zswap layer by design, that is fine. Just need to reason the trade off
vs other options like store in swap cache etc.

> [...]
> > >  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 34 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 3c0fd8a13718..0fb940d03268 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
> > >  static u64 zswap_reject_reclaim_fail;
> > >  /* Store failed due to compression algorithm failure */
> > >  static u64 zswap_reject_compress_fail;
> > > +/* Compression into a size of <PAGE_SIZE failed */
> > > +static u64 zswap_compress_fail;
> > >  /* Compressed page was too big for the allocator to (optimally) store */
> > >  static u64 zswap_reject_compress_poor;
> > >  /* Load or writeback failed due to decompression failure */
> > > @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > >          */
> > >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > >         dlen = acomp_ctx->req->dlen;
> > > -       if (comp_ret)
> > > -               goto unlock;
> > > +
> > > +       /*
> > > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > > +        * save the content as is without a compression, to keep the LRU order
> > > +        * of writebacks.  If writeback is disabled, reject the page since it
> > > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > > +        * to the active LRU list in the case.
> > > +        */
> > > +       if (comp_ret || dlen >= PAGE_SIZE) {
> > > +               zswap_compress_fail++;
> >
> > I feel that mixing comp_ret and dlen size if tested here complicates
> > the nested if statement and the behavior as well.
> > One behavior change is that, if the comp_ret != 0, it means the
> > compression crypt engine has some error. e.g. have some bad chip
> > always fail the compression. That is a different error case than the
> > compression was successful and the compression rate is not good. In
> > this case the patch makes the page store in zpool for cryto engine
> > failure, which does not help.
>
> I agree the code has rooms to improve, in terms of the readability, and keeping
> fine grained observailibty.
>
> >
> > Is there any reason you want to store the page in zpool when the
> > compression engine (not the ratio) fails?
>
> The main problem this patch tries to solve is the LRU order corruption.  In any
> case, I want to keep the order if possible.

In that case, does it mean that in order to keep the LRU, you never
want to write from zswap to a real back end device? That will make
zswap behave more like a standalone tier.

>
> >
> > What do you say about the following alternative, this keeps the
> > original behavior if compression engines fail.
> >
> >      if (comp_ret) {
> >           zswap_compress_egine_fail++;
>
> I agree this counter can be useful.

Ack.

>
> >           goto unlock;
>
> This will make the page go directly to underlying slower swap device, and hence
> cause LRU order inversion.  So unfortuately I have to say this is not the
> behavior I want.

Ack, that is a design choice. I understand now.

>
> I agree engine failure is not frequent, so this behavior might not really make
> problem to me.  But, still I don't see a reason to let the page go directly to
> swap and make LRU order unexpected.
>
> >      }
> >
> >      if (dlen >= PAGE_SIZE) {
> >         zswap_compress_fail++;
>
> I define compress failure here as a failure of attempt to compress the given
> page's content into a size smaller than PAGE_SIZE.  It is a superset including
> both "ratio" failure and "engine" failure.  Hence I think zswap_compress_fail
> should be increased even in the upper case.

I slept over it a bit. Now I think we should make this a counter of
how many uncompressed pages count stored in zswap. Preperbelly as per
memcg counter.
I saw that you implement it as a counter in your V1. Does the zsmalloc
already track this information in the zspool class? Having this per
cgroup information we can evaluate how  much zswap is saving. Some
jobs like databases might store a lot of hash value and encrypted data
which is not compressible. In that case, we might just pass the whole
zswap tier as a whole. The overall compression ratio will reflect that
as well. Having the separate per memcg counter is kind of useful as
well.

>
> We can discuss about re-defining and re-naming what each stat means, of course,
> if you want.  But I think even if we keep the current definitions and names, we
> can still get the ratio failures if we add your nicely suggested
> zswap_compress_engine_fail, as
> 'zswap_compress_fail - zswap_compress_engine_fail', so we might not really need
> re-definition and re-naming?

I am actually less interested in the absolute failure number which
keeps increasing, more on how much incompressible zswap is stored.
That counter + number of engine errors should be perfect.

>
> >         if (!mem_cgroup_zswap_writeback_enabled(
> >                                       folio_memcg(page_folio(page)))) {
> >               comp_ret = -EINVAL;
> >               goto unlock;
> >         }
> >        dlen = PAGE_SIZE;
> >        dst = kmap_local_page(page);
> >     }
> >
> > Overall I feel this alternative is simpler and easier to read.
>
> I agree this code is nice to read.  Nonetheless I have to say the behavior is
> somewhat different from what I want.

Even if you keep the current behavior, you can move the invert the
test condition and then remove the "else + goto" similar to the above.
That will make your code less and flatter. I will need to think about
whether we can assign the return value less.

Another point I would like to make is that you currently make the cut
off threshold as page size. The ideal threshold might be something
slightly smaller than page size. The reason is that the zsmalloc has a
fixed size chuck to store it. If your page is close to full, it will
store the data in the same class as the full page. You are not gaining
anything from zsmalloc if the store page compression ratio is 95%.
That 5% will be in the waste in the zsmalloc class fragment anyway. So
the trade off is, decompress 95% of a page vs memcpy 100% of a page.
It is likely memcpy 100% is faster. I don't know the exact ideal cut
off of threshold. If I had to guess, I would guess below 90%.

>
> > I can
> > be biased towards my own code :-).
> > I think we should treat the compression engine failure separately from
> > the compression rate failure. The engine failure is rare and we should
> > know about it as a real error. The compression rate failure is normal.
>
> Again, I agree having the observability would be nice.  I can add a new counter
> for that, like below attached patch, for example.

I would love that. Make that per memcg if possible :-)

>
> [1] https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org
> [2] https://lore.kernel.org/761a2899-6fd9-4bfe-aeaf-23bce0baa0f1@redhat.com

Thanks for the pointer, good read.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 18:32   ` Nhat Pham
@ 2025-08-15 22:34     ` Chris Li
  2025-08-15 22:44       ` Nhat Pham
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Li @ 2025-08-15 22:34 UTC (permalink / raw)
  To: Nhat Pham
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, Aug 13, 2025 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > I know Hugh has some idea to store incompressible pages in the swap
> > cache as well. Hugh?
>
> I've also proposed that approach internally - keeping the page in
> swapcache, while adding them to the zswap LRU for writeback to disk
> (and so that we do not consider them for zswap again in the future).
>
> But after a while, we decided against it, mostly due to the complexity
> of the solution. On the zswap side, we need to distinguish between the

Google actually has an internal patch to keep incompressible pages in
separate LRU out of zswap. But that breaks the zswap LRU order as
well. If there is interest and I can find the time, I can send it out
for note comparison purposes. I do see the value of maintaining the
LRU in the zswap tier as a whole.

> ordinary struct zswap_entry and the struct page on zswap's LRU list.
> Externally, we need to handle moving a page currently in the zswap LRU
> to the main memory anon LRUs too.
>
> Migration is another concern. Zswap needs to be notified that the
> "backend" of a zswap entry has changed underneath it. Not impossible,
> but again that's just more surgery.

Ack. We might need to get that operation inside zsmalloc.

>
> So we decided to start with a simple solution (this one), and iterate
> as issues cropped up. At least then, we have production justifications
> for any future improvements.

Ack.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 18:58   ` Hugh Dickins
@ 2025-08-15 22:38     ` Chris Li
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-15 22:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki

On Wed, Aug 13, 2025 at 11:58 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 13 Aug 2025, Chris Li wrote:
> >
> > I know Hugh has some idea to store incompressible pages in the swap
> > cache as well. Hugh?
>
> No, I don't have any idea or plan to store incompressible pages in the
> swap cache myself.  But yes, incompressible pages could well be "stored"
> in the swap cache.

Sorry Hugh, I don't mean to assign you more work. Just brainstorming.
Yes it could. The pros will be no swap fault. The cons is breaking the
zswap LRU ordering.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 19:42   ` Shakeel Butt
  2025-08-13 20:48     ` Johannes Weiner
@ 2025-08-15 22:42     ` Chris Li
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-15 22:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, Aug 13, 2025 at 12:42 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Aug 13, 2025 at 10:07:18AM -0700, Chris Li wrote:
> >
> > If you store uncompressed data in the zpool, zpool has metadata
> > overhead, e.g. allocating the entry->handle for uncompressed pages.
> > If the page is not compressed, another idea is just skip the zpool,
> > store it as a page in the zswap entry as page. We can make a union of
> > entry->handle and entry->incompressble_page. If entry->length ==
> > PAGE_SIZE, use entry->incompressable_page as a page.
>
> The main problem being solved here is to avoid the scenario where the
> incompressible pages are being rotated in LRUs and zswapped multiple
> times and wasting CPU on compressing incompressible pages. SJ's approach
> solves the issue but with some memory overhead (zswap entry). With your
> suggestion and to solve the mentioned issue, we will need to change some
> core parts of reclaim (__remove_mapping()), LRU handling (swap cache
> pages not in LRUs) and refault (putting such pages back in LRU and

No, that is not what I have in mind. I mean store a separate page and
memcpy into it, keep the swap cache folio still reclaimed as normal.
Not much different from a store in the zpool.

The cons really are migration etc. We might have to get it in zsmalloc later.

> should it handle read and write faults differently). So, the cons of
> that approach is more complex code.
>
> Personally I would prefer a simple solution with some overhead over a
> more complicated and error prone solution without overhead. Or maybe you
> have a more simplied approach instead?

Ack, there is more complexity to get the last bit of performance gain.
Maybe in zsmalloc later.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-13 20:48     ` Johannes Weiner
@ 2025-08-15 22:44       ` Chris Li
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-15 22:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, SeongJae Park, Andrew Morton, Chengming Zhou,
	David Hildenbrand, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Wed, Aug 13, 2025 at 1:48 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Aug 13, 2025 at 12:42:32PM -0700, Shakeel Butt wrote:
> > On Wed, Aug 13, 2025 at 10:07:18AM -0700, Chris Li wrote:
> > >
> > > If you store uncompressed data in the zpool, zpool has metadata
> > > overhead, e.g. allocating the entry->handle for uncompressed pages.
> > > If the page is not compressed, another idea is just skip the zpool,
> > > store it as a page in the zswap entry as page. We can make a union of
> > > entry->handle and entry->incompressble_page. If entry->length ==
> > > PAGE_SIZE, use entry->incompressable_page as a page.
> >
> > The main problem being solved here is to avoid the scenario where the
> > incompressible pages are being rotated in LRUs and zswapped multiple
> > times and wasting CPU on compressing incompressible pages. SJ's approach
> > solves the issue but with some memory overhead (zswap entry). With your
> > suggestion and to solve the mentioned issue, we will need to change some
> > core parts of reclaim (__remove_mapping()), LRU handling (swap cache
> > pages not in LRUs) and refault (putting such pages back in LRU and
> > should it handle read and write faults differently). So, the cons of
> > that approach is more complex code.
>
> What Chris is proposing would also fix that, even for configurations
> without writeback. So I'm not opposed to it.
>
> However, for deployments where writeback *is* enabled, this code is an
> improvement over the status quo. And it's not in conflict with a
> broader fix for !writeback setups, so it's not an either-or scenario.
>
> Specifically for the writeback case, the metadata overhead is not much
> of a concern: we can just write back more zswap tail to make up for
> it; the more important thing is that we can now do so in LRU order.
>
> The premise being that writing an additional cold page from zswap to
> disk to make room for a slightly inefficiently stored warm page is
> better than rejecting and sending the *warm* page to disk instead.
>
> So I agree with you Chris. But also think that's follow-up work for
> somebody who cares about the !writeback case.

Thank you and I agree with your assessment.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-15 22:34     ` Chris Li
@ 2025-08-15 22:44       ` Nhat Pham
  0 siblings, 0 replies; 24+ messages in thread
From: Nhat Pham @ 2025-08-15 22:44 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Fri, Aug 15, 2025 at 3:34 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Aug 13, 2025 at 11:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > I know Hugh has some idea to store incompressible pages in the swap
> > > cache as well. Hugh?
> >
> > I've also proposed that approach internally - keeping the page in
> > swapcache, while adding them to the zswap LRU for writeback to disk
> > (and so that we do not consider them for zswap again in the future).
> >
> > But after a while, we decided against it, mostly due to the complexity
> > of the solution. On the zswap side, we need to distinguish between the
>
> Google actually has an internal patch to keep incompressible pages in
> separate LRU out of zswap. But that breaks the zswap LRU order as
> well. If there is interest and I can find the time, I can send it out
> for note comparison purposes. I do see the value of maintaining the
> LRU in the zswap tier as a whole.

It would be very valuable. Much appreciated, Chris!

And yes, we also discussed that approach too. What I described above
was an attempt to get the best-of-all-world.

There's a couple of desirata:

1. Maintain LRU ordering, in case we want to do zswap writeback.
2. Do not retry incompressible pages, until they are accessed.
3. Minimize the cost of page fault (memcpy, page allocation/free), as
much as possible.
4. Do not retry incompressible pages, until they are dirtied (a
stronger guarantee than 2).
5. Keep the stored data migratable.

In the end, it got too complicated. So we decided to go for 1, 2, and
5, with this approach. Regarding 3, this is still an improvement over
vanilla zswap (which writes to disk, and tends to have an even higher
cost of page fault).

I believe Google's approach gets us 2, 3, 4.


>
> > ordinary struct zswap_entry and the struct page on zswap's LRU list.
> > Externally, we need to handle moving a page currently in the zswap LRU
> > to the main memory anon LRUs too.
> >
> > Migration is another concern. Zswap needs to be notified that the
> > "backend" of a zswap entry has changed underneath it. Not impossible,
> > but again that's just more surgery.
>
> Ack. We might need to get that operation inside zsmalloc.
>
> >
> > So we decided to start with a simple solution (this one), and iterate
> > as issues cropped up. At least then, we have production justifications
> > for any future improvements.
>
> Ack.
>
> Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-15 22:28     ` Chris Li
@ 2025-08-15 23:08       ` Nhat Pham
  2025-08-16  0:14         ` SeongJae Park
  2025-08-16  0:30         ` Chris Li
  2025-08-16  0:07       ` SeongJae Park
  1 sibling, 2 replies; 24+ messages in thread
From: Nhat Pham @ 2025-08-15 23:08 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Fri, Aug 15, 2025 at 3:29 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Aug 13, 2025 at 11:20 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > My RFC v1[1] was using a kind of such approach.  I changed it to use zpool
> > starting from RFC v2, following the suggestions from Nhat, Johannes and Joshua.
> > It was suggested to use zpool to reuse its support of migration, highmem
> > handling, and stats.  And I agree their suggestions.  David also raised a
> > question on RFC v2, about the movability behavior changes[2].
>
> Thanks for the pointer, that is an interesting read.
>
> > I agree we will get more metadata overhead.  Nonetheless, I argue it is not a
>
> Keep in mind that is more than just metadata. There is another hidden
> overhead of using zpool to store your full page compared to directly
> using the page.  When the page is read, because most likely the zpool
> back end is not at a position aligned to the page. That zpool page
> will not be able to free immediately, those fragmented zpool will need
> to wait for compaction to free. it.

It's been awhile since I last worked on zsmalloc, but IIRC zsmalloc
handles these pages PAGE_SIZE sized object by just handing out a full
page. I skimmed through zsmalloc code, and it seems like for this
size_class, each zspage just contain one base page?

(check out the logic around ZsHugePage and
calculate_zspage_chain_size() in mm/zsmalloc.c. Confusing name, I
know).

So we won't need compaction to get rid of these buffers for
incompressible pages, at free time. There might still be some extra
overhead (metadata), but at least I assume we can free these pages
easily?

>
> > big problem and can be mitigated in writeback-enabled environments.  Hence this
> > feature is disabled on writeback-disabled case.  Next paragraph on the original
> > cover letter explains my points about this.
> >
> > >
> > > The pros is that, on the page fault, there is no need to allocate a
> > > new page. You can just move the uncompressed_page into the swap_cache.
> > > You save one memcpy on the page fault as well. That will make the
> > > incompressible page fault behave very similar to the minor page fault.
> >
> > I agree this can save one memcpy, and that's cool idea!  Nonetheless, this
> > patch is not making the [z]swap-in overhead worse.  Rather, this patch also
>
> We might still wait to evaluate the lost/gain vs store the
> incompressible in swap cache. Google actually has an internal patch to
> store the incompressible pages in swap cache and move them out of the
> LRU to another already existing list. I can clean it up a bit and send
> it to the list for comparison. This approach will not mess with the
> zswap LRU because the incompressible data is not moving into zswap.
> There is no page fault to handle the incompressible pages.

We can always iterate to get the best of all worlds :) One objective at a time.

BTW, May I ask - how/when do we move the incompressible pages out of
the "incompressible LRU"? I believe there needs to be some sort of
scanning mechanism to detect dirtying right?

>
> > slightly improve it for incompressible pages case by skipping the
> > decompression.  Also, if we take this way, I think we should also need to make
> > a good answer to the zswapped-page migrations etc.
>
> Yes, if we keep the store page in the zswap approach, we might have to
> optimize inside zsmalloc to handle full page storing better.

I believe it already does - check my response above. But I can be wrong.

>
> > So, if we agree on my justification about the metadata overhead, I think this
> > could be done as a followup work of this patch?
>
> Sure. I am most interested in getting the best overall solution. No
> objects to get it now vs later.

Agree. We can always iterate on solutions. No big deal.

>
> >
> > >
> > > The cons is that, now zpool does not account for uncompressed pages,
> > > on the second thought, that can be a con as well, the page is not
> > > compressed, why should it account as compressed pages?
> >
> > I agree it might look weird.  But, in my humble opinion, in a perspective of
> > thinking it as zswap backend storage, and by thinking the definition of
> > compression in a flexible way, this may be understandable and not cause real
> > user problems?
>
> I think the real problem is hiding the real error with non errors like
> data not compressible. If you want to store uncompressed data in the
> zswap layer by design, that is fine. Just need to reason the trade off
> vs other options like store in swap cache etc.
>
> > [...]
> > > >  mm/zswap.c | 36 ++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 34 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 3c0fd8a13718..0fb940d03268 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -60,6 +60,8 @@ static u64 zswap_written_back_pages;
> > > >  static u64 zswap_reject_reclaim_fail;
> > > >  /* Store failed due to compression algorithm failure */
> > > >  static u64 zswap_reject_compress_fail;
> > > > +/* Compression into a size of <PAGE_SIZE failed */
> > > > +static u64 zswap_compress_fail;
> > > >  /* Compressed page was too big for the allocator to (optimally) store */
> > > >  static u64 zswap_reject_compress_poor;
> > > >  /* Load or writeback failed due to decompression failure */
> > > > @@ -976,8 +978,26 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > > >          */
> > > >         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> > > >         dlen = acomp_ctx->req->dlen;
> > > > -       if (comp_ret)
> > > > -               goto unlock;
> > > > +
> > > > +       /*
> > > > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > > > +        * save the content as is without a compression, to keep the LRU order
> > > > +        * of writebacks.  If writeback is disabled, reject the page since it
> > > > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > > > +        * to the active LRU list in the case.
> > > > +        */
> > > > +       if (comp_ret || dlen >= PAGE_SIZE) {
> > > > +               zswap_compress_fail++;
> > >
> > > I feel that mixing comp_ret and dlen size if tested here complicates
> > > the nested if statement and the behavior as well.
> > > One behavior change is that, if the comp_ret != 0, it means the
> > > compression crypt engine has some error. e.g. have some bad chip
> > > always fail the compression. That is a different error case than the
> > > compression was successful and the compression rate is not good. In
> > > this case the patch makes the page store in zpool for cryto engine
> > > failure, which does not help.
> >
> > I agree the code has rooms to improve, in terms of the readability, and keeping
> > fine grained observailibty.
> >
> > >
> > > Is there any reason you want to store the page in zpool when the
> > > compression engine (not the ratio) fails?
> >
> > The main problem this patch tries to solve is the LRU order corruption.  In any
> > case, I want to keep the order if possible.
>
> In that case, does it mean that in order to keep the LRU, you never
> want to write from zswap to a real back end device? That will make
> zswap behave more like a standalone tier.
>
> >
> > >
> > > What do you say about the following alternative, this keeps the
> > > original behavior if compression engines fail.
> > >
> > >      if (comp_ret) {
> > >           zswap_compress_egine_fail++;
> >
> > I agree this counter can be useful.
>
> Ack.
>
> >
> > >           goto unlock;
> >
> > This will make the page go directly to underlying slower swap device, and hence
> > cause LRU order inversion.  So unfortuately I have to say this is not the
> > behavior I want.
>
> Ack, that is a design choice. I understand now.
>
> >
> > I agree engine failure is not frequent, so this behavior might not really make
> > problem to me.  But, still I don't see a reason to let the page go directly to
> > swap and make LRU order unexpected.
> >
> > >      }
> > >
> > >      if (dlen >= PAGE_SIZE) {
> > >         zswap_compress_fail++;
> >
> > I define compress failure here as a failure of attempt to compress the given
> > page's content into a size smaller than PAGE_SIZE.  It is a superset includin.
> > both "ratio" failure and "engine" failure.  Hence I think zswap_compress_fail
> > should be increased even in the upper case.
>
> I slept over it a bit. Now I think we should make this a counter of
> how many uncompressed pages count stored in zswap. Preperbelly as per
> memcg counter.

Actually, yeah I asked about this counter in a review in an earlier
version as well, then I completely forgot about it :)


> I saw that you implement it as a counter in your V1. Does the zsmalloc
> already track this information in the zspool class? Having this per

Kinda sorta. If we build the kernel with CONFIG_ZSMALLOC_STAT, we can
get the number of objects in each size_class.

Each time we read, I believe we have to read every size class though.
So it's kinda annoying. Whereas here, we can just read an atomic
counter? :)

> cgroup information we can evaluate how  much zswap is saving. Some
> jobs like databases might store a lot of hash value and encrypted data
> which is not compressible. In that case, we might just pass the whole

Hmmm actually, this might be a good point. Lemme think about it a bit more.

(but again, worst case scenario, we can send a follow up patch to add
these counters).

> zswap tier as a whole. The overall compression ratio will reflect that
> as well. Having the separate per memcg counter is kind of useful as
> well.
>
> >
> > We can discuss about re-defining and re-naming what each stat means, of course,
> > if you want.  But I think even if we keep the current definitions and names, we
> > can still get the ratio failures if we add your nicely suggested
> > zswap_compress_engine_fail, as
> > 'zswap_compress_fail - zswap_compress_engine_fail', so we might not really need
> > re-definition and re-naming?
>
> I am actually less interested in the absolute failure number which
> keeps increasing, more on how much incompressible zswap is stored.
> That counter + number of engine errors should be perfect.
>
> >
> > >         if (!mem_cgroup_zswap_writeback_enabled(
> > >                                       folio_memcg(page_folio(page)))) {
> > >               comp_ret = -EINVAL;
> > >               goto unlock;
> > >         }
> > >        dlen = PAGE_SIZE;
> > >        dst = kmap_local_page(page);
> > >     }
> > >
> > > Overall I feel this alternative is simpler and easier to read.
> >
> > I agree this code is nice to read.  Nonetheless I have to say the behavior is
> > somewhat different from what I want.
>
> Even if you keep the current behavior, you can move the invert the
> test condition and then remove the "else + goto" similar to the above.
> That will make your code less and flatter. I will need to think about
> whether we can assign the return value less.
>
> Another point I would like to make is that you currently make the cut
> off threshold as page size. The ideal threshold might be something
> slightly smaller than page size. The reason is that the zsmalloc has a
> fixed size chuck to store it. If your page is close to full, it will
> store the data in the same class as the full page. You are not gaining
> anything from zsmalloc if the store page compression ratio is 95%.
> That 5% will be in the waste in the zsmalloc class fragment anyway. So
> the trade off is, decompress 95% of a page vs memcpy 100% of a page.
> It is likely memcpy 100% is faster. I don't know the exact ideal cut
> off of threshold. If I had to guess, I would guess below 90%.
>
> >
> > > I can
> > > be biased towards my own code :-).
> > > I think we should treat the compression engine failure separately from
> > > the compression rate failure. The engine failure is rare and we should
> > > know about it as a real error. The compression rate failure is normal.
> >
> > Again, I agree having the observability would be nice.  I can add a new counter
> > for that, like below attached patch, for example.
>
> I would love that. Make that per memcg if possible :-)
>
> >
> > [1] https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org
> > [2] https://lore.kernel.org/761a2899-6fd9-4bfe-aeaf-23bce0baa0f1@redhat.com
>
> Thanks for the pointer, good read.
>
> Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-15 22:28     ` Chris Li
  2025-08-15 23:08       ` Nhat Pham
@ 2025-08-16  0:07       ` SeongJae Park
  2025-08-16  2:20         ` Chris Li
  1 sibling, 1 reply; 24+ messages in thread
From: SeongJae Park @ 2025-08-16  0:07 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Fri, 15 Aug 2025 15:28:59 -0700 Chris Li <chrisl@kernel.org> wrote:

> On Wed, Aug 13, 2025 at 11:20 AM SeongJae Park <sj@kernel.org> wrote:
[...]
> We might still wait to evaluate the lost/gain vs store the
> incompressible in swap cache. Google actually has an internal patch to
> store the incompressible pages in swap cache and move them out of the
> LRU to another already existing list. I can clean it up a bit and send
> it to the list for comparison.

This would be really nice!

[...]
> > So, if we agree on my justification about the metadata overhead, I think this
> > could be done as a followup work of this patch?
> 
> Sure. I am most interested in getting the best overall solution. No
> objects to get it now vs later.

Thank you for being flexible, Chris.  I'm also looking forward to keep
collaborating with you on the followup works!

[...]
> > > Is there any reason you want to store the page in zpool when the
> > > compression engine (not the ratio) fails?
> >
> > The main problem this patch tries to solve is the LRU order corruption.  In any
> > case, I want to keep the order if possible.
> 
> In that case, does it mean that in order to keep the LRU, you never
> want to write from zswap to a real back end device?

Not always, but until we have to write back zswapped pages to real back end
deevice, and all zswapped pages of lower LRU-order are already wrote back.

[...]
> > >      if (dlen >= PAGE_SIZE) {
> > >         zswap_compress_fail++;
> >
> > I define compress failure here as a failure of attempt to compress the given
> > page's content into a size smaller than PAGE_SIZE.  It is a superset including
> > both "ratio" failure and "engine" failure.  Hence I think zswap_compress_fail
> > should be increased even in the upper case.
> 
> I slept over it a bit. Now I think we should make this a counter of
> how many uncompressed pages count stored in zswap. Preperbelly as per
> memcg counter.

I agree that could be useful.  I will add the counter in the next version (v4).
But making it for each memcg may be out of the scope of this patch, in my
opinion.  Would you mind doing per-memcg counter implementation as a followup?

> I saw that you implement it as a counter in your V1.

Yes, though it was only for internal usage and therefore not exposed to the
user space.  I will make it again and expose to the user space via debugfs.
Say, stored_uncompressed_pages?

> Does the zsmalloc
> already track this information in the zspool class?

zsmalloc provides such information when CONFIG_ZSMALLOC_STAT is enabled, to my
understanding.

[...]
> I am actually less interested in the absolute failure number which
> keeps increasing, more on how much incompressible zswap is stored.
> That counter + number of engine errors should be perfect.

Sounds good.  So the next version (v4) of this patch will provide two new
debugfs counters, namely compress_engine_fail, and stored_uncompressed_pages.

[...]
> > I agree this code is nice to read.  Nonetheless I have to say the behavior is
> > somewhat different from what I want.
> 
> Even if you keep the current behavior, you can move the invert the
> test condition and then remove the "else + goto" similar to the above.
> That will make your code less and flatter. I will need to think about
> whether we can assign the return value less.

Nice idea.  What about below?

        if (comp_ret) {
                zswap_compress_engine_fail++;
                dlen = PAGE_SIZE;
        }
        if (dlen >= PAGE_SIZE) {
                zswap_compress_fail++;
                if (!mem_cgroup_zswap_writeback_enabled(
                                        folio_memcg(page_folio(page)))) {
                        comp_ret = -EINVAL;
                        goto unlock;
                }
                comp_ret = 0;
                dlen = PAGE_SIZE;
                dst = kmap_local_page(page);
        }

> 
> Another point I would like to make is that you currently make the cut
> off threshold as page size. The ideal threshold might be something
> slightly smaller than page size. The reason is that the zsmalloc has a
> fixed size chuck to store it. If your page is close to full, it will
> store the data in the same class as the full page. You are not gaining
> anything from zsmalloc if the store page compression ratio is 95%.
> That 5% will be in the waste in the zsmalloc class fragment anyway. So
> the trade off is, decompress 95% of a page vs memcpy 100% of a page.
> It is likely memcpy 100% is faster. I don't know the exact ideal cut
> off of threshold. If I had to guess, I would guess below 90%.

Agreed, this could be another nice followup work to do.

> 
> >
> > > I can
> > > be biased towards my own code :-).
> > > I think we should treat the compression engine failure separately from
> > > the compression rate failure. The engine failure is rare and we should
> > > know about it as a real error. The compression rate failure is normal.
> >
> > Again, I agree having the observability would be nice.  I can add a new counter
> > for that, like below attached patch, for example.
> 
> I would love that. Make that per memcg if possible :-)

As mentioned above, I would like to make only a global counter on debugfs for
now, if you don't mind.  Let me know if you mind.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-15 23:08       ` Nhat Pham
@ 2025-08-16  0:14         ` SeongJae Park
  2025-08-16  2:23           ` Chris Li
  2025-08-16  0:30         ` Chris Li
  1 sibling, 1 reply; 24+ messages in thread
From: SeongJae Park @ 2025-08-16  0:14 UTC (permalink / raw)
  To: Nhat Pham
  Cc: SeongJae Park, Chris Li, Andrew Morton, Chengming Zhou,
	David Hildenbrand, Johannes Weiner, Yosry Ahmed, linux-kernel,
	linux-mm, Takero Funaki, Hugh Dickins

On Fri, 15 Aug 2025 16:08:50 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> On Fri, Aug 15, 2025 at 3:29 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Wed, Aug 13, 2025 at 11:20 AM SeongJae Park <sj@kernel.org> wrote:
[...]
> > I slept over it a bit. Now I think we should make this a counter of
> > how many uncompressed pages count stored in zswap. Preperbelly as per
> > memcg counter.
> 
> Actually, yeah I asked about this counter in a review in an earlier
> version as well, then I completely forgot about it :)
> 
> 
> > I saw that you implement it as a counter in your V1. Does the zsmalloc
> > already track this information in the zspool class? Having this per
> 
> Kinda sorta. If we build the kernel with CONFIG_ZSMALLOC_STAT, we can
> get the number of objects in each size_class.
> 
> Each time we read, I believe we have to read every size class though.
> So it's kinda annoying. Whereas here, we can just read an atomic
> counter? :)

Sounds good.  So in the next version (v4), I will drop compress_fail.  Instead,
I will add two new counters, namely compress_engine_fail and the new atomic
counter, say, stored_uncompressed_pages.  Please suggest better names or
correct me if I'm missing some of your points.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-15 23:08       ` Nhat Pham
  2025-08-16  0:14         ` SeongJae Park
@ 2025-08-16  0:30         ` Chris Li
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-16  0:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Fri, Aug 15, 2025 at 4:09 PM Nhat Pham <nphamcs@gmail.com> wrote:
>

> > Keep in mind that is more than just metadata. There is another hidden
> > overhead of using zpool to store your full page compared to directly
> > using the page.  When the page is read, because most likely the zpool
> > back end is not at a position aligned to the page. That zpool page
> > will not be able to free immediately, those fragmented zpool will need
> > to wait for compaction to free. it.
>
> It's been awhile since I last worked on zsmalloc, but IIRC zsmalloc
> handles these pages PAGE_SIZE sized object by just handing out a full
> page. I skimmed through zsmalloc code, and it seems like for this
> size_class, each zspage just contain one base page?

Yes, they have to be full size for the page class that contains the
full size. That is exactly what I expected. What I previously
mentioned the page might be fragmented and not able to release from
zpool is wrong, now you mention it due to the full size.  That is also
why I make the other point that the ideal cut off threshold can be
less than PAGE_SIZE. e.g. after compression 95% of a full page, you
might just as well store it as uncompressed. Those 5% will waste in
the class margin anyway.

>
> (check out the logic around ZsHugePage and
> calculate_zspage_chain_size() in mm/zsmalloc.c. Confusing name, I
> know).
>
> So we won't need compaction to get rid of these buffers for
> incompressible pages, at free time. There might still be some extra
> overhead (metadata), but at least I assume we can free these pages
> easily?

Maybe, or steal them into the swap cache.

> > We might still wait to evaluate the lost/gain vs store the
> > incompressible in swap cache. Google actually has an internal patch to
> > store the incompressible pages in swap cache and move them out of the
> > LRU to another already existing list. I can clean it up a bit and send
> > it to the list for comparison. This approach will not mess with the
> > zswap LRU because the incompressible data is not moving into zswap.
> > There is no page fault to handle the incompressible pages.
>
> We can always iterate to get the best of all worlds :) One objective at a time.
>
> BTW, May I ask - how/when do we move the incompressible pages out of
> the "incompressible LRU"? I believe there needs to be some sort of
> scanning mechanism to detect dirtying right?
>
> >
> > > slightly improve it for incompressible pages case by skipping the
> > > decompression.  Also, if we take this way, I think we should also need to make
> > > a good answer to the zswapped-page migrations etc.
> >
> > Yes, if we keep the store page in the zswap approach, we might have to
> > optimize inside zsmalloc to handle full page storing better.

Sure. Ack.

>
> I believe it already does - check my response above. But I can be wrong.
>

> > I slept over it a bit. Now I think we should make this a counter of
> > how many uncompressed pages count stored in zswap. Preperbelly as per
> > memcg counter.
>
> Actually, yeah I asked about this counter in a review in an earlier
> version as well, then I completely forgot about it :)

Ack.

>
>
> > I saw that you implement it as a counter in your V1. Does the zsmalloc
> > already track this information in the zspool class? Having this per
>
> Kinda sorta. If we build the kernel with CONFIG_ZSMALLOC_STAT, we can
> get the number of objects in each size_class.
>
> Each time we read, I believe we have to read every size class though.
> So it's kinda annoying. Whereas here, we can just read an atomic
> counter? :)
>

The data compressible or not is very dependent on the job. A per memcg
counter will be more useful than a global overall counter. As a
devil's advocate to challenge the per memcg incompressible counter. If
most of the data is not compressible, the per memcg overall
compression ratio, compressed size vs before compress size can reflect
that as well. Depending on how fine grain we want this data.

> > cgroup information we can evaluate how  much zswap is saving. Some
> > jobs like databases might store a lot of hash value and encrypted data
> > which is not compressible. In that case, we might just pass the whole
>
> Hmmm actually, this might be a good point. Lemme think about it a bit more.
>
> (but again, worst case scenario, we can send a follow up patch to add
> these counters).

Agree.

Thanks

Chris


Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-16  0:07       ` SeongJae Park
@ 2025-08-16  2:20         ` Chris Li
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-16  2:20 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Chengming Zhou, David Hildenbrand, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, linux-kernel, linux-mm, Takero Funaki,
	Hugh Dickins

On Fri, Aug 15, 2025 at 5:07 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Fri, 15 Aug 2025 15:28:59 -0700 Chris Li <chrisl@kernel.org> wrote:
> > Sure. I am most interested in getting the best overall solution. No
> > objects to get it now vs later.
>
> Thank you for being flexible, Chris.  I'm also looking forward to keep
> collaborating with you on the followup works!

Likewise.

>
> [...]
> > > > Is there any reason you want to store the page in zpool when the
> > > > compression engine (not the ratio) fails?
> > >
> > > The main problem this patch tries to solve is the LRU order corruption.  In any
> > > case, I want to keep the order if possible.
> >
> > In that case, does it mean that in order to keep the LRU, you never
> > want to write from zswap to a real back end device?
>
> Not always, but until we have to write back zswapped pages to real back end
> deevice, and all zswapped pages of lower LRU-order are already wrote back.

I see, that makes sense. Only writers to the lower tier while
maintaining LRU order.
Thanks for the explanation. I can see the bigger picture now.

> > I slept over it a bit. Now I think we should make this a counter of
> > how many uncompressed pages count stored in zswap. Preperbelly as per
> > memcg counter.
>
> I agree that could be useful.  I will add the counter in the next version (v4).
> But making it for each memcg may be out of the scope of this patch, in my
> opinion.  Would you mind doing per-memcg counter implementation as a followup?

No objections. If the incompressible page count is very bad, it will
drag down the overall compression ratio as well. So we do have some
per memcg signal to pick it up. The trade off is how complex we want
to go for those marginal finer grain measurements.

> > I saw that you implement it as a counter in your V1.
>
> Yes, though it was only for internal usage and therefore not exposed to the
> user space.  I will make it again and expose to the user space via debugfs.
> Say, stored_uncompressed_pages?

I am not very good at naming either. Maybe
"stored_incompressible_pages"? Uncompressed pages are the resulting
state, it sounds like we choose to store them uncompressed. The root
cause is that those pages are incompressible. That is the nature of
the page,  we don't have a choice to compress them.

>
> > Does the zsmalloc
> > already track this information in the zspool class?
>
> zsmalloc provides such information when CONFIG_ZSMALLOC_STAT is enabled, to my
> understanding.

Thanks. I think that is likely global not per memcg.

>
> [...]
> > I am actually less interested in the absolute failure number which
> > keeps increasing, more on how much incompressible zswap is stored.
> > That counter + number of engine errors should be perfect.
>
> Sounds good.  So the next version (v4) of this patch will provide two new
> debugfs counters, namely compress_engine_fail, and stored_uncompressed_pages.

Sounds good. Maybe "crypto_compress_fail", the engine is the common
name as I call it. Here the engine refers to the crypto library so
"crypto_compress_fail" matches the code better.

>
> [...]
> > > I agree this code is nice to read.  Nonetheless I have to say the behavior is
> > > somewhat different from what I want.
> >
> > Even if you keep the current behavior, you can move the invert the
> > test condition and then remove the "else + goto" similar to the above.
> > That will make your code less and flatter. I will need to think about
> > whether we can assign the return value less.
>
> Nice idea.  What about below?
>
>         if (comp_ret) {

You can consider adding (comp_ret || !dlen) , because if dlen == 0,
something is seriously wrong with the compression, that is a real
error.

>                 zswap_compress_engine_fail++;
>                 dlen = PAGE_SIZE;
>         }
>         if (dlen >= PAGE_SIZE) {
>                 zswap_compress_fail++;
>                 if (!mem_cgroup_zswap_writeback_enabled(
>                                         folio_memcg(page_folio(page)))) {
>                         comp_ret = -EINVAL;
>                         goto unlock;
>                 }
>                 comp_ret = 0;
>                 dlen = PAGE_SIZE;
>                 dst = kmap_local_page(page);
>         }

Looks very good to me. Much cleaner than the V2. Thanks for adopting that.

> > Another point I would like to make is that you currently make the cut
> > off threshold as page size. The ideal threshold might be something
> > slightly smaller than page size. The reason is that the zsmalloc has a
> > fixed size chuck to store it. If your page is close to full, it will
> > store the data in the same class as the full page. You are not gaining
> > anything from zsmalloc if the store page compression ratio is 95%.
> > That 5% will be in the waste in the zsmalloc class fragment anyway. So
> > the trade off is, decompress 95% of a page vs memcpy 100% of a page.
> > It is likely memcpy 100% is faster. I don't know the exact ideal cut
> > off of threshold. If I had to guess, I would guess below 90%.
>
> Agreed, this could be another nice followup work to do.

Ack.

>
> >
> > >
> > > > I can
> > > > be biased towards my own code :-).
> > > > I think we should treat the compression engine failure separately from
> > > > the compression rate failure. The engine failure is rare and we should
> > > > know about it as a real error. The compression rate failure is normal.
> > >
> > > Again, I agree having the observability would be nice.  I can add a new counter
> > > for that, like below attached patch, for example.
> >
> > I would love that. Make that per memcg if possible :-)
>
> As mentioned above, I would like to make only a global counter on debugfs for
> now, if you don't mind.  Let me know if you mind.

I don't mind. We can take the incremental approach.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-16  0:14         ` SeongJae Park
@ 2025-08-16  2:23           ` Chris Li
  2025-08-18 18:18             ` SeongJae Park
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Li @ 2025-08-16  2:23 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Nhat Pham, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Fri, Aug 15, 2025 at 5:14 PM SeongJae Park <sj@kernel.org> wrote:
>
> Sounds good.  So in the next version (v4), I will drop compress_fail.  Instead,
> I will add two new counters, namely compress_engine_fail and the new atomic
> counter, say, stored_uncompressed_pages.  Please suggest better names or
> correct me if I'm missing some of your points.

Sounds great. As I suggest in my other email, if up to me,

"crypto_compress_fail" and "stored_incompressible_pages"

Thanks!

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-16  2:23           ` Chris Li
@ 2025-08-18 18:18             ` SeongJae Park
  2025-08-18 20:33               ` Chris Li
  0 siblings, 1 reply; 24+ messages in thread
From: SeongJae Park @ 2025-08-18 18:18 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Nhat Pham, Andrew Morton, Chengming Zhou,
	David Hildenbrand, Johannes Weiner, Yosry Ahmed, linux-kernel,
	linux-mm, Takero Funaki, Hugh Dickins

On Fri, 15 Aug 2025 19:23:10 -0700 Chris Li <chrisl@kernel.org> wrote:

> On Fri, Aug 15, 2025 at 5:14 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > Sounds good.  So in the next version (v4), I will drop compress_fail.  Instead,
> > I will add two new counters, namely compress_engine_fail and the new atomic
> > counter, say, stored_uncompressed_pages.  Please suggest better names or
> > correct me if I'm missing some of your points.
> 
> Sounds great. As I suggest in my other email, if up to me,
> 
> "crypto_compress_fail" and "stored_incompressible_pages"

Nice names, thank you!  I'll send the next verison soon, following all your
suggestions[1] together.

[1] https://lore.kernel.org/CAF8kJuPCpnGAoz=1CBwe46ytpcR0ZUMAEWLFQce7eUWkb+0ERA@mail.gmail.com


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-18 18:18             ` SeongJae Park
@ 2025-08-18 20:33               ` Chris Li
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-18 20:33 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Nhat Pham, Andrew Morton, Chengming Zhou, David Hildenbrand,
	Johannes Weiner, Yosry Ahmed, linux-kernel, linux-mm,
	Takero Funaki, Hugh Dickins

On Mon, Aug 18, 2025 at 11:18 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Fri, 15 Aug 2025 19:23:10 -0700 Chris Li <chrisl@kernel.org> wrote:
>
> > On Fri, Aug 15, 2025 at 5:14 PM SeongJae Park <sj@kernel.org> wrote:
> > >
> > > Sounds good.  So in the next version (v4), I will drop compress_fail.  Instead,
> > > I will add two new counters, namely compress_engine_fail and the new atomic
> > > counter, say, stored_uncompressed_pages.  Please suggest better names or
> > > correct me if I'm missing some of your points.
> >
> > Sounds great. As I suggest in my other email, if up to me,
> >
> > "crypto_compress_fail" and "stored_incompressible_pages"
>
> Nice names, thank you!  I'll send the next verison soon, following all your
> suggestions[1] together.

Thanks, looking forward to it.

Chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-08-18 20:33 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-12 17:00 [PATCH v2] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
2025-08-12 18:52 ` Nhat Pham
2025-08-13 17:07 ` Chris Li
2025-08-13 18:20   ` SeongJae Park
2025-08-15 22:28     ` Chris Li
2025-08-15 23:08       ` Nhat Pham
2025-08-16  0:14         ` SeongJae Park
2025-08-16  2:23           ` Chris Li
2025-08-18 18:18             ` SeongJae Park
2025-08-18 20:33               ` Chris Li
2025-08-16  0:30         ` Chris Li
2025-08-16  0:07       ` SeongJae Park
2025-08-16  2:20         ` Chris Li
2025-08-13 18:32   ` Nhat Pham
2025-08-15 22:34     ` Chris Li
2025-08-15 22:44       ` Nhat Pham
2025-08-13 18:58   ` Hugh Dickins
2025-08-15 22:38     ` Chris Li
2025-08-13 19:42   ` Shakeel Butt
2025-08-13 20:48     ` Johannes Weiner
2025-08-15 22:44       ` Chris Li
2025-08-15 22:42     ` Chris Li
2025-08-14 16:23 ` Johannes Weiner
2025-08-14 17:04   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).