[PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
@ 2025-08-19 19:34 SeongJae Park
  2025-08-20  1:13 ` Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: SeongJae Park @ 2025-08-19 19:34 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Barry Song,
	Chris Li, Kairui Song

When zswap writeback is enabled and it fails compressing a given page, the
page is swapped out to the backing swap device.  This behavior breaks the
zswap's writeback LRU order, and hence users can experience unexpected
latency spikes.  If the page is compressed without failure, but results in
a size of PAGE_SIZE, the LRU order is kept, but the decompression overhead
for loading the page back on the later access is unnecessary.

Keep the LRU order and optimize unnecessary decompression overheads in
those cases, by storing the original content as-is in zpool.  The length
field of zswap_entry will be set appropriately, as PAGE_SIZE.  Hence
whether it is saved as-is or not (whether decompression is unnecessary)
is identified by 'zswap_entry->length == PAGE_SIZE'.

Because the uncompressed data is saved in zpool, same to the compressed
ones, this introduces no change in terms of memory management including
movability and migratability of involved pages.

This change is also not increasing per zswap entry metadata overhead.  But
as the number of incompressible pages increases, total zswap metadata
overhead is proportionally increased.  The overhead should not be
problematic in usual cases, since the zswap metadata for single zswap
entry is much smaller than PAGE_SIZE, and in common zswap use cases there
should be a sufficient amount of compressible pages.  Also it can be
mitigated by the zswap writeback.

When the writeback is disabled, the additional overhead could be
problematic.  For the case, keep the current behavior that just returns
the failure and let swap_writeout() put the page back to the active LRU
list in the case.

Knowing how many compression failures from the crypto engine happened so
far, and how many incompressible pages are stored at the given moment
will be useful for future investigations.  Add two new debugfs files,
crypto_compress_fail and stored_incompressible_pages, for the two
counts, respectively.

Tests
-----

I tested this patch using a simple self-written microbenchmark that is
available at GitHub[1].  You can reproduce the test I did by executing
run_tests.sh of the repo on your system.  Note that the repo's
documentation is not good as of this writing, so you may need to read and
use the code.

The basic test scenario is simple.  Run a test program making artificial
accesses to memory having artificial content under memory.high-set memory
limit and measure how many accesses were made in a given time.

The test program repeatedly and randomly access three anonymous memory
regions.  The regions are all 500 MiB size, and be accessed in the same
probability.  Two of those are filled up with a simple content that can
easily be compressed, while the remaining one is filled up with a
content that s read from /dev/urandom, which is easy to fail at
compressing to a size smaller than PAGE_SIZE.  The program runs for two
minutes and prints out the number of accesses made every five seconds.

The test script runs the program under below four configurations.

- 0: memory.high is set to 2 GiB, zswap is disabled.
- 1-1: memory.high is set to 1350 MiB, zswap is disabled.
- 1-2: On 1-1, zswap is enabled without this patch.
- 1-3: On 1-2, this patch is applied.

For all zswap enabled cases, zswap shrinker is enabled.

Configuration '0' is for showing the original memory performance.
Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
zswap, and this patch under a level of memory pressure (~10% of working
set).  Configurations 0 and 1-1 are not the main focus of this patch, but
I'm adding those since their results transparently show how far this
microbenchmark test is from the real world.

Because the per-5 seconds performance is not very reliable, I measured the
average of that for the last one minute period of the test program run.  I
also measured a few vmstat counters including zswpin, zswpout, zswpwb,
pswpin and pswpout during the test runs.

The measurement results are as below.  To save space, I show performance
numbers that are normalized to that of the configuration '0' (no memory
pressure).  The averaged accesses per 5 seconds of configuration '0' was
36493417.75.

    config            0       1-1     1-2      1-3
    perf_normalized   1.0000  0.0057  0.0235   0.0367
    perf_stdev_ratio  0.0582  0.0652  0.0167   0.0346
    zswpin            0       0       3548424  1999335
    zswpout           0       0       3588817  2361689
    zswpwb            0       0       10214    340270
    pswpin            0       485806  772038   340967
    pswpout           0       649543  144773   340270

'perf_normalized' is the performance metric, normalized to that of
configuration '0' (no pressure).  'perf_stdev_ratio' is the standard
deviation of the averaged data points, as a ratio to the averaged metric
value.  For example, configuration '0' performance was showing 5.8% stdev.
Configurations 1-1 and 1-3 were having about 6.5% and 6.1% stdev.  Also
the results were highly variable between multiple runs.  So this result is
not very stable but just showing ball park figures.  Please keep this in
your mind when reading these results.

Under about 10% of working set memory pressure, the performance was
dropped to about 0.57% of no-pressure one, when the normal swap is used
(1-1).  Note that ~10% working set pressure is already extreme, at least
on this test setup.  No one would desire system setups that can degrade
performance to 0.57% of the best case.

By turning zswap on (1-2), the performance was improved about 4x,
resulting in about 2.35% of no-pressure one.  Because of the
incompressible pages in the third memory region, a significant amount of
(non-zswap) swap I/O operations were made, though.

By applying this patch (1-3), about 56% performance improvement was made,
resulting in about 3.67% of no-pressure one.  Reduced pswpin of 1-3
compared to 1-2 let us see where this improvement came from.

Tests without Zswap Shrinker
----------------------------

Zswap shrinker is not enabled by default, so I ran the above test after
disabling zswap shrinker.  The results are as below.

    config            0       1-1     1-2      1-3
    perf_normalized   1.0000  0.0056  0.0185   0.0260
    perf_stdev_ratio  0.0467  0.0348  0.1832   0.3387
    zswpin            0       0       2506765  6049078
    zswpout           0       0       2534357  6115426
    zswpwb            0       0       0        0
    pswpin            0       463694  472978   0
    pswpout           0       686227  612149   0

The overall normalized performance of the different configs are very
similar to those of zswap shrinker enabled case.  By adding the memory
pressure, the performance was dropped to 0.56% of the original one.  By
enabling zswap without zswap shrinker, the performance was increased to
1.85% of the original one.  By applying this patch on it, the performance
was further increased to 2.6% of the original one.

Even though zswap shrinker is disabled, 1-2 shows high numbers of pswpin
and pswpout because the incompressible pages are directly swapped out.
In the case of 1-3, it shows zero pswpin and pswpout since it saves
incompressible pages in the memory, and shows higher performance.

Note that the performance of 1-2 and 1-3 varies pretty much.  Standard
deviation of the performance for 1-2 was about 18.32% of the
performance, while that for 1-3 was about 33.87%.  Because zswap
shrinker is disabled and the memory pressure is induced by memory.high,
the workload got penalty_jiffies sleeps, and this resulted in the
unstabilized performance.

Related Works
-------------

This is not an entirely new attempt.  Nhat Pham and Takero Funaki tried
very similar approaches in October 2023[2] and April 2024[3],
respectively.  The two approaches didn't get merged mainly due to the
metadata overhead concern.  I described why I think that shouldn't be a
problem for this change, which is automatically disabled when writeback is
disabled, at the beginning of this changelog.

This patch is not particularly different from those, and actually built
upon those.  I wrote this from scratch again, though.  Hence adding
Suggested-by tags for them.  Actually Nhat first suggested this to me
offlist.

Historically, writeback disabling was introduced partially as a way to
solve the LRU order issue.  Yosry pointed out[4] this is still suboptimal
when the incompressible pages are cold, since the incompressible pages
will continuously be tried to be zswapped out, and burn CPU cycles for
compression attempts that will anyway fail.  One imaginable solution for
the problem is reusing the swapped-out page and its struct page to store
in the zswap pool.  But that's out of the scope of this patch.

[1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
[2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
[3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
[4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com

Signed-off-by: SeongJae Park <sj@kernel.org>
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Takero Funaki <flintglass@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
---
Changes from v3
(https://lore.kernel.org/20250815213020.89327-1-sj@kernel.org)
(discussions for changes from v3 were made on v2 thread)
- Drop the cumulated compression failure counter (compress_fail)
- Add a cumulated crypto-failure only counter (crypto_compress_fail)
- Add a not cumulated stored incompressible pages counter
  (stored_incompressible_pages)
- Cleanup compression failure handling code for readability

Changes from v2
(https://lore.kernel.org/20250812170046.56468-1-sj@kernel.org)
- No code change bug changelog updates
  - Add zswap shrinker disabled case test results.
  - Fix a typo on changelog.
  - Add a clarification of intention of 0 and 1-1 test configs.

Changes from v1
(https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org)
- Optimize out memcpy() per incompressible page saving, using
  k[un]map_local().
- Add a debugfs file for counting compression failures.
- Use a clear form of a ternary operation.
- Add the history of writeback disabling with a link.
- Wordsmith comments.

Changes from RFC v2
(https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
- Fix race conditions at decompressed pages identification.
- Remove the parameter and make saving as-is the default behavior.
- Open-code main changes.
- Clarify there is no memory management changes on the cover letter.
- Remove 20% pressure case from test results, since it is arguably too
  extreme and only adds confusion.
- Drop RFC tag.

Changes from RFC v1
(https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
- Consider PAGE_SIZE compression successes as failures.
- Use zpool for storing incompressible pages.
- Test with zswap shrinker enabled.
- Wordsmith changelog and comments.
- Add documentation of save_incompressible_pages parameter.

 mm/zswap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 3c0fd8a13718..1f1ac043a2d9 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -42,8 +42,10 @@
 /*********************************
 * statistics
 **********************************/
-/* The number of compressed pages currently stored in zswap */
+/* The number of pages currently stored in zswap */
 atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0);
+/* The number of incompressible pages currently stored in zswap */
+atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0);

 /*
  * The statistics below are not protected from concurrent access for
@@ -60,6 +62,8 @@ static u64 zswap_written_back_pages;
 static u64 zswap_reject_reclaim_fail;
 /* Store failed due to compression algorithm failure */
 static u64 zswap_reject_compress_fail;
+/* Compression failed by the crypto library */
+static u64 zswap_crypto_compress_fail;
 /* Compressed page was too big for the allocator to (optimally) store */
 static u64 zswap_reject_compress_poor;
 /* Load or writeback failed due to decompression failure */
@@ -811,6 +815,8 @@ static void zswap_entry_free(struct zswap_entry *entry)
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
 		obj_cgroup_put(entry->objcg);
 	}
+	if (entry->length == PAGE_SIZE)
+		atomic_long_dec(&zswap_stored_incompressible_pages);
 	zswap_entry_cache_free(entry);
 	atomic_long_dec(&zswap_stored_pages);
 }
@@ -976,8 +982,28 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 */
 	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
 	dlen = acomp_ctx->req->dlen;
-	if (comp_ret)
-		goto unlock;
+
+	/*
+	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
+	 * save the content as is without a compression, to keep the LRU order
+	 * of writebacks.  If writeback is disabled, reject the page since it
+	 * only adds metadata overhead.  swap_writeout() will put the page back
+	 * to the active LRU list in the case.
+	 */
+	if (comp_ret || !dlen) {
+		zswap_crypto_compress_fail++;
+		dlen = PAGE_SIZE;
+	}
+	if (dlen >= PAGE_SIZE) {
+		if (!mem_cgroup_zswap_writeback_enabled(
+					folio_memcg(page_folio(page)))) {
+			comp_ret = -EINVAL;
+			goto unlock;
+		}
+		comp_ret = 0;
+		dlen = PAGE_SIZE;
+		dst = kmap_local_page(page);
+	}

 	zpool = pool->zpool;
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
@@ -990,6 +1016,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	entry->length = dlen;

 unlock:
+	if (dst != acomp_ctx->buffer)
+		kunmap_local(dst);
 	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
 		zswap_reject_compress_poor++;
 	else if (comp_ret)
@@ -1012,6 +1040,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
 	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);

+	/* zswap entries of length PAGE_SIZE are not compressed. */
+	if (entry->length == PAGE_SIZE) {
+		memcpy_to_folio(folio, 0, obj, entry->length);
+		zpool_obj_read_end(zpool, entry->handle, obj);
+		acomp_ctx_put_unlock(acomp_ctx);
+		return true;
+	}
+
 	/*
 	 * zpool_obj_read_begin() might return a kmap address of highmem when
 	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
@@ -1524,6 +1560,8 @@ static bool zswap_store_page(struct page *page,
 		obj_cgroup_charge_zswap(objcg, entry->length);
 	}
 	atomic_long_inc(&zswap_stored_pages);
+	if (entry->length == PAGE_SIZE)
+		atomic_long_inc(&zswap_stored_incompressible_pages);

 	/*
 	 * We finish initializing the entry while it's already in xarray.
@@ -1792,6 +1830,14 @@ static int debugfs_get_stored_pages(void *data, u64 *val)
 }
 DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n");

+static int debugfs_get_stored_incompressible_pages(void *data, u64 *val)
+{
+	*val = atomic_long_read(&zswap_stored_incompressible_pages);
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(stored_incompressible_pages_fops,
+		debugfs_get_stored_incompressible_pages, NULL, "%llu\n");
+
 static int zswap_debugfs_init(void)
 {
 	if (!debugfs_initialized())
@@ -1809,6 +1855,8 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_kmemcache_fail);
 	debugfs_create_u64("reject_compress_fail", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_fail);
+	debugfs_create_u64("crypto_compress_fail", 0444,
+			   zswap_debugfs_root, &zswap_crypto_compress_fail);
 	debugfs_create_u64("reject_compress_poor", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("decompress_fail", 0444,
@@ -1819,6 +1867,9 @@ static int zswap_debugfs_init(void)
 			    zswap_debugfs_root, NULL, &total_size_fops);
 	debugfs_create_file("stored_pages", 0444,
 			    zswap_debugfs_root, NULL, &stored_pages_fops);
+	debugfs_create_file("stored_incompressible_pages", 0444,
+			    zswap_debugfs_root, NULL,
+			    &stored_incompressible_pages_fops);

 	return 0;
 }

base-commit: 803d261a97f9b4025282723d2930e58d49adcbf9
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-19 19:34 [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
@ 2025-08-20  1:13 ` Barry Song
  2025-08-20  1:20   ` Herbert Xu
  2025-08-20  4:49   ` Chris Li
  2025-08-21 16:17 ` SeongJae Park
  2025-08-21 21:21 ` Chris Li
  2 siblings, 2 replies; 24+ messages in thread
From: Barry Song @ 2025-08-20  1:13 UTC (permalink / raw)
  To: SeongJae Park, Herbert Xu
  Cc: Andrew Morton, Chengming Zhou, Johannes Weiner, Nhat Pham,
	Yosry Ahmed, kernel-team, linux-kernel, linux-mm, Takero Funaki,
	David Hildenbrand, Baoquan He, Chris Li, Kairui Song

[...]
> +
> +       /*
> +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> +        * save the content as is without a compression, to keep the LRU order
> +        * of writebacks.  If writeback is disabled, reject the page since it
> +        * only adds metadata overhead.  swap_writeout() will put the page back
> +        * to the active LRU list in the case.
> +        */
> +       if (comp_ret || !dlen) {
> +               zswap_crypto_compress_fail++;
> +               dlen = PAGE_SIZE;
> +       }

I’m not entirely sure about this. As long as we pass 2*PAGE_SIZE as
dst_buf, any error returned by crypto drivers should indicate a bug in
the driver that needs to be fixed.

There have also been attempts to unify crypto drivers so they consistently
return -ENOSPC when the destination buffer would overflow. If that has
been achieved, we might be able to reduce the buffer from 2*PAGE_SIZE to
just PAGE_SIZE in zswap. There was a long discussion on this here:
https://lore.kernel.org/linux-mm/Z7dnPh4tPxLO1UEo@google.com/

I’m not sure of the current status — do all crypto drivers now return a
consistent -ENOSPC when the compressed size exceeds PAGE_SIZE? From
what I recall during that discussion, most drivers already behaved this
way, but Sergey Senozhatsky pointed out one or two exceptions.

Let’s sync with Herbert: have we reached the stage where all drivers
reliably return -ENOSPC when dst_buf is PAGE_SIZE but the compressed
size would exceed it?

Thanks
Barry

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:13 ` Barry Song
@ 2025-08-20  1:20   ` Herbert Xu
  2025-08-20  1:34     ` Barry Song
  2025-08-20  4:55     ` Chris Li
  2025-08-20  4:49   ` Chris Li
  1 sibling, 2 replies; 24+ messages in thread
From: Herbert Xu @ 2025-08-20  1:20 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Chris Li,
	Kairui Song

On Wed, Aug 20, 2025 at 01:13:13PM +1200, Barry Song wrote:
>
> Let’s sync with Herbert: have we reached the stage where all drivers
> reliably return -ENOSPC when dst_buf is PAGE_SIZE but the compressed
> size would exceed it?

It doesn't matter.  Software compression should never fail, and
if it does fail due to a bug, that's not something that the user
of the API should worry about.

Hardware compression should always fall back to software compression
in case of failure.

IOW all compression errors should be treated as incompressible
data.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:20   ` Herbert Xu
@ 2025-08-20  1:34     ` Barry Song
  2025-08-20  1:37       ` Herbert Xu
  2025-08-20  5:07       ` Chris Li
  2025-08-20  4:55     ` Chris Li
  1 sibling, 2 replies; 24+ messages in thread
From: Barry Song @ 2025-08-20  1:34 UTC (permalink / raw)
  To: Herbert Xu
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Chris Li,
	Kairui Song

On Wed, Aug 20, 2025 at 1:21 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Aug 20, 2025 at 01:13:13PM +1200, Barry Song wrote:
> >
> > Let’s sync with Herbert: have we reached the stage where all drivers
> > reliably return -ENOSPC when dst_buf is PAGE_SIZE but the compressed
> > size would exceed it?
>
> It doesn't matter.  Software compression should never fail, and
> if it does fail due to a bug, that's not something that the user
> of the API should worry about.
>
> Hardware compression should always fall back to software compression
> in case of failure.
>
> IOW all compression errors should be treated as incompressible
> data.

That is why I think it makes little sense to add a counter
zswap_crypto_compress_fail, since zswap already passes 2*PAGE_SIZE as
dst_buf, which is large enough to hold even incompressible data. Adding
a counter that will always stay at zero seems meaningless.

We might want to revisit the old thread to check whether it is now safe for us
to move to PAGE_SIZE in zswap now.

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:34     ` Barry Song
@ 2025-08-20  1:37       ` Herbert Xu
  2025-08-20 17:32         ` Nhat Pham
  2025-08-20  5:07       ` Chris Li
  1 sibling, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2025-08-20  1:37 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Chris Li,
	Kairui Song

On Wed, Aug 20, 2025 at 01:34:01PM +1200, Barry Song wrote:
>
> We might want to revisit the old thread to check whether it is now safe for us
> to move to PAGE_SIZE in zswap now.

It's perfectly safe as LZO was fixed months ago.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:13 ` Barry Song
  2025-08-20  1:20   ` Herbert Xu
@ 2025-08-20  4:49   ` Chris Li
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-20  4:49 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, Herbert Xu, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 6:13 PM Barry Song <21cnbao@gmail.com> wrote:
>
> [...]
> > +
> > +       /*
> > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > +        * save the content as is without a compression, to keep the LRU order
> > +        * of writebacks.  If writeback is disabled, reject the page since it
> > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > +        * to the active LRU list in the case.
> > +        */
> > +       if (comp_ret || !dlen) {
> > +               zswap_crypto_compress_fail++;
> > +               dlen = PAGE_SIZE;
> > +       }
>
> I’m not entirely sure about this. As long as we pass 2*PAGE_SIZE as
> dst_buf, any error returned by crypto drivers should indicate a bug in
> the driver that needs to be fixed.

That is what I have in mind for that counter, if that counter is not
zero it is something with the crypto has gone wrong. If we are sure
that it can never fail, maybe we shouldn't check the error return?

> There have also been attempts to unify crypto drivers so they consistently
> return -ENOSPC when the destination buffer would overflow. If that has
> been achieved, we might be able to reduce the buffer from 2*PAGE_SIZE to
> just PAGE_SIZE in zswap. There was a long discussion on this here:
> https://lore.kernel.org/linux-mm/Z7dnPh4tPxLO1UEo@google.com/
>
> I’m not sure of the current status — do all crypto drivers now return a
> consistent -ENOSPC when the compressed size exceeds PAGE_SIZE? From
> what I recall during that discussion, most drivers already behaved this
> way, but Sergey Senozhatsky pointed out one or two exceptions.
>
> Let’s sync with Herbert: have we reached the stage where all drivers
> reliably return -ENOSPC when dst_buf is PAGE_SIZE but the compressed
> size would exceed it?

I agree -ENOSPC should treat the compression ratio the same too low.
However, is the crypto library able to return any error other than
-ENOSPC? I am tempted to do something like BUG_ON() or WARN_ONCE() if
other errors we think are never possible. However even the BUG_ON and
WARN are discouraged in the kernel so we should handle the error. The
one error is we can log it and monitor it to make sure it never
happens.

If we are absolutely sure the other error should not happen. We should
remove the free form error from the interface to reflect that. Make
the function should just return bool success or failure as a buffer
too small.

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:20   ` Herbert Xu
  2025-08-20  1:34     ` Barry Song
@ 2025-08-20  4:55     ` Chris Li
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-20  4:55 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Barry Song, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 6:21 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Aug 20, 2025 at 01:13:13PM +1200, Barry Song wrote:
> >
> > Let’s sync with Herbert: have we reached the stage where all drivers
> > reliably return -ENOSPC when dst_buf is PAGE_SIZE but the compressed
> > size would exceed it?
>
> It doesn't matter.  Software compression should never fail, and
> if it does fail due to a bug, that's not something that the user
> of the API should worry about.
>
> Hardware compression should always fall back to software compression
> in case of failure.
>
> IOW all compression errors should be treated as incompressible
> data.

In that case, you seem to suggest that the crypto library can NEVER
return an error other than the dst buffer is too small. In that case I
would suggest the crypto library remove that free form error return
code. Just return true for success compression or false on failure.

It is also strange to return a freeform error code and declare the
caller shouldn't check on it. Ideally the caller shouldn't warn or bug
on it, so it should handle the possible error other than -ENOSPC.

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:34     ` Barry Song
  2025-08-20  1:37       ` Herbert Xu
@ 2025-08-20  5:07       ` Chris Li
  2025-08-20  5:10         ` Herbert Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Li @ 2025-08-20  5:07 UTC (permalink / raw)
  To: Barry Song
  Cc: Herbert Xu, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 6:34 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Aug 20, 2025 at 1:21 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Wed, Aug 20, 2025 at 01:13:13PM +1200, Barry Song wrote:
> > >
> > > Let’s sync with Herbert: have we reached the stage where all drivers
> > > reliably return -ENOSPC when dst_buf is PAGE_SIZE but the compressed
> > > size would exceed it?
> >
> > It doesn't matter.  Software compression should never fail, and
> > if it does fail due to a bug, that's not something that the user
> > of the API should worry about.
> >
> > Hardware compression should always fall back to software compression
> > in case of failure.
> >
> > IOW all compression errors should be treated as incompressible
> > data.
>
> That is why I think it makes little sense to add a counter
> zswap_crypto_compress_fail, since zswap already passes 2*PAGE_SIZE as
> dst_buf, which is large enough to hold even incompressible data. Adding
> a counter that will always stay at zero seems meaningless.

I was thinking the crypto might be able to return errors other than
the destination buffer being too small, here we give 2x the dst buffer
already, shouldn't see a buffer to small error. In that case, having a
counter mostly stay zero, only an exceptional error case is none zero
is fine.

But if that error case can never happen, we should convert the crypto
compression from returning error number to bool instead. It does not
make sense to return a free form error code and also demand the caller
does not check on it. If you insist that the caller should check
return value like a boolean, just let the API return a boolean.

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  5:07       ` Chris Li
@ 2025-08-20  5:10         ` Herbert Xu
  2025-08-20  5:20           ` Chris Li
  0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2025-08-20  5:10 UTC (permalink / raw)
  To: Chris Li
  Cc: Barry Song, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 10:07:57PM -0700, Chris Li wrote:
>
> But if that error case can never happen, we should convert the crypto
> compression from returning error number to bool instead. It does not
> make sense to return a free form error code and also demand the caller
> does not check on it. If you insist that the caller should check
> return value like a boolean, just let the API return a boolean.

No we need the error value for other things, such as asynchronous
return -EINPROGRESS.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  5:10         ` Herbert Xu
@ 2025-08-20  5:20           ` Chris Li
  2025-08-20  5:22             ` Herbert Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Li @ 2025-08-20  5:20 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Barry Song, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 10:11 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Aug 19, 2025 at 10:07:57PM -0700, Chris Li wrote:
> >
> > But if that error case can never happen, we should convert the crypto
> > compression from returning error number to bool instead. It does not
> > make sense to return a free form error code and also demand the caller
> > does not check on it. If you insist that the caller should check
> > return value like a boolean, just let the API return a boolean.
>
> No we need the error value for other things, such as asynchronous
> return -EINPROGRESS.

In that case it is fair game for the caller to check for the error
other than -ENOSPC. If the return value is -EINPROGRESS during
synchronous crypto compression, it is a bug. We are not encouraged to
use WARN_ON, then having an error count to check that something truly
unlikely but possible to happen is better than folding it into
incompressible and pretend the error never happened. At least we can
know such an error case happened.

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  5:20           ` Chris Li
@ 2025-08-20  5:22             ` Herbert Xu
  2025-08-21 20:42               ` Chris Li
  0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2025-08-20  5:22 UTC (permalink / raw)
  To: Chris Li
  Cc: Barry Song, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 10:20:45PM -0700, Chris Li wrote:
>
> In that case it is fair game for the caller to check for the error
> other than -ENOSPC. If the return value is -EINPROGRESS during
> synchronous crypto compression, it is a bug. We are not encouraged to
> use WARN_ON, then having an error count to check that something truly
> unlikely but possible to happen is better than folding it into
> incompressible and pretend the error never happened. At least we can
> know such an error case happened.

Sure warning on -EINPROGRESS is fine.  But every other non-zero
return value should be treated as incompressible.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  1:37       ` Herbert Xu
@ 2025-08-20 17:32         ` Nhat Pham
  2025-08-21 10:27           ` Barry Song
  0 siblings, 1 reply; 24+ messages in thread
From: Nhat Pham @ 2025-08-20 17:32 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Barry Song, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Chris Li,
	Kairui Song

On Tue, Aug 19, 2025 at 6:37 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Aug 20, 2025 at 01:34:01PM +1200, Barry Song wrote:
> >
> > We might want to revisit the old thread to check whether it is now safe for us
> > to move to PAGE_SIZE in zswap now.
>
> It's perfectly safe as LZO was fixed months ago.

Perfect. Then I'll revive Chengming's patch (see [1]) to reduce the
compression buffer :)

[1]: https://lore.kernel.org/linux-mm/20231213-zswap-dstmem-v4-1-f228b059dd89@bytedance.com/

>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20 17:32         ` Nhat Pham
@ 2025-08-21 10:27           ` Barry Song
  2025-08-21 16:42             ` SeongJae Park
  0 siblings, 1 reply; 24+ messages in thread
From: Barry Song @ 2025-08-21 10:27 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Herbert Xu, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Chris Li,
	Kairui Song

On Thu, Aug 21, 2025 at 1:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Aug 19, 2025 at 6:37 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Wed, Aug 20, 2025 at 01:34:01PM +1200, Barry Song wrote:
> > >
> > > We might want to revisit the old thread to check whether it is now safe for us
> > > to move to PAGE_SIZE in zswap now.
> >
> > It's perfectly safe as LZO was fixed months ago.
>
> Perfect. Then I'll revive Chengming's patch (see [1]) to reduce the
> compression buffer :)

Nice!

But perhaps we should wait until SeongJae sends a new version that
addresses the counter issue? Also, I noticed the following code may
have problems with the patch:

        if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
                zswap_reject_compress_poor++;

Can we still reach the code comp_ret == -ENOSPC since we already
handled comp_ret by ...

+       if (comp_ret || !dlen) {
+               zswap_crypto_compress_fail++;
+               dlen = PAGE_SIZE;
+       }
+       if (dlen >= PAGE_SIZE) {
+               if (!mem_cgroup_zswap_writeback_enabled(
+                                       folio_memcg(page_folio(page)))) {
+                       comp_ret = -EINVAL;
+                       goto unlock;
+               }
+               comp_ret = 0;
+               dlen = PAGE_SIZE;
+               dst = kmap_local_page(page);
+       }

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-19 19:34 [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
  2025-08-20  1:13 ` Barry Song
@ 2025-08-21 16:17 ` SeongJae Park
  2025-08-21 21:21 ` Chris Li
  2 siblings, 0 replies; 24+ messages in thread
From: SeongJae Park @ 2025-08-21 16:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Chengming Zhou, Johannes Weiner, Nhat Pham,
	Yosry Ahmed, kernel-team, linux-kernel, linux-mm, Takero Funaki,
	David Hildenbrand, Baoquan He, Barry Song, Chris Li, Kairui Song,
	kernel test robot

On Tue, 19 Aug 2025 12:34:04 -0700 SeongJae Park <sj@kernel.org> wrote:

[...]
> Knowing how many compression failures from the crypto engine happened so
> far, and how many incompressible pages are stored at the given moment
> will be useful for future investigations.  Add two new debugfs files,
> crypto_compress_fail and stored_incompressible_pages, for the two
> counts, respectively.
[...]
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 3c0fd8a13718..1f1ac043a2d9 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -42,8 +42,10 @@
>  /*********************************
>  * statistics
>  **********************************/
> -/* The number of compressed pages currently stored in zswap */
> +/* The number of pages currently stored in zswap */
>  atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0);
> +/* The number of incompressible pages currently stored in zswap */
> +atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0);

Kernel test robot reported a sparse warning for the above line.  Andrew, could
you please add below attached fixup?


Thanks,
SJ

[...]

==== Attachment 0 (0001-mm-zswap-mark-zswap_stored_incompressible_pages-as-s.patch) ====
From 1d41d75c47a6d3ef3cbf58636faf2b4dc04616ba Mon Sep 17 00:00:00 2001
From: SeongJae Park <sj@kernel.org>
Date: Thu, 21 Aug 2025 09:10:57 -0700
Subject: [PATCH] mm/zswap: mark zswap_stored_incompressible_pages as static

Only zswap.c uses zswap_stored_incompressible_pages, but it is not
marked as static.  This incurs a sparse warning that reported by kernel
teset robot.  Mark it as a static variable to eliminate the warning.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202508211706.DnJPQQMn-lkp@intel.com/
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/zswap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 5dd282c5b626..ee443b317ac7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -45,7 +45,7 @@
 /* The number of pages currently stored in zswap */
 atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0);
 /* The number of incompressible pages currently stored in zswap */
-atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0);
+static atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0);
 
 /*
  * The statistics below are not protected from concurrent access for
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-21 10:27           ` Barry Song
@ 2025-08-21 16:42             ` SeongJae Park
  2025-08-21 20:49               ` Chris Li
  0 siblings, 1 reply; 24+ messages in thread
From: SeongJae Park @ 2025-08-21 16:42 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, Nhat Pham, Herbert Xu, Andrew Morton,
	Chengming Zhou, Johannes Weiner, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Chris Li, Kairui Song

On Thu, 21 Aug 2025 18:27:52 +0800 Barry Song <21cnbao@gmail.com> wrote:

> On Thu, Aug 21, 2025 at 1:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Aug 19, 2025 at 6:37 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > >
> > > On Wed, Aug 20, 2025 at 01:34:01PM +1200, Barry Song wrote:
> > > >
> > > > We might want to revisit the old thread to check whether it is now safe for us
> > > > to move to PAGE_SIZE in zswap now.
> > >
> > > It's perfectly safe as LZO was fixed months ago.
> >
> > Perfect. Then I'll revive Chengming's patch (see [1]) to reduce the
> > compression buffer :)
> 
> Nice!
> 
> But perhaps we should wait until SeongJae sends a new version that
> addresses the counter issue?

Is there a reason to wait?  I was thinking those are orthogonal problems?

Anyway, for the counter (crypto_compress_fail), I don't have a strong opinion.
To my understanding, the options for path forward are...

1. remove it,
2. keep it as is, or
3. keep it, but account only -EINPROGRESS[1]

If I'm not missing other options, I'm tempted to the first option (remove it)
since it doesn't change any existing things, and we can revisit later.

Please let me know if I'm missing another options or you have other preferrence
on what option to take.

> Also, I noticed the following code may
> have problems with the patch:
> 
>         if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
>                 zswap_reject_compress_poor++;
> 
> Can we still reach the code comp_ret == -ENOSPC since we already
> handled comp_ret by ...
> 
> +       if (comp_ret || !dlen) {
> +               zswap_crypto_compress_fail++;
> +               dlen = PAGE_SIZE;
> +       }
> +       if (dlen >= PAGE_SIZE) {
> +               if (!mem_cgroup_zswap_writeback_enabled(
> +                                       folio_memcg(page_folio(page)))) {
> +                       comp_ret = -EINVAL;
> +                       goto unlock;
> +               }
> +               comp_ret = 0;
> +               dlen = PAGE_SIZE;
> +               dst = kmap_local_page(page);
> +       }

Nice catch, thank you Berry.  Actually there was a check for keeping the code
reachable, but I forgot keeping it while updating the version from v3[2] to
this one.  I will make below change on the next version to restore it.

    @@ -992,7 +992,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
            if (dlen >= PAGE_SIZE) {
                    if (!mem_cgroup_zswap_writeback_enabled(
                                            folio_memcg(page_folio(page)))) {
    -                       comp_ret = -EINVAL;
    +                       comp_ret = comp_ret ? comp_ret : -EINVAL;
                            goto unlock;
                    }
                    comp_ret = 0;

[1] https://lore.kernel.org/CAF8kJuPPsLzWu8+xm2A_UPHMBhb7OTjJNErM1Kp3hPmvHXNDUQ@mail.gmail.com
[2] https://lore.kernel.org/20250815213020.89327-1-sj@kernel.org


Thanks,
SJ


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-20  5:22             ` Herbert Xu
@ 2025-08-21 20:42               ` Chris Li
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Li @ 2025-08-21 20:42 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Barry Song, SeongJae Park, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Nhat Pham, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Tue, Aug 19, 2025 at 10:23 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Aug 19, 2025 at 10:20:45PM -0700, Chris Li wrote:
> >
> > In that case it is fair game for the caller to check for the error
> > other than -ENOSPC. If the return value is -EINPROGRESS during
> > synchronous crypto compression, it is a bug. We are not encouraged to
> > use WARN_ON, then having an error count to check that something truly
> > unlikely but possible to happen is better than folding it into
> > incompressible and pretend the error never happened. At least we can
> > know such an error case happened.
>
> Sure warning on -EINPROGRESS is fine.  But every other non-zero
> return value should be treated as incompressible.

Ack.

Thanks for the feedback.

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-21 16:42             ` SeongJae Park
@ 2025-08-21 20:49               ` Chris Li
  2025-08-21 21:36                 ` SeongJae Park
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Li @ 2025-08-21 20:49 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Barry Song, Nhat Pham, Herbert Xu, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Kairui Song

On Thu, Aug 21, 2025 at 9:43 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Thu, 21 Aug 2025 18:27:52 +0800 Barry Song <21cnbao@gmail.com> wrote:
>
> > On Thu, Aug 21, 2025 at 1:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Tue, Aug 19, 2025 at 6:37 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > >
> > > > On Wed, Aug 20, 2025 at 01:34:01PM +1200, Barry Song wrote:
> > > > >
> > > > > We might want to revisit the old thread to check whether it is now safe for us
> > > > > to move to PAGE_SIZE in zswap now.
> > > >
> > > > It's perfectly safe as LZO was fixed months ago.
> > >
> > > Perfect. Then I'll revive Chengming's patch (see [1]) to reduce the
> > > compression buffer :)
> >
> > Nice!
> >
> > But perhaps we should wait until SeongJae sends a new version that
> > addresses the counter issue?
>
> Is there a reason to wait?  I was thinking those are orthogonal problems?
>
> Anyway, for the counter (crypto_compress_fail), I don't have a strong opinion.
> To my understanding, the options for path forward are...
>
> 1. remove it,
> 2. keep it as is, or
> 3. keep it, but account only -EINPROGRESS[1]
>
> If I'm not missing other options, I'm tempted to the first option (remove it)
> since it doesn't change any existing things, and we can revisit later.

I am fine with 1) removing it. Maybe add a log once print error on the
error code if -EINPROGRESS, just to know such extreme error has been
triggered.
>
> Please let me know if I'm missing other options or if you have other preferences.

I just don't want to hide the extreme error case but I am also fine
with just removing it. It is your call.

Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-19 19:34 [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
  2025-08-20  1:13 ` Barry Song
  2025-08-21 16:17 ` SeongJae Park
@ 2025-08-21 21:21 ` Chris Li
  2025-08-21 21:41   ` SeongJae Park
  2 siblings, 1 reply; 24+ messages in thread
From: Chris Li @ 2025-08-21 21:21 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Chengming Zhou, Johannes Weiner, Nhat Pham,
	Yosry Ahmed, kernel-team, linux-kernel, linux-mm, Takero Funaki,
	David Hildenbrand, Baoquan He, Barry Song, Kairui Song

On Tue, Aug 19, 2025 at 12:34 PM SeongJae Park <sj@kernel.org> wrote:
>
> When zswap writeback is enabled and it fails compressing a given page, the
> page is swapped out to the backing swap device.  This behavior breaks the
> zswap's writeback LRU order, and hence users can experience unexpected
> latency spikes.  If the page is compressed without failure, but results in
> a size of PAGE_SIZE, the LRU order is kept, but the decompression overhead
> for loading the page back on the later access is unnecessary.
>
> Keep the LRU order and optimize unnecessary decompression overheads in
> those cases, by storing the original content as-is in zpool.  The length
> field of zswap_entry will be set appropriately, as PAGE_SIZE.  Hence
> whether it is saved as-is or not (whether decompression is unnecessary)
> is identified by 'zswap_entry->length == PAGE_SIZE'.
>
> Because the uncompressed data is saved in zpool, same to the compressed
> ones, this introduces no change in terms of memory management including
> movability and migratability of involved pages.
>
> This change is also not increasing per zswap entry metadata overhead.  But
> as the number of incompressible pages increases, total zswap metadata
> overhead is proportionally increased.  The overhead should not be
> problematic in usual cases, since the zswap metadata for single zswap
> entry is much smaller than PAGE_SIZE, and in common zswap use cases there
> should be a sufficient amount of compressible pages.  Also it can be
> mitigated by the zswap writeback.
>
> When the writeback is disabled, the additional overhead could be
> problematic.  For the case, keep the current behavior that just returns
> the failure and let swap_writeout() put the page back to the active LRU
> list in the case.
>
> Knowing how many compression failures from the crypto engine happened so
> far, and how many incompressible pages are stored at the given moment
> will be useful for future investigations.  Add two new debugfs files,
> crypto_compress_fail and stored_incompressible_pages, for the two
> counts, respectively.
>
> Tests
> -----
>
> I tested this patch using a simple self-written microbenchmark that is
> available at GitHub[1].  You can reproduce the test I did by executing
> run_tests.sh of the repo on your system.  Note that the repo's
> documentation is not good as of this writing, so you may need to read and
> use the code.
>
> The basic test scenario is simple.  Run a test program making artificial
> accesses to memory having artificial content under memory.high-set memory
> limit and measure how many accesses were made in a given time.
>
> The test program repeatedly and randomly access three anonymous memory
> regions.  The regions are all 500 MiB size, and be accessed in the same
> probability.  Two of those are filled up with a simple content that can
> easily be compressed, while the remaining one is filled up with a
> content that s read from /dev/urandom, which is easy to fail at
> compressing to a size smaller than PAGE_SIZE.  The program runs for two
> minutes and prints out the number of accesses made every five seconds.
>
> The test script runs the program under below four configurations.
>
> - 0: memory.high is set to 2 GiB, zswap is disabled.
> - 1-1: memory.high is set to 1350 MiB, zswap is disabled.
> - 1-2: On 1-1, zswap is enabled without this patch.
> - 1-3: On 1-2, this patch is applied.
>
> For all zswap enabled cases, zswap shrinker is enabled.
>
> Configuration '0' is for showing the original memory performance.
> Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap,
> zswap, and this patch under a level of memory pressure (~10% of working
> set).  Configurations 0 and 1-1 are not the main focus of this patch, but
> I'm adding those since their results transparently show how far this
> microbenchmark test is from the real world.
>
> Because the per-5 seconds performance is not very reliable, I measured the
> average of that for the last one minute period of the test program run.  I
> also measured a few vmstat counters including zswpin, zswpout, zswpwb,
> pswpin and pswpout during the test runs.
>
> The measurement results are as below.  To save space, I show performance
> numbers that are normalized to that of the configuration '0' (no memory
> pressure).  The averaged accesses per 5 seconds of configuration '0' was
> 36493417.75.
>
>     config            0       1-1     1-2      1-3
>     perf_normalized   1.0000  0.0057  0.0235   0.0367
>     perf_stdev_ratio  0.0582  0.0652  0.0167   0.0346
>     zswpin            0       0       3548424  1999335
>     zswpout           0       0       3588817  2361689
>     zswpwb            0       0       10214    340270
>     pswpin            0       485806  772038   340967
>     pswpout           0       649543  144773   340270
>
> 'perf_normalized' is the performance metric, normalized to that of
> configuration '0' (no pressure).  'perf_stdev_ratio' is the standard
> deviation of the averaged data points, as a ratio to the averaged metric
> value.  For example, configuration '0' performance was showing 5.8% stdev.
> Configurations 1-1 and 1-3 were having about 6.5% and 6.1% stdev.  Also
> the results were highly variable between multiple runs.  So this result is
> not very stable but just showing ball park figures.  Please keep this in
> your mind when reading these results.
>
> Under about 10% of working set memory pressure, the performance was
> dropped to about 0.57% of no-pressure one, when the normal swap is used
> (1-1).  Note that ~10% working set pressure is already extreme, at least
> on this test setup.  No one would desire system setups that can degrade
> performance to 0.57% of the best case.
>
> By turning zswap on (1-2), the performance was improved about 4x,
> resulting in about 2.35% of no-pressure one.  Because of the
> incompressible pages in the third memory region, a significant amount of
> (non-zswap) swap I/O operations were made, though.
>
> By applying this patch (1-3), about 56% performance improvement was made,
> resulting in about 3.67% of no-pressure one.  Reduced pswpin of 1-3
> compared to 1-2 let us see where this improvement came from.
>
> Tests without Zswap Shrinker
> ----------------------------
>
> Zswap shrinker is not enabled by default, so I ran the above test after
> disabling zswap shrinker.  The results are as below.
>
>     config            0       1-1     1-2      1-3
>     perf_normalized   1.0000  0.0056  0.0185   0.0260
>     perf_stdev_ratio  0.0467  0.0348  0.1832   0.3387
>     zswpin            0       0       2506765  6049078
>     zswpout           0       0       2534357  6115426
>     zswpwb            0       0       0        0
>     pswpin            0       463694  472978   0
>     pswpout           0       686227  612149   0
>
> The overall normalized performance of the different configs are very
> similar to those of zswap shrinker enabled case.  By adding the memory
> pressure, the performance was dropped to 0.56% of the original one.  By
> enabling zswap without zswap shrinker, the performance was increased to
> 1.85% of the original one.  By applying this patch on it, the performance
> was further increased to 2.6% of the original one.
>
> Even though zswap shrinker is disabled, 1-2 shows high numbers of pswpin
> and pswpout because the incompressible pages are directly swapped out.
> In the case of 1-3, it shows zero pswpin and pswpout since it saves
> incompressible pages in the memory, and shows higher performance.
>
> Note that the performance of 1-2 and 1-3 varies pretty much.  Standard
> deviation of the performance for 1-2 was about 18.32% of the
> performance, while that for 1-3 was about 33.87%.  Because zswap
> shrinker is disabled and the memory pressure is induced by memory.high,
> the workload got penalty_jiffies sleeps, and this resulted in the
> unstabilized performance.
>
> Related Works
> -------------
>
> This is not an entirely new attempt.  Nhat Pham and Takero Funaki tried
> very similar approaches in October 2023[2] and April 2024[3],
> respectively.  The two approaches didn't get merged mainly due to the
> metadata overhead concern.  I described why I think that shouldn't be a
> problem for this change, which is automatically disabled when writeback is
> disabled, at the beginning of this changelog.
>
> This patch is not particularly different from those, and actually built
> upon those.  I wrote this from scratch again, though.  Hence adding
> Suggested-by tags for them.  Actually Nhat first suggested this to me
> offlist.
>
> Historically, writeback disabling was introduced partially as a way to
> solve the LRU order issue.  Yosry pointed out[4] this is still suboptimal
> when the incompressible pages are cold, since the incompressible pages
> will continuously be tried to be zswapped out, and burn CPU cycles for
> compression attempts that will anyway fail.  One imaginable solution for
> the problem is reusing the swapped-out page and its struct page to store
> in the zswap pool.  But that's out of the scope of this patch.
>
> [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
> [4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Takero Funaki <flintglass@gmail.com>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Cc: Chengming Zhou <chengming.zhou@linux.dev>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Chris Li <chrisl@kernel.org>
> Cc: Kairui Song <kasong@tencent.com>
> ---
> Changes from v3
> (https://lore.kernel.org/20250815213020.89327-1-sj@kernel.org)
> (discussions for changes from v3 were made on v2 thread)
> - Drop the cumulated compression failure counter (compress_fail)
> - Add a cumulated crypto-failure only counter (crypto_compress_fail)
> - Add a not cumulated stored incompressible pages counter
>   (stored_incompressible_pages)
> - Cleanup compression failure handling code for readability
>
> Changes from v2
> (https://lore.kernel.org/20250812170046.56468-1-sj@kernel.org)
> - No code change bug changelog updates
>   - Add zswap shrinker disabled case test results.
>   - Fix a typo on changelog.
>   - Add a clarification of intention of 0 and 1-1 test configs.
>
> Changes from v1
> (https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org)
> - Optimize out memcpy() per incompressible page saving, using
>   k[un]map_local().
> - Add a debugfs file for counting compression failures.
> - Use a clear form of a ternary operation.
> - Add the history of writeback disabling with a link.
> - Wordsmith comments.
>
> Changes from RFC v2
> (https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org)
> - Fix race conditions at decompressed pages identification.
> - Remove the parameter and make saving as-is the default behavior.
> - Open-code main changes.
> - Clarify there is no memory management changes on the cover letter.
> - Remove 20% pressure case from test results, since it is arguably too
>   extreme and only adds confusion.
> - Drop RFC tag.
>
> Changes from RFC v1
> (https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org)
> - Consider PAGE_SIZE compression successes as failures.
> - Use zpool for storing incompressible pages.
> - Test with zswap shrinker enabled.
> - Wordsmith changelog and comments.
> - Add documentation of save_incompressible_pages parameter.
>
>  mm/zswap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 54 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 3c0fd8a13718..1f1ac043a2d9 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -42,8 +42,10 @@
>  /*********************************
>  * statistics
>  **********************************/
> -/* The number of compressed pages currently stored in zswap */
> +/* The number of pages currently stored in zswap */
>  atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0);
> +/* The number of incompressible pages currently stored in zswap */
> +atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0);
>
>  /*
>   * The statistics below are not protected from concurrent access for
> @@ -60,6 +62,8 @@ static u64 zswap_written_back_pages;
>  static u64 zswap_reject_reclaim_fail;
>  /* Store failed due to compression algorithm failure */
>  static u64 zswap_reject_compress_fail;
> +/* Compression failed by the crypto library */
> +static u64 zswap_crypto_compress_fail;
>  /* Compressed page was too big for the allocator to (optimally) store */
>  static u64 zswap_reject_compress_poor;
>  /* Load or writeback failed due to decompression failure */
> @@ -811,6 +815,8 @@ static void zswap_entry_free(struct zswap_entry *entry)
>                 obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
>                 obj_cgroup_put(entry->objcg);
>         }
> +       if (entry->length == PAGE_SIZE)
> +               atomic_long_dec(&zswap_stored_incompressible_pages);
>         zswap_entry_cache_free(entry);
>         atomic_long_dec(&zswap_stored_pages);
>  }
> @@ -976,8 +982,28 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          */
>         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
>         dlen = acomp_ctx->req->dlen;
> -       if (comp_ret)
> -               goto unlock;
> +
> +       /*
> +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> +        * save the content as is without a compression, to keep the LRU order
> +        * of writebacks.  If writeback is disabled, reject the page since it
> +        * only adds metadata overhead.  swap_writeout() will put the page back
> +        * to the active LRU list in the case.
> +        */
> +       if (comp_ret || !dlen) {

Looks good other than the feedback provided by Barry as well. Need to
handle the -ENOSPC.
Other errors will depend on your plan to drop this counter or not. I
will wait for your next version.


> +               zswap_crypto_compress_fail++;
> +               dlen = PAGE_SIZE;
> +       }
> +       if (dlen >= PAGE_SIZE) {
> +               if (!mem_cgroup_zswap_writeback_enabled(
> +                                       folio_memcg(page_folio(page)))) {
> +                       comp_ret = -EINVAL;
> +                       goto unlock;
I saw you mention this in the cover letter, so just to confirm we are
on the same page. Current patch still has the issue [4] of write back
disabled cases, the incompressible page will stay in the page LRU and
possibly attempt to reclaim over and over again, right?

Chris

> +               }
> +               comp_ret = 0;
> +               dlen = PAGE_SIZE;
> +               dst = kmap_local_page(page);
> +       }
>
>         zpool = pool->zpool;
>         gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> @@ -990,6 +1016,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>         entry->length = dlen;
>
>  unlock:
> +       if (dst != acomp_ctx->buffer)
> +               kunmap_local(dst);
>         if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
>                 zswap_reject_compress_poor++;
>         else if (comp_ret)
> @@ -1012,6 +1040,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>         acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
>         obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
>
> +       /* zswap entries of length PAGE_SIZE are not compressed. */
> +       if (entry->length == PAGE_SIZE) {
> +               memcpy_to_folio(folio, 0, obj, entry->length);
> +               zpool_obj_read_end(zpool, entry->handle, obj);
> +               acomp_ctx_put_unlock(acomp_ctx);
> +               return true;
> +       }
> +
>         /*
>          * zpool_obj_read_begin() might return a kmap address of highmem when
>          * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> @@ -1524,6 +1560,8 @@ static bool zswap_store_page(struct page *page,
>                 obj_cgroup_charge_zswap(objcg, entry->length);
>         }
>         atomic_long_inc(&zswap_stored_pages);
> +       if (entry->length == PAGE_SIZE)
> +               atomic_long_inc(&zswap_stored_incompressible_pages);
>
>         /*
>          * We finish initializing the entry while it's already in xarray.
> @@ -1792,6 +1830,14 @@ static int debugfs_get_stored_pages(void *data, u64 *val)
>  }
>  DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n");
>
> +static int debugfs_get_stored_incompressible_pages(void *data, u64 *val)
> +{
> +       *val = atomic_long_read(&zswap_stored_incompressible_pages);
> +       return 0;
> +}
> +DEFINE_DEBUGFS_ATTRIBUTE(stored_incompressible_pages_fops,
> +               debugfs_get_stored_incompressible_pages, NULL, "%llu\n");
> +
>  static int zswap_debugfs_init(void)
>  {
>         if (!debugfs_initialized())
> @@ -1809,6 +1855,8 @@ static int zswap_debugfs_init(void)
>                            zswap_debugfs_root, &zswap_reject_kmemcache_fail);
>         debugfs_create_u64("reject_compress_fail", 0444,
>                            zswap_debugfs_root, &zswap_reject_compress_fail);
> +       debugfs_create_u64("crypto_compress_fail", 0444,
> +                          zswap_debugfs_root, &zswap_crypto_compress_fail);
>         debugfs_create_u64("reject_compress_poor", 0444,
>                            zswap_debugfs_root, &zswap_reject_compress_poor);
>         debugfs_create_u64("decompress_fail", 0444,
> @@ -1819,6 +1867,9 @@ static int zswap_debugfs_init(void)
>                             zswap_debugfs_root, NULL, &total_size_fops);
>         debugfs_create_file("stored_pages", 0444,
>                             zswap_debugfs_root, NULL, &stored_pages_fops);
> +       debugfs_create_file("stored_incompressible_pages", 0444,
> +                           zswap_debugfs_root, NULL,
> +                           &stored_incompressible_pages_fops);
>
>         return 0;
>  }
>
> base-commit: 803d261a97f9b4025282723d2930e58d49adcbf9
> --
> 2.39.5
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-21 20:49               ` Chris Li
@ 2025-08-21 21:36                 ` SeongJae Park
  2025-08-22  0:48                   ` Barry Song
  0 siblings, 1 reply; 24+ messages in thread
From: SeongJae Park @ 2025-08-21 21:36 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Barry Song, Nhat Pham, Herbert Xu, Andrew Morton,
	Chengming Zhou, Johannes Weiner, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Thu, 21 Aug 2025 13:49:31 -0700 Chris Li <chrisl@kernel.org> wrote:

> On Thu, Aug 21, 2025 at 9:43 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Thu, 21 Aug 2025 18:27:52 +0800 Barry Song <21cnbao@gmail.com> wrote:
> >
> > > On Thu, Aug 21, 2025 at 1:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 19, 2025 at 6:37 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > > >
> > > > > On Wed, Aug 20, 2025 at 01:34:01PM +1200, Barry Song wrote:
> > > > > >
> > > > > > We might want to revisit the old thread to check whether it is now safe for us
> > > > > > to move to PAGE_SIZE in zswap now.
> > > > >
> > > > > It's perfectly safe as LZO was fixed months ago.
> > > >
> > > > Perfect. Then I'll revive Chengming's patch (see [1]) to reduce the
> > > > compression buffer :)
> > >
> > > Nice!
> > >
> > > But perhaps we should wait until SeongJae sends a new version that
> > > addresses the counter issue?
> >
> > Is there a reason to wait?  I was thinking those are orthogonal problems?
> >
> > Anyway, for the counter (crypto_compress_fail), I don't have a strong opinion.
> > To my understanding, the options for path forward are...
> >
> > 1. remove it,
> > 2. keep it as is, or
> > 3. keep it, but account only -EINPROGRESS[1]
> >
> > If I'm not missing other options, I'm tempted to the first option (remove it)
> > since it doesn't change any existing things, and we can revisit later.
> 
> I am fine with 1) removing it. Maybe add a log once print error on the
> error code if -EINPROGRESS, just to know such extreme error has been
> triggered.
> >
> > Please let me know if I'm missing other options or if you have other preferences.
> 
> I just don't want to hide the extreme error case but I am also fine
> with just removing it. It is your call.

Thank you for your opinion, Chris!  Unless others have different opinions, I
will only remove the counter (option 1), since it is simplest and we can
consider adding another counter or error logs on top of it.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-21 21:21 ` Chris Li
@ 2025-08-21 21:41   ` SeongJae Park
  0 siblings, 0 replies; 24+ messages in thread
From: SeongJae Park @ 2025-08-21 21:41 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Andrew Morton, Chengming Zhou, Johannes Weiner,
	Nhat Pham, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Barry Song,
	Kairui Song

On Thu, 21 Aug 2025 14:21:11 -0700 Chris Li <chrisl@kernel.org> wrote:

> On Tue, Aug 19, 2025 at 12:34 PM SeongJae Park <sj@kernel.org> wrote:
[...]
> > Historically, writeback disabling was introduced partially as a way to
> > solve the LRU order issue.  Yosry pointed out[4] this is still suboptimal
> > when the incompressible pages are cold, since the incompressible pages
> > will continuously be tried to be zswapped out, and burn CPU cycles for
> > compression attempts that will anyway fail.  One imaginable solution for
> > the problem is reusing the swapped-out page and its struct page to store
> > in the zswap pool.  But that's out of the scope of this patch.
> >
> > [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh
> > [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com
> > [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com
> > [4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com
[...]
> > +       /*
> > +        * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > +        * save the content as is without a compression, to keep the LRU order
> > +        * of writebacks.  If writeback is disabled, reject the page since it
> > +        * only adds metadata overhead.  swap_writeout() will put the page back
> > +        * to the active LRU list in the case.
> > +        */
> > +       if (comp_ret || !dlen) {
> 
> Looks good other than the feedback provided by Barry as well. Need to
> handle the -ENOSPC.
> Other errors will depend on your plan to drop this counter or not. I
> will wait for your next version.

Ack.  The next version will keep -ENOSPC comp_ret value so that
reject_compress_poor counter is not broken, like I replied to Barry.

> 
> 
> > +               zswap_crypto_compress_fail++;
> > +               dlen = PAGE_SIZE;
> > +       }
> > +       if (dlen >= PAGE_SIZE) {
> > +               if (!mem_cgroup_zswap_writeback_enabled(
> > +                                       folio_memcg(page_folio(page)))) {
> > +                       comp_ret = -EINVAL;
> > +                       goto unlock;
> I saw you mention this in the cover letter, so just to confirm we are
> on the same page. Current patch still has the issue [4] of write back
> disabled cases, the incompressible page will stay in the page LRU and
> possibly attempt to reclaim over and over again, right?

You are correct.  This patch is not making a change for writeback disabled
cases.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-21 21:36                 ` SeongJae Park
@ 2025-08-22  0:48                   ` Barry Song
  2025-08-22  5:54                     ` Herbert Xu
  2025-08-22 16:44                     ` SeongJae Park
  0 siblings, 2 replies; 24+ messages in thread
From: Barry Song @ 2025-08-22  0:48 UTC (permalink / raw)
  To: SeongJae Park, Herbert Xu
  Cc: Chris Li, Nhat Pham, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Kairui Song

> > >
> > > 1. remove it,
> > > 2. keep it as is, or
> > > 3. keep it, but account only -EINPROGRESS[1]
> > >
> > > If I'm not missing other options, I'm tempted to the first option (remove it)
> > > since it doesn't change any existing things, and we can revisit later.
> >
> > I am fine with 1) removing it. Maybe add a log once print error on the
> > error code if -EINPROGRESS, just to know such extreme error has been
> > triggered.
> > >
> > > Please let me know if I'm missing other options or if you have other preferences.
> >
> > I just don't want to hide the extreme error case but I am also fine
> > with just removing it. It is your call.
>
> Thank you for your opinion, Chris!  Unless others have different opinions, I
> will only remove the counter (option 1), since it is simplest and we can
> consider adding another counter or error logs on top of it.

Yes, that seems the best option—to remove the counter for now.

And I still need Herbert’s help to understand why crypto_wait_req() might return
-EINPROGRESS, given the code below:

static inline int crypto_wait_req(int err, struct crypto_wait *wait)
{
    switch (err) {
    case -EINPROGRESS:
    case -EBUSY:
        wait_for_completion(&wait->completion);
        reinit_completion(&wait->completion);
        err = wait->err;
        break;
    }

    return err;
}

void crypto_req_done(void *data, int err)
{
    struct crypto_wait *wait = data;

    if (err == -EINPROGRESS)
        return;

    wait->err = err;
    complete(&wait->completion);
}

Is it even possible for crypto_wait_req() to return -EINPROGRESS, since
crypto_req_done() will not call complete(&wait->completion) in that case at
all?

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-22  0:48                   ` Barry Song
@ 2025-08-22  5:54                     ` Herbert Xu
  2025-08-22  7:30                       ` Barry Song
  2025-08-22 16:44                     ` SeongJae Park
  1 sibling, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2025-08-22  5:54 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, Chris Li, Nhat Pham, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Kairui Song

On Fri, Aug 22, 2025 at 12:48:27PM +1200, Barry Song wrote:
>
> Is it even possible for crypto_wait_req() to return -EINPROGRESS, since
> crypto_req_done() will not call complete(&wait->completion) in that case at
> all?

Of course crypto_wait_req cannot return -EINPROGRESS.  However,
I was responding to a request to make the crypto_acomp_compress
call return a boolean instead of an error code.  That is not
possible because it has to be able to return -EINPROGRESS.

Nor is it possible to change crypto_wait_req to be boolean since
other Crypto API operations (including decompression) can indeed
fail.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-22  5:54                     ` Herbert Xu
@ 2025-08-22  7:30                       ` Barry Song
  0 siblings, 0 replies; 24+ messages in thread
From: Barry Song @ 2025-08-22  7:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: SeongJae Park, Chris Li, Nhat Pham, Andrew Morton, Chengming Zhou,
	Johannes Weiner, Yosry Ahmed, kernel-team, linux-kernel, linux-mm,
	Takero Funaki, David Hildenbrand, Baoquan He, Kairui Song

On Fri, Aug 22, 2025 at 1:54 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Aug 22, 2025 at 12:48:27PM +1200, Barry Song wrote:
> >
> > Is it even possible for crypto_wait_req() to return -EINPROGRESS, since
> > crypto_req_done() will not call complete(&wait->completion) in that case at
> > all?
>
> Of course crypto_wait_req cannot return -EINPROGRESS.  However,
> I was responding to a request to make the crypto_acomp_compress
> call return a boolean instead of an error code.  That is not
> possible because it has to be able to return -EINPROGRESS.
>
> Nor is it possible to change crypto_wait_req to be boolean since
> other Crypto API operations (including decompression) can indeed
> fail.
>

Ok. zswap only cares about the return value of crypto_wait_req(),
not crypto_acomp_compress(). So it’s clear now—any error return
means the page is incompressible.

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is
  2025-08-22  0:48                   ` Barry Song
  2025-08-22  5:54                     ` Herbert Xu
@ 2025-08-22 16:44                     ` SeongJae Park
  1 sibling, 0 replies; 24+ messages in thread
From: SeongJae Park @ 2025-08-22 16:44 UTC (permalink / raw)
  To: Barry Song
  Cc: SeongJae Park, Herbert Xu, Chris Li, Nhat Pham, Andrew Morton,
	Chengming Zhou, Johannes Weiner, Yosry Ahmed, kernel-team,
	linux-kernel, linux-mm, Takero Funaki, David Hildenbrand,
	Baoquan He, Kairui Song

On Fri, 22 Aug 2025 12:48:27 +1200 Barry Song <21cnbao@gmail.com> wrote:

> > > >
> > > > 1. remove it,
> > > > 2. keep it as is, or
> > > > 3. keep it, but account only -EINPROGRESS[1]
> > > >
> > > > If I'm not missing other options, I'm tempted to the first option (remove it)
> > > > since it doesn't change any existing things, and we can revisit later.
> > >
> > > I am fine with 1) removing it. Maybe add a log once print error on the
> > > error code if -EINPROGRESS, just to know such extreme error has been
> > > triggered.
> > > >
> > > > Please let me know if I'm missing other options or if you have other preferences.
> > >
> > > I just don't want to hide the extreme error case but I am also fine
> > > with just removing it. It is your call.
> >
> > Thank you for your opinion, Chris!  Unless others have different opinions, I
> > will only remove the counter (option 1), since it is simplest and we can
> > consider adding another counter or error logs on top of it.
> 
> Yes, that seems the best option—to remove the counter for now.

Thank you for making your opinion clear, Barry.  I will post v5 soon.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-08-22 16:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 19:34 [PATCH v4] mm/zswap: store <PAGE_SIZE compression failed page as-is SeongJae Park
2025-08-20  1:13 ` Barry Song
2025-08-20  1:20   ` Herbert Xu
2025-08-20  1:34     ` Barry Song
2025-08-20  1:37       ` Herbert Xu
2025-08-20 17:32         ` Nhat Pham
2025-08-21 10:27           ` Barry Song
2025-08-21 16:42             ` SeongJae Park
2025-08-21 20:49               ` Chris Li
2025-08-21 21:36                 ` SeongJae Park
2025-08-22  0:48                   ` Barry Song
2025-08-22  5:54                     ` Herbert Xu
2025-08-22  7:30                       ` Barry Song
2025-08-22 16:44                     ` SeongJae Park
2025-08-20  5:07       ` Chris Li
2025-08-20  5:10         ` Herbert Xu
2025-08-20  5:20           ` Chris Li
2025-08-20  5:22             ` Herbert Xu
2025-08-21 20:42               ` Chris Li
2025-08-20  4:55     ` Chris Li
2025-08-20  4:49   ` Chris Li
2025-08-21 16:17 ` SeongJae Park
2025-08-21 21:21 ` Chris Li
2025-08-21 21:41   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).