[PATCH 0/2] minimize swapping on zswap store failure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] minimize swapping on zswap store failure
@ 2023-10-17  0:35 Nhat Pham
  2023-10-17  0:35 ` [PATCH 1/2] swap: allows swap bypassing " Nhat Pham
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Nhat Pham @ 2023-10-17  0:35 UTC (permalink / raw)
  To: akpm
  Cc: hannes, cerasuolodomenico, yosryahmed, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

Currently, when a zswap store attempt fails, the page is immediately
swapped out. This could happen for a variety of reasons. For instance,
the compression algorithm could fail (such as when the data is not
compressible), or the backend allocator might not be able to find a
suitable slot for the compressed page. If these pages are needed
later on, users will incur IOs from swapins.

This issue prevents the adoption of zswap for potential users who
cannot tolerate the latency associated with swapping. In many cases,
these IOs are avoidable if we just keep in memory the pages that zswap
fail to store.

This patch series add two new features for zswap that will alleviate
the risk of swapping:

a) When a store attempt fail, keep the page untouched in memory
instead of swapping it out.

b) If the store attempt fails at the compression step, allow the page
to be stored in its uncompressed form in the zswap pool. This maintains
the LRU ordering of pages, which will be helpful for accurate
memory reclaim (zswap writeback in particular).

These features could be enabled independently via two new zswap module
parameters.

Nhat Pham (2):
  swap: allows swap bypassing on zswap store failure
  zswap: store uncompressed pages when compression algorithm fails

 Documentation/admin-guide/mm/zswap.rst | 16 +++++++
 include/linux/zswap.h                  |  9 ++++
 mm/page_io.c                           |  6 +++
 mm/shmem.c                             |  8 +++-
 mm/zswap.c                             | 64 +++++++++++++++++++++++---
 5 files changed, 95 insertions(+), 8 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] swap: allows swap bypassing on zswap store failure
  2023-10-17  0:35 [PATCH 0/2] minimize swapping on zswap store failure Nhat Pham
@ 2023-10-17  0:35 ` Nhat Pham
  2023-10-17  0:35 ` [PATCH 2/2] zswap: store uncompressed pages when compression algorithm fails Nhat Pham
  2023-10-17  0:57 ` [PATCH 0/2] minimize swapping on zswap store failure Yosry Ahmed
  2 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-10-17  0:35 UTC (permalink / raw)
  To: akpm
  Cc: hannes, cerasuolodomenico, yosryahmed, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

During our experiment with zswap, we sometimes observe swap IOs even
though the zswap pool limit is never hit. This is due to occasional
zswap store failures, in which case the page will be written straight to
the swapping device. This prevents many users who cannot tolerate
swapping from adopting zswap to save memory where possible.

This patch adds the option to bypass swap when a zswap store fails. The
feature is disabled by default (to preserve the existing behavior), and
can be enabled via a new zswap module parameter. When enabled, swapping
is all but prevented (except for when the zswap pool is full and have to
write pages back to swap).

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 Documentation/admin-guide/mm/zswap.rst | 9 +++++++++
 include/linux/zswap.h                  | 9 +++++++++
 mm/page_io.c                           | 6 ++++++
 mm/shmem.c                             | 8 ++++++--
 mm/zswap.c                             | 4 ++++
 5 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index ae8597a67804..82fa8148a65a 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -153,6 +153,15 @@ attribute, e. g.::
 
 Setting this parameter to 100 will disable the hysteresis.
 
+Many users cannot tolerate the swapping that comes with zswap store failures,
+due to the IO incurred if these pages are needed later on. In this scenario,
+users can bypass swapping when zswap store attempts fail (and keep the pages
+in memory) as follows:
+
+	echo Y > /sys/module/zswap/parameters/bypass_swap_when_store_fail_enabled
+
+Note that swapping due to writeback is not disabled with this option.
+
 When there is a sizable amount of cold memory residing in the zswap pool, it
 can be advantageous to proactively write these cold pages to swap and reclaim
 the memory for other use cases. By default, the zswap shrinker is disabled.
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 04f80b64a09b..c67da5223894 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -7,6 +7,7 @@
 
 extern u64 zswap_pool_total_size;
 extern atomic_t zswap_stored_pages;
+extern bool zswap_bypass_swap_when_store_fail_enabled;
 
 #ifdef CONFIG_ZSWAP
 
@@ -18,6 +19,10 @@ void zswap_swapoff(int type);
 bool zswap_remove_swpentry_from_lru(swp_entry_t swpentry);
 void zswap_insert_swpentry_into_lru(swp_entry_t swpentry);
 
+static inline bool zswap_bypass_swap_when_store_fail(void)
+{
+	return zswap_bypass_swap_when_store_fail_enabled;
+}
 #else
 
 static inline bool zswap_store(struct folio *folio)
@@ -41,6 +46,10 @@ static inline bool zswap_remove_swpentry_from_lru(swp_entry_t swpentry)
 
 static inline void zswap_insert_swpentry_into_lru(swp_entry_t swpentry) {}
 
+static inline bool zswap_bypass_swap_when_store_fail(void)
+{
+	return false;
+}
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/page_io.c b/mm/page_io.c
index cb559ae324c6..482f56d27bcd 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		folio_end_writeback(folio);
 		return 0;
 	}
+
+	if (zswap_bypass_swap_when_store_fail()) {
+		folio_mark_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	__swap_writepage(&folio->page, wbc);
 	return 0;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 6503910b0f54..8614d7fbe18c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1514,8 +1514,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(folio_mapped(folio));
-		swap_writepage(&folio->page, wbc);
-		return 0;
+		/*
+		 * Seeing AOP_WRITEPAGE_ACTIVATE here indicates swapping is disabled on
+		 * zswap store failure. Note that in that case the folio is already
+		 * re-marked dirty by swap_writepage()
+		 */
+		return swap_writepage(&folio->page, wbc);
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/zswap.c b/mm/zswap.c
index d545516fb5de..db2674548670 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -138,6 +138,10 @@ static bool zswap_non_same_filled_pages_enabled = true;
 module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
 		   bool, 0644);
 
+bool zswap_bypass_swap_when_store_fail_enabled;
+module_param_named(bypass_swap_when_store_fail_enabled,
+		   zswap_bypass_swap_when_store_fail_enabled, bool, 0644);
+
 static bool zswap_exclusive_loads_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
 module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] zswap: store uncompressed pages when compression algorithm fails
  2023-10-17  0:35 [PATCH 0/2] minimize swapping on zswap store failure Nhat Pham
  2023-10-17  0:35 ` [PATCH 1/2] swap: allows swap bypassing " Nhat Pham
@ 2023-10-17  0:35 ` Nhat Pham
  2023-10-17  0:57 ` [PATCH 0/2] minimize swapping on zswap store failure Yosry Ahmed
  2 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-10-17  0:35 UTC (permalink / raw)
  To: akpm
  Cc: hannes, cerasuolodomenico, yosryahmed, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

In a zswap store, the compression step could fail, leaving the pages
out of the zswap pool. This could happen, for instance, when the pages
hold random data (such as /dev/urandom). This behavior can potentially
create unnecessary LRU inversion: warm but incompressible pages are
swapped out, whereas cold compressible pages reside in the zswap pool.
This could incur IO later when the warmer pages are indeed required
again.

We can instead just store the original page in the zswap pool. Storing
PAGE_SIZE objects is already supported by zsmalloc, zswap's default
backend allocator. When these pages are needed (either for a zswap load
or for writeback), we can simply copy the original content into the
desired buffer.

This patch allows zswap to store uncompressed pages. The feature can be
enabled at runtime via a new parameter. In addition, we add a counter to
track the number of times the compression step fails for diagnostics.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 Documentation/admin-guide/mm/zswap.rst |  7 +++
 mm/zswap.c                             | 60 +++++++++++++++++++++++---
 2 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 82fa8148a65a..ac1018cb5373 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -162,6 +162,13 @@ in memory) as follows:
 
 Note that swapping due to writeback is not disabled with this option.
 
+Compression could also fail during a zswap store attempt. In many cases, it is
+nevertheless beneficial to store the page in the zswap pool (in its
+uncompressed form) for the sake of maintaining the LRU ordering, which will
+be useful for reclaim. This can be enabled as follows:
+
+	echo Y > /sys/module/zswap/parameters/uncompressed_pages_enabled
+
 When there is a sizable amount of cold memory residing in the zswap pool, it
 can be advantageous to proactively write these cold pages to swap and reclaim
 the memory for other use cases. By default, the zswap shrinker is disabled.
diff --git a/mm/zswap.c b/mm/zswap.c
index db2674548670..096266d35602 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -62,6 +62,8 @@ static u64 zswap_pool_limit_hit;
 static u64 zswap_written_back_pages;
 /* Store failed due to a reclaim failure after pool limit was reached */
 static u64 zswap_reject_reclaim_fail;
+/* Compression step fails during store attempt */
+static u64 zswap_reject_compress_fail;
 /* Compressed page was too big for the allocator to (optimally) store */
 static u64 zswap_reject_compress_poor;
 /* Store failed because underlying allocator could not get memory */
@@ -142,6 +144,10 @@ bool zswap_bypass_swap_when_store_fail_enabled;
 module_param_named(bypass_swap_when_store_fail_enabled,
 		   zswap_bypass_swap_when_store_fail_enabled, bool, 0644);
 
+static bool zswap_uncompressed_pages_enabled;
+module_param_named(uncompressed_pages_enabled,
+		   zswap_uncompressed_pages_enabled, bool, 0644);
+
 static bool zswap_exclusive_loads_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
 module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
@@ -224,6 +230,7 @@ struct zswap_pool {
  * value - value of the same-value filled pages which have same content
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
+ * is_uncompressed - whether the page is stored in its uncompressed form.
  */
 struct zswap_entry {
 	struct rb_node rbnode;
@@ -238,6 +245,7 @@ struct zswap_entry {
 	struct obj_cgroup *objcg;
 	int nid;
 	struct list_head lru;
+	bool is_uncompressed;
 };
 
 /*
@@ -1307,7 +1315,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct zpool *pool = zswap_find_zpool(entry);
 	bool page_was_allocated;
-	u8 *src, *tmp = NULL;
+	u8 *src, *dst, *tmp = NULL;
 	unsigned int dlen;
 	int ret;
 	struct writeback_control wbc = {
@@ -1356,6 +1364,19 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	dlen = PAGE_SIZE;
 
 	src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
+	if (entry->is_uncompressed) {
+		if (!zpool_can_sleep_mapped(pool))
+			kfree(tmp);
+
+		dst = kmap_local_page(page);
+		copy_page(dst, src);
+		kunmap_local(dst);
+		zpool_unmap_handle(pool, entry->handle);
+
+		ret = 0;
+		goto success;
+	}
+
 	if (!zpool_can_sleep_mapped(pool)) {
 		memcpy(tmp, src, entry->length);
 		src = tmp;
@@ -1376,6 +1397,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	else
 		zpool_unmap_handle(pool, entry->handle);
 
+success:
 	BUG_ON(ret);
 	BUG_ON(dlen != PAGE_SIZE);
 
@@ -1454,7 +1476,7 @@ bool zswap_store(struct folio *folio)
 	char *buf;
 	u8 *src, *dst;
 	gfp_t gfp;
-	int ret;
+	int ret, compress_ret;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1569,11 +1591,15 @@ bool zswap_store(struct folio *folio)
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
+	compress_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+						&acomp_ctx->wait);
 	dlen = acomp_ctx->req->dlen;
 
-	if (ret)
-		goto put_dstmem;
+	if (compress_ret) {
+		zswap_reject_compress_fail++;
+		if (!zswap_uncompressed_pages_enabled)
+			goto put_dstmem;
+	}
 
 	/* store */
 	zpool = zswap_find_zpool(entry);
@@ -1590,7 +1616,15 @@ bool zswap_store(struct folio *folio)
 		goto put_dstmem;
 	}
 	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
-	memcpy(buf, dst, dlen);
+
+	/* Compressor failed. Store the page in its uncompressed form. */
+	if (compress_ret) {
+		dlen = PAGE_SIZE;
+		src = kmap_local_page(page);
+		copy_page(buf, src);
+		kunmap_local(src);
+	} else
+		memcpy(buf, dst, dlen);
 	zpool_unmap_handle(zpool, handle);
 	mutex_unlock(acomp_ctx->mutex);
 
@@ -1598,6 +1632,7 @@ bool zswap_store(struct folio *folio)
 	entry->swpentry = swp_entry(type, offset);
 	entry->handle = handle;
 	entry->length = dlen;
+	entry->is_uncompressed = compress_ret;
 
 insert_entry:
 	entry->objcg = objcg;
@@ -1687,6 +1722,17 @@ bool zswap_load(struct folio *folio)
 	}
 
 	zpool = zswap_find_zpool(entry);
+
+	if (entry->is_uncompressed) {
+		src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+		dst = kmap_local_page(page);
+		copy_page(dst, src);
+		kunmap_local(dst);
+		zpool_unmap_handle(zpool, entry->handle);
+		ret = true;
+		goto stats;
+	}
+
 	if (!zpool_can_sleep_mapped(zpool)) {
 		tmp = kmalloc(entry->length, GFP_KERNEL);
 		if (!tmp) {
@@ -1855,6 +1901,8 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_alloc_fail);
 	debugfs_create_u64("reject_kmemcache_fail", 0444,
 			   zswap_debugfs_root, &zswap_reject_kmemcache_fail);
+	debugfs_create_u64("reject_compress_fail", 0444,
+			   zswap_debugfs_root, &zswap_reject_compress_fail);
 	debugfs_create_u64("reject_compress_poor", 0444,
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("written_back_pages", 0444,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  0:35 [PATCH 0/2] minimize swapping on zswap store failure Nhat Pham
  2023-10-17  0:35 ` [PATCH 1/2] swap: allows swap bypassing " Nhat Pham
  2023-10-17  0:35 ` [PATCH 2/2] zswap: store uncompressed pages when compression algorithm fails Nhat Pham
@ 2023-10-17  0:57 ` Yosry Ahmed
  2023-10-17  4:47   ` Johannes Weiner
                     ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-10-17  0:57 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, hannes, cerasuolodomenico, sjenning, ddstreet, vitaly.wool,
	hughd, corbet, konrad.wilk, senozhatsky, rppt, linux-mm,
	kernel-team, linux-kernel, david

On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, when a zswap store attempt fails, the page is immediately
> swapped out. This could happen for a variety of reasons. For instance,
> the compression algorithm could fail (such as when the data is not
> compressible), or the backend allocator might not be able to find a
> suitable slot for the compressed page. If these pages are needed
> later on, users will incur IOs from swapins.
>
> This issue prevents the adoption of zswap for potential users who
> cannot tolerate the latency associated with swapping. In many cases,
> these IOs are avoidable if we just keep in memory the pages that zswap
> fail to store.
>
> This patch series add two new features for zswap that will alleviate
> the risk of swapping:
>
> a) When a store attempt fail, keep the page untouched in memory
> instead of swapping it out.

What about writeback when the zswap limit is hit? I understand the
problem, but I am wondering if this is the correct way of fixing it.
We really need to make zswap work without a backing swapfile, which I
think is the correct way to fix all these problems. I was working on
that, but unfortunately I had to pivot to something else before I had
something that was working.

At Google, we have "ghost" swapfiles that we use just to use zswap
without a swapfile. They are sparse files, and we have internal kernel
patches to flag them and never try to actually write to them.

I am not sure how many bandaids we can afford before doing the right
thing. I understand it's a much larger surgery, perhaps there is a way
to get a short-term fix that is also a step towards the final state we
want to reach instead?

>
> b) If the store attempt fails at the compression step, allow the page
> to be stored in its uncompressed form in the zswap pool. This maintains
> the LRU ordering of pages, which will be helpful for accurate
> memory reclaim (zswap writeback in particular).

This is dangerous. Johannes and I discussed this before. This means
that reclaim can end up allocating more memory instead of freeing.
Allocations made in the reclaim path are made under the assumption
that we will eventually free memory. In this case, we won't. In the
worst case scenario, reclaim can leave the system/memcg in a worse
state than before it started.

Perhaps there is a way we can do this without allocating a zswap entry?

I thought before about having a special list_head that allows us to
use the lower bits of the pointers as markers, similar to the xarray.
The markers can be used to place different objects on the same list.
We can have a list that is a mixture of struct page and struct
zswap_entry. I never pursued this idea, and I am sure someone will
scream at me for suggesting it. Maybe there is a less convoluted way
to keep the LRU ordering intact without allocating memory on the
reclaim path.

>
> These features could be enabled independently via two new zswap module
> parameters.
>
> Nhat Pham (2):
>   swap: allows swap bypassing on zswap store failure
>   zswap: store uncompressed pages when compression algorithm fails
>
>  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
>  include/linux/zswap.h                  |  9 ++++
>  mm/page_io.c                           |  6 +++
>  mm/shmem.c                             |  8 +++-
>  mm/zswap.c                             | 64 +++++++++++++++++++++++---
>  5 files changed, 95 insertions(+), 8 deletions(-)
>
> --
> 2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  0:57 ` [PATCH 0/2] minimize swapping on zswap store failure Yosry Ahmed
@ 2023-10-17  4:47   ` Johannes Weiner
  2023-10-17  5:33     ` Yosry Ahmed
  2023-10-17 19:24     ` Nhat Pham
  2023-10-17 19:03   ` Nhat Pham
  2025-04-02 20:06   ` Joshua Hahn
  2 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2023-10-17  4:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, akpm, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, when a zswap store attempt fails, the page is immediately
> > swapped out. This could happen for a variety of reasons. For instance,
> > the compression algorithm could fail (such as when the data is not
> > compressible), or the backend allocator might not be able to find a
> > suitable slot for the compressed page. If these pages are needed
> > later on, users will incur IOs from swapins.
> >
> > This issue prevents the adoption of zswap for potential users who
> > cannot tolerate the latency associated with swapping. In many cases,
> > these IOs are avoidable if we just keep in memory the pages that zswap
> > fail to store.
> >
> > This patch series add two new features for zswap that will alleviate
> > the risk of swapping:
> >
> > a) When a store attempt fail, keep the page untouched in memory
> > instead of swapping it out.
> 
> What about writeback when the zswap limit is hit? I understand the
> problem, but I am wondering if this is the correct way of fixing it.
> We really need to make zswap work without a backing swapfile, which I
> think is the correct way to fix all these problems. I was working on
> that, but unfortunately I had to pivot to something else before I had
> something that was working.
>
> At Google, we have "ghost" swapfiles that we use just to use zswap
> without a swapfile. They are sparse files, and we have internal kernel
> patches to flag them and never try to actually write to them.
> 
> I am not sure how many bandaids we can afford before doing the right
> thing. I understand it's a much larger surgery, perhaps there is a way
> to get a short-term fix that is also a step towards the final state we
> want to reach instead?

I agree it should also stop writeback due to the limit.

Note that a knob like this is still useful even when zswap space is
decoupled from disk swap slots. We still are using disk swap broadly
in the fleet as well, so a static ghost file setup wouldn't be a good
solution for us. Swapoff with common swapfile sizes is often not an
option during runtime, due to how slow it is, and the destabilizing
effect it can have on the system unless that's basically completely
idle. As such, we expect to continue deploying swap files for physical
use, and switch the zswap-is-terminal knob depending on the workload.

The other aspect to this is that workloads that do not want the
swapout/swapin overhead for themselves are usually co-located with
system management software, and/or can share the host with less
latency sensitive workloads, that should continue to use disk swap.

Due to the latter case, I wonder if a global knob is actually
enough. More likely we'd need per-cgroup control over this.

[ It's at this point where the historic coupling of zswap to disk
  space is especially unfortunate. Because of it, zswap usage counts
  toward the memory.swap allowance. If these were separate, we could
  have easily set memory.zswap.max=max, memory.swap.max=0 to achieve
  the desired effect.

  Alas, that ship has sailed. Even after a decoupling down the line it
  would be difficult to change established memory.swap semantics. ]

So I obviously agree that we still need to invest in decoupling zswap
space from physical disk slots. It's insanely wasteful, especially
with larger memory capacities. But while it would be a fantastic
optimization, I don't see how it would be an automatic solution to the
problem that inspired this proposal.

We still need some way to reasonably express desired workload policy
in a setup that supports multiple, simultaneous modes of operation.

> > b) If the store attempt fails at the compression step, allow the page
> > to be stored in its uncompressed form in the zswap pool. This maintains
> > the LRU ordering of pages, which will be helpful for accurate
> > memory reclaim (zswap writeback in particular).
> 
> This is dangerous. Johannes and I discussed this before. This means
> that reclaim can end up allocating more memory instead of freeing.
> Allocations made in the reclaim path are made under the assumption
> that we will eventually free memory. In this case, we won't. In the
> worst case scenario, reclaim can leave the system/memcg in a worse
> state than before it started.

Yeah, this is a concern. It's not such a big deal if it's only a few
pages, and we're still shrinking the footprint on aggregate. But it's
conceivable this can happen systematically with some datasets, in
which case reclaim will drive up the memory consumption and cause
OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause
memory deadlocks.

This isn't something we can reasonably accept as worst-case behavior.

> Perhaps there is a way we can do this without allocating a zswap entry?
>
> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

That should work. Once zswap has exclusive control over the page, it
is free to muck with its lru linkage. A lower bit tag on the next or
prev pointer should suffice to distinguish between struct page and
struct zswap_entry when pulling stuff from the list.

We'd also have to teach vmscan.c to hand off the page. It currently
expects that it either frees the page back to the allocator, or puts
it back on the LRU. We'd need a compromise where it continues to tear
down the page and remove the mapping, but then leaves it to zswap.

Neither of those sound impossible. But since it's a bigger
complication than this proposal, it probably needs a new cost/benefit
analysis, with potentially more data on the problem of LRU inversions.

Thanks for your insightful feedback, Yosry.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  4:47   ` Johannes Weiner
@ 2023-10-17  5:33     ` Yosry Ahmed
  2023-10-17 14:51       ` Johannes Weiner
  2023-10-17 19:24     ` Nhat Pham
  1 sibling, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2023-10-17  5:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, akpm, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david, Wei Xu, Chris Li,
	Greg Thelen

On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Currently, when a zswap store attempt fails, the page is immediately
> > > swapped out. This could happen for a variety of reasons. For instance,
> > > the compression algorithm could fail (such as when the data is not
> > > compressible), or the backend allocator might not be able to find a
> > > suitable slot for the compressed page. If these pages are needed
> > > later on, users will incur IOs from swapins.
> > >
> > > This issue prevents the adoption of zswap for potential users who
> > > cannot tolerate the latency associated with swapping. In many cases,
> > > these IOs are avoidable if we just keep in memory the pages that zswap
> > > fail to store.
> > >
> > > This patch series add two new features for zswap that will alleviate
> > > the risk of swapping:
> > >
> > > a) When a store attempt fail, keep the page untouched in memory
> > > instead of swapping it out.
> >
> > What about writeback when the zswap limit is hit? I understand the
> > problem, but I am wondering if this is the correct way of fixing it.
> > We really need to make zswap work without a backing swapfile, which I
> > think is the correct way to fix all these problems. I was working on
> > that, but unfortunately I had to pivot to something else before I had
> > something that was working.
> >
> > At Google, we have "ghost" swapfiles that we use just to use zswap
> > without a swapfile. They are sparse files, and we have internal kernel
> > patches to flag them and never try to actually write to them.
> >
> > I am not sure how many bandaids we can afford before doing the right
> > thing. I understand it's a much larger surgery, perhaps there is a way
> > to get a short-term fix that is also a step towards the final state we
> > want to reach instead?
>
> I agree it should also stop writeback due to the limit.
>
> Note that a knob like this is still useful even when zswap space is
> decoupled from disk swap slots. We still are using disk swap broadly
> in the fleet as well, so a static ghost file setup wouldn't be a good
> solution for us. Swapoff with common swapfile sizes is often not an
> option during runtime, due to how slow it is, and the destabilizing
> effect it can have on the system unless that's basically completely
> idle. As such, we expect to continue deploying swap files for physical
> use, and switch the zswap-is-terminal knob depending on the workload.
>
> The other aspect to this is that workloads that do not want the
> swapout/swapin overhead for themselves are usually co-located with
> system management software, and/or can share the host with less
> latency sensitive workloads, that should continue to use disk swap.
>
> Due to the latter case, I wonder if a global knob is actually
> enough. More likely we'd need per-cgroup control over this.

In conjunction with ghost swapfiles, we have a knob to determine the
type of swapfile to use for each cgroup (normal, ghost, either, or
none). This achieves what you are describing, allowing different
workloads on the same machine to use zswap only or disk swap, although
in practice we only use zswap now.

I am not saying that's necessarily the correct way of doing it. Having
a zswap-is-terminal knob per-cgroup is another way to achieve this. I
will loop in folks maintaining this code internally to see what they
think.

>
> [ It's at this point where the historic coupling of zswap to disk
>   space is especially unfortunate. Because of it, zswap usage counts
>   toward the memory.swap allowance. If these were separate, we could
>   have easily set memory.zswap.max=max, memory.swap.max=0 to achieve
>   the desired effect.
>
>   Alas, that ship has sailed. Even after a decoupling down the line it
>   would be difficult to change established memory.swap semantics. ]

Fully agree here. This is unfortunate.

>
> So I obviously agree that we still need to invest in decoupling zswap
> space from physical disk slots. It's insanely wasteful, especially
> with larger memory capacities. But while it would be a fantastic
> optimization, I don't see how it would be an automatic solution to the
> problem that inspired this proposal.

Well, in my head, I imagine such a world where we have multiple
separate swapping backends with cgroup knob(s) that control what
backends are allowed for each cgroup. A zswap-is-terminal knob is
hacky-ish way of doing that where the backends are only zswap and disk
swap.

>
> We still need some way to reasonably express desired workload policy
> in a setup that supports multiple, simultaneous modes of operation.
>
> > > b) If the store attempt fails at the compression step, allow the page
> > > to be stored in its uncompressed form in the zswap pool. This maintains
> > > the LRU ordering of pages, which will be helpful for accurate
> > > memory reclaim (zswap writeback in particular).
> >
> > This is dangerous. Johannes and I discussed this before. This means
> > that reclaim can end up allocating more memory instead of freeing.
> > Allocations made in the reclaim path are made under the assumption
> > that we will eventually free memory. In this case, we won't. In the
> > worst case scenario, reclaim can leave the system/memcg in a worse
> > state than before it started.
>
> Yeah, this is a concern. It's not such a big deal if it's only a few
> pages, and we're still shrinking the footprint on aggregate. But it's
> conceivable this can happen systematically with some datasets, in
> which case reclaim will drive up the memory consumption and cause
> OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause
> memory deadlocks.
>
> This isn't something we can reasonably accept as worst-case behavior.

Right.

>
> > Perhaps there is a way we can do this without allocating a zswap entry?
> >
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> That should work. Once zswap has exclusive control over the page, it
> is free to muck with its lru linkage. A lower bit tag on the next or
> prev pointer should suffice to distinguish between struct page and
> struct zswap_entry when pulling stuff from the list.

Right.

We handle incompressible memory internally in a different way, we put
them back on the unevictable list with an incompressible page flag.
This achieves a similar effect.

A missing point here is that those pages, if dirtied, may be
compressible again. When we have them on the LRUs, we rely on periodic
scanning (similar to the MGLRU-based periodic scanning we proposed
before) to check the dirty bit and make those pages evictable again.

If we leave them on the zswap LRU, we will incur a fault instead to
pull them back to the LRUs. For anon pages, that's probably fine, as
in both cases by the time we reach zswap the page has been unmapped,
and accessing it again incurs a fault anyway (whether it's in zswap
LRUs or in the reclaim LRUs). For shmem though, we put the
incompressible pages back in the page cache, preventing a page fault
on the next access. This is a drawback of the zswap LRU approach
AFAICT. Not sure how much it matters in practice.

>
> We'd also have to teach vmscan.c to hand off the page. It currently
> expects that it either frees the page back to the allocator, or puts
> it back on the LRU. We'd need a compromise where it continues to tear
> down the page and remove the mapping, but then leaves it to zswap.

Right.

>
> Neither of those sound impossible. But since it's a bigger
> complication than this proposal, it probably needs a new cost/benefit
> analysis, with potentially more data on the problem of LRU inversions.

Makes sense.

>
> Thanks for your insightful feedback, Yosry.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  5:33     ` Yosry Ahmed
@ 2023-10-17 14:51       ` Johannes Weiner
  2023-10-17 15:51         ` Yosry Ahmed
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2023-10-17 14:51 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, akpm, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david, Wei Xu, Chris Li,
	Greg Thelen

On Mon, Oct 16, 2023 at 10:33:23PM -0700, Yosry Ahmed wrote:
> On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > So I obviously agree that we still need to invest in decoupling zswap
> > space from physical disk slots. It's insanely wasteful, especially
> > with larger memory capacities. But while it would be a fantastic
> > optimization, I don't see how it would be an automatic solution to the
> > problem that inspired this proposal.
> 
> Well, in my head, I imagine such a world where we have multiple
> separate swapping backends with cgroup knob(s) that control what
> backends are allowed for each cgroup. A zswap-is-terminal knob is
> hacky-ish way of doing that where the backends are only zswap and disk
> swap.

"I want compression" vs "I want disk offloading" is a more reasonable
question to ask at the cgroup level. We've had historically a variety
of swap configurations across the fleet. E.g. it's a lot easier to add
another swapfile than it is to grow an existing one at runtime. In
other cases, one storage config might have one swapfile, another
machine model might want to spread it out over multiple disks etc.

This doesn't matter much with ghost files. But with conventional
swapfiles this requires an unnecessary awareness of the backend
topology in order to express container policy. That's no bueno.

> > > Perhaps there is a way we can do this without allocating a zswap entry?
> > >
> > > I thought before about having a special list_head that allows us to
> > > use the lower bits of the pointers as markers, similar to the xarray.
> > > The markers can be used to place different objects on the same list.
> > > We can have a list that is a mixture of struct page and struct
> > > zswap_entry. I never pursued this idea, and I am sure someone will
> > > scream at me for suggesting it. Maybe there is a less convoluted way
> > > to keep the LRU ordering intact without allocating memory on the
> > > reclaim path.
> >
> > That should work. Once zswap has exclusive control over the page, it
> > is free to muck with its lru linkage. A lower bit tag on the next or
> > prev pointer should suffice to distinguish between struct page and
> > struct zswap_entry when pulling stuff from the list.
> 
> Right.
> 
> We handle incompressible memory internally in a different way, we put
> them back on the unevictable list with an incompressible page flag.
> This achieves a similar effect.

It doesn't. We want those incompressible pages to continue aging
alongside their compressible peers, and eventually get written back to
disk with them.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17 14:51       ` Johannes Weiner
@ 2023-10-17 15:51         ` Yosry Ahmed
  0 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-10-17 15:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, akpm, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david, Wei Xu, Chris Li,
	Greg Thelen

On Tue, Oct 17, 2023 at 7:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 10:33:23PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > > > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > > So I obviously agree that we still need to invest in decoupling zswap
> > > space from physical disk slots. It's insanely wasteful, especially
> > > with larger memory capacities. But while it would be a fantastic
> > > optimization, I don't see how it would be an automatic solution to the
> > > problem that inspired this proposal.
> >
> > Well, in my head, I imagine such a world where we have multiple
> > separate swapping backends with cgroup knob(s) that control what
> > backends are allowed for each cgroup. A zswap-is-terminal knob is
> > hacky-ish way of doing that where the backends are only zswap and disk
> > swap.
>
> "I want compression" vs "I want disk offloading" is a more reasonable
> question to ask at the cgroup level. We've had historically a variety
> of swap configurations across the fleet. E.g. it's a lot easier to add
> another swapfile than it is to grow an existing one at runtime. In
> other cases, one storage config might have one swapfile, another
> machine model might want to spread it out over multiple disks etc.
>
> This doesn't matter much with ghost files. But with conventional
> swapfiles this requires an unnecessary awareness of the backend
> topology in order to express container policy. That's no bueno.

Oh I didn't mean that cgroups would designate specific swapfiles, but
rather swapping backends, which would be "zswap" or "disk" or both in
this case. I just imagined an interface that is more generic and
extensible rather than a specific zswap-is-terminal knob.

>
> > > > Perhaps there is a way we can do this without allocating a zswap entry?
> > > >
> > > > I thought before about having a special list_head that allows us to
> > > > use the lower bits of the pointers as markers, similar to the xarray.
> > > > The markers can be used to place different objects on the same list.
> > > > We can have a list that is a mixture of struct page and struct
> > > > zswap_entry. I never pursued this idea, and I am sure someone will
> > > > scream at me for suggesting it. Maybe there is a less convoluted way
> > > > to keep the LRU ordering intact without allocating memory on the
> > > > reclaim path.
> > >
> > > That should work. Once zswap has exclusive control over the page, it
> > > is free to muck with its lru linkage. A lower bit tag on the next or
> > > prev pointer should suffice to distinguish between struct page and
> > > struct zswap_entry when pulling stuff from the list.
> >
> > Right.
> >
> > We handle incompressible memory internally in a different way, we put
> > them back on the unevictable list with an incompressible page flag.
> > This achieves a similar effect.
>
> It doesn't. We want those incompressible pages to continue aging
> alongside their compressible peers, and eventually get written back to
> disk with them.

Sorry I wasn't clear, I was talking about the case where zswap is
terminal. When zswap is not, in our approach we just skip zswap for
incompressible pages and write them directly to disk. Aging them on
the LRU is probably the better approach here.

For the case where zswap is terminal, making them unevictable incurs
less page faults, at least for shmem.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  0:57 ` [PATCH 0/2] minimize swapping on zswap store failure Yosry Ahmed
  2023-10-17  4:47   ` Johannes Weiner
@ 2023-10-17 19:03   ` Nhat Pham
  2023-10-17 19:04     ` Nhat Pham
  2025-04-02 20:06   ` Joshua Hahn
  2 siblings, 1 reply; 20+ messages in thread
From: Nhat Pham @ 2023-10-17 19:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, hannes, cerasuolodomenico, sjenning, ddstreet, vitaly.wool,
	hughd, corbet, konrad.wilk, senozhatsky, rppt, linux-mm,
	kernel-team, linux-kernel, david

On Mon, Oct 16, 2023 at 5:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, when a zswap store attempt fails, the page is immediately
> > swapped out. This could happen for a variety of reasons. For instance,
> > the compression algorithm could fail (such as when the data is not
> > compressible), or the backend allocator might not be able to find a
> > suitable slot for the compressed page. If these pages are needed
> > later on, users will incur IOs from swapins.
> >
> > This issue prevents the adoption of zswap for potential users who
> > cannot tolerate the latency associated with swapping. In many cases,
> > these IOs are avoidable if we just keep in memory the pages that zswap
> > fail to store.
> >
> > This patch series add two new features for zswap that will alleviate
> > the risk of swapping:
> >
> > a) When a store attempt fail, keep the page untouched in memory
> > instead of swapping it out.
>
> What about writeback when the zswap limit is hit? I understand the
> problem, but I am wondering if this is the correct way of fixing it.
> We really need to make zswap work without a backing swapfile, which I
> think is the correct way to fix all these problems. I was working on
> that, but unfortunately I had to pivot to something else before I had
> something that was working.
>
> At Google, we have "ghost" swapfiles that we use just to use zswap
> without a swapfile. They are sparse files, and we have internal kernel
> patches to flag them and never try to actually write to them.
>
> I am not sure how many bandaids we can afford before doing the right
> thing. I understand it's a much larger surgery, perhaps there is a way
> to get a short-term fix that is also a step towards the final state we
> want to reach instead?

Regarding the writeback - I'll make sure to also short-circuit writeback
when the bypass_swap option is enabled in v2 :) I'll probably send out
v2 after

I absolutely agree that we must decouple zswap and swap (and would
be happy to help out in any capacity I could - we have heard similar
concerns/complaints about swap wastage from internal parties as well).

However, as Johannes has pointed out, this feature still has its place,
given our already existing swapfile deployments. I do agree that a
global knob is insufficient tho. I'll add a per-cgroup knob in v2 so that
we can enable/disable this feature on a per-workload basis.

>
> >
> > b) If the store attempt fails at the compression step, allow the page
> > to be stored in its uncompressed form in the zswap pool. This maintains
> > the LRU ordering of pages, which will be helpful for accurate
> > memory reclaim (zswap writeback in particular).
>
> This is dangerous. Johannes and I discussed this before. This means
> that reclaim can end up allocating more memory instead of freeing.
> Allocations made in the reclaim path are made under the assumption
> that we will eventually free memory. In this case, we won't. In the
> worst case scenario, reclaim can leave the system/memcg in a worse
> state than before it started.
>
> Perhaps there is a way we can do this without allocating a zswap entry?
>
> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

Hmm yeah you're right about these concerns. That seems like a lot more
involved than what I envisioned initially.

Let's put this aside for now. I'll just send the first patch in v2, and we can
work on + discuss more about uncompressed pages storing later on.

>
> >
> > These features could be enabled independently via two new zswap module
> > parameters.
> >
> > Nhat Pham (2):
> >   swap: allows swap bypassing on zswap store failure
> >   zswap: store uncompressed pages when compression algorithm fails
> >
> >  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
> >  include/linux/zswap.h                  |  9 ++++
> >  mm/page_io.c                           |  6 +++
> >  mm/shmem.c                             |  8 +++-
> >  mm/zswap.c                             | 64 +++++++++++++++++++++++---
> >  5 files changed, 95 insertions(+), 8 deletions(-)
> >
> > --
> > 2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17 19:03   ` Nhat Pham
@ 2023-10-17 19:04     ` Nhat Pham
  0 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-10-17 19:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, hannes, cerasuolodomenico, sjenning, ddstreet, vitaly.wool,
	hughd, corbet, konrad.wilk, senozhatsky, rppt, linux-mm,
	kernel-team, linux-kernel, david

On Tue, Oct 17, 2023 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Oct 16, 2023 at 5:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Currently, when a zswap store attempt fails, the page is immediately
> > > swapped out. This could happen for a variety of reasons. For instance,
> > > the compression algorithm could fail (such as when the data is not
> > > compressible), or the backend allocator might not be able to find a
> > > suitable slot for the compressed page. If these pages are needed
> > > later on, users will incur IOs from swapins.
> > >
> > > This issue prevents the adoption of zswap for potential users who
> > > cannot tolerate the latency associated with swapping. In many cases,
> > > these IOs are avoidable if we just keep in memory the pages that zswap
> > > fail to store.
> > >
> > > This patch series add two new features for zswap that will alleviate
> > > the risk of swapping:
> > >
> > > a) When a store attempt fail, keep the page untouched in memory
> > > instead of swapping it out.
> >
> > What about writeback when the zswap limit is hit? I understand the
> > problem, but I am wondering if this is the correct way of fixing it.
> > We really need to make zswap work without a backing swapfile, which I
> > think is the correct way to fix all these problems. I was working on
> > that, but unfortunately I had to pivot to something else before I had
> > something that was working.
> >
> > At Google, we have "ghost" swapfiles that we use just to use zswap
> > without a swapfile. They are sparse files, and we have internal kernel
> > patches to flag them and never try to actually write to them.
> >
> > I am not sure how many bandaids we can afford before doing the right
> > thing. I understand it's a much larger surgery, perhaps there is a way
> > to get a short-term fix that is also a step towards the final state we
> > want to reach instead?
>
> Regarding the writeback - I'll make sure to also short-circuit writeback
> when the bypass_swap option is enabled in v2 :) I'll probably send out
> v2 after

... I gather all these feedbacks.

> I absolutely agree that we must decouple zswap and swap (and would
> be happy to help out in any capacity I could - we have heard similar
> concerns/complaints about swap wastage from internal parties as well).
>
> However, as Johannes has pointed out, this feature still has its place,
> given our already existing swapfile deployments. I do agree that a
> global knob is insufficient tho. I'll add a per-cgroup knob in v2 so that
> we can enable/disable this feature on a per-workload basis.
>
> >
> > >
> > > b) If the store attempt fails at the compression step, allow the page
> > > to be stored in its uncompressed form in the zswap pool. This maintains
> > > the LRU ordering of pages, which will be helpful for accurate
> > > memory reclaim (zswap writeback in particular).
> >
> > This is dangerous. Johannes and I discussed this before. This means
> > that reclaim can end up allocating more memory instead of freeing.
> > Allocations made in the reclaim path are made under the assumption
> > that we will eventually free memory. In this case, we won't. In the
> > worst case scenario, reclaim can leave the system/memcg in a worse
> > state than before it started.
> >
> > Perhaps there is a way we can do this without allocating a zswap entry?
> >
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> Hmm yeah you're right about these concerns. That seems like a lot more
> involved than what I envisioned initially.
>
> Let's put this aside for now. I'll just send the first patch in v2, and we can
> work on + discuss more about uncompressed pages storing later on.
>
> >
> > >
> > > These features could be enabled independently via two new zswap module
> > > parameters.
> > >
> > > Nhat Pham (2):
> > >   swap: allows swap bypassing on zswap store failure
> > >   zswap: store uncompressed pages when compression algorithm fails
> > >
> > >  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
> > >  include/linux/zswap.h                  |  9 ++++
> > >  mm/page_io.c                           |  6 +++
> > >  mm/shmem.c                             |  8 +++-
> > >  mm/zswap.c                             | 64 +++++++++++++++++++++++---
> > >  5 files changed, 95 insertions(+), 8 deletions(-)
> > >
> > > --
> > > 2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  4:47   ` Johannes Weiner
  2023-10-17  5:33     ` Yosry Ahmed
@ 2023-10-17 19:24     ` Nhat Pham
  1 sibling, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-10-17 19:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, akpm, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Currently, when a zswap store attempt fails, the page is immediately
> > > swapped out. This could happen for a variety of reasons. For instance,
> > > the compression algorithm could fail (such as when the data is not
> > > compressible), or the backend allocator might not be able to find a
> > > suitable slot for the compressed page. If these pages are needed
> > > later on, users will incur IOs from swapins.
> > >
> > > This issue prevents the adoption of zswap for potential users who
> > > cannot tolerate the latency associated with swapping. In many cases,
> > > these IOs are avoidable if we just keep in memory the pages that zswap
> > > fail to store.
> > >
> > > This patch series add two new features for zswap that will alleviate
> > > the risk of swapping:
> > >
> > > a) When a store attempt fail, keep the page untouched in memory
> > > instead of swapping it out.
> >
> > What about writeback when the zswap limit is hit? I understand the
> > problem, but I am wondering if this is the correct way of fixing it.
> > We really need to make zswap work without a backing swapfile, which I
> > think is the correct way to fix all these problems. I was working on
> > that, but unfortunately I had to pivot to something else before I had
> > something that was working.
> >
> > At Google, we have "ghost" swapfiles that we use just to use zswap
> > without a swapfile. They are sparse files, and we have internal kernel
> > patches to flag them and never try to actually write to them.
> >
> > I am not sure how many bandaids we can afford before doing the right
> > thing. I understand it's a much larger surgery, perhaps there is a way
> > to get a short-term fix that is also a step towards the final state we
> > want to reach instead?
>
> I agree it should also stop writeback due to the limit.
>
> Note that a knob like this is still useful even when zswap space is
> decoupled from disk swap slots. We still are using disk swap broadly
> in the fleet as well, so a static ghost file setup wouldn't be a good
> solution for us. Swapoff with common swapfile sizes is often not an
> option during runtime, due to how slow it is, and the destabilizing
> effect it can have on the system unless that's basically completely
> idle. As such, we expect to continue deploying swap files for physical
> use, and switch the zswap-is-terminal knob depending on the workload.
>
> The other aspect to this is that workloads that do not want the
> swapout/swapin overhead for themselves are usually co-located with
> system management software, and/or can share the host with less
> latency sensitive workloads, that should continue to use disk swap.
>
> Due to the latter case, I wonder if a global knob is actually
> enough. More likely we'd need per-cgroup control over this.
>
> [ It's at this point where the historic coupling of zswap to disk
>   space is especially unfortunate. Because of it, zswap usage counts
>   toward the memory.swap allowance. If these were separate, we could
>   have easily set memory.zswap.max=max, memory.swap.max=0 to achieve
>   the desired effect.
>
>   Alas, that ship has sailed. Even after a decoupling down the line it
>   would be difficult to change established memory.swap semantics. ]
>
> So I obviously agree that we still need to invest in decoupling zswap
> space from physical disk slots. It's insanely wasteful, especially
> with larger memory capacities. But while it would be a fantastic
> optimization, I don't see how it would be an automatic solution to the
> problem that inspired this proposal.
>
> We still need some way to reasonably express desired workload policy
> in a setup that supports multiple, simultaneous modes of operation.
>
> > > b) If the store attempt fails at the compression step, allow the page
> > > to be stored in its uncompressed form in the zswap pool. This maintains
> > > the LRU ordering of pages, which will be helpful for accurate
> > > memory reclaim (zswap writeback in particular).
> >
> > This is dangerous. Johannes and I discussed this before. This means
> > that reclaim can end up allocating more memory instead of freeing.
> > Allocations made in the reclaim path are made under the assumption
> > that we will eventually free memory. In this case, we won't. In the
> > worst case scenario, reclaim can leave the system/memcg in a worse
> > state than before it started.
>
> Yeah, this is a concern. It's not such a big deal if it's only a few
> pages, and we're still shrinking the footprint on aggregate. But it's
> conceivable this can happen systematically with some datasets, in
> which case reclaim will drive up the memory consumption and cause
> OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause
> memory deadlocks.
>
> This isn't something we can reasonably accept as worst-case behavior.
>
> > Perhaps there is a way we can do this without allocating a zswap entry?
> >
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> That should work. Once zswap has exclusive control over the page, it
> is free to muck with its lru linkage. A lower bit tag on the next or
> prev pointer should suffice to distinguish between struct page and
> struct zswap_entry when pulling stuff from the list.

Hmm I'm a bit iffy about pointers bit hacking, but I guess that's
the least involved way to store the uncompressed page in the zswap
LRU without allocating a zswap_entry for it.

Lemme give this a try. If it looks decently clean I'll send it out :)

>
> We'd also have to teach vmscan.c to hand off the page. It currently
> expects that it either frees the page back to the allocator, or puts
> it back on the LRU. We'd need a compromise where it continues to tear
> down the page and remove the mapping, but then leaves it to zswap.
>
> Neither of those sound impossible. But since it's a bigger
> complication than this proposal, it probably needs a new cost/benefit
> analysis, with potentially more data on the problem of LRU inversions.
>
> Thanks for your insightful feedback, Yosry.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2023-10-17  0:57 ` [PATCH 0/2] minimize swapping on zswap store failure Yosry Ahmed
  2023-10-17  4:47   ` Johannes Weiner
  2023-10-17 19:03   ` Nhat Pham
@ 2025-04-02 20:06   ` Joshua Hahn
  2025-04-03 20:38     ` Nhat Pham
                       ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Joshua Hahn @ 2025-04-02 20:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, akpm, hannes, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

On Mon, 16 Oct 2023 17:57:31 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:

> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:

> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

Hi Yosry,

Apologies for reviving an old thread, but I wasn't sure whether opening an
entirely new thread was a better choice : -)

So I've implemented your idea, using the lower 2 bits of the list_head's prev
pointer (last bit indicates whether the list_head belongs to a page or a
zswap_entry, and the second to last bit was repurposed for the second chance
algorithm).

For a very high level overview what I did in the patch:
- When a page fails to compress, I remove the page mapping and tag both the
  xarray entry (tag == set lowest bit to 1) and the page's list_head prev ptr,
  then store the page directly into the zswap LRU.
- In zswap_load, we take the entry out of the xarray and check if it's tagged.
  - If it is tagged, then instead of decompressing, we just copy the page's
    contents to the newly allocated page. 
- (More details about how to teach vmscan / page_io / list iterators how to
  handle this, but we can gloss over those details for now)

I have a working version, but have been holding off because I have only been
seeing regressions. I wasn't really sure where they were coming from, but
after going through some perf traces with Nhat, found out that the regressions
come from the associated page faults that come from initially unmapping the
page, and then re-allocating it for every load. This causes (1) more memcg
flushing, and (2) extra allocations ==> more pressure ==> more reclaim, even
though we only temporarily keep the extra page.

Just wanted to put this here in case you were still thinking about this idea.
What do you think? Ideally, there would be a way to keep the page around in
the zswap LRU, but do not have to re-allocate a new page on a fault, but this
seems like a bigger task.

Ultimately the goal is to prevent an incompressible page from hoarding the
compression algorithm on multiple reclaim attempts, but if we are spending
more time by allocating new pages... maybe this isn't the correct approach :(

Please let me know if you have any thoughts on this : -)
Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-02 20:06   ` Joshua Hahn
@ 2025-04-03 20:38     ` Nhat Pham
  2025-04-04  1:46       ` Sergey Senozhatsky
  2025-04-04 15:39     ` Nhat Pham
  2025-04-22 11:27     ` Yosry Ahmed
  2 siblings, 1 reply; 20+ messages in thread
From: Nhat Pham @ 2025-04-03 20:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Yosry Ahmed, akpm, hannes, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david, Minchan Kim,
	Shakeel Butt, Chengming Zhou, Kairui Song

On Wed, Apr 2, 2025 at 1:06 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Mon, 16 Oct 2023 17:57:31 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> Hi Yosry,
>
> Apologies for reviving an old thread, but I wasn't sure whether opening an
> entirely new thread was a better choice : -)
>
> So I've implemented your idea, using the lower 2 bits of the list_head's prev
> pointer (last bit indicates whether the list_head belongs to a page or a
> zswap_entry, and the second to last bit was repurposed for the second chance
> algorithm).
>
> For a very high level overview what I did in the patch:
> - When a page fails to compress, I remove the page mapping and tag both the
>   xarray entry (tag == set lowest bit to 1) and the page's list_head prev ptr,
>   then store the page directly into the zswap LRU.
> - In zswap_load, we take the entry out of the xarray and check if it's tagged.
>   - If it is tagged, then instead of decompressing, we just copy the page's
>     contents to the newly allocated page.
> - (More details about how to teach vmscan / page_io / list iterators how to
>   handle this, but we can gloss over those details for now)
>
> I have a working version, but have been holding off because I have only been
> seeing regressions. I wasn't really sure where they were coming from, but
> after going through some perf traces with Nhat, found out that the regressions
> come from the associated page faults that come from initially unmapping the
> page, and then re-allocating it for every load. This causes (1) more memcg
> flushing, and (2) extra allocations ==> more pressure ==> more reclaim, even
> though we only temporarily keep the extra page.

Thanks for your effort on this idea :)

>
> Just wanted to put this here in case you were still thinking about this idea.
> What do you think? Ideally, there would be a way to keep the page around in
> the zswap LRU, but do not have to re-allocate a new page on a fault, but this
> seems like a bigger task.

I wonder if we can return the page in the event of a page fault. We'll
need to keep it in the swap cache for this to work:

1. On reclaim, do the same thing as your prototype but keep the page
in swap cache (i.e do not remove_mapping() it).

2. On page fault (do_swap_page), before returning check if the page is
in zswap LRU. If it is, invalidate the zswap LRU linkage, and put it
back to one of the proper LRUs.

Johannes, do you feel like this is possible?

>
> Ultimately the goal is to prevent an incompressible page from hoarding the
> compression algorithm on multiple reclaim attempts, but if we are spending
> more time by allocating new pages... maybe this isn't the correct approach :(

Hmmm, IIUC this problem also exists with zram, since zram allocates a
PAGE_SIZE sized buffer to hold the original page's content. I will
note though that zram seems to favor these kinds of pages for
writeback :) Maybe this is why...?

(+ Minchan)

>
> Please let me know if you have any thoughts on this : -)

Well worst case scenario there is still the special incompressible LRU
idea. We'll need some worker thread to check for write access to these
pages to promote them though.

(+ Shakeel)

> Have a great day!
> Joshua
>
> Sent using hkml (https://github.com/sjp38/hackermail)
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-03 20:38     ` Nhat Pham
@ 2025-04-04  1:46       ` Sergey Senozhatsky
  2025-04-04 14:06         ` Joshua Hahn
  0 siblings, 1 reply; 20+ messages in thread
From: Sergey Senozhatsky @ 2025-04-04  1:46 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Joshua Hahn, Yosry Ahmed, akpm, hannes, cerasuolodomenico,
	sjenning, ddstreet, vitaly.wool, hughd, corbet, konrad.wilk,
	senozhatsky, rppt, linux-mm, kernel-team, linux-kernel, david,
	Minchan Kim, Shakeel Butt, Chengming Zhou, Kairui Song

On (25/04/03 13:38), Nhat Pham wrote:
> > Ultimately the goal is to prevent an incompressible page from hoarding the
> > compression algorithm on multiple reclaim attempts, but if we are spending
> > more time by allocating new pages... maybe this isn't the correct approach :(
> 
> Hmmm, IIUC this problem also exists with zram, since zram allocates a
> PAGE_SIZE sized buffer to hold the original page's content. I will
> note though that zram seems to favor these kinds of pages for
> writeback :) Maybe this is why...?

zram is a generic block device, it must store whatever comes in,
compressible or incompressible.  E.g. when we have, say, ext4
running atop of the zram device we cannot reject page stores.

And you are right, when we use zram for swap, there is some benefit
in storing incompressible pages.  First, those pages are candidates
for zram writeback, which achieves the goal of removing the page from
RAM after all, we give up on the incompressible page reclamation with
"return it back to LRU" approach.  Second, on some zram setups we do
re-compression (with a slower and more efficient algorithm) and in
certain number of cases what is incompressible with the primary (fast)
algorithm is compressible with the secondary algorithm.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-04  1:46       ` Sergey Senozhatsky
@ 2025-04-04 14:06         ` Joshua Hahn
  2025-04-04 15:29           ` Nhat Pham
  2025-04-08  3:33           ` Sergey Senozhatsky
  0 siblings, 2 replies; 20+ messages in thread
From: Joshua Hahn @ 2025-04-04 14:06 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Nhat Pham, Yosry Ahmed, akpm, hannes, cerasuolodomenico, sjenning,
	ddstreet, vitaly.wool, hughd, corbet, konrad.wilk, rppt, linux-mm,
	kernel-team, linux-kernel, david, Minchan Kim, Shakeel Butt,
	Chengming Zhou, Kairui Song

On Fri, 4 Apr 2025 10:46:22 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> On (25/04/03 13:38), Nhat Pham wrote:
> > > Ultimately the goal is to prevent an incompressible page from hoarding the
> > > compression algorithm on multiple reclaim attempts, but if we are spending
> > > more time by allocating new pages... maybe this isn't the correct approach :(
> > 
> > Hmmm, IIUC this problem also exists with zram, since zram allocates a
> > PAGE_SIZE sized buffer to hold the original page's content. I will
> > note though that zram seems to favor these kinds of pages for
> > writeback :) Maybe this is why...?
> 
> zram is a generic block device, it must store whatever comes in,
> compressible or incompressible.  E.g. when we have, say, ext4
> running atop of the zram device we cannot reject page stores.
> 
> And you are right, when we use zram for swap, there is some benefit
> in storing incompressible pages.  First, those pages are candidates
> for zram writeback, which achieves the goal of removing the page from
> RAM after all, we give up on the incompressible page reclamation with
> "return it back to LRU" approach.  Second, on some zram setups we do
> re-compression (with a slower and more efficient algorithm) and in
> certain number of cases what is incompressible with the primary (fast)
> algorithm is compressible with the secondary algorithm.

Hello Sergey,

Thank you for your insight, I did not know this is how zram handled
incompressible pages. In the case of this prototype, I expected to see the most
gains from storing incompressible pages in the zswap LRU when writeback was
disabled (if writeback is enabled, then we expect to see less differences with
just writing the page back).

On the note of trying a second compression algorithm -- do you know how much
of the initially incompressible pages get compressed later? I can certainly
imagine that trying different compression algorithms makes a difference, I am
wondering if zswap should attempt this as well, or if it is not worth spending
even more CPU trying to re-comprses the page.

Thank you again for your response! Have a great day : -)
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-04 14:06         ` Joshua Hahn
@ 2025-04-04 15:29           ` Nhat Pham
  2025-04-08  3:33           ` Sergey Senozhatsky
  1 sibling, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2025-04-04 15:29 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Sergey Senozhatsky, Yosry Ahmed, akpm, hannes, cerasuolodomenico,
	sjenning, ddstreet, vitaly.wool, hughd, corbet, konrad.wilk, rppt,
	linux-mm, kernel-team, linux-kernel, david, Minchan Kim,
	Shakeel Butt, Chengming Zhou, Kairui Song

On Fri, Apr 4, 2025 at 7:06 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Fri, 4 Apr 2025 10:46:22 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
>
> > On (25/04/03 13:38), Nhat Pham wrote:
> > > > Ultimately the goal is to prevent an incompressible page from hoarding the
> > > > compression algorithm on multiple reclaim attempts, but if we are spending
> > > > more time by allocating new pages... maybe this isn't the correct approach :(
> > >
> > > Hmmm, IIUC this problem also exists with zram, since zram allocates a
> > > PAGE_SIZE sized buffer to hold the original page's content. I will
> > > note though that zram seems to favor these kinds of pages for
> > > writeback :) Maybe this is why...?
> >
> > zram is a generic block device, it must store whatever comes in,
> > compressible or incompressible.  E.g. when we have, say, ext4
> > running atop of the zram device we cannot reject page stores.
> >
> > And you are right, when we use zram for swap, there is some benefit
> > in storing incompressible pages.  First, those pages are candidates
> > for zram writeback, which achieves the goal of removing the page from
> > RAM after all, we give up on the incompressible page reclamation with
> > "return it back to LRU" approach.  Second, on some zram setups we do
> > re-compression (with a slower and more efficient algorithm) and in
> > certain number of cases what is incompressible with the primary (fast)
> > algorithm is compressible with the secondary algorithm.
>
> Hello Sergey,
>
> Thank you for your insight, I did not know this is how zram handled
> incompressible pages. In the case of this prototype, I expected to see the most
> gains from storing incompressible pages in the zswap LRU when writeback was
> disabled (if writeback is enabled, then we expect to see less differences with
> just writing the page back).
>
> On the note of trying a second compression algorithm -- do you know how much
> of the initially incompressible pages get compressed later? I can certainly
> imagine that trying different compression algorithms makes a difference, I am
> wondering if zswap should attempt this as well, or if it is not worth spending
> even more CPU trying to re-comprses the page.

It wouldn't help us :) The algorithm we use, zstd, is usually already
the slow algorithm in this context. We can try higher levels of zstd,
but there are always data that are simply incompressible - think
random values, or memory already compressed by userspace.

Yeah we can target them for writeback to swap in zswap as well. It
wouldn't help your (micro)benchmark though, because IIRC you don't do
writeback and/or do not writeback before it is faulted back in :)

>
> Thank you again for your response! Have a great day : -)
> Joshua
>
> Sent using hkml (https://github.com/sjp38/hackermail)
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-02 20:06   ` Joshua Hahn
  2025-04-03 20:38     ` Nhat Pham
@ 2025-04-04 15:39     ` Nhat Pham
  2025-04-22 11:27     ` Yosry Ahmed
  2 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2025-04-04 15:39 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Yosry Ahmed, akpm, hannes, cerasuolodomenico, sjenning, ddstreet,
	vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky, rppt,
	linux-mm, kernel-team, linux-kernel, david

Wed, Apr 2, 2025 at 1:06 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Mon, 16 Oct 2023 17:57:31 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> Hi Yosry,
>
> Apologies for reviving an old thread, but I wasn't sure whether opening an
> entirely new thread was a better choice : -)
>
> So I've implemented your idea, using the lower 2 bits of the list_head's prev
> pointer (last bit indicates whether the list_head belongs to a page or a
> zswap_entry, and the second to last bit was repurposed for the second chance
> algorithm).
>
> For a very high level overview what I did in the patch:
> - When a page fails to compress, I remove the page mapping and tag both the
>   xarray entry (tag == set lowest bit to 1) and the page's list_head prev ptr,
>   then store the page directly into the zswap LRU.
> - In zswap_load, we take the entry out of the xarray and check if it's tagged.
>   - If it is tagged, then instead of decompressing, we just copy the page's
>     contents to the newly allocated page.
> - (More details about how to teach vmscan / page_io / list iterators how to
>   handle this, but we can gloss over those details for now)
>
> I have a working version, but have been holding off because I have only been
> seeing regressions. I wasn't really sure where they were coming from, but
> after going through some perf traces with Nhat, found out that the regressions
> come from the associated page faults that come from initially unmapping the
> page, and then re-allocating it for every load. This causes (1) more memcg
> flushing, and (2) extra allocations ==> more pressure ==> more reclaim, even
> though we only temporarily keep the extra page.

Also, double check your benchmark :) If you only cycle through the
pages only in the LRU, or if you will need the pages soon, by
definition you cannot do better than the status quo.

The gains will only materialize when you have longer workload, which
can (and often will) cycle the incompressible pages through the LRU
multiple times, recompressing (and failing) them multiple times with
zswap, etc.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-04 14:06         ` Joshua Hahn
  2025-04-04 15:29           ` Nhat Pham
@ 2025-04-08  3:33           ` Sergey Senozhatsky
  1 sibling, 0 replies; 20+ messages in thread
From: Sergey Senozhatsky @ 2025-04-08  3:33 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Sergey Senozhatsky, Nhat Pham, Yosry Ahmed, akpm, hannes,
	cerasuolodomenico, sjenning, ddstreet, vitaly.wool, hughd, corbet,
	konrad.wilk, rppt, linux-mm, kernel-team, linux-kernel, david,
	Minchan Kim, Shakeel Butt, Chengming Zhou, Kairui Song

Hi,

Sorry for the delay

On (25/04/04 07:06), Joshua Hahn wrote:
> On Fri, 4 Apr 2025 10:46:22 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
> 
> > On (25/04/03 13:38), Nhat Pham wrote:
> > > > Ultimately the goal is to prevent an incompressible page from hoarding the
> > > > compression algorithm on multiple reclaim attempts, but if we are spending
> > > > more time by allocating new pages... maybe this isn't the correct approach :(
> > > 
> > > Hmmm, IIUC this problem also exists with zram, since zram allocates a
> > > PAGE_SIZE sized buffer to hold the original page's content. I will
> > > note though that zram seems to favor these kinds of pages for
> > > writeback :) Maybe this is why...?
> > 
> > zram is a generic block device, it must store whatever comes in,
> > compressible or incompressible.  E.g. when we have, say, ext4
> > running atop of the zram device we cannot reject page stores.
> > 
> > And you are right, when we use zram for swap, there is some benefit
> > in storing incompressible pages.  First, those pages are candidates
> > for zram writeback, which achieves the goal of removing the page from
> > RAM after all, we give up on the incompressible page reclamation with
> > "return it back to LRU" approach.  Second, on some zram setups we do
> > re-compression (with a slower and more efficient algorithm) and in
> > certain number of cases what is incompressible with the primary (fast)
> > algorithm is compressible with the secondary algorithm.
> 
> Hello Sergey,
> 
> Thank you for your insight, I did not know this is how zram handled
> incompressible pages.

Well, yes, zram doesn't have a freedom to reject writes, to the
fs/vfs that would look like a block device error.

[..]
> On the note of trying a second compression algorithm -- do you know how much
> of the initially incompressible pages get compressed later?

So I don't recall the exact numbers, but, if I'm not mistaken, in
my tests (on chromeos) I think I saw something like 20+% (a little
higher than just 20%) success rate (successful re-compression with
a secondary algorithm), but like you said this is very data patterns
specific.

> Thank you again for your response! Have a great day : -)

Thanks, you too!


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-02 20:06   ` Joshua Hahn
  2025-04-03 20:38     ` Nhat Pham
  2025-04-04 15:39     ` Nhat Pham
@ 2025-04-22 11:27     ` Yosry Ahmed
  2025-04-22 15:00       ` Joshua Hahn
  2 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2025-04-22 11:27 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Yosry Ahmed, Nhat Pham, akpm, hannes, cerasuolodomenico, sjenning,
	ddstreet, vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky,
	rppt, linux-mm, kernel-team, linux-kernel, david

On Wed, Apr 02, 2025 at 01:06:49PM -0700, Joshua Hahn wrote:
> On Mon, 16 Oct 2023 17:57:31 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
> 
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> 
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
> 
> Hi Yosry,
> 
> Apologies for reviving an old thread, but I wasn't sure whether opening an
> entirely new thread was a better choice : -)

This seems like the right choice to me :)

> 
> So I've implemented your idea, using the lower 2 bits of the list_head's prev
> pointer (last bit indicates whether the list_head belongs to a page or a
> zswap_entry, and the second to last bit was repurposed for the second chance
> algorithm).

Thanks a lot for spending time looking into this, and sorry for the
delayed resposne (I am technically on leave right now).

> 
> For a very high level overview what I did in the patch:
> - When a page fails to compress, I remove the page mapping and tag both the
>   xarray entry (tag == set lowest bit to 1) and the page's list_head prev ptr,
>   then store the page directly into the zswap LRU.

What do you mean by 'remove the page mapping'? Do you mean
__remove_mapping()?

This is already called by reclaim, so I assume vmscan code hands over
ownership of the page to zswap and doesn't call __remove_mapping(), so
you end up doing that in zswap instead.

> - In zswap_load, we take the entry out of the xarray and check if it's tagged.
>   - If it is tagged, then instead of decompressing, we just copy the page's
>     contents to the newly allocated page. 
> - (More details about how to teach vmscan / page_io / list iterators how to
>   handle this, but we can gloss over those details for now)
> 
> I have a working version, but have been holding off because I have only been
> seeing regressions. I wasn't really sure where they were coming from, but
> after going through some perf traces with Nhat, found out that the regressions
> come from the associated page faults that come from initially unmapping the
> page, and then re-allocating it for every load. This causes (1) more memcg
> flushing, and (2) extra allocations ==> more pressure ==> more reclaim, even
> though we only temporarily keep the extra page.

Hmm how is this worse than the status quo though? IIUC currently
incompressible pages will skip zswap and go to the backing swapfile.
Surely reading them from disk is slower than copying them?

Unless of course, writeback is disabled, in which case these pages are
not being reclaimed at all today. In this case, it makes sense that
reclaiming them makes accessing them slower, even if we don't actually
need to decompress them.

I have a few thoughts in mind:

- As Nhat said, if we can keep the pages in the swapcache, we can avoid
  making a new allocation and copying the page. We'd need to move it
  back from zswap LRUs to the reclaim LRUs though.

- One advantage of keeping incompressible pages in zswap is preserving
  LRU ordering. IOW, if some compressible pages go to zswap first (old),
  then some incompressible pages (new), then the old compressible pages
  should go to disk via writeback first. Otherwise, accessing the hotter
  incompressible pages will be slower than accessing the colder
  compressible pages. This happens today because incompressible pages go
  straight to disk.

  The above will only materialize for a workload that has writeback
  enabled and a mixture of both incompressible and compressible
  workingset.

  The other advantage, as you mention below, is preventing repeatedly
  sending incompressible pages to zswap when writeback is disabled, but
  that could be offset by the extra cost of allocations/copying.

- The above being said, we should not regress workloads that have
  writeback disabled, so we either need to keep the pages in the
  swapcache to avoid the extra allocations/copies -- or avoid storing
  the pages in zswap completely if writeback is disabled. If writeback
  is disabled and the page is incompressible, we could probably just put
  it in the unevictable LRU because that's what it really is. We'd need
  to make sure we remove it when it becomes compressible again. The
  first approach is probably simpler.

> 
> Just wanted to put this here in case you were still thinking about this idea.
> What do you think? Ideally, there would be a way to keep the page around in
> the zswap LRU, but do not have to re-allocate a new page on a fault, but this
> seems like a bigger task.
> 
> Ultimately the goal is to prevent an incompressible page from hoarding the
> compression algorithm on multiple reclaim attempts, but if we are spending
> more time by allocating new pages... maybe this isn't the correct approach :(
> 
> Please let me know if you have any thoughts on this : -)
> Have a great day!
> Joshua
> 
> Sent using hkml (https://github.com/sjp38/hackermail)
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] minimize swapping on zswap store failure
  2025-04-22 11:27     ` Yosry Ahmed
@ 2025-04-22 15:00       ` Joshua Hahn
  0 siblings, 0 replies; 20+ messages in thread
From: Joshua Hahn @ 2025-04-22 15:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yosry Ahmed, Nhat Pham, akpm, hannes, cerasuolodomenico, sjenning,
	ddstreet, vitaly.wool, hughd, corbet, konrad.wilk, senozhatsky,
	rppt, linux-mm, kernel-team, linux-kernel, david


> > > I thought before about having a special list_head that allows us to
> > > use the lower bits of the pointers as markers, similar to the xarray.
> > > The markers can be used to place different objects on the same list.
> > > We can have a list that is a mixture of struct page and struct
> > > zswap_entry. I never pursued this idea, and I am sure someone will
> > > scream at me for suggesting it. Maybe there is a less convoluted way
> > > to keep the LRU ordering intact without allocating memory on the
> > > reclaim path.

> > So I've implemented your idea, using the lower 2 bits of the list_head's prev
> > pointer (last bit indicates whether the list_head belongs to a page or a
> > zswap_entry, and the second to last bit was repurposed for the second chance
> > algorithm).
> 
> Thanks a lot for spending time looking into this, and sorry for the
> delayed resposne (I am technically on leave right now).

Hi Yosry,

Thank you for getting back to me! I hope you are enjoying your leave : -)

> > For a very high level overview what I did in the patch:
> > - When a page fails to compress, I remove the page mapping and tag both the
> >   xarray entry (tag == set lowest bit to 1) and the page's list_head prev ptr,
> >   then store the page directly into the zswap LRU.
> 
> What do you mean by 'remove the page mapping'? Do you mean
> __remove_mapping()?

Yes -- but I am calling remove_mapping() to unfreeze with a refcount of 1
(zswap is now the sole owner of the page). 

> This is already called by reclaim, so I assume vmscan code hands over
> ownership of the page to zswap and doesn't call __remove_mapping(), so
> you end up doing that in zswap instead.

Yes! I changed reclaim logic to be aware that zswap can do this, so there
is a new switch case that simply continues through the folio list when
zswap steals the incompressible page (but we don't want to drop the page).

> > - In zswap_load, we take the entry out of the xarray and check if it's tagged.
> >   - If it is tagged, then instead of decompressing, we just copy the page's
> >     contents to the newly allocated page. 
> > - (More details about how to teach vmscan / page_io / list iterators how to
> >   handle this, but we can gloss over those details for now)
> > 
> > I have a working version, but have been holding off because I have only been
> > seeing regressions. I wasn't really sure where they were coming from, but
> > after going through some perf traces with Nhat, found out that the regressions
> > come from the associated page faults that come from initially unmapping the
> > page, and then re-allocating it for every load. This causes (1) more memcg
> > flushing, and (2) extra allocations ==> more pressure ==> more reclaim, even
> > though we only temporarily keep the extra page.
> 
> Hmm how is this worse than the status quo though? IIUC currently
> incompressible pages will skip zswap and go to the backing swapfile.
> Surely reading them from disk is slower than copying them?
> 
> Unless of course, writeback is disabled, in which case these pages are
> not being reclaimed at all today. In this case, it makes sense that
> reclaiming them makes accessing them slower, even if we don't actually
> need to decompress them.

Yes, sorry for the ambiguity -- this was specifically for the writeback
disabled case. My focus currently is on reducing the amount of CPU cycles
spent stuck on trying to compress incompressible pages. 

> I have a few thoughts in mind:
> 
> - As Nhat said, if we can keep the pages in the swapcache, we can avoid
>   making a new allocation and copying the page. We'd need to move it
>   back from zswap LRUs to the reclaim LRUs though.

Yes, Nhat and Shakeel have both offered the same perspective. I'm currently
working on this approach with help from Nhat. 

> - One advantage of keeping incompressible pages in zswap is preserving
>   LRU ordering. IOW, if some compressible pages go to zswap first (old),
>   then some incompressible pages (new), then the old compressible pages
>   should go to disk via writeback first. Otherwise, accessing the hotter
>   incompressible pages will be slower than accessing the colder
>   compressible pages. This happens today because incompressible pages go
>   straight to disk.
> 
>   The above will only materialize for a workload that has writeback
>   enabled and a mixture of both incompressible and compressible
>   workingset.

This makes sense to me. I'll take this into consideration when writing
benchmarks for this patch!

>   The other advantage, as you mention below, is preventing repeatedly
>   sending incompressible pages to zswap when writeback is disabled, but
>   that could be offset by the extra cost of allocations/copying.

Yes -- hopefully, keeping it in swapcache allows us to reap the benefits of
both worlds, minus the duplicated allocation / copying.

> - The above being said, we should not regress workloads that have
>   writeback disabled, so we either need to keep the pages in the
>   swapcache to avoid the extra allocations/copies -- or avoid storing
>   the pages in zswap completely if writeback is disabled. If writeback
>   is disabled and the page is incompressible, we could probably just put
>   it in the unevictable LRU because that's what it really is. We'd need
>   to make sure we remove it when it becomes compressible again. The
>   first approach is probably simpler.

This is a good point. While I don't have the numbers to back this up, I have
an intuition that this patch really sees the most benefits from reducing
CPU time spent trying to compress incompressible pages, rather than from
maintaining a better LRU ordering. For that reason I also suspect that we
see much better performance gains when writeback is disabled. Following that
logic... maybe there is some future work to be done that just moves these
incompressible pages to an unevictable LRU when writeback is disabled?

For now, I'm still experimenting with keeping the page in swapcache. I'll
be sure to report back with cool findings! Thank you again for your review
of this idea Yosry, I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-04-22 15:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-17  0:35 [PATCH 0/2] minimize swapping on zswap store failure Nhat Pham
2023-10-17  0:35 ` [PATCH 1/2] swap: allows swap bypassing " Nhat Pham
2023-10-17  0:35 ` [PATCH 2/2] zswap: store uncompressed pages when compression algorithm fails Nhat Pham
2023-10-17  0:57 ` [PATCH 0/2] minimize swapping on zswap store failure Yosry Ahmed
2023-10-17  4:47   ` Johannes Weiner
2023-10-17  5:33     ` Yosry Ahmed
2023-10-17 14:51       ` Johannes Weiner
2023-10-17 15:51         ` Yosry Ahmed
2023-10-17 19:24     ` Nhat Pham
2023-10-17 19:03   ` Nhat Pham
2023-10-17 19:04     ` Nhat Pham
2025-04-02 20:06   ` Joshua Hahn
2025-04-03 20:38     ` Nhat Pham
2025-04-04  1:46       ` Sergey Senozhatsky
2025-04-04 14:06         ` Joshua Hahn
2025-04-04 15:29           ` Nhat Pham
2025-04-08  3:33           ` Sergey Senozhatsky
2025-04-04 15:39     ` Nhat Pham
2025-04-22 11:27     ` Yosry Ahmed
2025-04-22 15:00       ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).