[PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups
@ 2023-12-26 15:54 Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 1/6] mm/zswap: change dstmem size to one page Chengming Zhou
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

Hi everyone,

Changes in v4:
- Collect Reviewed-by and Acked-by tags.
- Fold in the comment fix in zswap_writeback_entry() from Yosry Ahmed.
- Add patch to change per-cpu mutex and dstmem to per-acomp_ctx.
- Just rename crypto_acomp_ctx->dstmem field to buffer.
- Link to v3: https://lore.kernel.org/r/20231213-zswap-dstmem-v3-0-4eac09b94ece@bytedance.com

Changes in v3:
- Collect Reviewed-by tag.
- Drop the __zswap_store() refactoring part.
- Link to v2: https://lore.kernel.org/r/20231213-zswap-dstmem-v2-0-daa5d9ae41a7@bytedance.com

Changes in v2:
- Add more changelog and test data about changing dstmem to one page.
- Reorder patches to put dstmem reusing and __zswap_load() refactoring
  together, still refactor after dstmem reusing since we don't want
  to handle __zswap_load() failure due to memory allocation failure
  in zswap_writeback_entry().
- Append a patch to directly use percpu mutex and buffer in load/store
  and refactor out __zswap_store() to simplify zswap_store().
- Link to v1: https://lore.kernel.org/r/20231213-zswap-dstmem-v1-0-896763369d04@bytedance.com

This series is split from [1] to only include zswap dstmem reuse
optimizations and cleanups, the other part of rbtree breakdown will
be deferred to retest after the rbtree converted to xarray.

And the problem this series tries to optimize is that zswap_load()
and zswap_writeback_entry() have to malloc a temporary memory to
support !zpool_can_sleep_mapped(). We can avoid it by reusing the
percpu crypto_acomp_ctx->dstmem, which is also used by zswap_store()
and protected by the same percpu crypto_acomp_ctx->mutex.

[1] https://lore.kernel.org/all/20231206-zswap-lock-optimize-v1-0-e25b059f9c3a@bytedance.com/

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (6):
      mm/zswap: change dstmem size to one page
      mm/zswap: reuse dstmem when decompress
      mm/zswap: refactor out __zswap_load()
      mm/zswap: cleanup zswap_load()
      mm/zswap: cleanup zswap_writeback_entry()
      mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx

 include/linux/cpuhotplug.h |   1 -
 mm/zswap.c                 | 246 +++++++++++++--------------------------------
 2 files changed, 71 insertions(+), 176 deletions(-)
---
base-commit: 1f242c1964cf9b8d663a2fd72159b296205a8126
change-id: 20231213-zswap-dstmem-d828f563303d

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
@ 2023-12-26 15:54 ` Chengming Zhou
  2023-12-27  1:07   ` Barry Song
  2023-12-26 15:54 ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Chengming Zhou
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

Change the dstmem size from 2 * PAGE_SIZE to only one page since
we only need at most one page when compress, and the "dlen" is also
PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
we don't wanna store the output in zswap anyway.

So change it to one page, and delete the stale comment.

There is no any history about the reason why we needed 2 pages, it has
been 2 * PAGE_SIZE since the time zswap was first merged.

According to Yosry and Nhat, one potential reason is that we used to
store a zswap header containing the swap entry in the compressed page
for writeback purposes, but we don't do that anymore.

This patch works good in kernel build testing even when the input data
doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
from the bpftrace tool:

bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
@[1]: 2
@[0]: 12011430

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 7ee54a3d8281..976f278aa507 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
 	struct mutex *mutex;
 	u8 *dst;
 
-	dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
 	if (!dst)
 		return -ENOMEM;
 
@@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
-	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
+	sg_init_one(&output, dst, PAGE_SIZE);
 	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 1/6] mm/zswap: change dstmem size to one page Chengming Zhou
@ 2023-12-26 15:54 ` Chengming Zhou
  2023-12-27  1:24   ` Barry Song
  2023-12-26 15:54 ` [PATCH v4 3/6] mm/zswap: refactor out __zswap_load() Chengming Zhou
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
copy the entry->handle memory to a temporary memory, which is allocated
using kmalloc.

Obviously we can reuse the per-compressor dstmem to avoid allocating
every time, since it's percpu-compressor and protected in percpu mutex.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 44 ++++++++++++--------------------------------
 1 file changed, 12 insertions(+), 32 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 976f278aa507..6b872744e962 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct zpool *pool = zswap_find_zpool(entry);
 	bool page_was_allocated;
-	u8 *src, *tmp = NULL;
+	u8 *src;
 	unsigned int dlen;
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
 
-	if (!zpool_can_sleep_mapped(pool)) {
-		tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
-		if (!tmp)
-			return -ENOMEM;
-	}
-
 	/* try to allocate swap cache page */
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
@@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	/* decompress */
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	dlen = PAGE_SIZE;
+	mutex_lock(acomp_ctx->mutex);
 
 	src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
 	if (!zpool_can_sleep_mapped(pool)) {
-		memcpy(tmp, src, entry->length);
-		src = tmp;
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
 		zpool_unmap_handle(pool, entry->handle);
 	}
 
-	mutex_lock(acomp_ctx->mutex);
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_page(&output, page, PAGE_SIZE, 0);
@@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	dlen = acomp_ctx->req->dlen;
 	mutex_unlock(acomp_ctx->mutex);
 
-	if (!zpool_can_sleep_mapped(pool))
-		kfree(tmp);
-	else
+	if (zpool_can_sleep_mapped(pool))
 		zpool_unmap_handle(pool, entry->handle);
 
 	BUG_ON(ret);
@@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	return ret;
 
 fail:
-	if (!zpool_can_sleep_mapped(pool))
-		kfree(tmp);
-
 	/*
 	 * If we get here because the page is already in swapcache, a
 	 * load may be happening concurrently. It is safe and okay to
@@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
 	struct zswap_entry *entry;
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
-	u8 *src, *dst, *tmp;
+	u8 *src, *dst;
 	struct zpool *zpool;
 	unsigned int dlen;
 	bool ret;
@@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
 	}
 
 	zpool = zswap_find_zpool(entry);
-	if (!zpool_can_sleep_mapped(zpool)) {
-		tmp = kmalloc(entry->length, GFP_KERNEL);
-		if (!tmp) {
-			ret = false;
-			goto freeentry;
-		}
-	}
 
 	/* decompress */
 	dlen = PAGE_SIZE;
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(acomp_ctx->mutex);
 
+	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
 	if (!zpool_can_sleep_mapped(zpool)) {
-		memcpy(tmp, src, entry->length);
-		src = tmp;
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_page(&output, page, PAGE_SIZE, 0);
@@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
 
 	if (zpool_can_sleep_mapped(zpool))
 		zpool_unmap_handle(zpool, entry->handle);
-	else
-		kfree(tmp);
 
 	ret = true;
 stats:
 	count_vm_event(ZSWPIN);
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
-freeentry:
+
 	spin_lock(&tree->lock);
 	if (ret && zswap_exclusive_loads_enabled) {
 		zswap_invalidate_entry(tree, entry);

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 3/6] mm/zswap: refactor out __zswap_load()
  2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 1/6] mm/zswap: change dstmem size to one page Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Chengming Zhou
@ 2023-12-26 15:54 ` Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 4/6] mm/zswap: cleanup zswap_load() Chengming Zhou
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

The zswap_load() and zswap_writeback_entry() have the same part that
decompress the data from zswap_entry to page, so refactor out the
common part as __zswap_load(entry, page).

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 92 ++++++++++++++++++++++----------------------------------------
 1 file changed, 32 insertions(+), 60 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 6b872744e962..3433bd6b3cef 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1392,6 +1392,35 @@ static int zswap_enabled_param_set(const char *val,
 	return ret;
 }
 
+static void __zswap_load(struct zswap_entry *entry, struct page *page)
+{
+	struct zpool *zpool = zswap_find_zpool(entry);
+	struct scatterlist input, output;
+	struct crypto_acomp_ctx *acomp_ctx;
+	u8 *src;
+
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(acomp_ctx->mutex);
+
+	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+	if (!zpool_can_sleep_mapped(zpool)) {
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
+		zpool_unmap_handle(zpool, entry->handle);
+	}
+
+	sg_init_one(&input, src, entry->length);
+	sg_init_table(&output, 1);
+	sg_set_page(&output, page, PAGE_SIZE, 0);
+	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	mutex_unlock(acomp_ctx->mutex);
+
+	if (zpool_can_sleep_mapped(zpool))
+		zpool_unmap_handle(zpool, entry->handle);
+}
+
 /*********************************
 * writeback code
 **********************************/
@@ -1413,12 +1442,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	swp_entry_t swpentry = entry->swpentry;
 	struct page *page;
 	struct mempolicy *mpol;
-	struct scatterlist input, output;
-	struct crypto_acomp_ctx *acomp_ctx;
-	struct zpool *pool = zswap_find_zpool(entry);
 	bool page_was_allocated;
-	u8 *src;
-	unsigned int dlen;
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
@@ -1456,31 +1480,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	}
 	spin_unlock(&tree->lock);
 
-	/* decompress */
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	dlen = PAGE_SIZE;
-	mutex_lock(acomp_ctx->mutex);
-
-	src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
-	if (!zpool_can_sleep_mapped(pool)) {
-		memcpy(acomp_ctx->dstmem, src, entry->length);
-		src = acomp_ctx->dstmem;
-		zpool_unmap_handle(pool, entry->handle);
-	}
-
-	sg_init_one(&input, src, entry->length);
-	sg_init_table(&output, 1);
-	sg_set_page(&output, page, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
-	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
-	mutex_unlock(acomp_ctx->mutex);
-
-	if (zpool_can_sleep_mapped(pool))
-		zpool_unmap_handle(pool, entry->handle);
-
-	BUG_ON(ret);
-	BUG_ON(dlen != PAGE_SIZE);
+	__zswap_load(entry, page);
 
 	/* page is up to date */
 	SetPageUptodate(page);
@@ -1758,11 +1758,7 @@ bool zswap_load(struct folio *folio)
 	struct page *page = &folio->page;
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry;
-	struct scatterlist input, output;
-	struct crypto_acomp_ctx *acomp_ctx;
-	u8 *src, *dst;
-	struct zpool *zpool;
-	unsigned int dlen;
+	u8 *dst;
 	bool ret;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
@@ -1784,31 +1780,7 @@ bool zswap_load(struct folio *folio)
 		goto stats;
 	}
 
-	zpool = zswap_find_zpool(entry);
-
-	/* decompress */
-	dlen = PAGE_SIZE;
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
-
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	if (!zpool_can_sleep_mapped(zpool)) {
-		memcpy(acomp_ctx->dstmem, src, entry->length);
-		src = acomp_ctx->dstmem;
-		zpool_unmap_handle(zpool, entry->handle);
-	}
-
-	sg_init_one(&input, src, entry->length);
-	sg_init_table(&output, 1);
-	sg_set_page(&output, page, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
-	if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait))
-		WARN_ON(1);
-	mutex_unlock(acomp_ctx->mutex);
-
-	if (zpool_can_sleep_mapped(zpool))
-		zpool_unmap_handle(zpool, entry->handle);
-
+	__zswap_load(entry, page);
 	ret = true;
 stats:
 	count_vm_event(ZSWPIN);

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 4/6] mm/zswap: cleanup zswap_load()
  2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
                   ` (2 preceding siblings ...)
  2023-12-26 15:54 ` [PATCH v4 3/6] mm/zswap: refactor out __zswap_load() Chengming Zhou
@ 2023-12-26 15:54 ` Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 5/6] mm/zswap: cleanup zswap_writeback_entry() Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 6/6] mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx Chengming Zhou
  5 siblings, 0 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

After the common decompress part goes to __zswap_load(), we can cleanup
the zswap_load() a little.

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chis Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 3433bd6b3cef..86886276cb81 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1759,7 +1759,6 @@ bool zswap_load(struct folio *folio)
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry;
 	u8 *dst;
-	bool ret;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 
@@ -1776,19 +1775,16 @@ bool zswap_load(struct folio *folio)
 		dst = kmap_local_page(page);
 		zswap_fill_page(dst, entry->value);
 		kunmap_local(dst);
-		ret = true;
-		goto stats;
+	} else {
+		__zswap_load(entry, page);
 	}
 
-	__zswap_load(entry, page);
-	ret = true;
-stats:
 	count_vm_event(ZSWPIN);
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
 	spin_lock(&tree->lock);
-	if (ret && zswap_exclusive_loads_enabled) {
+	if (zswap_exclusive_loads_enabled) {
 		zswap_invalidate_entry(tree, entry);
 		folio_mark_dirty(folio);
 	} else if (entry->length) {
@@ -1798,7 +1794,7 @@ bool zswap_load(struct folio *folio)
 	zswap_entry_put(tree, entry);
 	spin_unlock(&tree->lock);
 
-	return ret;
+	return true;
 }
 
 void zswap_invalidate(int type, pgoff_t offset)

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 5/6] mm/zswap: cleanup zswap_writeback_entry()
  2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
                   ` (3 preceding siblings ...)
  2023-12-26 15:54 ` [PATCH v4 4/6] mm/zswap: cleanup zswap_load() Chengming Zhou
@ 2023-12-26 15:54 ` Chengming Zhou
  2023-12-26 15:54 ` [PATCH v4 6/6] mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx Chengming Zhou
  5 siblings, 0 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

Also after the common decompress part goes to __zswap_load(), we can
cleanup the zswap_writeback_entry() a little.

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 86886276cb81..40ee9f109f98 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1443,7 +1443,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct page *page;
 	struct mempolicy *mpol;
 	bool page_was_allocated;
-	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
@@ -1452,16 +1451,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &page_was_allocated, true);
-	if (!page) {
-		ret = -ENOMEM;
-		goto fail;
-	}
+	if (!page)
+		return -ENOMEM;
 
-	/* Found an existing page, we raced with load/swapin */
+	/*
+	 * Found an existing page, we raced with load/swapin. We generally
+	 * writeback cold pages from zswap, and swapin means the page just
+	 * became hot. Skip this page and let the caller find another one.
+	 */
 	if (!page_was_allocated) {
 		put_page(page);
-		ret = -EEXIST;
-		goto fail;
+		return -EEXIST;
 	}
 
 	/*
@@ -1475,8 +1475,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
 		spin_unlock(&tree->lock);
 		delete_from_swap_cache(page_folio(page));
-		ret = -ENOMEM;
-		goto fail;
+		return -ENOMEM;
 	}
 	spin_unlock(&tree->lock);
 
@@ -1497,15 +1496,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	__swap_writepage(page, &wbc);
 	put_page(page);
 
-	return ret;
-
-fail:
-	/*
-	 * If we get here because the page is already in swapcache, a
-	 * load may be happening concurrently. It is safe and okay to
-	 * not free the entry. It is also okay to return !0.
-	 */
-	return ret;
+	return 0;
 }
 
 static int zswap_is_page_same_filled(void *ptr, unsigned long *value)

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 6/6] mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx
  2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
                   ` (4 preceding siblings ...)
  2023-12-26 15:54 ` [PATCH v4 5/6] mm/zswap: cleanup zswap_writeback_entry() Chengming Zhou
@ 2023-12-26 15:54 ` Chengming Zhou
  2023-12-26 19:08   ` Nhat Pham
  5 siblings, 1 reply; 24+ messages in thread
From: Chengming Zhou @ 2023-12-26 15:54 UTC (permalink / raw)
  To: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman
  Cc: linux-kernel, Chengming Zhou, linux-mm, Nhat Pham, Yosry Ahmed,
	Chris Li

First of all, we need to rename acomp_ctx->dstmem field to buffer,
since we are now using for purposes other than compression.

Then we change per-cpu mutex and buffer to per-acomp_ctx, since
them belong to the acomp_ctx and are necessary parts when used
in the compress/decompress contexts.

So we can remove the old per-cpu mutex and dstmem.

Acked-by: Chris Li <chrisl@kernel.org> (Google)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/cpuhotplug.h |  1 -
 mm/zswap.c                 | 98 +++++++++++++---------------------------------
 2 files changed, 28 insertions(+), 71 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index efc0c0b07efb..c3e06e21766a 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -124,7 +124,6 @@ enum cpuhp_state {
 	CPUHP_ARM_BL_PREPARE,
 	CPUHP_TRACE_RB_PREPARE,
 	CPUHP_MM_ZS_PREPARE,
-	CPUHP_MM_ZSWP_MEM_PREPARE,
 	CPUHP_MM_ZSWP_POOL_PREPARE,
 	CPUHP_KVM_PPC_BOOK3S_PREPARE,
 	CPUHP_ZCOMP_PREPARE,
diff --git a/mm/zswap.c b/mm/zswap.c
index 40ee9f109f98..8014509736ad 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -166,8 +166,8 @@ struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
 	struct acomp_req *req;
 	struct crypto_wait wait;
-	u8 *dstmem;
-	struct mutex *mutex;
+	u8 *buffer;
+	struct mutex mutex;
 };
 
 /*
@@ -694,63 +694,26 @@ static void zswap_alloc_shrinker(struct zswap_pool *pool)
 /*********************************
 * per-cpu code
 **********************************/
-static DEFINE_PER_CPU(u8 *, zswap_dstmem);
-/*
- * If users dynamically change the zpool type and compressor at runtime, i.e.
- * zswap is running, zswap can have more than one zpool on one cpu, but they
- * are sharing dtsmem. So we need this mutex to be per-cpu.
- */
-static DEFINE_PER_CPU(struct mutex *, zswap_mutex);
-
-static int zswap_dstmem_prepare(unsigned int cpu)
-{
-	struct mutex *mutex;
-	u8 *dst;
-
-	dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
-	if (!dst)
-		return -ENOMEM;
-
-	mutex = kmalloc_node(sizeof(*mutex), GFP_KERNEL, cpu_to_node(cpu));
-	if (!mutex) {
-		kfree(dst);
-		return -ENOMEM;
-	}
-
-	mutex_init(mutex);
-	per_cpu(zswap_dstmem, cpu) = dst;
-	per_cpu(zswap_mutex, cpu) = mutex;
-	return 0;
-}
-
-static int zswap_dstmem_dead(unsigned int cpu)
-{
-	struct mutex *mutex;
-	u8 *dst;
-
-	mutex = per_cpu(zswap_mutex, cpu);
-	kfree(mutex);
-	per_cpu(zswap_mutex, cpu) = NULL;
-
-	dst = per_cpu(zswap_dstmem, cpu);
-	kfree(dst);
-	per_cpu(zswap_dstmem, cpu) = NULL;
-
-	return 0;
-}
-
 static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp;
 	struct acomp_req *req;
+	int ret;
+
+	mutex_init(&acomp_ctx->mutex);
+
+	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffer)
+		return -ENOMEM;
 
 	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp));
-		return PTR_ERR(acomp);
+		ret = PTR_ERR(acomp);
+		goto acomp_fail;
 	}
 	acomp_ctx->acomp = acomp;
 
@@ -758,8 +721,8 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!req) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
-		crypto_free_acomp(acomp_ctx->acomp);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto req_fail;
 	}
 	acomp_ctx->req = req;
 
@@ -772,10 +735,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
-	acomp_ctx->mutex = per_cpu(zswap_mutex, cpu);
-	acomp_ctx->dstmem = per_cpu(zswap_dstmem, cpu);
-
 	return 0;
+
+req_fail:
+	crypto_free_acomp(acomp_ctx->acomp);
+acomp_fail:
+	kfree(acomp_ctx->buffer);
+	return ret;
 }
 
 static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
@@ -788,6 +754,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 			acomp_request_free(acomp_ctx->req);
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
+		kfree(acomp_ctx->buffer);
 	}
 
 	return 0;
@@ -1400,12 +1367,12 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
 	u8 *src;
 
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
+	mutex_lock(&acomp_ctx->mutex);
 
 	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
 	if (!zpool_can_sleep_mapped(zpool)) {
-		memcpy(acomp_ctx->dstmem, src, entry->length);
-		src = acomp_ctx->dstmem;
+		memcpy(acomp_ctx->buffer, src, entry->length);
+		src = acomp_ctx->buffer;
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
@@ -1415,7 +1382,7 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
 	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
 	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
-	mutex_unlock(acomp_ctx->mutex);
+	mutex_unlock(&acomp_ctx->mutex);
 
 	if (zpool_can_sleep_mapped(zpool))
 		zpool_unmap_handle(zpool, entry->handle);
@@ -1636,9 +1603,9 @@ bool zswap_store(struct folio *folio)
 	/* compress */
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 
-	mutex_lock(acomp_ctx->mutex);
+	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->dstmem;
+	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -1681,7 +1648,7 @@ bool zswap_store(struct folio *folio)
 	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
 	memcpy(buf, dst, dlen);
 	zpool_unmap_handle(zpool, handle);
-	mutex_unlock(acomp_ctx->mutex);
+	mutex_unlock(&acomp_ctx->mutex);
 
 	/* populate entry */
 	entry->swpentry = swp_entry(type, offset);
@@ -1724,7 +1691,7 @@ bool zswap_store(struct folio *folio)
 	return true;
 
 put_dstmem:
-	mutex_unlock(acomp_ctx->mutex);
+	mutex_unlock(&acomp_ctx->mutex);
 put_pool:
 	zswap_pool_put(entry->pool);
 freepage:
@@ -1899,13 +1866,6 @@ static int zswap_setup(void)
 		goto cache_fail;
 	}
 
-	ret = cpuhp_setup_state(CPUHP_MM_ZSWP_MEM_PREPARE, "mm/zswap:prepare",
-				zswap_dstmem_prepare, zswap_dstmem_dead);
-	if (ret) {
-		pr_err("dstmem alloc failed\n");
-		goto dstmem_fail;
-	}
-
 	ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
 				      "mm/zswap_pool:prepare",
 				      zswap_cpu_comp_prepare,
@@ -1937,8 +1897,6 @@ static int zswap_setup(void)
 	if (pool)
 		zswap_pool_destroy(pool);
 hp_fail:
-	cpuhp_remove_state(CPUHP_MM_ZSWP_MEM_PREPARE);
-dstmem_fail:
 	kmem_cache_destroy(zswap_entry_cache);
 cache_fail:
 	/* if built-in, we aren't unloaded on failure; don't allow use */

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 6/6] mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx
  2023-12-26 15:54 ` [PATCH v4 6/6] mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx Chengming Zhou
@ 2023-12-26 19:08   ` Nhat Pham
  0 siblings, 0 replies; 24+ messages in thread
From: Nhat Pham @ 2023-12-26 19:08 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel, linux-mm,
	Chris Li

On Tue, Dec 26, 2023 at 7:55 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> First of all, we need to rename acomp_ctx->dstmem field to buffer,
> since we are now using for purposes other than compression.
>
> Then we change per-cpu mutex and buffer to per-acomp_ctx, since
> them belong to the acomp_ctx and are necessary parts when used
> in the compress/decompress contexts.
>
> So we can remove the old per-cpu mutex and dstmem.
>
> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Fairly straightforward, and actually delete some code :) LGTM.
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

> ---
>  include/linux/cpuhotplug.h |  1 -
>  mm/zswap.c                 | 98 +++++++++++++---------------------------------
>  2 files changed, 28 insertions(+), 71 deletions(-)
>
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index efc0c0b07efb..c3e06e21766a 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -124,7 +124,6 @@ enum cpuhp_state {
>         CPUHP_ARM_BL_PREPARE,
>         CPUHP_TRACE_RB_PREPARE,
>         CPUHP_MM_ZS_PREPARE,
> -       CPUHP_MM_ZSWP_MEM_PREPARE,
>         CPUHP_MM_ZSWP_POOL_PREPARE,
>         CPUHP_KVM_PPC_BOOK3S_PREPARE,
>         CPUHP_ZCOMP_PREPARE,
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 40ee9f109f98..8014509736ad 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -166,8 +166,8 @@ struct crypto_acomp_ctx {
>         struct crypto_acomp *acomp;
>         struct acomp_req *req;
>         struct crypto_wait wait;
> -       u8 *dstmem;
> -       struct mutex *mutex;
> +       u8 *buffer;
> +       struct mutex mutex;
>  };
>
>  /*
> @@ -694,63 +694,26 @@ static void zswap_alloc_shrinker(struct zswap_pool *pool)
>  /*********************************
>  * per-cpu code
>  **********************************/
> -static DEFINE_PER_CPU(u8 *, zswap_dstmem);
> -/*
> - * If users dynamically change the zpool type and compressor at runtime, i.e.
> - * zswap is running, zswap can have more than one zpool on one cpu, but they
> - * are sharing dtsmem. So we need this mutex to be per-cpu.
> - */
> -static DEFINE_PER_CPU(struct mutex *, zswap_mutex);
> -
> -static int zswap_dstmem_prepare(unsigned int cpu)
> -{
> -       struct mutex *mutex;
> -       u8 *dst;
> -
> -       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> -       if (!dst)
> -               return -ENOMEM;
> -
> -       mutex = kmalloc_node(sizeof(*mutex), GFP_KERNEL, cpu_to_node(cpu));
> -       if (!mutex) {
> -               kfree(dst);
> -               return -ENOMEM;
> -       }
> -
> -       mutex_init(mutex);
> -       per_cpu(zswap_dstmem, cpu) = dst;
> -       per_cpu(zswap_mutex, cpu) = mutex;
> -       return 0;
> -}
> -
> -static int zswap_dstmem_dead(unsigned int cpu)
> -{
> -       struct mutex *mutex;
> -       u8 *dst;
> -
> -       mutex = per_cpu(zswap_mutex, cpu);
> -       kfree(mutex);
> -       per_cpu(zswap_mutex, cpu) = NULL;
> -
> -       dst = per_cpu(zswap_dstmem, cpu);
> -       kfree(dst);
> -       per_cpu(zswap_dstmem, cpu) = NULL;
> -
> -       return 0;
> -}
> -
>  static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  {
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         struct crypto_acomp *acomp;
>         struct acomp_req *req;
> +       int ret;
> +
> +       mutex_init(&acomp_ctx->mutex);
> +
> +       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> +       if (!acomp_ctx->buffer)
> +               return -ENOMEM;
>
>         acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>         if (IS_ERR(acomp)) {
>                 pr_err("could not alloc crypto acomp %s : %ld\n",
>                                 pool->tfm_name, PTR_ERR(acomp));
> -               return PTR_ERR(acomp);
> +               ret = PTR_ERR(acomp);
> +               goto acomp_fail;
>         }
>         acomp_ctx->acomp = acomp;
>
> @@ -758,8 +721,8 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         if (!req) {
>                 pr_err("could not alloc crypto acomp_request %s\n",
>                        pool->tfm_name);
> -               crypto_free_acomp(acomp_ctx->acomp);
> -               return -ENOMEM;
> +               ret = -ENOMEM;
> +               goto req_fail;
>         }
>         acomp_ctx->req = req;
>
> @@ -772,10 +735,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> -       acomp_ctx->mutex = per_cpu(zswap_mutex, cpu);
> -       acomp_ctx->dstmem = per_cpu(zswap_dstmem, cpu);
> -
>         return 0;
> +
> +req_fail:
> +       crypto_free_acomp(acomp_ctx->acomp);
> +acomp_fail:
> +       kfree(acomp_ctx->buffer);
> +       return ret;
>  }
>
>  static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
> @@ -788,6 +754,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>                         acomp_request_free(acomp_ctx->req);
>                 if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>                         crypto_free_acomp(acomp_ctx->acomp);
> +               kfree(acomp_ctx->buffer);
>         }
>
>         return 0;
> @@ -1400,12 +1367,12 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
>         u8 *src;
>
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -       mutex_lock(acomp_ctx->mutex);
> +       mutex_lock(&acomp_ctx->mutex);
>
>         src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(zpool)) {
> -               memcpy(acomp_ctx->dstmem, src, entry->length);
> -               src = acomp_ctx->dstmem;
> +               memcpy(acomp_ctx->buffer, src, entry->length);
> +               src = acomp_ctx->buffer;
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
> @@ -1415,7 +1382,7 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
>         acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
>         BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
>         BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> -       mutex_unlock(acomp_ctx->mutex);
> +       mutex_unlock(&acomp_ctx->mutex);
>
>         if (zpool_can_sleep_mapped(zpool))
>                 zpool_unmap_handle(zpool, entry->handle);
> @@ -1636,9 +1603,9 @@ bool zswap_store(struct folio *folio)
>         /* compress */
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>
> -       mutex_lock(acomp_ctx->mutex);
> +       mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->dstmem;
> +       dst = acomp_ctx->buffer;
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> @@ -1681,7 +1648,7 @@ bool zswap_store(struct folio *folio)
>         buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
>         memcpy(buf, dst, dlen);
>         zpool_unmap_handle(zpool, handle);
> -       mutex_unlock(acomp_ctx->mutex);
> +       mutex_unlock(&acomp_ctx->mutex);
>
>         /* populate entry */
>         entry->swpentry = swp_entry(type, offset);
> @@ -1724,7 +1691,7 @@ bool zswap_store(struct folio *folio)
>         return true;
>
>  put_dstmem:
> -       mutex_unlock(acomp_ctx->mutex);
> +       mutex_unlock(&acomp_ctx->mutex);
>  put_pool:
>         zswap_pool_put(entry->pool);
>  freepage:
> @@ -1899,13 +1866,6 @@ static int zswap_setup(void)
>                 goto cache_fail;
>         }
>
> -       ret = cpuhp_setup_state(CPUHP_MM_ZSWP_MEM_PREPARE, "mm/zswap:prepare",
> -                               zswap_dstmem_prepare, zswap_dstmem_dead);
> -       if (ret) {
> -               pr_err("dstmem alloc failed\n");
> -               goto dstmem_fail;
> -       }
> -
>         ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
>                                       "mm/zswap_pool:prepare",
>                                       zswap_cpu_comp_prepare,
> @@ -1937,8 +1897,6 @@ static int zswap_setup(void)
>         if (pool)
>                 zswap_pool_destroy(pool);
>  hp_fail:
> -       cpuhp_remove_state(CPUHP_MM_ZSWP_MEM_PREPARE);
> -dstmem_fail:
>         kmem_cache_destroy(zswap_entry_cache);
>  cache_fail:
>         /* if built-in, we aren't unloaded on failure; don't allow use */
>
> --
> b4 0.10.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-26 15:54 ` [PATCH v4 1/6] mm/zswap: change dstmem size to one page Chengming Zhou
@ 2023-12-27  1:07   ` Barry Song
  2023-12-27  6:11     ` Chengming Zhou
  0 siblings, 1 reply; 24+ messages in thread
From: Barry Song @ 2023-12-27  1:07 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> we only need at most one page when compress, and the "dlen" is also
> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> we don't wanna store the output in zswap anyway.
>
> So change it to one page, and delete the stale comment.
>
> There is no any history about the reason why we needed 2 pages, it has
> been 2 * PAGE_SIZE since the time zswap was first merged.

i remember there was an over-compression case,  that means the compressed
data can be bigger than the source data. the similar thing is also done in zram
drivers/block/zram/zcomp.c

int zcomp_compress(struct zcomp_strm *zstrm,
                const void *src, unsigned int *dst_len)
{
        /*
         * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
         * because sometimes we can endup having a bigger compressed data
         * due to various reasons: for example compression algorithms tend
         * to add some padding to the compressed buffer. Speaking of padding,
         * comp algorithm `842' pads the compressed length to multiple of 8
         * and returns -ENOSP when the dst memory is not big enough, which
         * is not something that ZRAM wants to see. We can handle the
         * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
         * receive -ERRNO from the compressing backend we can't help it
         * anymore. To make `842' happy we need to tell the exact size of
         * the dst buffer, zram_drv will take care of the fact that
         * compressed buffer is too big.
         */
        *dst_len = PAGE_SIZE * 2;

        return crypto_comp_compress(zstrm->tfm,
                        src, PAGE_SIZE,
                        zstrm->buffer, dst_len);
}


>
> According to Yosry and Nhat, one potential reason is that we used to
> store a zswap header containing the swap entry in the compressed page
> for writeback purposes, but we don't do that anymore.
>
> This patch works good in kernel build testing even when the input data
> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
> from the bpftrace tool:
>
> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> @[1]: 2
> @[0]: 12011430
>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7ee54a3d8281..976f278aa507 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
>         struct mutex *mutex;
>         u8 *dst;
>
> -       dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
>         if (!dst)
>                 return -ENOMEM;
>
> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> -       /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> +       sg_init_one(&output, dst, PAGE_SIZE);
>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
>         /*
>          * it maybe looks a little bit silly that we send an asynchronous request,
>
> --
> b4 0.10.1
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2023-12-26 15:54 ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Chengming Zhou
@ 2023-12-27  1:24   ` Barry Song
  2023-12-27  6:32     ` Chengming Zhou
  0 siblings, 1 reply; 24+ messages in thread
From: Barry Song @ 2023-12-27  1:24 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> copy the entry->handle memory to a temporary memory, which is allocated
> using kmalloc.
>
> Obviously we can reuse the per-compressor dstmem to avoid allocating
> every time, since it's percpu-compressor and protected in percpu mutex.

what is the benefit of this since we are actually increasing lock contention
by reusing this buffer between multiple compression and decompression
threads?

this mainly affects zsmalloc which can't sleep? do we have performance
data?

and it seems this patch is also negatively affecting z3fold and zbud.c
which actually don't need to allocate a tmp buffer.

>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 44 ++++++++++++--------------------------------
>  1 file changed, 12 insertions(+), 32 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 976f278aa507..6b872744e962 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct zpool *pool = zswap_find_zpool(entry);
>         bool page_was_allocated;
> -       u8 *src, *tmp = NULL;
> +       u8 *src;
>         unsigned int dlen;
>         int ret;
>         struct writeback_control wbc = {
>                 .sync_mode = WB_SYNC_NONE,
>         };
>
> -       if (!zpool_can_sleep_mapped(pool)) {
> -               tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -               if (!tmp)
> -                       return -ENOMEM;
> -       }
> -
>         /* try to allocate swap cache page */
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> @@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         /* decompress */
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>         dlen = PAGE_SIZE;
> +       mutex_lock(acomp_ctx->mutex);
>
>         src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(pool)) {
> -               memcpy(tmp, src, entry->length);
> -               src = tmp;
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;
>                 zpool_unmap_handle(pool, entry->handle);
>         }
>
> -       mutex_lock(acomp_ctx->mutex);
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_page(&output, page, PAGE_SIZE, 0);
> @@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         dlen = acomp_ctx->req->dlen;
>         mutex_unlock(acomp_ctx->mutex);
>
> -       if (!zpool_can_sleep_mapped(pool))
> -               kfree(tmp);
> -       else
> +       if (zpool_can_sleep_mapped(pool))
>                 zpool_unmap_handle(pool, entry->handle);
>
>         BUG_ON(ret);
> @@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         return ret;
>
>  fail:
> -       if (!zpool_can_sleep_mapped(pool))
> -               kfree(tmp);
> -
>         /*
>          * If we get here because the page is already in swapcache, a
>          * load may be happening concurrently. It is safe and okay to
> @@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
>         struct zswap_entry *entry;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> -       u8 *src, *dst, *tmp;
> +       u8 *src, *dst;
>         struct zpool *zpool;
>         unsigned int dlen;
>         bool ret;
> @@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
>         }
>
>         zpool = zswap_find_zpool(entry);
> -       if (!zpool_can_sleep_mapped(zpool)) {
> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> -               if (!tmp) {
> -                       ret = false;
> -                       goto freeentry;
> -               }
> -       }
>
>         /* decompress */
>         dlen = PAGE_SIZE;
> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +       mutex_lock(acomp_ctx->mutex);
>
> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(zpool)) {
> -               memcpy(tmp, src, entry->length);
> -               src = tmp;
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -       mutex_lock(acomp_ctx->mutex);
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_page(&output, page, PAGE_SIZE, 0);
> @@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
>
>         if (zpool_can_sleep_mapped(zpool))
>                 zpool_unmap_handle(zpool, entry->handle);
> -       else
> -               kfree(tmp);
>
>         ret = true;
>  stats:
>         count_vm_event(ZSWPIN);
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
> -freeentry:
> +
>         spin_lock(&tree->lock);
>         if (ret && zswap_exclusive_loads_enabled) {
>                 zswap_invalidate_entry(tree, entry);
>
> --
> b4 0.10.1
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-27  1:07   ` Barry Song
@ 2023-12-27  6:11     ` Chengming Zhou
  2023-12-27  6:32       ` Barry Song
  2023-12-27 20:58       ` Andrew Morton
  0 siblings, 2 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-27  6:11 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On 2023/12/27 09:07, Barry Song wrote:
> On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>> we only need at most one page when compress, and the "dlen" is also
>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>> we don't wanna store the output in zswap anyway.
>>
>> So change it to one page, and delete the stale comment.
>>
>> There is no any history about the reason why we needed 2 pages, it has
>> been 2 * PAGE_SIZE since the time zswap was first merged.
> 
> i remember there was an over-compression case,  that means the compressed
> data can be bigger than the source data. the similar thing is also done in zram
> drivers/block/zram/zcomp.c

Right, there is a buffer overflow report[1] that I just +to you.

I think over-compression is all right, but buffer overflow is not acceptable,
so we should fix any buffer overflow problem IMHO. Anyway, 2 pages maybe
overflowed too, just with smaller probability, right?

Thanks.

> 
> int zcomp_compress(struct zcomp_strm *zstrm,
>                 const void *src, unsigned int *dst_len)
> {
>         /*
>          * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
>          * because sometimes we can endup having a bigger compressed data
>          * due to various reasons: for example compression algorithms tend
>          * to add some padding to the compressed buffer. Speaking of padding,
>          * comp algorithm `842' pads the compressed length to multiple of 8
>          * and returns -ENOSP when the dst memory is not big enough, which
>          * is not something that ZRAM wants to see. We can handle the
>          * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
>          * receive -ERRNO from the compressing backend we can't help it
>          * anymore. To make `842' happy we need to tell the exact size of
>          * the dst buffer, zram_drv will take care of the fact that
>          * compressed buffer is too big.
>          */
>         *dst_len = PAGE_SIZE * 2;
> 
>         return crypto_comp_compress(zstrm->tfm,
>                         src, PAGE_SIZE,
>                         zstrm->buffer, dst_len);
> }
> 
> 
>>
>> According to Yosry and Nhat, one potential reason is that we used to
>> store a zswap header containing the swap entry in the compressed page
>> for writeback purposes, but we don't do that anymore.
>>
>> This patch works good in kernel build testing even when the input data
>> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
>> from the bpftrace tool:
>>
>> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
>> @[1]: 2
>> @[0]: 12011430
>>
>> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>> Acked-by: Chris Li <chrisl@kernel.org> (Google)
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 7ee54a3d8281..976f278aa507 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
>>         struct mutex *mutex;
>>         u8 *dst;
>>
>> -       dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
>> +       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
>>         if (!dst)
>>                 return -ENOMEM;
>>
>> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
>>         sg_init_table(&input, 1);
>>         sg_set_page(&input, page, PAGE_SIZE, 0);
>>
>> -       /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
>> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
>> +       sg_init_one(&output, dst, PAGE_SIZE);
>>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
>>         /*
>>          * it maybe looks a little bit silly that we send an asynchronous request,
>>
>> --
>> b4 0.10.1
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-27  6:11     ` Chengming Zhou
@ 2023-12-27  6:32       ` Barry Song
  2023-12-27 20:58       ` Andrew Morton
  1 sibling, 0 replies; 24+ messages in thread
From: Barry Song @ 2023-12-27  6:32 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On Wed, Dec 27, 2023 at 7:11 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/27 09:07, Barry Song wrote:
> > On Wed, Dec 27, 2023 at 4:55 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> >> we only need at most one page when compress, and the "dlen" is also
> >> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> >> we don't wanna store the output in zswap anyway.
> >>
> >> So change it to one page, and delete the stale comment.
> >>
> >> There is no any history about the reason why we needed 2 pages, it has
> >> been 2 * PAGE_SIZE since the time zswap was first merged.
> >
> > i remember there was an over-compression case,  that means the compressed
> > data can be bigger than the source data. the similar thing is also done in zram
> > drivers/block/zram/zcomp.c
>
> Right, there is a buffer overflow report[1] that I just +to you.
>
> I think over-compression is all right, but buffer overflow is not acceptable,
> so we should fix any buffer overflow problem IMHO. Anyway, 2 pages maybe
> overflowed too, just with smaller probability, right?

practically, the typical page size is 4KB or above, so we have never seen 2
pages can be overflowed. We may have a chance to let CPU-based
compression code to return earlier before overflowing though it is still
very tough.
but for accelerators-based compression in drivers/crypto, the only choice is
giving its dma engine a buffer whose length is enough - 2*PAGE_SIZE.

so i don't think this patch is correct.

>
> Thanks.
>
> >
> > int zcomp_compress(struct zcomp_strm *zstrm,
> >                 const void *src, unsigned int *dst_len)
> > {
> >         /*
> >          * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
> >          * because sometimes we can endup having a bigger compressed data
> >          * due to various reasons: for example compression algorithms tend
> >          * to add some padding to the compressed buffer. Speaking of padding,
> >          * comp algorithm `842' pads the compressed length to multiple of 8
> >          * and returns -ENOSP when the dst memory is not big enough, which
> >          * is not something that ZRAM wants to see. We can handle the
> >          * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
> >          * receive -ERRNO from the compressing backend we can't help it
> >          * anymore. To make `842' happy we need to tell the exact size of
> >          * the dst buffer, zram_drv will take care of the fact that
> >          * compressed buffer is too big.
> >          */
> >         *dst_len = PAGE_SIZE * 2;
> >
> >         return crypto_comp_compress(zstrm->tfm,
> >                         src, PAGE_SIZE,
> >                         zstrm->buffer, dst_len);
> > }
> >
> >
> >>
> >> According to Yosry and Nhat, one potential reason is that we used to
> >> store a zswap header containing the swap entry in the compressed page
> >> for writeback purposes, but we don't do that anymore.
> >>
> >> This patch works good in kernel build testing even when the input data
> >> doesn't compress at all (i.e. dlen == PAGE_SIZE), which we can see
> >> from the bpftrace tool:
> >>
> >> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> >> @[1]: 2
> >> @[0]: 12011430
> >>
> >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> ---
> >>  mm/zswap.c | 5 ++---
> >>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 7ee54a3d8281..976f278aa507 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
> >>         struct mutex *mutex;
> >>         u8 *dst;
> >>
> >> -       dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> >> +       dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> >>         if (!dst)
> >>                 return -ENOMEM;
> >>
> >> @@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
> >>         sg_init_table(&input, 1);
> >>         sg_set_page(&input, page, PAGE_SIZE, 0);
> >>
> >> -       /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> >> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> >> +       sg_init_one(&output, dst, PAGE_SIZE);
> >>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> >>         /*
> >>          * it maybe looks a little bit silly that we send an asynchronous request,
> >>
> >> --
> >> b4 0.10.1
> >>

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2023-12-27  1:24   ` Barry Song
@ 2023-12-27  6:32     ` Chengming Zhou
  2023-12-28  8:03       ` Barry Song
  0 siblings, 1 reply; 24+ messages in thread
From: Chengming Zhou @ 2023-12-27  6:32 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On 2023/12/27 09:24, Barry Song wrote:
> On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
>> copy the entry->handle memory to a temporary memory, which is allocated
>> using kmalloc.
>>
>> Obviously we can reuse the per-compressor dstmem to avoid allocating
>> every time, since it's percpu-compressor and protected in percpu mutex.
> 
> what is the benefit of this since we are actually increasing lock contention
> by reusing this buffer between multiple compression and decompression
> threads?

This mutex is already reused in all compress/decompress paths even before
the reuse optimization. I think the best way maybe to use separate crypto_acomp
for compression and decompression.

Do you think the lock contention will be increased because we now put zpool_map_handle()
and memcpy() in the lock section? Actually, we can move zpool_map_handle() before
the lock section if needed, but that memcpy() should be protected in lock section.

> 
> this mainly affects zsmalloc which can't sleep? do we have performance
> data?

Right, last time when test I remembered there is very minor performance difference.
The main benefit here is to simply the code much and delete one failure case.

> 
> and it seems this patch is also negatively affecting z3fold and zbud.c
> which actually don't need to allocate a tmp buffer.
> 

As for z3fold and zbud, the influence should be much less since the only difference
here is zpool_map_handle() moved in lock section, which could be moved out if needed
as noted above. And also no evident performance regression in the testing.

Thanks.

>>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>> Acked-by: Chris Li <chrisl@kernel.org> (Google)
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 44 ++++++++++++--------------------------------
>>  1 file changed, 12 insertions(+), 32 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 976f278aa507..6b872744e962 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         struct crypto_acomp_ctx *acomp_ctx;
>>         struct zpool *pool = zswap_find_zpool(entry);
>>         bool page_was_allocated;
>> -       u8 *src, *tmp = NULL;
>> +       u8 *src;
>>         unsigned int dlen;
>>         int ret;
>>         struct writeback_control wbc = {
>>                 .sync_mode = WB_SYNC_NONE,
>>         };
>>
>> -       if (!zpool_can_sleep_mapped(pool)) {
>> -               tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -               if (!tmp)
>> -                       return -ENOMEM;
>> -       }
>> -
>>         /* try to allocate swap cache page */
>>         mpol = get_task_policy(current);
>>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>> @@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         /* decompress */
>>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>>         dlen = PAGE_SIZE;
>> +       mutex_lock(acomp_ctx->mutex);
>>
>>         src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
>>         if (!zpool_can_sleep_mapped(pool)) {
>> -               memcpy(tmp, src, entry->length);
>> -               src = tmp;
>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>> +               src = acomp_ctx->dstmem;
>>                 zpool_unmap_handle(pool, entry->handle);
>>         }
>>
>> -       mutex_lock(acomp_ctx->mutex);
>>         sg_init_one(&input, src, entry->length);
>>         sg_init_table(&output, 1);
>>         sg_set_page(&output, page, PAGE_SIZE, 0);
>> @@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         dlen = acomp_ctx->req->dlen;
>>         mutex_unlock(acomp_ctx->mutex);
>>
>> -       if (!zpool_can_sleep_mapped(pool))
>> -               kfree(tmp);
>> -       else
>> +       if (zpool_can_sleep_mapped(pool))
>>                 zpool_unmap_handle(pool, entry->handle);
>>
>>         BUG_ON(ret);
>> @@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>         return ret;
>>
>>  fail:
>> -       if (!zpool_can_sleep_mapped(pool))
>> -               kfree(tmp);
>> -
>>         /*
>>          * If we get here because the page is already in swapcache, a
>>          * load may be happening concurrently. It is safe and okay to
>> @@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
>>         struct zswap_entry *entry;
>>         struct scatterlist input, output;
>>         struct crypto_acomp_ctx *acomp_ctx;
>> -       u8 *src, *dst, *tmp;
>> +       u8 *src, *dst;
>>         struct zpool *zpool;
>>         unsigned int dlen;
>>         bool ret;
>> @@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
>>         }
>>
>>         zpool = zswap_find_zpool(entry);
>> -       if (!zpool_can_sleep_mapped(zpool)) {
>> -               tmp = kmalloc(entry->length, GFP_KERNEL);
>> -               if (!tmp) {
>> -                       ret = false;
>> -                       goto freeentry;
>> -               }
>> -       }
>>
>>         /* decompress */
>>         dlen = PAGE_SIZE;
>> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>> +       mutex_lock(acomp_ctx->mutex);
>>
>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>         if (!zpool_can_sleep_mapped(zpool)) {
>> -               memcpy(tmp, src, entry->length);
>> -               src = tmp;
>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>> +               src = acomp_ctx->dstmem;
>>                 zpool_unmap_handle(zpool, entry->handle);
>>         }
>>
>> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>> -       mutex_lock(acomp_ctx->mutex);
>>         sg_init_one(&input, src, entry->length);
>>         sg_init_table(&output, 1);
>>         sg_set_page(&output, page, PAGE_SIZE, 0);
>> @@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
>>
>>         if (zpool_can_sleep_mapped(zpool))
>>                 zpool_unmap_handle(zpool, entry->handle);
>> -       else
>> -               kfree(tmp);
>>
>>         ret = true;
>>  stats:
>>         count_vm_event(ZSWPIN);
>>         if (entry->objcg)
>>                 count_objcg_event(entry->objcg, ZSWPIN);
>> -freeentry:
>> +
>>         spin_lock(&tree->lock);
>>         if (ret && zswap_exclusive_loads_enabled) {
>>                 zswap_invalidate_entry(tree, entry);
>>
>> --
>> b4 0.10.1
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-27  6:11     ` Chengming Zhou
  2023-12-27  6:32       ` Barry Song
@ 2023-12-27 20:58       ` Andrew Morton
  2023-12-27 23:21         ` Nhat Pham
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2023-12-27 20:58 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Barry Song, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote:

> > i remember there was an over-compression case,  that means the compressed
> > data can be bigger than the source data. the similar thing is also done in zram
> > drivers/block/zram/zcomp.c
> 
> Right, there is a buffer overflow report[1] that I just +to you.

What does "[1]" refer to?  Is there a bug report about this series?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-27 20:58       ` Andrew Morton
@ 2023-12-27 23:21         ` Nhat Pham
  2023-12-28  6:41           ` Chengming Zhou
  0 siblings, 1 reply; 24+ messages in thread
From: Nhat Pham @ 2023-12-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chengming Zhou, Barry Song, Seth Jennings, Johannes Weiner,
	Vitaly Wool, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On Wed, Dec 27, 2023 at 12:58 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote:
>
> > > i remember there was an over-compression case,  that means the compressed
> > > data can be bigger than the source data. the similar thing is also done in zram
> > > drivers/block/zram/zcomp.c
> >
> > Right, there is a buffer overflow report[1] that I just +to you.
>
> What does "[1]" refer to?  Is there a bug report about this series?

I think Chengming was referring to this:

https://lore.kernel.org/lkml/0000000000000b05cd060d6b5511@google.com/

Syzkaller/syzbot found an edge case where the page's "compressed" form
was larger than one page, which tripped up the compression code (since
we reduced the compression buffer size to 1 page here).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/6] mm/zswap: change dstmem size to one page
  2023-12-27 23:21         ` Nhat Pham
@ 2023-12-28  6:41           ` Chengming Zhou
  0 siblings, 0 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-28  6:41 UTC (permalink / raw)
  To: Nhat Pham, Andrew Morton
  Cc: Barry Song, Seth Jennings, Johannes Weiner, Vitaly Wool, Chris Li,
	Yosry Ahmed, Dan Streetman, linux-kernel, linux-mm, Chris Li

On 2023/12/28 07:21, Nhat Pham wrote:
> On Wed, Dec 27, 2023 at 12:58 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> On Wed, 27 Dec 2023 14:11:06 +0800 Chengming Zhou <zhouchengming@bytedance.com> wrote:
>>
>>>> i remember there was an over-compression case,  that means the compressed
>>>> data can be bigger than the source data. the similar thing is also done in zram
>>>> drivers/block/zram/zcomp.c
>>>
>>> Right, there is a buffer overflow report[1] that I just +to you.
>>
>> What does "[1]" refer to?  Is there a bug report about this series?
> 
> I think Chengming was referring to this:
> 
> https://lore.kernel.org/lkml/0000000000000b05cd060d6b5511@google.com/
> 
> Syzkaller/syzbot found an edge case where the page's "compressed" form
> was larger than one page, which tripped up the compression code (since
> we reduced the compression buffer size to 1 page here).

Right, thanks Nhat!

The reported bug can be fixed by a patch I posted:
https://lore.kernel.org/all/20231227093523.2735484-1-chengming.zhou@linux.dev/

Although this bug is fixed, we still have to revert the first patch to use
2 pages buffer in zswap, since not all compressor drivers would respect the
buffer size we passed in and may overflow our output buffer.

Barry Song has explained the background in:
https://lore.kernel.org/all/CAGsJ_4xuuaPnQzkkQVaRyZL6ZdwkiQ_B7_c2baNaCKVg_O7ZQA@mail.gmail.com/

I will send an updated series later.

Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2023-12-27  6:32     ` Chengming Zhou
@ 2023-12-28  8:03       ` Barry Song
  2023-12-28  8:23         ` Chengming Zhou
  2023-12-28  9:49         ` Herbert Xu
  0 siblings, 2 replies; 24+ messages in thread
From: Barry Song @ 2023-12-28  8:03 UTC (permalink / raw)
  To: Chengming Zhou, Herbert Xu
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On Wed, Dec 27, 2023 at 7:32 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/27 09:24, Barry Song wrote:
> > On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> >> copy the entry->handle memory to a temporary memory, which is allocated
> >> using kmalloc.
> >>
> >> Obviously we can reuse the per-compressor dstmem to avoid allocating
> >> every time, since it's percpu-compressor and protected in percpu mutex.
> >
> > what is the benefit of this since we are actually increasing lock contention
> > by reusing this buffer between multiple compression and decompression
> > threads?
>
> This mutex is already reused in all compress/decompress paths even before
> the reuse optimization. I think the best way maybe to use separate crypto_acomp
> for compression and decompression.
>
> Do you think the lock contention will be increased because we now put zpool_map_handle()
> and memcpy() in the lock section? Actually, we can move zpool_map_handle() before
> the lock section if needed, but that memcpy() should be protected in lock section.
>
> >
> > this mainly affects zsmalloc which can't sleep? do we have performance
> > data?
>
> Right, last time when test I remembered there is very minor performance difference.
> The main benefit here is to simply the code much and delete one failure case.

ok.

For the majority of hardware, people are using CPU-based
compression/decompression,
there is no chance they will sleep. Thus, all
compression/decompression can be done
in a zpool_map section, there is *NO* need to copy at all! Only for
those hardware which
can provide a HW-accelerator to offload CPU, crypto will actually wait
for completion by

static inline int crypto_wait_req(int err, struct crypto_wait *wait)
{
        switch (err) {
        case -EINPROGRESS:
        case -EBUSY:
                wait_for_completion(&wait->completion);
                reinit_completion(&wait->completion);
                err = wait->err;
                break;
        }

        return err;
}

for CPU-based alg, we have completed the compr/decompr within
crypto_acomp_decompress()
synchronously. they won't return EINPROGRESS, EBUSY.

The problem is that crypto_acomp won't expose this information to its
users. if it does,
we can use this info, we will totally avoid the code of copying
zsmalloc's data to a tmp
buffer for the most majority users of zswap.

But I am not sure if we can find a way to convince Herbert(+To)  :-)

>
> >
> > and it seems this patch is also negatively affecting z3fold and zbud.c
> > which actually don't need to allocate a tmp buffer.
> >
>
> As for z3fold and zbud, the influence should be much less since the only difference
> here is zpool_map_handle() moved in lock section, which could be moved out if needed
> as noted above. And also no evident performance regression in the testing.
>
> Thanks.
>
> >>
> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> >> Acked-by: Chris Li <chrisl@kernel.org> (Google)
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> ---
> >>  mm/zswap.c | 44 ++++++++++++--------------------------------
> >>  1 file changed, 12 insertions(+), 32 deletions(-)
> >>
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 976f278aa507..6b872744e962 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -1417,19 +1417,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         struct crypto_acomp_ctx *acomp_ctx;
> >>         struct zpool *pool = zswap_find_zpool(entry);
> >>         bool page_was_allocated;
> >> -       u8 *src, *tmp = NULL;
> >> +       u8 *src;
> >>         unsigned int dlen;
> >>         int ret;
> >>         struct writeback_control wbc = {
> >>                 .sync_mode = WB_SYNC_NONE,
> >>         };
> >>
> >> -       if (!zpool_can_sleep_mapped(pool)) {
> >> -               tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
> >> -               if (!tmp)
> >> -                       return -ENOMEM;
> >> -       }
> >> -
> >>         /* try to allocate swap cache page */
> >>         mpol = get_task_policy(current);
> >>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> >> @@ -1465,15 +1459,15 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         /* decompress */
> >>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >>         dlen = PAGE_SIZE;
> >> +       mutex_lock(acomp_ctx->mutex);
> >>
> >>         src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
> >>         if (!zpool_can_sleep_mapped(pool)) {
> >> -               memcpy(tmp, src, entry->length);
> >> -               src = tmp;
> >> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> >> +               src = acomp_ctx->dstmem;
> >>                 zpool_unmap_handle(pool, entry->handle);
> >>         }
> >>
> >> -       mutex_lock(acomp_ctx->mutex);
> >>         sg_init_one(&input, src, entry->length);
> >>         sg_init_table(&output, 1);
> >>         sg_set_page(&output, page, PAGE_SIZE, 0);
> >> @@ -1482,9 +1476,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         dlen = acomp_ctx->req->dlen;
> >>         mutex_unlock(acomp_ctx->mutex);
> >>
> >> -       if (!zpool_can_sleep_mapped(pool))
> >> -               kfree(tmp);
> >> -       else
> >> +       if (zpool_can_sleep_mapped(pool))
> >>                 zpool_unmap_handle(pool, entry->handle);
> >>
> >>         BUG_ON(ret);
> >> @@ -1508,9 +1500,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>         return ret;
> >>
> >>  fail:
> >> -       if (!zpool_can_sleep_mapped(pool))
> >> -               kfree(tmp);
> >> -
> >>         /*
> >>          * If we get here because the page is already in swapcache, a
> >>          * load may be happening concurrently. It is safe and okay to
> >> @@ -1771,7 +1760,7 @@ bool zswap_load(struct folio *folio)
> >>         struct zswap_entry *entry;
> >>         struct scatterlist input, output;
> >>         struct crypto_acomp_ctx *acomp_ctx;
> >> -       u8 *src, *dst, *tmp;
> >> +       u8 *src, *dst;
> >>         struct zpool *zpool;
> >>         unsigned int dlen;
> >>         bool ret;
> >> @@ -1796,26 +1785,19 @@ bool zswap_load(struct folio *folio)
> >>         }
> >>
> >>         zpool = zswap_find_zpool(entry);
> >> -       if (!zpool_can_sleep_mapped(zpool)) {
> >> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> >> -               if (!tmp) {
> >> -                       ret = false;
> >> -                       goto freeentry;
> >> -               }
> >> -       }
> >>
> >>         /* decompress */
> >>         dlen = PAGE_SIZE;
> >> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >> +       mutex_lock(acomp_ctx->mutex);
> >>
> >> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >>         if (!zpool_can_sleep_mapped(zpool)) {
> >> -               memcpy(tmp, src, entry->length);
> >> -               src = tmp;
> >> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> >> +               src = acomp_ctx->dstmem;
> >>                 zpool_unmap_handle(zpool, entry->handle);
> >>         }
> >>
> >> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >> -       mutex_lock(acomp_ctx->mutex);
> >>         sg_init_one(&input, src, entry->length);
> >>         sg_init_table(&output, 1);
> >>         sg_set_page(&output, page, PAGE_SIZE, 0);
> >> @@ -1826,15 +1808,13 @@ bool zswap_load(struct folio *folio)
> >>
> >>         if (zpool_can_sleep_mapped(zpool))
> >>                 zpool_unmap_handle(zpool, entry->handle);
> >> -       else
> >> -               kfree(tmp);
> >>
> >>         ret = true;
> >>  stats:
> >>         count_vm_event(ZSWPIN);
> >>         if (entry->objcg)
> >>                 count_objcg_event(entry->objcg, ZSWPIN);
> >> -freeentry:
> >> +
> >>         spin_lock(&tree->lock);
> >>         if (ret && zswap_exclusive_loads_enabled) {
> >>                 zswap_invalidate_entry(tree, entry);
> >>
> >> --
> >> b4 0.10.1
> >>

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2023-12-28  8:03       ` Barry Song
@ 2023-12-28  8:23         ` Chengming Zhou
  2023-12-28  9:49         ` Herbert Xu
  1 sibling, 0 replies; 24+ messages in thread
From: Chengming Zhou @ 2023-12-28  8:23 UTC (permalink / raw)
  To: Barry Song, Herbert Xu
  Cc: Andrew Morton, Seth Jennings, Johannes Weiner, Vitaly Wool,
	Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman, linux-kernel,
	linux-mm, Chris Li

On 2023/12/28 16:03, Barry Song wrote:
> On Wed, Dec 27, 2023 at 7:32 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2023/12/27 09:24, Barry Song wrote:
>>> On Wed, Dec 27, 2023 at 4:56 AM Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>>
>>>> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
>>>> copy the entry->handle memory to a temporary memory, which is allocated
>>>> using kmalloc.
>>>>
>>>> Obviously we can reuse the per-compressor dstmem to avoid allocating
>>>> every time, since it's percpu-compressor and protected in percpu mutex.
>>>
>>> what is the benefit of this since we are actually increasing lock contention
>>> by reusing this buffer between multiple compression and decompression
>>> threads?
>>
>> This mutex is already reused in all compress/decompress paths even before
>> the reuse optimization. I think the best way maybe to use separate crypto_acomp
>> for compression and decompression.
>>
>> Do you think the lock contention will be increased because we now put zpool_map_handle()
>> and memcpy() in the lock section? Actually, we can move zpool_map_handle() before
>> the lock section if needed, but that memcpy() should be protected in lock section.
>>
>>>
>>> this mainly affects zsmalloc which can't sleep? do we have performance
>>> data?
>>
>> Right, last time when test I remembered there is very minor performance difference.
>> The main benefit here is to simply the code much and delete one failure case.
> 
> ok.
> 
> For the majority of hardware, people are using CPU-based
> compression/decompression,
> there is no chance they will sleep. Thus, all
> compression/decompression can be done
> in a zpool_map section, there is *NO* need to copy at all! Only for

Yes, very good for zsmalloc.

> those hardware which
> can provide a HW-accelerator to offload CPU, crypto will actually wait
> for completion by
> 
> static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> {
>         switch (err) {
>         case -EINPROGRESS:
>         case -EBUSY:
>                 wait_for_completion(&wait->completion);
>                 reinit_completion(&wait->completion);
>                 err = wait->err;
>                 break;
>         }
> 
>         return err;
> }
> 
> for CPU-based alg, we have completed the compr/decompr within
> crypto_acomp_decompress()
> synchronously. they won't return EINPROGRESS, EBUSY.

Ok, this is useful to know.

> 
> The problem is that crypto_acomp won't expose this information to its
> users. if it does,
> we can use this info, we will totally avoid the code of copying
> zsmalloc's data to a tmp
> buffer for the most majority users of zswap.

Agree, I think it's worthwhile to export, so zsmalloc users don't need to
prepare the temporary buffer and copy in the majority case.

Thanks!

> 
> But I am not sure if we can find a way to convince Herbert(+To)  :-)
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2023-12-28  8:03       ` Barry Song
  2023-12-28  8:23         ` Chengming Zhou
@ 2023-12-28  9:49         ` Herbert Xu
  2024-01-03  2:57           ` [PATCH RFC 1/2] crypto: introduce acomp_is_async to expose if a acomp has a scomp backend Barry Song
  1 sibling, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2023-12-28  9:49 UTC (permalink / raw)
  To: Barry Song
  Cc: Chengming Zhou, Andrew Morton, Seth Jennings, Johannes Weiner,
	Vitaly Wool, Nhat Pham, Chris Li, Yosry Ahmed, Dan Streetman,
	linux-kernel, linux-mm, Chris Li

On Thu, Dec 28, 2023 at 09:03:32PM +1300, Barry Song wrote:
>
> for CPU-based alg, we have completed the compr/decompr within
> crypto_acomp_decompress()
> synchronously. they won't return EINPROGRESS, EBUSY.
> 
> The problem is that crypto_acomp won't expose this information to its
> users. if it does,
> we can use this info, we will totally avoid the code of copying
> zsmalloc's data to a tmp
> buffer for the most majority users of zswap.
> 
> But I am not sure if we can find a way to convince Herbert(+To)  :-)

What would you like to expose? The async status of the underlying
algorithm?

We could certainly do that.  But I wonder if it might actually be
better for you to allocate a second sync-only algorithm for such
cases.  I'd like to see some real numbers.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH RFC 1/2] crypto: introduce acomp_is_async to expose if a acomp has a scomp backend
  2023-12-28  9:49         ` Herbert Xu
@ 2024-01-03  2:57           ` Barry Song
  2024-01-03  2:57             ` [PATCH RFC 2/2] mm/zswap: remove the memcpy if acomp is not asynchronous Barry Song
  2024-01-03  2:57             ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Barry Song
  0 siblings, 2 replies; 24+ messages in thread
From: Barry Song @ 2024-01-03  2:57 UTC (permalink / raw)
  To: herbert, davem, akpm, chriscli, chrisl, ddstreet, hannes,
	linux-mm, nphamcs, sjenning, vitaly.wool, yosryahmed,
	zhouchengming
  Cc: linux-kernel, linux-crypto, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Almost all CPU-based compressors/decompressors are actually synchronous
though they support acomp APIs. While some chips have hardware-based
accelerators to offload CPU's work such as hisilicon and intel/qat/,
their drivers are working in async mode.
Letting acomp's users know exactly if the acomp is really async will
help users know if the compression and decompression procedure can
sleep and make their decisions accordingly.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 crypto/acompress.c         | 8 ++++++++
 include/crypto/acompress.h | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 1c682810a484..99118e879a4a 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -152,6 +152,14 @@ struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name, u32 type,
 }
 EXPORT_SYMBOL_GPL(crypto_alloc_acomp_node);
 
+bool acomp_is_async(struct crypto_acomp *acomp)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
+
+	return tfm->__crt_alg->cra_type == &crypto_acomp_type;
+}
+EXPORT_SYMBOL_GPL(acomp_is_async);
+
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *acomp)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 574cffc90730..5831080479e9 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -195,6 +195,15 @@ static inline int crypto_has_acomp(const char *alg_name, u32 type, u32 mask)
  */
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *tfm);
 
+/**
+ * acomp_is_async() -- check if an acomp is asynchronous(can sleep)
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	true if the acomp is asynchronous, otherwise, false
+ */
+bool acomp_is_async(struct crypto_acomp *tfm);
+
 /**
  * acomp_request_free() -- zeroize and free asynchronous (de)compression
  *			   request as well as the output buffer if allocated
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC 2/2] mm/zswap: remove the memcpy if acomp is not asynchronous
  2024-01-03  2:57           ` [PATCH RFC 1/2] crypto: introduce acomp_is_async to expose if a acomp has a scomp backend Barry Song
@ 2024-01-03  2:57             ` Barry Song
  2024-01-03  2:57             ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Barry Song
  1 sibling, 0 replies; 24+ messages in thread
From: Barry Song @ 2024-01-03  2:57 UTC (permalink / raw)
  To: herbert, davem, akpm, chriscli, chrisl, ddstreet, hannes,
	linux-mm, nphamcs, sjenning, vitaly.wool, yosryahmed,
	zhouchengming
  Cc: linux-kernel, linux-crypto, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Most compressors are actually CPU-based, they won't sleep during
decompression. we should be able to remove the redundant memcpy
for them.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/zswap.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index ca25b676048e..36898614ebcc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -168,6 +168,7 @@ struct crypto_acomp_ctx {
 	struct crypto_wait wait;
 	u8 *buffer;
 	struct mutex mutex;
+	bool is_async; /* if acomp can sleep */
 };
 
 /*
@@ -716,6 +717,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		goto acomp_fail;
 	}
 	acomp_ctx->acomp = acomp;
+	acomp_ctx->is_async = acomp_is_async(acomp);
 
 	req = acomp_request_alloc(acomp_ctx->acomp);
 	if (!req) {
@@ -1370,7 +1372,7 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
 	mutex_lock(&acomp_ctx->mutex);
 
 	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	if (!zpool_can_sleep_mapped(zpool)) {
+	if (acomp_ctx->is_async && !zpool_can_sleep_mapped(zpool)) {
 		memcpy(acomp_ctx->buffer, src, entry->length);
 		src = acomp_ctx->buffer;
 		zpool_unmap_handle(zpool, entry->handle);
@@ -1384,7 +1386,7 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
 	mutex_unlock(&acomp_ctx->mutex);
 
-	if (zpool_can_sleep_mapped(zpool))
+	if (!acomp_ctx->is_async || zpool_can_sleep_mapped(zpool))
 		zpool_unmap_handle(zpool, entry->handle);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2024-01-03  2:57           ` [PATCH RFC 1/2] crypto: introduce acomp_is_async to expose if a acomp has a scomp backend Barry Song
  2024-01-03  2:57             ` [PATCH RFC 2/2] mm/zswap: remove the memcpy if acomp is not asynchronous Barry Song
@ 2024-01-03  2:57             ` Barry Song
  2024-01-25  9:41               ` Herbert Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Barry Song @ 2024-01-03  2:57 UTC (permalink / raw)
  To: herbert, davem, akpm, chriscli, chrisl, ddstreet, hannes,
	linux-mm, nphamcs, sjenning, vitaly.wool, yosryahmed,
	zhouchengming
  Cc: linux-kernel, linux-crypto

>>
>> for CPU-based alg, we have completed the compr/decompr within
>> crypto_acomp_decompress()
>> synchronously. they won't return EINPROGRESS, EBUSY.
>>
>> The problem is that crypto_acomp won't expose this information to its
>> users. if it does,
>> we can use this info, we will totally avoid the code of copying
>> zsmalloc's data to a tmp
>> buffer for the most majority users of zswap.
>>
>> But I am not sure if we can find a way to convince Herbert(+To)  :-)

> What would you like to expose? The async status of the underlying
> algorithm?

Right. followed by a rfc patchset, please help take a look.

> 
> We could certainly do that.  But I wonder if it might actually be
> better for you to allocate a second sync-only algorithm for such
> cases.  I'd like to see some real numbers.

some hardware might want to use an accelerator to help offload CPU's
work. their drivers are working in async mode, for example, hisilicon
and intel.

I don't have the exact number we can save by removing the redundant
memcpy, nor do i have a proper hardware to test and get the number.
As Chengming is actually working in zswap, i wonder if you can test
my patches and post some data?

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>

Thanks
Barry



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2024-01-03  2:57             ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Barry Song
@ 2024-01-25  9:41               ` Herbert Xu
  2024-01-27 14:41                 ` Barry Song
  0 siblings, 1 reply; 24+ messages in thread
From: Herbert Xu @ 2024-01-25  9:41 UTC (permalink / raw)
  To: Barry Song
  Cc: davem, akpm, chriscli, chrisl, ddstreet, hannes, linux-mm,
	nphamcs, sjenning, vitaly.wool, yosryahmed, zhouchengming,
	linux-kernel, linux-crypto

On Wed, Jan 03, 2024 at 03:57:59PM +1300, Barry Song wrote:
>
> > We could certainly do that.  But I wonder if it might actually be
> > better for you to allocate a second sync-only algorithm for such
> > cases.  I'd like to see some real numbers.
> 
> some hardware might want to use an accelerator to help offload CPU's
> work. their drivers are working in async mode, for example, hisilicon
> and intel.
> 
> I don't have the exact number we can save by removing the redundant
> memcpy, nor do i have a proper hardware to test and get the number.
> As Chengming is actually working in zswap, i wonder if you can test
> my patches and post some data?

I don't have the hardware to test this.  Since you're proposing
the change, please test it to ensure that we're not adding cruft
to the API that's actually detrimental to performance.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress
  2024-01-25  9:41               ` Herbert Xu
@ 2024-01-27 14:41                 ` Barry Song
  0 siblings, 0 replies; 24+ messages in thread
From: Barry Song @ 2024-01-27 14:41 UTC (permalink / raw)
  To: Herbert Xu
  Cc: davem, akpm, chriscli, chrisl, ddstreet, hannes, linux-mm,
	nphamcs, sjenning, vitaly.wool, yosryahmed, zhouchengming,
	linux-kernel, linux-crypto

On Thu, Jan 25, 2024 at 5:41 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Wed, Jan 03, 2024 at 03:57:59PM +1300, Barry Song wrote:
> >
> > > We could certainly do that.  But I wonder if it might actually be
> > > better for you to allocate a second sync-only algorithm for such
> > > cases.  I'd like to see some real numbers.
> >
> > some hardware might want to use an accelerator to help offload CPU's
> > work. their drivers are working in async mode, for example, hisilicon
> > and intel.
> >
> > I don't have the exact number we can save by removing the redundant
> > memcpy, nor do i have a proper hardware to test and get the number.
> > As Chengming is actually working in zswap, i wonder if you can test
> > my patches and post some data?
>
> I don't have the hardware to test this.  Since you're proposing
> the change, please test it to ensure that we're not adding cruft
> to the API that's actually detrimental to performance.

Understood.
sorry for the grammatical errors, i was actually asking for
chengming's help for testing as
he had hardware and was actively working on optimizing  zswap.
and he has already tested and sent the performance data which I quoted
in the formal
patchset.

>
> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-01-27 14:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-26 15:54 [PATCH v4 0/6] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
2023-12-26 15:54 ` [PATCH v4 1/6] mm/zswap: change dstmem size to one page Chengming Zhou
2023-12-27  1:07   ` Barry Song
2023-12-27  6:11     ` Chengming Zhou
2023-12-27  6:32       ` Barry Song
2023-12-27 20:58       ` Andrew Morton
2023-12-27 23:21         ` Nhat Pham
2023-12-28  6:41           ` Chengming Zhou
2023-12-26 15:54 ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Chengming Zhou
2023-12-27  1:24   ` Barry Song
2023-12-27  6:32     ` Chengming Zhou
2023-12-28  8:03       ` Barry Song
2023-12-28  8:23         ` Chengming Zhou
2023-12-28  9:49         ` Herbert Xu
2024-01-03  2:57           ` [PATCH RFC 1/2] crypto: introduce acomp_is_async to expose if a acomp has a scomp backend Barry Song
2024-01-03  2:57             ` [PATCH RFC 2/2] mm/zswap: remove the memcpy if acomp is not asynchronous Barry Song
2024-01-03  2:57             ` [PATCH v4 2/6] mm/zswap: reuse dstmem when decompress Barry Song
2024-01-25  9:41               ` Herbert Xu
2024-01-27 14:41                 ` Barry Song
2023-12-26 15:54 ` [PATCH v4 3/6] mm/zswap: refactor out __zswap_load() Chengming Zhou
2023-12-26 15:54 ` [PATCH v4 4/6] mm/zswap: cleanup zswap_load() Chengming Zhou
2023-12-26 15:54 ` [PATCH v4 5/6] mm/zswap: cleanup zswap_writeback_entry() Chengming Zhou
2023-12-26 15:54 ` [PATCH v4 6/6] mm/zswap: change per-cpu mutex and buffer to per-acomp_ctx Chengming Zhou
2023-12-26 19:08   ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).