Re: [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, nphamcs@gmail.com, chengming.zhou@linux.dev,
	usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, kristen.c.accardi@intel.com,
	wajdi.k.feghali@intel.com, vinodh.gopal@intel.com
Subject: Re: [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching.
Date: Thu, 6 Feb 2025 18:55:17 +0000	[thread overview]
Message-ID: <Z6UFlV2B47vgpXgt@google.com> (raw)
In-Reply-To: <20250206072102.29045-13-kanchana.p.sridhar@intel.com>

On Wed, Feb 05, 2025 at 11:20:58PM -0800, Kanchana P Sridhar wrote:
> This patch adds support for the per-CPU acomp_ctx to track multiple
> compression/decompression requests. The zswap_cpu_comp_prepare() cpu

nit: s/cpu/CPU

> onlining code will check if the compressor supports batching. If so, it
> will allocate the necessary batching resources.
> 
> However, zswap does not use more than one request yet. Follow-up patches
> will actually utilize the multiple acomp_ctx requests/buffers for batch
> compression/decompression of multiple pages.
> 
> The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> for batching. There is no extra memory usage for compressors that do not
> support batching.

That's not entirely accurate, there's a tiny bit of extra overhead to
allocate the arrays. It can be avoided, but I am not sure it's worth the
complexity.

> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 132 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 98 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a2baceed3bf9..dc7d1ff04b22 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
>  
>  #define ZSWAP_PARAM_UNSET ""
>  
> +/*
> + * For compression batching of large folios:
> + * Maximum number of acomp compress requests that will be processed
> + * in a batch, iff the zswap compressor supports batching.
> + * This limit exists because we preallocate enough requests and buffers
> + * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means higher
> + * memory usage.
> + */
> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>  
>  /* Enable/disable zswap */
> @@ -143,9 +153,10 @@ bool zswap_never_enabled(void)
>  
>  struct crypto_acomp_ctx {
>  	struct crypto_acomp *acomp;
> -	struct acomp_req *req;
> +	struct acomp_req **reqs;
> +	u8 **buffers;
> +	unsigned int nr_reqs;
>  	struct crypto_wait wait;
> -	u8 *buffer;
>  	struct mutex mutex;
>  	bool is_sleepable;
>  };
> @@ -821,15 +832,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  	struct crypto_acomp *acomp = NULL;
> -	struct acomp_req *req = NULL;
> -	u8 *buffer = NULL;
> -	int ret;
> +	unsigned int nr_reqs = 1;
> +	int ret = -ENOMEM;
> +	int i;
>  
> -	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!buffer) {
> -		ret = -ENOMEM;
> -		goto fail;
> -	}
> +	acomp_ctx->buffers = NULL;
> +	acomp_ctx->reqs = NULL;
> +	acomp_ctx->nr_reqs = 0;
>  
>  	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>  	if (IS_ERR(acomp)) {
> @@ -839,12 +848,30 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  		goto fail;
>  	}
>  
> -	req = acomp_request_alloc(acomp);
> -	if (!req) {
> -		pr_err("could not alloc crypto acomp_request %s\n",
> -		       pool->tfm_name);
> -		ret = -ENOMEM;
> +	if (acomp_has_async_batching(acomp))
> +		nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, crypto_acomp_batch_size(acomp));

Do we need to check acomp_has_async_batching() here? Shouldn't
crypto_acomp_batch_size() just return 1 if batching is not supported?

> +
> +	acomp_ctx->buffers = kcalloc_node(nr_reqs, sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->buffers)
> +		goto fail;
> +
> +	for (i = 0; i < nr_reqs; ++i) {
> +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +		if (!acomp_ctx->buffers[i])
> +			goto fail;
> +	}
> +
> +	acomp_ctx->reqs = kcalloc_node(nr_reqs, sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->reqs)
>  		goto fail;
> +
> +	for (i = 0; i < nr_reqs; ++i) {
> +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp);
> +		if (!acomp_ctx->reqs[i]) {
> +			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
> +			       i, pool->tfm_name);
> +			goto fail;
> +		}
>  	}
>  
>  	/*
> @@ -853,6 +880,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	 * again resulting in a deadlock.
>  	 */
>  	mutex_lock(&acomp_ctx->mutex);

I had moved all the acomp_ctx initializations under the mutex to keep
its state always fully initialized or uninitialized for anyone holding
the lock. With this change, acomp_ctx->reqs will be set to non-NULL
before the mutex is held and the acomp_ctx is fully initialized.

The code in the compression/decompression path uses acomp_ctx->reqes to
check if the acomp_ctx can be used. While I don't believe it's currently
possible for zswap_cpu_comp_prepare() to race with these paths, I did
this to be future proof. I don't want the code to end up initializing
some of the struct under the lock and some of it without it.

So I think there's two options here:
- Do the due diligence check that holding the mutex is not required when
  initializing acomp_ctx here, and remove the mutex locking here
  completely.
- Keep the initializations in the lock critical section (i.e. allocate
  everything first, then initialize under the lock).

> +
> +	/*
> +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +	 * or non-poll mode of acomp, hence there is only one "wait" per
> +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> +	 * there is at least 1 request per acomp_ctx.
> +	 */
>  	crypto_init_wait(&acomp_ctx->wait);
>  
>  	/*
> @@ -860,20 +894,33 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
>  	 * won't be called, crypto_wait_req() will return without blocking.
>  	 */
> -	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>  				   crypto_req_done, &acomp_ctx->wait);
>  
> -	acomp_ctx->buffer = buffer;
> +	acomp_ctx->nr_reqs = nr_reqs;
>  	acomp_ctx->acomp = acomp;
>  	acomp_ctx->is_sleepable = acomp_is_async(acomp);
> -	acomp_ctx->req = req;
>  	mutex_unlock(&acomp_ctx->mutex);
>  	return 0;
>  
>  fail:
> +	if (acomp_ctx->buffers) {
> +		for (i = 0; i < nr_reqs; ++i)
> +			kfree(acomp_ctx->buffers[i]);
> +		kfree(acomp_ctx->buffers);
> +		acomp_ctx->buffers = NULL;
> +	}
> +
> +	if (acomp_ctx->reqs) {
> +		for (i = 0; i < nr_reqs; ++i)
> +			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +				acomp_request_free(acomp_ctx->reqs[i]);
> +		kfree(acomp_ctx->reqs);
> +		acomp_ctx->reqs = NULL;
> +	}
> +
>  	if (acomp)
>  		crypto_free_acomp(acomp);
> -	kfree(buffer);
>  	return ret;
>  }
>  
> @@ -883,14 +930,31 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  
>  	mutex_lock(&acomp_ctx->mutex);
> +
>  	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -			acomp_request_free(acomp_ctx->req);
> -		acomp_ctx->req = NULL;
> +		int i;
> +
> +		if (acomp_ctx->reqs) {
> +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +					acomp_request_free(acomp_ctx->reqs[i]);
> +			kfree(acomp_ctx->reqs);
> +			acomp_ctx->reqs = NULL;
> +		}
> +
> +		if (acomp_ctx->buffers) {
> +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +				kfree(acomp_ctx->buffers[i]);
> +			kfree(acomp_ctx->buffers);
> +			acomp_ctx->buffers = NULL;
> +		}
> +

The code here seems to be almost exactly like the failure path in
zswap_cpu_comp_prepare(), would it be better to put it in a helper?

>  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  			crypto_free_acomp(acomp_ctx->acomp);
> -		kfree(acomp_ctx->buffer);
> +
> +		acomp_ctx->nr_reqs = 0;
>  	}
> +
>  	mutex_unlock(&acomp_ctx->mutex);
>  
>  	return 0;
> @@ -903,7 +967,7 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
>  	for (;;) {
>  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>  		mutex_lock(&acomp_ctx->mutex);
> -		if (likely(acomp_ctx->req))
> +		if (likely(acomp_ctx->reqs))
>  			return acomp_ctx;
>  		/*
>  		 * It is possible that we were migrated to a different CPU after
> @@ -935,7 +999,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	u8 *dst;
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> -	dst = acomp_ctx->buffer;
> +	dst = acomp_ctx->buffers[0];
>  	sg_init_table(&input, 1);
>  	sg_set_page(&input, page, PAGE_SIZE, 0);
>  
> @@ -945,7 +1009,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	 * giving the dst buffer with enough length to avoid buffer overflow.
>  	 */
>  	sg_init_one(&output, dst, PAGE_SIZE * 2);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
>  
>  	/*
>  	 * it maybe looks a little bit silly that we send an asynchronous request,
> @@ -959,8 +1023,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	 * but in different threads running on different cpu, we have different
>  	 * acomp instance, so multiple threads can do (de)compression in parallel.
>  	 */
> -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -	dlen = acomp_ctx->req->dlen;
> +	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> +	dlen = acomp_ctx->reqs[0]->dlen;
>  	if (comp_ret)
>  		goto unlock;
>  
> @@ -1011,19 +1075,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	 */
>  	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
>  	    !virt_addr_valid(src)) {
> -		memcpy(acomp_ctx->buffer, src, entry->length);
> -		src = acomp_ctx->buffer;
> +		memcpy(acomp_ctx->buffers[0], src, entry->length);
> +		src = acomp_ctx->buffers[0];
>  		zpool_unmap_handle(zpool, entry->handle);
>  	}
>  
>  	sg_init_one(&input, src, entry->length);
>  	sg_init_table(&output, 1);
>  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
> +	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
> +	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
>  
> -	if (src != acomp_ctx->buffer)
> +	if (src != acomp_ctx->buffers[0])
>  		zpool_unmap_handle(zpool, entry->handle);
>  	acomp_ctx_put_unlock(acomp_ctx);
>  }
> -- 
> 2.27.0
>

next prev parent reply	other threads:[~2025-02-06 18:55 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
2025-02-16  5:10   ` Herbert Xu
2025-02-28 10:00     ` Sridhar, Kanchana P
2025-02-06  7:20 ` [PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 04/16] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 05/16] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 06/16] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 07/16] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 08/16] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 09/16] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 10/16] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 11/16] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-02-06 18:55   ` Yosry Ahmed [this message]
2025-02-28 10:00     ` Sridhar, Kanchana P
2025-02-06  7:20 ` [PATCH v6 13/16] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
2025-02-06  7:21 ` [PATCH v6 14/16] mm: zswap: Introduce zswap_compress_folio() to compress all pages in a folio Kanchana P Sridhar
2025-02-06  7:21 ` [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2025-02-06 19:10   ` Yosry Ahmed
2025-02-06 19:24     ` Sridhar, Kanchana P
2025-02-28 10:00       ` Sridhar, Kanchana P
2025-02-06  7:21 ` [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios Kanchana P Sridhar
2025-02-06 19:15   ` Yosry Ahmed
2025-02-28 10:00     ` Sridhar, Kanchana P
2025-02-20 23:28   ` Nhat Pham
2025-02-21  3:24     ` Sridhar, Kanchana P
2025-02-11 17:05 ` [PATCH v6 00/16] zswap IAA compress batching Eric Biggers
2025-02-11 17:52   ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6UFlV2B47vgpXgt@google.com \
    --to=yosry.ahmed@linux.dev \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=clabbe@baylibre.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kristen.c.accardi@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.