Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	yosryahmed@google.com, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, kristen.c.accardi@intel.com,
	zanussi@kernel.org, wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com
Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
Date: Thu, 7 Nov 2024 13:16:26 -0500	[thread overview]
Message-ID: <20241107181626.GF1172372@cmpxchg.org> (raw)
In-Reply-To: <20241106192105.6731-14-kanchana.p.sridhar@intel.com>

On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> If the system has Intel IAA, and if sysctl vm.compress-batching is set to
> "1", zswap_store() will call crypto_acomp_batch_compress() to compress up
> to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> the multiple compress engines available in IAA hardware.
> 
> On platforms with multiple IAA devices per socket, compress jobs from all
> cores in a socket will be distributed among all IAA devices on the socket
> by the iaa_crypto driver.
> 
> With deflate-iaa configured as the zswap compressor, and
> sysctl vm.compress-batching is enabled, the first time zswap_store() has to
> swapout a large folio on any given cpu, it will allocate the
> pool->acomp_batch_ctx resources on that cpu, and set pool->can_batch_comp
> to BATCH_COMP_ENABLED. It will then proceed to call the main
> __zswap_store_batch_core() compress batching function. Subsequent calls to
> zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx by
> checking the pool->can_batch_comp status.
> 
> Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on an
> as-needed basis, to reduce memory footprint cost. The cost is not incurred
> on cores that never get to swapout a large folio.
> 
> This patch introduces the main __zswap_store_batch_core() function for
> compress batching. This interface represents the extensible compress
> batching architecture that can potentially be called with a batch of
> any-order folios from shrink_folio_list(). In other words, although
> zswap_store() calls __zswap_store_batch_core() with exactly one large folio
> in this patch, we can reuse this interface to reclaim a batch of folios, to
> significantly improve the reclaim path efficiency due to IAA's parallel
> compression capability.
> 
> The newly added functions that implement batched stores follow the
> general structure of zswap_store() of a large folio. Some amount of
> restructuring and optimization is done to minimize failure points
> for a batch, fail early and maximize the zswap store pipeline occupancy
> with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple
> folios. This is intended to maximize reclaim throughput with the IAA
> hardware parallel compressions.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/zswap.h |  84 ++++++
>  mm/zswap.c            | 625 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 709 insertions(+)
> 
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 9ad27ab3d222..6d3ef4780c69 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -31,6 +31,88 @@ struct zswap_lruvec_state {
>  	atomic_long_t nr_disk_swapins;
>  };
>  
> +/*
> + * struct zswap_store_sub_batch_page:
> + *
> + * This represents one "zswap batching element", namely, the
> + * attributes associated with a page in a large folio that will
> + * be compressed and stored in zswap. The term "batch" is reserved
> + * for a conceptual "batch" of folios that can be sent to
> + * zswap_store() by reclaim. The term "sub-batch" is used to describe
> + * a collection of "zswap batching elements", i.e., an array of
> + * "struct zswap_store_sub_batch_page *".
> + *
> + * The zswap compress sub-batch size is specified by
> + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the
> + * platform has Intel IAA. This means zswap can store a large folio
> + * by creating sub-batches of up to 8 pages and compressing this
> + * batch using IAA to parallelize the 8 compress jobs in hardware.
> + * For e.g., a 64KB folio can be compressed as 2 sub-batches of
> + * 8 pages each. This can significantly improve the zswap_store()
> + * performance for large folios.
> + *
> + * Although the page itself is represented directly, the structure
> + * adds a "u8 batch_idx" to represent an index for the folio in a
> + * conceptual "batch of folios" that can be passed to zswap_store().
> + * Conceptually, this allows for up to 256 folios that can be passed
> + * to zswap_store(). If this conceptual number of folios sent to
> + * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
> + */
> +struct zswap_store_sub_batch_page {
> +	u8 batch_idx;
> +	swp_entry_t swpentry;
> +	struct obj_cgroup *objcg;
> +	struct zswap_entry *entry;
> +	int error; /* folio error status. */
> +};
> +
> +/*
> + * struct zswap_store_pipeline_state:
> + *
> + * This stores state during IAA compress batching of (conceptually, a batch of)
> + * folios. The term pipelining in this context, refers to breaking down
> + * the batch of folios being reclaimed into sub-batches of
> + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the
> + * sub-batch. This concept could be further evolved to use overlap of CPU
> + * computes with IAA computes. For instance, we could stage the post-compress
> + * computes for sub-batch "N-1" to happen in parallel with IAA batch
> + * compression of sub-batch "N".
> + *
> + * We begin by developing the concept of compress batching. Pipelining with
> + * overlap can be future work.
> + *
> + * @errors: The errors status for the batch of reclaim folios passed in from
> + *          a higher mm layer such as swap_writepage().
> + * @pool: A valid zswap_pool.
> + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the @pool.
> + * @sub_batch: This is an array that represents the sub-batch of up to
> + *             SWAP_CRYPTO_BATCH_SIZE pages that are being stored
> + *             in zswap.
> + * @comp_dsts: The destination buffers for crypto_acomp_compress() for each
> + *             page being compressed.
> + * @comp_dlens: The destination buffers' lengths from crypto_acomp_compress()
> + *              obtained after crypto_acomp_poll() returns completion status,
> + *              for each page being compressed.
> + * @comp_errors: Compression errors for each page being compressed.
> + * @nr_comp_pages: Total number of pages in @sub_batch.
> + *
> + * Note:
> + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the
> + * u8 members (except @comp_dsts) need to become u16.
> + */
> +struct zswap_store_pipeline_state {
> +	int *errors;
> +	struct zswap_pool *pool;
> +	struct crypto_acomp_ctx *acomp_ctx;
> +	struct zswap_store_sub_batch_page *sub_batch;
> +	struct page **comp_pages;
> +	u8 **comp_dsts;
> +	unsigned int *comp_dlens;
> +	int *comp_errors;
> +	u8 nr_comp_pages;
> +};

Why are these in the public header?

>  unsigned long zswap_total_pages(void);
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
> @@ -45,6 +127,8 @@ bool zswap_never_enabled(void);
>  #else
>  
>  struct zswap_lruvec_state {};
> +struct zswap_store_sub_batch_page {};
> +struct zswap_store_pipeline_state {};
> 
>  static inline bool zswap_store(struct folio *folio)
>  {
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2af736e38213..538aac3fb552 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int cpu,
>  				  char *tfm_name,
>  				  unsigned int nr_reqs);
>  
> +static bool __zswap_store_batch_core(
> +	int node_id,
> +	struct folio **folios,
> +	int *errors,
> +	unsigned int nr_folios);
> +

Please reorder the functions to avoid forward decls.

>  /*********************************
>  * pool functions
>  **********************************/
> @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page *page,
>  	return -EINVAL;
>  }
>  
> +/*
> + * Modified to use the IAA compress batching framework implemented in
> + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1.
> + * The batching code is intended to significantly improve folio store
> + * performance over the sequential code.

This isn't helpful, please delete.

>  bool zswap_store(struct folio *folio)
>  {
>  	long nr_pages = folio_nr_pages(folio);
> @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio)
>  	bool ret = false;
>  	long index;
>  
> +	/*
> +	 * Improve large folio zswap_store() latency with IAA compress batching,
> +	 * if this is enabled by setting sysctl vm.compress-batching to "1".
> +	 * If enabled, the large folio's pages are compressed in parallel in
> +	 * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every page in
> +	 * the large folio is compressed sequentially.
> +	 */

Same here. Reduce to "Try to batch compress large folios, fall back to
processing individual subpages if that fails."

> +	if (folio_test_large(folio) && READ_ONCE(compress_batching)) {
> +		pool = zswap_pool_current_get();

There is an existing zswap_pool_current_get() in zswap_store(), please
reorder the sequence so you don't need to add an extra one.

> +		if (!pool) {
> +			pr_err("Cannot setup acomp_batch_ctx for compress batching: no current pool found\n");

This is unnecessary.

> +			goto sequential_store;
> +		}
> +
> +		if (zswap_pool_can_batch(pool)) {

This function is introduced in another patch, where it isn't
used. Please add functions and callers in the same patch.

> +			int error = -1;
> +			bool store_batch = __zswap_store_batch_core(
> +						folio_nid(folio),
> +						&folio, &error, 1);
> +
> +			if (store_batch) {
> +				zswap_pool_put(pool);
> +				if (!error)
> +					ret = true;
> +				return ret;
> +			}
> +		}

Please don't future proof code like this, only implement what is
strictly necessary for the functionality in this patch. You're only
adding a single caller with nr_folios=1, so it shouldn't be a
parameter, and the function shouldn't have a that batch_idx loop.

> +		zswap_pool_put(pool);
> +	}
> +
> +sequential_store:
> +
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>  
> @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio)
>  	return ret;
>  }
>  
> +/*
> + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> + * u8 stack variables in the next several functions, to u16.
> + */
> +
> +/*
> + * Propagate the "sbp" error condition to other batch elements belonging to
> + * the same folio as "sbp".
> + */
> +static __always_inline void zswap_store_propagate_errors(
> +	struct zswap_store_pipeline_state *zst,
> +	u8 error_batch_idx)
> +{

Please observe surrounding coding style on how to wrap >80 col
function signatures.

Don't use __always_inline unless there is a clear, spelled out
performance reason. Since it's an error path, that's doubtful.

Please use a consistent namespace for all this:

CONFIG_ZSWAP_BATCH
zswap_batch_store()
zswap_batch_alloc_entries()
zswap_batch_add_folios()
zswap_batch_compress()

etc.

Again, order to avoid forward decls.

Try to keep the overall sequence of events between zswap_store() and
zswap_batch_store() similar as much as possible for readability.

next prev parent reply	other threads:[~2024-11-07 18:16 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 02/13] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 03/13] crypto: iaa - Implement compress/decompress batching API in iaa_crypto Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 04/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 05/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 06/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 07/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
2024-11-06 19:21 ` [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations Kanchana P Sridhar
2024-11-08 20:14   ` Yosry Ahmed
2024-11-08 21:34     ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs Kanchana P Sridhar
2024-11-07 17:20   ` Johannes Weiner
2024-11-07 22:21     ` Sridhar, Kanchana P
2024-11-08 20:22       ` Yosry Ahmed
2024-11-08 21:39         ` Sridhar, Kanchana P
2024-11-08 22:54           ` Yosry Ahmed
2024-11-09  1:03             ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool Kanchana P Sridhar
2024-11-08 20:23   ` Yosry Ahmed
2024-11-09  1:04     ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool Kanchana P Sridhar
2024-11-07 17:31   ` Johannes Weiner
2024-11-07 22:22     ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
2024-11-06 20:17   ` Andrew Morton
2024-11-06 20:39     ` Sridhar, Kanchana P
2024-11-07 17:34   ` Johannes Weiner
2024-11-07 22:24     ` Sridhar, Kanchana P
2024-11-08 20:23     ` Yosry Ahmed
2024-11-09  1:05       ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2024-11-07  5:24   ` kernel test robot
2024-11-07 18:16   ` Johannes Weiner [this message]
2024-11-07 22:32     ` Sridhar, Kanchana P
2024-11-07 18:53   ` Johannes Weiner
2024-11-07 22:50     ` Sridhar, Kanchana P
2024-11-06 20:25 ` [PATCH v3 00/13] zswap IAA compress batching Andrew Morton
2024-11-06 20:44   ` Sridhar, Kanchana P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241107181626.GF1172372@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=clabbe@baylibre.com \
    --cc=davem@davemloft.net \
    --cc=ebiggers@google.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kristen.c.accardi@intel.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=usamaarif642@gmail.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    --cc=zanussi@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.