From: Johannes Weiner <hannes@cmpxchg.org>
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
yosryahmed@google.com, nphamcs@gmail.com,
chengming.zhou@linux.dev, usamaarif642@gmail.com,
ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com,
akpm@linux-foundation.org, linux-crypto@vger.kernel.org,
herbert@gondor.apana.org.au, davem@davemloft.net,
clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
surenb@google.com, kristen.c.accardi@intel.com,
zanussi@kernel.org, wajdi.k.feghali@intel.com,
vinodh.gopal@intel.com
Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
Date: Thu, 7 Nov 2024 13:16:26 -0500 [thread overview]
Message-ID: <20241107181626.GF1172372@cmpxchg.org> (raw)
In-Reply-To: <20241106192105.6731-14-kanchana.p.sridhar@intel.com>
On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> If the system has Intel IAA, and if sysctl vm.compress-batching is set to
> "1", zswap_store() will call crypto_acomp_batch_compress() to compress up
> to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> the multiple compress engines available in IAA hardware.
>
> On platforms with multiple IAA devices per socket, compress jobs from all
> cores in a socket will be distributed among all IAA devices on the socket
> by the iaa_crypto driver.
>
> With deflate-iaa configured as the zswap compressor, and
> sysctl vm.compress-batching is enabled, the first time zswap_store() has to
> swapout a large folio on any given cpu, it will allocate the
> pool->acomp_batch_ctx resources on that cpu, and set pool->can_batch_comp
> to BATCH_COMP_ENABLED. It will then proceed to call the main
> __zswap_store_batch_core() compress batching function. Subsequent calls to
> zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx by
> checking the pool->can_batch_comp status.
>
> Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on an
> as-needed basis, to reduce memory footprint cost. The cost is not incurred
> on cores that never get to swapout a large folio.
>
> This patch introduces the main __zswap_store_batch_core() function for
> compress batching. This interface represents the extensible compress
> batching architecture that can potentially be called with a batch of
> any-order folios from shrink_folio_list(). In other words, although
> zswap_store() calls __zswap_store_batch_core() with exactly one large folio
> in this patch, we can reuse this interface to reclaim a batch of folios, to
> significantly improve the reclaim path efficiency due to IAA's parallel
> compression capability.
>
> The newly added functions that implement batched stores follow the
> general structure of zswap_store() of a large folio. Some amount of
> restructuring and optimization is done to minimize failure points
> for a batch, fail early and maximize the zswap store pipeline occupancy
> with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple
> folios. This is intended to maximize reclaim throughput with the IAA
> hardware parallel compressions.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
> include/linux/zswap.h | 84 ++++++
> mm/zswap.c | 625 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 709 insertions(+)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 9ad27ab3d222..6d3ef4780c69 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -31,6 +31,88 @@ struct zswap_lruvec_state {
> atomic_long_t nr_disk_swapins;
> };
>
> +/*
> + * struct zswap_store_sub_batch_page:
> + *
> + * This represents one "zswap batching element", namely, the
> + * attributes associated with a page in a large folio that will
> + * be compressed and stored in zswap. The term "batch" is reserved
> + * for a conceptual "batch" of folios that can be sent to
> + * zswap_store() by reclaim. The term "sub-batch" is used to describe
> + * a collection of "zswap batching elements", i.e., an array of
> + * "struct zswap_store_sub_batch_page *".
> + *
> + * The zswap compress sub-batch size is specified by
> + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the
> + * platform has Intel IAA. This means zswap can store a large folio
> + * by creating sub-batches of up to 8 pages and compressing this
> + * batch using IAA to parallelize the 8 compress jobs in hardware.
> + * For e.g., a 64KB folio can be compressed as 2 sub-batches of
> + * 8 pages each. This can significantly improve the zswap_store()
> + * performance for large folios.
> + *
> + * Although the page itself is represented directly, the structure
> + * adds a "u8 batch_idx" to represent an index for the folio in a
> + * conceptual "batch of folios" that can be passed to zswap_store().
> + * Conceptually, this allows for up to 256 folios that can be passed
> + * to zswap_store(). If this conceptual number of folios sent to
> + * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
> + */
> +struct zswap_store_sub_batch_page {
> + u8 batch_idx;
> + swp_entry_t swpentry;
> + struct obj_cgroup *objcg;
> + struct zswap_entry *entry;
> + int error; /* folio error status. */
> +};
> +
> +/*
> + * struct zswap_store_pipeline_state:
> + *
> + * This stores state during IAA compress batching of (conceptually, a batch of)
> + * folios. The term pipelining in this context, refers to breaking down
> + * the batch of folios being reclaimed into sub-batches of
> + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the
> + * sub-batch. This concept could be further evolved to use overlap of CPU
> + * computes with IAA computes. For instance, we could stage the post-compress
> + * computes for sub-batch "N-1" to happen in parallel with IAA batch
> + * compression of sub-batch "N".
> + *
> + * We begin by developing the concept of compress batching. Pipelining with
> + * overlap can be future work.
> + *
> + * @errors: The errors status for the batch of reclaim folios passed in from
> + * a higher mm layer such as swap_writepage().
> + * @pool: A valid zswap_pool.
> + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the @pool.
> + * @sub_batch: This is an array that represents the sub-batch of up to
> + * SWAP_CRYPTO_BATCH_SIZE pages that are being stored
> + * in zswap.
> + * @comp_dsts: The destination buffers for crypto_acomp_compress() for each
> + * page being compressed.
> + * @comp_dlens: The destination buffers' lengths from crypto_acomp_compress()
> + * obtained after crypto_acomp_poll() returns completion status,
> + * for each page being compressed.
> + * @comp_errors: Compression errors for each page being compressed.
> + * @nr_comp_pages: Total number of pages in @sub_batch.
> + *
> + * Note:
> + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the
> + * u8 members (except @comp_dsts) need to become u16.
> + */
> +struct zswap_store_pipeline_state {
> + int *errors;
> + struct zswap_pool *pool;
> + struct crypto_acomp_ctx *acomp_ctx;
> + struct zswap_store_sub_batch_page *sub_batch;
> + struct page **comp_pages;
> + u8 **comp_dsts;
> + unsigned int *comp_dlens;
> + int *comp_errors;
> + u8 nr_comp_pages;
> +};
Why are these in the public header?
> unsigned long zswap_total_pages(void);
> bool zswap_store(struct folio *folio);
> bool zswap_load(struct folio *folio);
> @@ -45,6 +127,8 @@ bool zswap_never_enabled(void);
> #else
>
> struct zswap_lruvec_state {};
> +struct zswap_store_sub_batch_page {};
> +struct zswap_store_pipeline_state {};
>
> static inline bool zswap_store(struct folio *folio)
> {
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2af736e38213..538aac3fb552 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int cpu,
> char *tfm_name,
> unsigned int nr_reqs);
>
> +static bool __zswap_store_batch_core(
> + int node_id,
> + struct folio **folios,
> + int *errors,
> + unsigned int nr_folios);
> +
Please reorder the functions to avoid forward decls.
> /*********************************
> * pool functions
> **********************************/
> @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page *page,
> return -EINVAL;
> }
>
> +/*
> + * Modified to use the IAA compress batching framework implemented in
> + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1.
> + * The batching code is intended to significantly improve folio store
> + * performance over the sequential code.
This isn't helpful, please delete.
> bool zswap_store(struct folio *folio)
> {
> long nr_pages = folio_nr_pages(folio);
> @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio)
> bool ret = false;
> long index;
>
> + /*
> + * Improve large folio zswap_store() latency with IAA compress batching,
> + * if this is enabled by setting sysctl vm.compress-batching to "1".
> + * If enabled, the large folio's pages are compressed in parallel in
> + * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every page in
> + * the large folio is compressed sequentially.
> + */
Same here. Reduce to "Try to batch compress large folios, fall back to
processing individual subpages if that fails."
> + if (folio_test_large(folio) && READ_ONCE(compress_batching)) {
> + pool = zswap_pool_current_get();
There is an existing zswap_pool_current_get() in zswap_store(), please
reorder the sequence so you don't need to add an extra one.
> + if (!pool) {
> + pr_err("Cannot setup acomp_batch_ctx for compress batching: no current pool found\n");
This is unnecessary.
> + goto sequential_store;
> + }
> +
> + if (zswap_pool_can_batch(pool)) {
This function is introduced in another patch, where it isn't
used. Please add functions and callers in the same patch.
> + int error = -1;
> + bool store_batch = __zswap_store_batch_core(
> + folio_nid(folio),
> + &folio, &error, 1);
> +
> + if (store_batch) {
> + zswap_pool_put(pool);
> + if (!error)
> + ret = true;
> + return ret;
> + }
> + }
Please don't future proof code like this, only implement what is
strictly necessary for the functionality in this patch. You're only
adding a single caller with nr_folios=1, so it shouldn't be a
parameter, and the function shouldn't have a that batch_idx loop.
> + zswap_pool_put(pool);
> + }
> +
> +sequential_store:
> +
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio)
> return ret;
> }
>
> +/*
> + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> + * u8 stack variables in the next several functions, to u16.
> + */
> +
> +/*
> + * Propagate the "sbp" error condition to other batch elements belonging to
> + * the same folio as "sbp".
> + */
> +static __always_inline void zswap_store_propagate_errors(
> + struct zswap_store_pipeline_state *zst,
> + u8 error_batch_idx)
> +{
Please observe surrounding coding style on how to wrap >80 col
function signatures.
Don't use __always_inline unless there is a clear, spelled out
performance reason. Since it's an error path, that's doubtful.
Please use a consistent namespace for all this:
CONFIG_ZSWAP_BATCH
zswap_batch_store()
zswap_batch_alloc_entries()
zswap_batch_add_folios()
zswap_batch_compress()
etc.
Again, order to avoid forward decls.
Try to keep the overall sequence of events between zswap_store() and
zswap_batch_store() similar as much as possible for readability.
next prev parent reply other threads:[~2024-11-07 18:16 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 02/13] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 03/13] crypto: iaa - Implement compress/decompress batching API in iaa_crypto Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 04/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 05/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 06/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 07/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
2024-11-06 19:21 ` [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations Kanchana P Sridhar
2024-11-08 20:14 ` Yosry Ahmed
2024-11-08 21:34 ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs Kanchana P Sridhar
2024-11-07 17:20 ` Johannes Weiner
2024-11-07 22:21 ` Sridhar, Kanchana P
2024-11-08 20:22 ` Yosry Ahmed
2024-11-08 21:39 ` Sridhar, Kanchana P
2024-11-08 22:54 ` Yosry Ahmed
2024-11-09 1:03 ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool Kanchana P Sridhar
2024-11-08 20:23 ` Yosry Ahmed
2024-11-09 1:04 ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool Kanchana P Sridhar
2024-11-07 17:31 ` Johannes Weiner
2024-11-07 22:22 ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
2024-11-06 20:17 ` Andrew Morton
2024-11-06 20:39 ` Sridhar, Kanchana P
2024-11-07 17:34 ` Johannes Weiner
2024-11-07 22:24 ` Sridhar, Kanchana P
2024-11-08 20:23 ` Yosry Ahmed
2024-11-09 1:05 ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2024-11-07 18:16 ` Johannes Weiner [this message]
2024-11-07 22:32 ` Sridhar, Kanchana P
2024-11-07 18:53 ` Johannes Weiner
2024-11-07 22:50 ` Sridhar, Kanchana P
2024-11-06 20:25 ` [PATCH v3 00/13] zswap IAA compress batching Andrew Morton
2024-11-06 20:44 ` Sridhar, Kanchana P
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241107181626.GF1172372@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ardb@kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=clabbe@baylibre.com \
--cc=davem@davemloft.net \
--cc=ebiggers@google.com \
--cc=herbert@gondor.apana.org.au \
--cc=kanchana.p.sridhar@intel.com \
--cc=kristen.c.accardi@intel.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=usamaarif642@gmail.com \
--cc=vinodh.gopal@intel.com \
--cc=wajdi.k.feghali@intel.com \
--cc=ying.huang@intel.com \
--cc=yosryahmed@google.com \
--cc=zanussi@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox