From: Nhat Pham <nphamcs@gmail.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
hannes@cmpxchg.org, yosry.ahmed@linux.dev,
chengming.zhou@linux.dev, usamaarif642@gmail.com,
ryan.roberts@arm.com, 21cnbao@gmail.com,
ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
herbert@gondor.apana.org.au, davem@davemloft.net,
clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
surenb@google.com, kristen.c.accardi@intel.com,
vinicius.gomes@intel.com, wajdi.k.feghali@intel.com,
vinodh.gopal@intel.com
Subject: Re: [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches.
Date: Thu, 14 Aug 2025 14:05:04 -0700 [thread overview]
Message-ID: <CAKEwX=Ov2X-0EnmWDCf=mahnR57si_QUuXF7F=Eb5P2cKYZEvg@mail.gmail.com> (raw)
In-Reply-To: <20250801043642.8103-24-kanchana.p.sridhar@intel.com>
On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch modifies zswap_store() to store a batch of pages in large
> folios at a time, instead of storing one page at a time. It does this by
> calling a new procedure zswap_store_pages() with a range of
> "pool->batch_size" indices in the folio.
>
> zswap_store_pages() implements all the computes done earlier in
> zswap_store_page() for a single-page, for multiple pages in a folio,
> namely the "batch":
>
> 1) It starts by allocating all zswap entries required to store the
> batch. New procedures, zswap_entries_cache_alloc_batch() and
> zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
> to optimize the performance of this step.
>
> 2) Next, the entries fields are written, computes that need to be happen
> anyway, without modifying the zswap xarray/LRU publishing order. This
> improves latency by avoiding having the bring the entries into the
> cache for writing in different code blocks within this procedure.
>
> 3) Next, it calls zswap_compress() to sequentially compress each page in
> the batch.
>
> 4) Finally, it adds the batch's zswap entries to the xarray and LRU,
> charges zswap memory and increments zswap stats.
>
> 5) The error handling and cleanup required for all failure scenarios
> that can occur while storing a batch in zswap are consolidated to a
> single "store_pages_failed" label in zswap_store_pages(). Here again,
> we optimize performance by calling kmem_cache_free_bulk().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
> mm/zswap.c | 218 ++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 149 insertions(+), 69 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 63a997b999537..8ca69c3f30df2 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -879,6 +879,24 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> kmem_cache_free(zswap_entry_cache, entry);
> }
>
> +/*
> + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number otherwise.
> + * The code for __kmem_cache_alloc_bulk() indicates that this positive number
> + * will be the @size requested, i.e., @nr_entries.
> + */
> +static __always_inline int zswap_entries_cache_alloc_batch(void **entries,
> + unsigned int nr_entries,
> + gfp_t gfp)
> +{
> + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries, entries);
> +}
> +
> +static __always_inline void zswap_entries_cache_free_batch(void **entries,
> + unsigned int nr_entries)
> +{
> + kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> +}
> +
> /*
> * Carries out the common pattern of freeing and entry's zpool allocation,
> * freeing the entry itself, and decrementing the number of stored pages.
> @@ -1512,93 +1530,154 @@ static void shrink_worker(struct work_struct *w)
> * main API
> **********************************/
>
> -static bool zswap_store_page(struct page *page,
> - struct obj_cgroup *objcg,
> - struct zswap_pool *pool)
> +/*
> + * Store multiple pages in @folio, starting from the page at index @start up to
> + * the page at index @end-1.
> + */
> +static bool zswap_store_pages(struct folio *folio,
> + long start,
> + long end,
> + struct obj_cgroup *objcg,
> + struct zswap_pool *pool,
> + int node_id)
> {
> - swp_entry_t page_swpentry = page_swap_entry(page);
> - struct zswap_entry *entry, *old;
> -
> - /* allocate entry */
> - entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> - if (!entry) {
> - zswap_reject_kmemcache_fail++;
> - return false;
> + struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> + u8 i, store_fail_idx = 0, nr_pages = end - start;
> +
> + if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> + nr_pages, GFP_KERNEL))) {
> + for (i = 0; i < nr_pages; ++i) {
> + entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +
> + if (unlikely(!entries[i])) {
> + zswap_reject_kmemcache_fail++;
> + /*
> + * While handling this error, we only need to
> + * call zswap_entries_cache_free_batch() for
> + * entries[0 .. i-1].
> + */
> + nr_pages = i;
> + goto store_pages_failed;
> + }
> + }
> }
>
> - if (!zswap_compress(page, entry, pool))
> - goto compress_failed;
> + /*
> + * Three sets of initializations are done to minimize bringing
> + * @entries into the cache for writing at different parts of this
> + * procedure, since doing so regresses performance:
> + *
> + * 1) Do all the writes to each entry in one code block. These
> + * writes need to be done anyway upon success which is more likely
> + * than not.
> + *
> + * 2) Initialize the handle to an error value. This facilitates
> + * having a consolidated failure handling
> + * 'goto store_pages_failed' that can inspect the value of the
> + * handle to determine whether zpool memory needs to be
> + * de-allocated.
> + *
> + * 3) The page_swap_entry() is obtained once and stored in the entry.
> + * Subsequent store in xarray gets the entry->swpentry instead of
> + * calling page_swap_entry(), minimizing computes.
> + */
> + for (i = 0; i < nr_pages; ++i) {
> + entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> + entries[i]->pool = pool;
> + entries[i]->swpentry = page_swap_entry(folio_page(folio, start + i));
> + entries[i]->objcg = objcg;
> + entries[i]->referenced = true;
> + INIT_LIST_HEAD(&entries[i]->lru);
> + }
>
> - old = xa_store(swap_zswap_tree(page_swpentry),
> - swp_offset(page_swpentry),
> - entry, GFP_KERNEL);
> - if (xa_is_err(old)) {
> - int err = xa_err(old);
> + for (i = 0; i < nr_pages; ++i) {
> + struct page *page = folio_page(folio, start + i);
>
> - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> - zswap_reject_alloc_fail++;
> - goto store_failed;
> + if (!zswap_compress(page, entries[i], pool))
> + goto store_pages_failed;
> }
>
> - /*
> - * We may have had an existing entry that became stale when
> - * the folio was redirtied and now the new version is being
> - * swapped out. Get rid of the old.
> - */
> - if (old)
> - zswap_entry_free(old);
> + for (i = 0; i < nr_pages; ++i) {
> + struct zswap_entry *old, *entry = entries[i];
>
> - /*
> - * The entry is successfully compressed and stored in the tree, there is
> - * no further possibility of failure. Grab refs to the pool and objcg,
> - * charge zswap memory, and increment zswap_stored_pages.
> - * The opposite actions will be performed by zswap_entry_free()
> - * when the entry is removed from the tree.
> - */
> - zswap_pool_get(pool);
> - if (objcg) {
> - obj_cgroup_get(objcg);
> - obj_cgroup_charge_zswap(objcg, entry->length);
> - }
> - atomic_long_inc(&zswap_stored_pages);
> + old = xa_store(swap_zswap_tree(entry->swpentry),
> + swp_offset(entry->swpentry),
> + entry, GFP_KERNEL);
> + if (unlikely(xa_is_err(old))) {
> + int err = xa_err(old);
>
> - /*
> - * We finish initializing the entry while it's already in xarray.
> - * This is safe because:
> - *
> - * 1. Concurrent stores and invalidations are excluded by folio lock.
> - *
> - * 2. Writeback is excluded by the entry not being on the LRU yet.
> - * The publishing order matters to prevent writeback from seeing
> - * an incoherent entry.
> - */
> - entry->pool = pool;
> - entry->swpentry = page_swpentry;
> - entry->objcg = objcg;
> - entry->referenced = true;
> - if (entry->length) {
> - INIT_LIST_HEAD(&entry->lru);
> - zswap_lru_add(&zswap_list_lru, entry);
> + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> + zswap_reject_alloc_fail++;
> + /*
> + * Entries up to this point have been stored in the
> + * xarray. zswap_store() will erase them from the xarray
> + * and call zswap_entry_free(). Local cleanup in
> + * 'store_pages_failed' only needs to happen for
> + * entries from [@i to @nr_pages).
> + */
> + store_fail_idx = i;
> + goto store_pages_failed;
> + }
> +
> + /*
> + * We may have had an existing entry that became stale when
> + * the folio was redirtied and now the new version is being
> + * swapped out. Get rid of the old.
> + */
> + if (unlikely(old))
> + zswap_entry_free(old);
> +
> + /*
> + * The entry is successfully compressed and stored in the tree, there is
> + * no further possibility of failure. Grab refs to the pool and objcg,
> + * charge zswap memory, and increment zswap_stored_pages.
> + * The opposite actions will be performed by zswap_entry_free()
> + * when the entry is removed from the tree.
> + */
> + zswap_pool_get(pool);
> + if (objcg) {
> + obj_cgroup_get(objcg);
> + obj_cgroup_charge_zswap(objcg, entry->length);
> + }
> + atomic_long_inc(&zswap_stored_pages);
> +
> + /*
> + * We finish by adding the entry to the LRU while it's already
> + * in xarray. This is safe because:
> + *
> + * 1. Concurrent stores and invalidations are excluded by folio lock.
> + *
> + * 2. Writeback is excluded by the entry not being on the LRU yet.
> + * The publishing order matters to prevent writeback from seeing
> + * an incoherent entry.
> + */
> + if (likely(entry->length))
> + zswap_lru_add(&zswap_list_lru, entry);
> }
>
> return true;
>
> -store_failed:
> - zpool_free(pool->zpool, entry->handle);
> -compress_failed:
> - zswap_entry_cache_free(entry);
> +store_pages_failed:
> + for (i = store_fail_idx; i < nr_pages; ++i) {
> + if (!IS_ERR_VALUE(entries[i]->handle))
> + zpool_free(pool->zpool, entries[i]->handle);
> + }
> + zswap_entries_cache_free_batch((void **)&entries[store_fail_idx],
> + nr_pages - store_fail_idx);
> +
> return false;
> }
>
> bool zswap_store(struct folio *folio)
> {
> long nr_pages = folio_nr_pages(folio);
> + int node_id = folio_nid(folio);
> swp_entry_t swp = folio->swap;
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> struct zswap_pool *pool;
> bool ret = false;
> - long index;
> + long start, end;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1632,10 +1711,11 @@ bool zswap_store(struct folio *folio)
> mem_cgroup_put(memcg);
> }
>
> - for (index = 0; index < nr_pages; ++index) {
> - struct page *page = folio_page(folio, index);
> + /* Store the folio in batches of @pool->batch_size pages. */
> + for (start = 0; start < nr_pages; start += pool->batch_size) {
> + end = min(start + pool->batch_size, nr_pages);
>
> - if (!zswap_store_page(page, objcg, pool))
> + if (!zswap_store_pages(folio, start, end, objcg, pool, node_id))
> goto put_pool;
> }
>
> @@ -1665,9 +1745,9 @@ bool zswap_store(struct folio *folio)
> struct zswap_entry *entry;
> struct xarray *tree;
>
> - for (index = 0; index < nr_pages; ++index) {
> - tree = swap_zswap_tree(swp_entry(type, offset + index));
> - entry = xa_erase(tree, offset + index);
> + for (start = 0; start < nr_pages; ++start) {
> + tree = swap_zswap_tree(swp_entry(type, offset + start));
> + entry = xa_erase(tree, offset + start);
> if (entry)
> zswap_entry_free(entry);
> }
> --
> 2.27.0
>
This patch LGTM for the most part. Lemme test the series again (I
tested an old version of this patch series), and I will give my Ack.
next prev parent reply other threads:[~2025-08-14 21:05 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-01 4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 01/24] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 02/24] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 03/24] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 04/24] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 05/24] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 06/24] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 07/24] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 08/24] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 09/24] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 10/24] crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap and zram Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 11/24] crypto: iaa - Enablers for submitting descriptors then polling for completion Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 12/24] crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for kernel users Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 13/24] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 14/24] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 15/24] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 16/24] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 17/24] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
2025-08-15 5:28 ` Herbert Xu
2025-08-22 19:31 ` Sridhar, Kanchana P
2025-08-22 21:48 ` Nhat Pham
2025-08-22 21:58 ` Sridhar, Kanchana P
2025-08-22 22:00 ` Sridhar, Kanchana P
2025-08-01 4:36 ` [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface Kanchana P Sridhar
2025-08-29 0:16 ` Barry Song
2025-08-29 3:12 ` Sridhar, Kanchana P
2025-08-01 4:36 ` [PATCH v11 20/24] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 21/24] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
2025-08-01 4:36 ` [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-08-14 20:58 ` Nhat Pham
2025-08-14 22:05 ` Sridhar, Kanchana P
2025-08-26 3:48 ` Barry Song
2025-08-26 4:27 ` Sridhar, Kanchana P
2025-08-26 4:42 ` Barry Song
2025-08-26 4:56 ` Sridhar, Kanchana P
2025-08-26 5:17 ` Barry Song
2025-08-27 0:06 ` Sridhar, Kanchana P
2025-08-28 21:39 ` Barry Song
2025-08-28 22:47 ` Sridhar, Kanchana P
2025-08-28 23:28 ` Barry Song
2025-08-29 2:56 ` Sridhar, Kanchana P
2025-08-29 3:42 ` Barry Song
2025-08-29 18:39 ` Sridhar, Kanchana P
2025-08-30 8:40 ` Barry Song
2025-09-03 18:00 ` Sridhar, Kanchana P
2025-08-01 4:36 ` [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
2025-08-14 21:05 ` Nhat Pham [this message]
2025-08-14 22:10 ` Sridhar, Kanchana P
2025-08-28 23:59 ` Barry Song
2025-08-29 3:06 ` Sridhar, Kanchana P
2025-08-01 4:36 ` [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
2025-08-14 21:14 ` Nhat Pham
2025-08-14 22:17 ` Sridhar, Kanchana P
2025-08-28 23:54 ` Barry Song
2025-08-29 3:04 ` Sridhar, Kanchana P
2025-08-29 3:31 ` Barry Song
2025-08-29 3:39 ` Sridhar, Kanchana P
2025-08-08 23:51 ` [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Nhat Pham
2025-08-09 0:03 ` Sridhar, Kanchana P
2025-08-15 5:27 ` Herbert Xu
2025-08-22 19:26 ` Sridhar, Kanchana P
2025-08-25 5:38 ` Herbert Xu
2025-08-25 18:12 ` Sridhar, Kanchana P
2025-08-26 1:13 ` Herbert Xu
2025-08-26 4:09 ` Sridhar, Kanchana P
2025-08-26 4:14 ` Herbert Xu
2025-08-26 4:42 ` Sridhar, Kanchana P
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAKEwX=Ov2X-0EnmWDCf=mahnR57si_QUuXF7F=Eb5P2cKYZEvg@mail.gmail.com' \
--to=nphamcs@gmail.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ardb@kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=clabbe@baylibre.com \
--cc=davem@davemloft.net \
--cc=ebiggers@google.com \
--cc=hannes@cmpxchg.org \
--cc=herbert@gondor.apana.org.au \
--cc=kanchana.p.sridhar@intel.com \
--cc=kristen.c.accardi@intel.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=senozhatsky@chromium.org \
--cc=surenb@google.com \
--cc=usamaarif642@gmail.com \
--cc=vinicius.gomes@intel.com \
--cc=vinodh.gopal@intel.com \
--cc=wajdi.k.feghali@intel.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosry.ahmed@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).