linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
@ 2025-08-01  4:36 Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 01/24] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
                   ` (24 more replies)
  0 siblings, 25 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar


Following Andrew's suggestion, the next two paragraphs emphasize generality
and alignment with current kernel efforts.

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be
hardware-agnostic. It has no dependencies on Intel-specific features and
can be leveraged by any hardware accelerator or software-based
compressor. In other words, the framework is open and inclusive by
design.

Other ongoing work that can use batching:
=========================================
This patch-series demonstrates the performance benefits of compress
batching when used in zswap_store() of large folios. shrink_folio_list()
"reclaim batching" of any-order folios is the next major work that uses
this zswap compress batching framework: our testing of kernel_compilation
with writeback and the zswap shrinker indicates 10X fewer pages get
written back when we reclaim 32 folios as a batch, as compared to one
folio at a time: this is with deflate-iaa and with zstd. We expect to
submit a patch-series with this data and the resulting performance
improvements shortly. Reclaim batching relieves memory pressure faster
than reclaiming one folio at a time, hence alleviates the need to scan
slab memory for writeback.

Many thanks to Nhat for suggesting ideas on using batching with the
ongoing kcompressd work, as well as beneficially using decompression
batching & block IO batching to improve zswap writeback efficiency.

Experiments with kernel compilation benchmark (allmod config) that
combine zswap compress batching, reclaim batching, swapin_readahead()
decompression batching of prefetched pages, and writeback batching show
that 0 pages are written back to disk with deflate-iaa and zstd. For
comparison, the baselines for these compressors see 200K-800K pages
written to disk.

To summarize, these are future clients of the batching framework:

   - shrink_folio_list() reclaim batching of multiple folios:
       Implemented, will submit patch-series.
   - zswap writeback with decompress batching:
       Implemented, will submit patch-series.
   - zram:
       Implemented, will submit patch-series.
   - kcompressd:
       Not yet implemented.
   - file systems:
       Not yet implemented.
   - swapin_readahead() decompression batching of prefetched pages:
       Implemented, will submit patch-series.


iaa_crypto Driver Rearchitecting and Optimizations:
===================================================

The most significant highlight of v11 is a new, lightweight and highly
optimized iaa_crypto driver, resulting directly in the latency and
throughput improvements noted later in this cover letter.

 1) Better stability, more functionally versatile to support zswap and
    zram with better performance on different Intel platforms.

    a) Patches 0002, 0005 and 0010 together resolve a race condition in
       mainline v6.15, reported from internal validation, when IAA
       wqs/devices are disabled while workloads are using IAA.

    b) Patch 0002 introduces a new architecture for mapping cores to
       IAAs based on packages instead of NUMA nodes, and generalizing
       how WQs are used: as package level shared resources for all
       same-package cores (default for compress WQs), or dedicated to
       mapped cores (default for decompress WQs). Further, users are
       able to configure multiple WQs and specify how many of those are
       for compress jobs only vs. decompress jobs only. sysfs iaa_crypto
       driver parameters can be used to change the default settings for
       performance tuning.

    c) idxd descriptor allocation moved from blocking to non-blocking
       with retry limits and mitigations if limits are exceeded.

    d) Code cleanup for readability and clearer code flow.

    e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs
       and devices that exists in the mainline v6.15.

    f) Rearchitecting iaa_crypto to be independent of crypto_acomp to
       enable a zram/zcomp backend_deflate_iaa.c, while fully supporting
       the crypto_acomp interfaces for zswap. A new
       include/linux/iaa_comp.h is added.

    g) New Dynamic compression mode for Granite Rapids to get better
       compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap
       compressor.

    h) New crypto_acomp API crypto_acomp_batch_size() that will return
       the driver's max batch size if the driver has registered the new
       get_batch_size() acomp_alg interface; or 1 if there is no driver
       specific implementation of get_batch_size().

       Accordingly, iaa_crypto provides an implementation for
       get_batch_size().

    i) A versatile set of interfaces independent of crypto_acomp for use
       in developing a zram zcomp backend for iaa_crypto.

 2) Performance optimizations (please refer to the latency data per
    optimization in the commit logs):

    a) Distributing [de]compress jobs in round-robin manner to available
       IAAs on package.

    b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a
       percpu_ref in struct iaa_wq, thereby eliminating acquiring a
       spinlock in the fast path, while using a combination of the
       iaa_crypto_enabled atomic with spinlocks in the slow path to
       ensure the compress/decompress code sees a consistent state of the
       wq tables.
       
    c) Directly call movdir64b for non-irq use cases, i.e., the most
       common usage. Avoid the overhead of irq-specific computes in
       idxd_submit_desc() to gain latency.

    d) Batching of compressions/decompressions using async submit-poll
       mechanism to derive the benefits of hardware parallelism.

    e) Batching compressors need to manage their own "request"
       abstraction, and remove this driver-specific aspect from being
       managed by kernel users such as zswap. iaa_crypto maintains
       per-CPU "struct iaa_req **reqs" to submit multiple jobs to the
       hardware accelerator to run in parallel.

    f) Add a "void *kernel_data" member to struct acomp_req for use by
       kernel modules to pass batching data to algorithms that support
       batching. This allows us to enable compress batching with only
       the crypto_acomp_batch_size() API, and without changes to
       existing crypto_acomp API.

    g) Submit the two largest data buffers first for decompression
       batching, so that the longest running jobs get a head start,
       reducing latency for the batch.


Main Changes in Zswap Compression Batching:
===========================================

 Note to zswap maintainers:
 --------------------------
 Patches 20 and 21 can be reviewed and improved/merged independently
 of this series, since they are zswap centric. These 2 patches help
 batching but the crypto_acomp_batch_size() from the iaa_crypto commits
 in this series is not a requirement, unlike patches 22-24.
 
 1) v11 preserves the pool acomp_ctx resources creation/deletion
    simplification of v9, namely, lasting from pool creation-deletion,
    persisting through CPU hot[un]plug operations. Further, zswap no
    longer needs to create multiple "struct acomp_req" in the per-CPU
    acomp_ctx. zswap only needs to manage multiple "u8 **buffers".

 2) We store the compressor's batch-size (@pool->compr_batch_size) and
    the batch-size to use during compression batching
    (@pool->batch_size) directly in struct zswap_pool for quick
    retrieval in the zswap_store() fast path.

 3) Optimizations to not cause regressions in software compressors with
    the introduction of the new unified zswap_compress() procedure that
    implements compression batching for all compressors. Since v9, the
    new zpool_malloc() interface that allocates pool memory on the NUMA
    node, when used in the new zswap_compress() batching implementation,
    caused some performance loss (verified by replacing
    page_to_nid(page) with NUMA_NO_NODE). These optimizations help
    recover the performance and are included in this series:

    a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free
       batch zswap_entry-s. These kmem_cache API allow allocator
       optimizations with internal locks for multiple allocations.

    b) Writes to the zswap_entry right after it is allocated without
       modifying the publishing order. This avoids different code blocks
       in zswap_store_pages() having to bring the zswap_entries to the
       cache for writing, potentially evicting other working set
       structures, impacting performance.

    c) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software
       compressors, since this gives the best performance with zstd when
       writeback is enabled, and does not regress performance when
       writeback is not enabled.

    d) More likely()/unlikely() annotations to try and minimize branch
       mis-predicts.

 4) "struct swap_batch_comp_data" and "struct swap_batch_decomp_data"
     added in mm/swap.h:

     /*
      * A compression algorithm that wants to batch compressions/decompressions
      * must define its own internal data structures that exactly mirror
      * @struct swap_batch_comp_data and @struct swap_batch_decomp_data.
      */

     Accordingly, zswap_compress() uses struct swap_batch_comp_data to
     pass batching data in the acomp_req->kernel_data
     pointer if the compressor supports batching.

     include/linux/iaa_comp.h has matching definitions of
     "struct iaa_batch_comp_data" and "struct iaa_batch_decomp_data".

     Feedback from the zswap maintainers is requested on whether this
     is a good approach. Suggestions for alternative approaches are also
     very welcome.


Compression Batching:
=====================

This patch-series introduces batch compression of pages in large folios to
improve zswap swapout latency. It preserves the existing zswap protocols
for non-batching software compressors by calling crypto_acomp sequentially
per page in the batch. Additionally, in support of hardware accelerators
that can process a batch as an integral unit, the patch-series allows
zswap to call crypto_acomp without API changes, for compressors
that intrinsically support batching.

The patch series provides a proof point by using the Intel Analytics
Accelerator (IAA) for implementing the compress/decompress batching API
using hardware parallelism in the iaa_crypto driver and another proof point
with a sequential software compressor, zstd.

SUMMARY:
========

  The first proof point is to test with IAA using a sequential call (fully
  synchronous, compress one page at a time) vs. a batching call (fully
  asynchronous, submit a batch to IAA for parallel compression, then poll for
  completion statuses).
  
    The performance testing data with 30 usemem processes/64K folios
    shows 52% throughput gains and 24% elapsed/sys time reductions with
    deflate-iaa; and 11% sys time reduction with zstd for a small
    throughput increase.

    Kernel compilation test with 64K folios using 28 threads and the
    zswap shrinker_enabled set to "Y", demonstrates similar
    improvements: zswap_store() large folios using IAA compress batching
    improves the workload performance by 6.8% and reduces sys time by
    19% as compared to IAA sequential. For zstd, compress batching
    improves workload performance by 5.2% and reduces sys time by
    27.4% as compared to sequentially calling zswap_compress() per page
    in a folio.

  The second proof point is to make sure that software algorithms such as
  zstd do not regress. The data indicates that for sequential software
  algorithms a performance gain is achieved. 
  
    With the performance optimizations implemented in patches 22-24
    of v11:
    *  zstd usemem30 throughput with PMD folios increases by
       1%. Throughput with 64K folios is within range of variation
       with a slight improvement. Workload performance with zstd
       improves by 8%-6%, and sys time reduces by 11%-8% with 64K/PMD
       folios. 

    *  With kernel compilation using zstd with the zswap shrinker, we
       get a 27.4%-28.2% reduction in sys time, a 5.2%-2.1% improvement
       in workload performance, and 65%-59% fewer pages written back to
       disk for 64K/PMD folios respectively.

    These optimizations pertain to ensuring common code paths, removing
    redundant branches/computes, using prefetchw() of the zswap entry
    before it is written, and selectively annotating branches with
    likely()/unlikely() compiler directives to minimize branch
    mis-prediction penalty. Additionally, using the batching code for
    non-batching compressors to sequentially compress/store batches of up
    to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to
    cache locality of working set structures such as the array of
    zswap_entry-s for the batch.
  
    Our internal validation of zstd with the batching interface vs. IAA with
    the batching interface on Emerald Rapids has shown that IAA
    compress/decompress batching gives 21.3% more memory savings as compared
    to zstd, for 5% performance loss as compared to the baseline without any
    memory pressure. IAA batching demonstrates more than 2X the memory
    savings obtained by zstd at this 95% performance KPI.
    The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
    this compression ratio deficit for IAA, batching is extremely
    beneficial. As we improve the compression ratio of the IAA accelerator,
    we expect to see even better memory savings with IAA as compared to
    software compressors.
    

  Batching Roadmap:
  =================

  1) Compression batching within large folios (this series).
  
  2) zswap writeback decompression batching:

     This is being co-developed with Nhat Pham, and shows promising
     results. We plan to submit an RFC shortly.

  3) Reclaim batching of hybrid folios:
  
     We can expect to see even more significant performance and throughput
     improvements if we use the parallelism offered by IAA to do reclaim
     batching of 4K/large folios (really any-order folios), and using the
     zswap_store() high throughput compression pipeline to batch-compress
     pages comprising these folios, not just batching within large
     folios. This is the reclaim batching patch 13 in v1, which we expect
     to submit in a separate patch-series. As mentioned earlier, reclaim
     batching reduces the # of writeback pages by 10X for zstd and
     deflate-iaa.

  4) swapin_readahead() decompression batching:

     We have developed a zswap load batching interface to be used
     for parallel decompression batching, using swapin_readahead().
  
  These capabilities are architected so as to be useful to zswap and
  zram. We are actively working on integrating these components with zram.

 
  v11 Performance Summary:
  ========================

  This is a performance testing summary of results with usemem30
  (30 usemem processes running in a cgroup limited at 150G, each trying to
   allocate 10G).

  zswap shrinker_enabled = N.
  
  usemem30 with 64K folios:
  =========================
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,153,359      10,856,388        52%
     Avg throughput (KB/s)         238,445         361,879                
     elapsed time (sec)              92.61           70.50       -24%      
     sys time (sec)               2,193.59        1,675.32       -24%      
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11    
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v11 zstd    
                                                             improvement  
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,866,411       6,874,244       0.1%
     Avg throughput (KB/s)         228,880         229,141            
     elapsed time (sec)              96.45           89.05        -8%
     sys time (sec)               2,410.72        2,150.63       -11%         
     -----------------------------------------------------------------------


  usemem30 with 2M folios:
  ========================
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,268,929      11,312,195        56%     
     Avg throughput (KB/s)         242,297         377,073                 
     elapsed time (sec)              80.40           68.73       -15%     
     sys time (sec)               1,856.54        1,599.25       -14%     
     -----------------------------------------------------------------------
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11      
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v11 zstd           
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,560,441       7,627,155       0.9%
     Avg throughput (KB/s)         252,014         254,238            
     elapsed time (sec)              88.89           83.22        -6%
     sys time (sec)               2,132.05        1,952.98        -8%
     -----------------------------------------------------------------------


  This is a performance testing summary of results with
  kernel_compilation test (allmod config, 28 cores, cgroup limited to 2G).

  Writeback to disk is enabled by setting zswap shrinker_enabled = Y.
  
  kernel_compilation with 64K folios:
  ===================================

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                     vs.
                                                                 IAA Sequential
     --------------------------------------------------------------------------
     real_sec                          901.81          840.60       -6.8%
     sys_sec                         2,672.93        2,171.17        -19%
     zswpout                       34,700,692      24,076,095        -31%
     zswap_written_back_pages       2,612,474       1,451,961        -44%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                          882.67          837.21       -5.2%
     sys_sec                         3,573.31        2,593.94      -27.4%
     zswpout                       42,768,967      22,660,215        -47%
     zswap_written_back_pages       2,109,739         727,919        -65%
     --------------------------------------------------------------------------


  kernel_compilation with PMD folios:
  ===================================

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                     vs.
                                                                 IAA Sequential
     --------------------------------------------------------------------------
     real_sec                          838.76          804.83         -4%
     sys_sec                         3,173.57        2,422.63        -24%
     zswpout                       59,544,198      38,093,995        -36%
     zswap_written_back_pages       2,726,367         929,614        -66%
     --------------------------------------------------------------------------
 
 
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                          831.09          813.40       -2.1%
     sys_sec                         4,251.11        3,053.95      -28.2%
     zswpout                       59,452,638      35,832,407        -40%
     zswap_written_back_pages       1,041,721         423,334        -59%
     --------------------------------------------------------------------------



DETAILS:
========

(A) From zswap's perspective, the most significant changes are:
===============================================================

1) A unified zswap_compress() API is added to compress multiple
   pages:

   - If the compressor has multiple acomp requests, i.e., internally
     supports batching, crypto_acomp_batch_compress() is called. If all
     pages are successfully compressed, the batch is stored in zpool.
   
   - If the compressor can only compress one page at a time, each page
     is compressed and stored sequentially.

   Many thanks to Yosry for this suggestion, because it is an essential
   component of unifying common code paths between sequential/batching
   compressions.

   prefetchw() is used in zswap_compress() to minimize cache-miss
   latency by moving the zswap entry to the cache before it is written
   to; reducing sys time by ~1.5% for zstd (non-batching software
   compression). In other words, this optimization helps both batching and
   software compressors.

   Overall, the prefetchw() and likely()/unlikely() annotations prevent
   regressions with software compressors like zstd, and generally improve
   non-batching compressors' performance with the batching code by ~8%.

2) A new zswap_store_pages() is added, that stores multiple pages in a
   folio in a range of indices. This is an extension of the earlier
   zswap_store_page(), except it operates on a batch of pages.

3) zswap_store() is modified to store the folio's pages in batches
   by calling zswap_store_pages(). If the compressor supports batching,
   the folio will be compressed in batches of
   "pool->compr_batch_size". If the compressor does not support
   batching, the folio will be compressed in batches of
   ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch is
   compressed sequentially. We see better performance by processing
   the folio in batches of ZSWAP_MAX_BATCH_SIZE, due to cache locality
   of working set structures such as the array of zswap_entry-s for the
   batch.

   Many thanks to Yosry and Johannes for steering towards a common
   design and code paths for sequential and batched compressions (i.e.,
   for software compressors and hardware accelerators such as IAA). As per
   Yosry's suggestion in v8, the "batch_size" is an attribute of the
   compressor/pool, and hence is stored in struct zswap_pool instead of
   in struct crypto_acomp_ctx.

4) Simplifications to the acomp_ctx resources allocation/deletion
   vis-a-vis CPU hot[un]plug. This further improves upon v8 of this
   patch-series based on the discussion with Yosry, and formalizes the
   lifetime of these resources from pool creation to pool
   deletion. zswap does not register a CPU hotplug teardown
   callback. The acomp_ctx resources will persist through CPU
   online/offline transitions. The main changes made to avoid UAF/race
   conditions, and correctly handle process migration, are:

   a) No acomp_ctx mutex locking in zswap_cpu_comp_prepare().
   b) No CPU hotplug teardown callback, no acomp_ctx resources deleted.
   c) New acomp_ctx_dealloc() procedure that cleans up the acomp_ctx
      resources, and is shared by
      zswap_cpu_comp_prepare()/zswap_pool_create() error handling and
      zswap_pool_destroy().
   d) The zswap_pool node list instance is removed right after the node
      list add function in zswap_pool_create().
   e) We directly call mutex_[un]lock(&acomp_ctx->mutex) in
      zswap_[de]compress(). acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock()
      are deleted.

   The commit log of patch 0020 has a more detailed analysis.


(B) Main changes in crypto_acomp and iaa_crypto:
================================================

1) A new architecture is introduced for IAA device WQs' usage as:
   - compress only
   - decompress only
   - generic, i.e., both compress/decompress.

   Further, IAA devices/wqs are assigned to cores based on packages
   instead of NUMA nodes.

   The WQ rebalancing algorithm that is invoked as WQs are
   discovered/deleted has been made very general and flexible so that
   the user can control exactly how IAA WQs are used. In addition to the
   user being able to specify a WQ type as comp/decomp/generic, the user
   can also configure if WQs need to be shared among all same-package
   cores, or, whether the cores should be divided up amongst the
   available IAA devices.

   If distribute_[de]comps is enabled, from a given core's perspective,
   the iaa_crypto driver will distribute comp/decomp jobs among all
   devices' WQs in round-robin manner. This improves batching latency
   and can improve compression/decompression throughput for workloads
   that see a lot of swap activity.

   The commit log of patch 0002 provides more details on new iaa_crypto
   driver parameters added, along with recommended settings (defaults
   are optimal settings).

2) Compress/decompress batching are implemented using
   crypto_acomp_[de]compress() with batching data passed to the driver
   using the acomp_req->kernel_data pointer.


(C) The patch-series is organized as follows:
=============================================

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             subsequent iaa_crypto patches. This patch makes no
             functional changes.

    Patch 2) Makes an infrastructure change in the iaa_crypto driver
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids during internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.

             This patch also develops a new architecture that
             generalizes how IAA device WQs are used. It enables
             designating IAA device WQs as either compress-only or
             decompress-only or generic. Once IAA device WQ types are
             thus defined, it also allows the configuration of whether
             device WQs will be shared by all cores on the package, or
             used only by "mapped cores" obtained by a simple allocation
             of available IAAs to cores on the package.

             As a result of the overhaul of wq_table definition,
             allocation and rebalancing, this patch eliminates
             duplication of device WQs in per-CPU wq_tables, thereby
             saving 140MiB on a 384 cores dual socket Granite Rapids server
             with 8 IAAs.

             Regardless of how the user has configured the WQs' usage,
             the next WQ to use is obtained through a direct look-up in
             per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
             as to minimize latency in the critical path driver compress
             and decompress routines.

    Patch 3) Code cleanup, consistency of function parameters.

    Patch 4) Makes a change to iaa_crypto driver's descriptor allocation,
             from blocking to non-blocking with retries/timeouts and
             mitigations in case of timeouts during compress/decompress
             ops. This prevents tasks getting blocked indefinitely, which
             was observed when testing 30 cores running workloads, with
             only 1 IAA enabled on Sapphire Rapids (out of 4). These
             timeouts are typically only encountered, and associated
             mitigations exercised, only in configurations with 1 IAA
             device shared by 30+ cores.

    Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of
             spinlocks and "int refcount".

    Patch 6) Code simplification and restructuring for understandability
             in core iaa_compress() and iaa_decompress() routines.

    Patch 7) Refactor hardware descriptor setup to their own procedures
             to reduce code clutter.

    Patch 8) Simplify and optimize (i.e. reduce computes) job submission
             for the most commonly used non-irq async mode by directly
             calling movdir64b.

    Patch 9) Deprecate exporting symbols for adding IAA compression
             modes.

    Patch 10) Rearchitect iaa_crypto to be agnostic of crypto_acomp for
              it be usable in both zswap and zram. crypto_acomp interfaces are
              maintained as earlier, for use in zswap.

    Patch 11) Descriptor submit and polling mechanisms, enablers for batching.

    Patch 12) Add a "void *kernel_data" member to struct acomp_req. This
              gets initialized to NULL in acomp_request_set_params().

    Patch 13) Implement IAA batching of compressions and decompressions
              for deriving hardware parallelism.

    Patch 14) Enables the "async" mode, sets it as the default.

    Patch 15) Disables verify_compress by default.

    Patch 16) Decompress batching optimization: Find the two largest
              buffers in the batch and submit them first.
             
    Patch 17) Add a new Dynamic compression mode that can be used on
              Granite Rapids.

    Patch 18) Add get_batch_size() to structs acomp_alg/crypto_acomp and
              a crypto_acomp_batch_size() API that returns the compressor's
              batch-size, if it has provided an implementation for
              get_batch_size(); 1 otherwise.

    Patch 19) iaa-crypto implementation for get_batch_size(), that
              returns an iaa_driver specific constant,
              IAA_CRYPTO_MAX_BATCH_SIZE (set to 8U currently).


 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

    Patch 20) Simplifies the zswap_pool's per-CPU acomp_ctx resource
              management and lifetime to be from pool creation to pool
              deletion.

    Patch 21) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for
              valid acomp/req, thereby making it consistent with the resource
              de-allocation code.

    Patch 22) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
              as 8U) to denote the maximum number of acomp_ctx batching
              resources to allocate, thus limiting the amount of extra
              memory used for batching. Further, the "struct
              crypto_acomp_ctx" is modified to contain multiple buffers.
              New "u8 compr_batch_size" and "u8 batch_size" members are
              added to "struct zswap_pool" to track the number of dst
              buffers associated with the compressor (more than 1 if
              the compressor supports batching) and the unit for storing
              large folios using compression batching respectively.

    Patch 23) Modifies zswap_store() to store the folio in batches of
              pool->batch_size by calling a new zswap_store_pages() that takes
              a range of indices in the folio to be stored.
              zswap_store_pages() pre-allocates zswap entries for the batch,
              calls zswap_compress() for each page in this range, and stores
              the entries in xarray/LRU.

    Patch 24) Introduces a new unified implementation of zswap_compress()
              for compressors that do and do not support batching. This
              eliminates code duplication and facilitates maintainability of
              the code with the introduction of compress batching. Further,
              there are many optimizations to this common code that result
              in workload throughput and performance improvements with
              software compressors and hardware accelerators such as IAA.

              zstd performance is better or on par with mm-unstable. We
              see impressive throughput/performance improvements with
              IAA and zstd batching vs. no-batching.


With v11 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute (the default).


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 7-30-2025,
commit 01da54f10fdd, without and with this patch-series. Data was
gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries,
503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed
at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with zswap shrinker_enabled=N. Two sets of
   experiments, one with 64K folios, another with PMD folios.

2) Kernel compilation allmodconfig with 2G max memory, 28 threads, with
   zswap shrinker_enabled=Y to test batching performance impact when
   writeback is enabled. Two sets of experiments, one with 64K folios,
   another with PMD folios.

IAA configuration is done by a CLI: script is included at the end of the
cover letter.


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and
sleeping for 10 sec before exiting:

 usemem --init-time -w -O -b 1 -s 10 -n 30 10g
 echo 0 > /sys/module/zswap/parameters/shrinker_enabled

 IAA WQ Configuration (script is iincluded at the end of the cover
 letter):

   ./enable_iaa.sh -d 4 -q 1
   
 This enables all 4 IAAs on the socket, and configures 1 WQ per IAA
 device, each containing 128 entries. The driver distributes compress
 jobs from each core to wqX.0 of all same-package IAAs in a
 round-robin manner. Decompress jobs are send to the wqX.0 of the
 mapped IAA device.

 Since usemem has significantly more swapouts than swapins, this
 configuration is the most optimal.

 64K folios: usemem30: deflate-iaa:
 ==================================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        7,153,359      10,856,388         52%
 Avg throughput (KB/s)            238,445         361,879                
 elapsed time (sec)                 92.61           70.50        -24%      
 sys time (sec)                  2,193.59        1,675.32        -24%      
                                                                         
 -------------------------------------------------------------------------------
 memcg_high                     1,061,494       1,340,863                
 memcg_swap_fail                    1,496             240                
 64kB_swpout_fallback               1,496             240                
 zswpout                       61,642,322      71,374,066                
 zswpin                               130             250                
 pswpout                                0               0                
 pswpin                                 0               0                
 ZSWPOUT-64kB                   3,851,135       4,460,571                
 SWPOUT-64kB                            0               0
 pgmajfault                         2,446           2,545
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages               0               0
 IAA incompressible pages               0               0
 -------------------------------------------------------------------------------


 2M folios: usemem30: deflate-iaa:
 =================================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa     IAA Batching
                                                                  vs.
                                                              IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        7,268,929      11,312,195         56%     
 Avg throughput (KB/s)            242,297         377,073       
 elapsed time (sec)                 80.40           68.73        -15%     
 sys time (sec)                  1,856.54        1,599.25        -14%     
                                                                
 -------------------------------------------------------------------------------
 memcg_high                        99,426         119,834      
 memcg_swap_fail                      371             293      
 thp_swpout_fallback                  371             293      
 zswpout                       63,227,705      71,567,857      
 zswpin                               456             482      
 pswpout                                0               0      
 pswpin                                 0               0      
 ZSWPOUT-2048kB                   123,119         139,505      
 thp_swpout                             0               0      
 pgmajfault                         2,901           2,813 
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages               0               0
 IAA incompressible pages               0               0
 -------------------------------------------------------------------------------



 64K folios: usemem30: zstd:
 ===========================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11        
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd        v11 zstd    
                                                                 improvement  
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,866,411       6,874,244        0.1%
 Avg throughput (KB/s)            228,880         229,141               
 elapsed time (sec)                 96.45           89.05         -8%
 sys time (sec)                  2,410.72        2,150.63        -11%
                                                         
 -------------------------------------------------------------------------------
 memcg_high                     1,070,285       1,075,178
 memcg_swap_fail                    2,404              66
 64kB_swpout_fallback               2,404              66
 zswpout                       49,767,024      49,672,972
 zswpin                               454             192
 pswpout                                0               0
 pswpin                                 0               0
 ZSWPOUT-64kB                   3,108,029       3,104,433
 SWPOUT-64kB                            0               0
 pgmajfault                         2,758           2,481
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages               0               0
 -------------------------------------------------------------------------------
                   

 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11      
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd        v11 zstd           
                                                                 improvement
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        7,560,441       7,627,155        0.9%
 Avg throughput (KB/s)            252,014         254,238            
 elapsed time (sec)                 88.89           83.22         -6%
 sys time (sec)                  2,132.05        1,952.98         -8%
                                                         
 -------------------------------------------------------------------------------
 memcg_high                        89,486          88,982
 memcg_swap_fail                      183              41
 thp_swpout_fallback                  183              41
 zswpout                       48,947,054      48,598,306
 zswpin                               472             252
 pswpout                                0               0
 pswpin                                 0               0
 ZSWPOUT-2048kB                    95,420          94,876
 thp_swpout                             0               0
 pgmajfault                         2,789           2,540
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages               0               0
 -------------------------------------------------------------------------------



Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test use 28 threads and build
the "allmodconfig" that takes ~14 minutes, and has considerable
swapout/swapin activity. The cgroup's memory.max is set to 2G. We
trigger writeback by enabling the zswap shrinker.

 echo 1 > /sys/module/zswap/parameters/shrinker_enabled

 IAA WQ Configuration (script is at the end of the cover letter):

   ./enable_iaa.sh -d 4 -q 2
   
 This enables all 4 IAAs on the socket, and configures 2 WQs per IAA,
 each containing 64 entries. The driver sends decompresses to wqX.0 of
 the mapped IAA device, and distributes compresses to wqX.1 of all
 same-package IAAs in a round-robin manner.

 64K folios: Kernel compilation/allmodconfig: deflate-iaa:
 =========================================================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 real_sec                          901.81          840.60       -6.8%
 user_sec                       15,499.45       15,431.54
 sys_sec                         2,672.93        2,171.17        -19%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,872,984       1,872,884
 -------------------------------------------------------------------------------
 memcg_high                             0               0        
 memcg_swap_fail                    2,633               0        
 64kB_swpout_fallback               2,630               0        
 zswpout                       34,700,692      24,076,095        -31%
 zswpin                         7,791,832       4,937,822        
 pswpout                        2,624,324       1,459,681        
 pswpin                         2,486,667       1,229,416        
 ZSWPOUT-64kB                   1,254,622         896,433        
 SWPOUT-64kB                           36              18
 pgmajfault                    10,613,272       6,305,623
 zswap_reject_compress_fail            64             111
 zswap_reject_reclaim_fail            301              59
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages       2,612,474       1,451,961        -44%
 IAA incompressible pages              64             111 
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig: deflate-iaa:
 ========================================================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 real_sec                          838.76          804.83         -4%
 user_sec                       15,624.57       15,566.49
 sys_sec                         3,173.57        2,422.63        -24%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,874,680       1,872,892
 -------------------------------------------------------------------------------
 memcg_high                             0               0   
 memcg_swap_fail                   10,350             906   
 thp_swpout_fallback               10,342             906   
 zswpout                       59,544,198      38,093,995        -36%
 zswpin                        17,933,865      10,220,321   
 pswpout                        2,740,024         935,226   
 pswpin                         3,179,590       1,346,338   
 ZSWPOUT-2048kB                     6,464          10,435   
 thp_swpout                             4               3   
 pgmajfault                    21,609,542      11,819,882
 zswap_reject_compress_fail            50               8
 zswap_reject_reclaim_fail          2,335           2,377
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages       2,726,367         929,614        -66%
 IAA incompressible pages              50               8
 -------------------------------------------------------------------------------

With the iaa_crypto driver changes for non-blocking descriptor allocations,
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.


 64K folios: Kernel compilation/allmodconfig: zstd:
 ==================================================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd    Improvement
 -------------------------------------------------------------------------------
 real_sec                          882.67          837.21       -5.2%
 user_sec                       15,533.14       15,434.03
 sys_sec                         3,573.31        2,593.94      -27.4%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,872,960       1,872,788
 -------------------------------------------------------------------------------
 memcg_high                             0               0        
 memcg_swap_fail                        0               0        
 64kB_swpout_fallback                   0               0        
 zswpout                       42,768,967      22,660,215        -47%
 zswpin                        10,146,520       4,750,133        
 pswpout                        2,118,745         731,419        
 pswpin                         2,114,735         824,655        
 ZSWPOUT-64kB                   1,484,862         824,976        
 SWPOUT-64kB                            6               3
 pgmajfault                    12,698,613       5,697,281
 zswap_reject_compress_fail            13               8
 zswap_reject_reclaim_fail            624             211
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages       2,109,739         727,919        -65%
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig: zstd:
 =================================================

 -------------------------------------------------------------------------------
                    mm-unstable-7-30-2025             v11
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd    Improvement
 -------------------------------------------------------------------------------
 real_sec                          831.09          813.40       -2.1%
 user_sec                       15,648.65       15,566.01
 sys_sec                         4,251.11        3,053.95      -28.2%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,872,892       1,874,684
 -------------------------------------------------------------------------------
 memcg_high                             0               0   
 memcg_swap_fail                    7,525           1,455   
 thp_swpout_fallback                7,499           1,452   
 zswpout                       59,452,638      35,832,407        -40%
 zswpin                        17,690,718       9,550,640   
 pswpout                        1,047,676         426,042   
 pswpin                         2,155,989         840,514   
 ZSWPOUT-2048kB                     8,254           8,651   
 thp_swpout                             4               2   
 pgmajfault                    20,278,921      10,581,456
 zswap_reject_compress_fail            47              20
 zswap_reject_reclaim_fail          2,342             451
 zswap_pool_limit_hit                   0               0
 zswap_written_back_pages       1,041,721         423,334        -59%
 -------------------------------------------------------------------------------



IAA configuration script "enable_iaa.sh":
=========================================

 Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad.

 Usage:
 ------

   ./enable_iaa.sh -d <num_IAAs> -q <num_WQs_per_IAA>


 #---------------------------------<cut here>----------------------------------
 #!/usr/bin/env bash
 #SPDX-License-Identifier: BSD-3-Clause
 #Copyright (c) 2025, Intel Corporation
 #Description: Configure IAA devices
 
 VERIFY_COMPRESS_PATH="/sys/bus/dsa/drivers/crypto/verify_compress"
 
 iax_dev_id="0cfe"
 num_iaa=$(lspci -d:${iax_dev_id} | wc -l)
 sockets=$(lscpu | grep Socket | awk '{print $2}')
 echo "Found ${num_iaa} instances in ${sockets} sockets(s)"
 
 #  The same number of devices will be configured in each socket, if there
 #  are  more than one socket.
 #  Normalize with respect to the number of sockets.
 device_num_per_socket=$(( num_iaa/sockets ))
 num_iaa_per_socket=$(( num_iaa / sockets ))
 
 iaa_wqs=2
 verbose=0
 iaa_engines=8
 mode="dedicated"
 wq_type="kernel"
 iaa_crypto_mode="async"
 verify_compress=0
 
 
 # Function to handle errors
 handle_error() {
     echo "Error: $1"
     exit 1
 }
 
 # Process arguments
 
 while getopts "d:hm:q:vD" opt; do
   case $opt in
     d)
       device_num_per_socket=$OPTARG
       ;;
     m)
       iaa_crypto_mode=$OPTARG
       ;;
     q)
       iaa_wqs=$OPTARG
       ;;
     D)
       verbose=1
       ;;
     v)
       verify_compress=1
       ;;
     h)
       echo "Usage: $0 [-d <device_count>][-q <wq_per_device>][-v]"
       echo "       -d - number of devices"
       echo "       -q - number of WQs per device"
       echo "       -v - verbose mode"
       echo "       -h - help"
       exit
       ;;
     \?)
       echo "Invalid option: -$OPTARG" >&2
       exit
       ;;
   esac
 done
 
 LOG="configure_iaa.log"
 
 # Update wq_size based on number of wqs
 wq_size=$(( 128 / iaa_wqs ))
 
 # Take care of the enumeration, if DSA is enabled.
 dsa=`lspci | grep -c 0b25`
 # set first,step counters to correctly enumerate iax devices based on
 # whether running on guest or host with or without dsa
 first=0
 step=1
 [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=1 && step=2
 echo "first index: ${first}, step: ${step}"
 
 
 #
 # Switch to software compressors and disable IAAs to have a clean start
 #
 COMPRESSOR=/sys/module/zswap/parameters/compressor
 last_comp=`cat ${COMPRESSOR}`
 echo lzo > ${COMPRESSOR}
 
 echo "Disable IAA devices before configuring"
 
 for ((i = ${first}; i < ${step} * ${num_iaa}; i += ${step})); do
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         cmd="accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null"
        [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
     cmd="accel-config disable-device iax${i} >& /dev/null"
     [[ $verbose == 1 ]] && echo $cmd; eval $cmd
 done
 
 rmmod iaa_crypto
 modprobe iaa_crypto
 
 # apply crypto parameters
 echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did not change verify_compress"
 # Note: This is a temporary solution for during the kernel transition.
 if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa || handle_error "did not set g_comp_wqs_per_iaa"
 elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "did not set g_wqs_per_iaa"
 fi
 if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq || handle_error "did not set g_consec_descs_per_gwq"
 fi
 echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode || handle_error "could not set sync_mode"
 
 
 
 echo "Configuring ${device_num_per_socket} device(s) out of $num_iaa_per_socket per socket"
 if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then
     echo "Configuring all devices"
     start=${first}
     end=$(( ${step} * ${device_num_per_socket} ))
 else
    echo "ERROR: Not enough devices"
    exit
 fi
 
 
 #
 # enable all iax devices and wqs
 #
 for (( socket = 0; socket < ${sockets}; socket += 1 )); do
 for ((i = ${start}; i < ${end}; i += ${step})); do
 
     echo "Configuring iaa$i on socket ${socket}"
 
     for ((j = 0; j < ${iaa_engines}; j += 1)); do
         cmd="accel-config config-engine iax${i}/engine${i}.${j} --group-id=0"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
     done
 
     # Config  WQs
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         # Config WQ: group 0,  priority=10, mode=shared, type = kernel name=kernel, driver_name=crypto
         cmd="accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_size} -p 10 -m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
 
     # Enable Device and WQs
     cmd="accel-config enable-device iax${i}"
     [[ $verbose == 1 ]] && echo $cmd; eval $cmd
 
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         cmd="accel-config enable-wq iax${i}/wq${i}.${j}"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
 
 done
     start=$(( start + ${step} * ${num_iaa_per_socket} ))
     end=$(( start + (${step} * ${device_num_per_socket}) ))
 done
 
 # Restore the last compressor
 echo "$last_comp" > ${COMPRESSOR}
 
 # Check if the configuration is correct
 echo "Configured IAA devices:"
 accel-config list | grep iax
 
 #---------------------------------<cut here>----------------------------------


Changes since v10:
==================
1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd.
2) Added change logging in patch 0024 on there being no Intel-specific
   dependencies in the batching framework, as suggested by
   Andrew Morton. Thanks Andrew!
3) Added change logging in patch 0024 on other ongoing work that can use
   batching, as per Andrew's suggestion. Thanks Andrew!
4) Added the IAA configuration script in the cover letter, as suggested
   by Nhat Pham. Thanks Nhat!
5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU
   hotplug procedures to pool functions.
6) Gathered kernel_compilation 'allmod' config performance data with
   writeback and zswap shrinker_enabled=Y.
7) Changed the pool->batch_size for software compressors to be
   ZSWAP_MAX_BATCH_SIZE since this gives better performance with the zswap
   shrinker enabled.
8) Was unable to replicate in v11 the issue seen in v10 with higher
   memcg_swap_fail than in the baseline, with usemem30/zstd.

Changes since v9:
=================
1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3.
2) iaa_crypto rearchitecting, mainline race condition fix, performance
   optimizations, code cleanup.
3) Addressed Herbert's comments in v9 patch 10, that an array based
   crypto_acomp interface is not acceptable.
4) Optimized the implementation of the batching zswap_compress() and
   zswap_store_pages() added in v9, to recover performance when
   integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer
   the the original page's node for compressed data").

Changes since v8:
=================
1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
2) Backported commits for reverting request chaining, since these are
   in cryptodev-2.6 but not yet in mm-unstable: without these backports,
   deflate-iaa is non-functional in mm-unstable:
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
   commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
                         testing"")
   Backported this hotfix as well:
   commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
   calculating last page").
3) crypto_acomp_[de]compress() restored to non-request chained
   implementations since request chaining has been removed from acomp in
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
4) New IAA WQ architecture to denote WQ type and whether or not a WQ
   should be shared among all package cores, or only to the "mapped"
   ones from an even cores-to-IAA distribution scheme.
5) Compress/decompress batching are implemented in iaa_crypto using new
   crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
6) Defines a "void *data" in struct acomp_req, based on Herbert advising
   against using req->base.data in the driver. This is needed for async
   submit-poll to work.
7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
   functions", per Yosry's suggestion to move procedures in a distinct
   patch before refactoring patches.
8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
   the number of requests/buffers associated with the per-cpu acomp_ctx,
   as per Yosry's suggestion.
9) Simplifications to the acomp_ctx resources allocation, deletion,
   locking, and for these to exist from pool creation to pool deletion,
   based on v8 code review discussions with Yosry.
10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
    acomp_ctx_dealloc(), as per Yosry's v8 comment.
11) zswap_store_folio() is deleted, and instead, the loop over
    zswap_store_pages() is moved inline in zswap_store(), per Yosry's
    suggestion.
12) Better structure in zswap_compress(), unified procedure that
    compresses/stores a batch of pages for both, non-batching and
    batching compressors. Renamed from zswap_batch_compress() to
    zswap_compress(): Thanks Yosry for these suggestions.


Changes since v7:
=================
1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE is
   defined as 8U, for saving memory in this per-cpu structure.
3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
   acomp_ctx->initialized to acomp_ctx->__online.
4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
   thanks to all!
   a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
      for this suggestion!
   b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
      of whether or not the compressor supports batching. This gets rid of
      the kmalloc(entries), and allows us to allocate an array of
      ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
      zswap_store_pages().
   c) Use of a common structure and code paths for compressing a folio in
      batches, either as a request chain (in parallel in IAA hardware) or
      sequentially. No code duplication since zswap_compress() has been
      replaced with zswap_batch_compress(), simplifying maintainability.
5) A key difference between compressors that support batching and
   those that do not, is that for the latter, the acomp_ctx mutex is
   locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that decompressions
   to handle page-faults can make progress. This fixes the zstd kernel
   compilation regression seen in v7. For compressors that support
   batching, for e.g. IAA, the mutex is locked/released once for storing
   the folio.
6) Used likely/unlikely compiler directives and prefetchw to restore
   performance with the common code paths.

Changes since v6:
=================
1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.

2) Deleted crypto_acomp_batch_compress() and
   crypto_acomp_batch_decompress() interfaces, as per Herbert's
   suggestion. Batching is instead enabled by chaining the requests. For
   non-batching compressors, there is no request chaining involved. Both,
   batching and non-batching compressions are accomplished by zswap by
   calling:

   crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

3) iaa_crypto implementation of batch compressions/decompressions using
   request chaining, as per Herbert's suggestions.
4) Simplification of the acomp_ctx resource allocation/deletion with
   respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
   mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
   the per-cpu memory cost of this proposed change is acceptable (IAA:
   64.8KB, Software compressors: 8.2KB). On the positive side, I believe
   restarting reclaim on a CPU after it has been through an offline-online
   transition, will be much faster by not deleting the acomp_ctx resources
   when the CPU gets offlined.
5) Use of lockdep assertions rather than comments for internal locking
   rules, as per Yosry's suggestion.
6) No specific references to IAA in zswap.c, as suggested by Yosry.
7) Explored various solutions other than the v6 zswap_store_folio()
   implementation, to fix the zstd regression seen in v5, to attempt to
   unify common code paths, and to allocate smaller arrays for the zswap
   entries on the stack. All these options were found to cause usemem30
   latency regression with zstd. The v6 version of zswap_store_folio() is
   the only implementation that does not cause zstd regression, confirmed
   by 10 consecutive runs, each giving quite consistent latency
   numbers. Hence, the v6 implementation is carried forward to v7, with
   changes for branching for batching vs. sequential compression API
   calls.


Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
   vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
   of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
   details.
6) Delete the pr_info_once() when batching resources are allocated in
   zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
   zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
   zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
    compressed objects and entries to be freed, and UAF when zswap_store()
    tries to free the entries that were already added to the xarray prior
    to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
    the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
    when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
    4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
    sync_mode is set to 'async'"), async mode was previously just sync. We
    now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
    timeouts, and mitigations for compress/decompress ops that fail to
    obtain a descriptor. This is a fix for tasks blocked errors seen in
    configurations where 30+ cores are running workloads under high memory
    pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
    deflate_generic_decompress(), which can cause data corruption and
    zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
    request chaining for slightly better latency. However, the request
    chaining framework itself is unchanged, preserved from v5.


Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().


Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana




Kanchana P Sridhar (24):
  crypto: iaa - Reorganize the iaa_crypto driver code.
  crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
    core mapping.
  crypto: iaa - Simplify, consistency of function parameters, minor
    stats bug fix.
  crypto: iaa - Descriptor allocation timeouts with mitigations.
  crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
  crypto: iaa - Simplify the code flow in iaa_compress() and
    iaa_decompress().
  crypto: iaa - Refactor hardware descriptor setup into separate
    procedures.
  crypto: iaa - Simplified, efficient job submissions for non-irq mode.
  crypto: iaa - Deprecate exporting add/remove IAA compression modes.
  crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap
    and zram.
  crypto: iaa - Enablers for submitting descriptors then polling for
    completion.
  crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for
    kernel users.
  crypto: iaa - IAA Batching for parallel compressions/decompressions.
  crypto: iaa - Enable async mode and make it the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Submit the two largest source buffers first in
    decompress batching.
  crypto: iaa - Add deflate-iaa-dynamic compression mode.
  crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's
    batch-size.
  crypto: iaa - IAA acomp_algs register the get_batch_size() interface.
  mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
    deletion.
  mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
    resources.
  mm: zswap: Allocate pool batching resources if the compressor supports
    batching.
  mm: zswap: zswap_store() will process a large folio in batches.
  mm: zswap: Batched zswap_compress() with compress batching of large
    folios.

 .../driver-api/crypto/iaa/iaa-crypto.rst      |  168 +-
 crypto/acompress.c                            |    1 +
 crypto/testmgr.c                              |   10 +
 crypto/testmgr.h                              |   74 +
 drivers/crypto/intel/iaa/Makefile             |    4 +-
 drivers/crypto/intel/iaa/iaa_crypto.h         |   59 +-
 .../intel/iaa/iaa_crypto_comp_dynamic.c       |   22 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 2902 ++++++++++++-----
 drivers/crypto/intel/iaa/iaa_crypto_stats.c   |    8 +
 drivers/crypto/intel/iaa/iaa_crypto_stats.h   |    2 +
 include/crypto/acompress.h                    |   30 +
 include/crypto/internal/acompress.h           |    3 +
 include/linux/iaa_comp.h                      |  159 +
 mm/swap.h                                     |   23 +
 mm/zswap.c                                    |  646 ++--
 15 files changed, 3085 insertions(+), 1026 deletions(-)
 create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
 create mode 100644 include/linux/iaa_comp.h

-- 
2.27.0



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v11 01/24] crypto: iaa - Reorganize the iaa_crypto driver code.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 02/24] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code, without requiring forward declarations.

This is expected to make the code more maintainable and for it to be
easier to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 677 +++++++++++----------
 1 file changed, 350 insertions(+), 327 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 23f585219fb4b..760997eee8fe5 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,10 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
+
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,54 +40,6 @@ static unsigned int cpus_per_iaa;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -91,36 +47,11 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = true;
 
-static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
-{
-	return sprintf(buf, "%d\n", iaa_verify_compress);
-}
-
-static ssize_t verify_compress_store(struct device_driver *driver,
-				     const char *buf, size_t count)
-{
-	int ret = -EBUSY;
-
-	mutex_lock(&iaa_devices_lock);
-
-	if (iaa_crypto_enabled)
-		goto out;
-
-	ret = kstrtobool(buf, &iaa_verify_compress);
-	if (ret)
-		goto out;
-
-	ret = count;
-out:
-	mutex_unlock(&iaa_devices_lock);
-
-	return ret;
-}
-static DRIVER_ATTR_RW(verify_compress);
-
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
  * compressions and decompressions are performed:
@@ -155,6 +86,37 @@ static bool async_mode;
 /* Use interrupts */
 static bool use_irq;
 
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
+
+static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_verify_compress);
+}
+
+static ssize_t verify_compress_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_verify_compress);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(verify_compress);
+
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -217,7 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+/****************************
+ * Driver compression modes.
+ ****************************/
 
 static int find_empty_iaa_compression_mode(void)
 {
@@ -409,11 +373,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -500,6 +459,11 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
+
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -513,18 +477,6 @@ static struct iaa_device *iaa_device_alloc(void)
 	return iaa_device;
 }
 
-static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
-{
-	struct iaa_wq *iaa_wq;
-
-	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
-		if (iaa_wq->wq == wq)
-			return true;
-	}
-
-	return false;
-}
-
 static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
@@ -560,6 +512,27 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 	nr_iaa--;
 }
 
+static void free_iaa_device(struct iaa_device *iaa_device)
+{
+	if (!iaa_device)
+		return;
+
+	remove_device_compression_modes(iaa_device);
+	kfree(iaa_device);
+}
+
+static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
+{
+	struct iaa_wq *iaa_wq;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->wq == wq)
+			return true;
+	}
+
+	return false;
+}
+
 static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 		      struct iaa_wq **new_wq)
 {
@@ -612,23 +585,23 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
+static void remove_iaa_wq(struct idxd_wq *wq)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
+	struct iaa_device *iaa_device;
 
-static void free_iaa_device(struct iaa_device *iaa_device)
-{
-	if (!iaa_device)
-		return;
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		if (iaa_has_wq(iaa_device, wq)) {
+			del_iaa_wq(iaa_device, wq);
+			break;
+		}
+	}
 
-	remove_device_compression_modes(iaa_device);
-	kfree(iaa_device);
+	if (nr_iaa) {
+		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		if (!cpus_per_iaa)
+			cpus_per_iaa = 1;
+	} else
+		cpus_per_iaa = 1;
 }
 
 static void __free_iaa_wq(struct iaa_wq *iaa_wq)
@@ -655,6 +628,75 @@ static void free_iaa_wq(struct iaa_wq *iaa_wq)
 	idxd_wq_set_private(wq, NULL);
 }
 
+static int save_iaa_wq(struct idxd_wq *wq)
+{
+	struct iaa_device *iaa_device, *found = NULL;
+	struct idxd_device *idxd;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		if (iaa_device->idxd == wq->idxd) {
+			idxd = iaa_device->idxd;
+			pdev = idxd->pdev;
+			dev = &pdev->dev;
+			/*
+			 * Check to see that we don't already have this wq.
+			 * Shouldn't happen but we don't control probing.
+			 */
+			if (iaa_has_wq(iaa_device, wq)) {
+				dev_dbg(dev, "same wq probed multiple times for iaa_device %p\n",
+					iaa_device);
+				goto out;
+			}
+
+			found = iaa_device;
+
+			ret = add_iaa_wq(iaa_device, wq, NULL);
+			if (ret)
+				goto out;
+
+			break;
+		}
+	}
+
+	if (!found) {
+		struct iaa_device *new_device;
+		struct iaa_wq *new_wq;
+
+		new_device = add_iaa_device(wq->idxd);
+		if (!new_device) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = add_iaa_wq(new_device, wq, &new_wq);
+		if (ret) {
+			del_iaa_device(new_device);
+			free_iaa_device(new_device);
+			goto out;
+		}
+
+		ret = init_iaa_device(new_device, new_wq);
+		if (ret) {
+			del_iaa_wq(new_device, new_wq->wq);
+			del_iaa_device(new_device);
+			free_iaa_wq(new_wq);
+			goto out;
+		}
+	}
+
+	if (WARN_ON(nr_iaa == 0))
+		return -EINVAL;
+
+	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	if (!cpus_per_iaa)
+		cpus_per_iaa = 1;
+out:
+	return 0;
+}
+
 static int iaa_wq_get(struct idxd_wq *wq)
 {
 	struct idxd_device *idxd = wq->idxd;
@@ -702,6 +744,37 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
 static void free_wq_table(void)
 {
 	int cpu;
@@ -739,92 +812,18 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static int save_iaa_wq(struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct idxd_wq *wq)
 {
-	struct iaa_device *iaa_device, *found = NULL;
-	struct idxd_device *idxd;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		if (iaa_device->idxd == wq->idxd) {
-			idxd = iaa_device->idxd;
-			pdev = idxd->pdev;
-			dev = &pdev->dev;
-			/*
-			 * Check to see that we don't already have this wq.
-			 * Shouldn't happen but we don't control probing.
-			 */
-			if (iaa_has_wq(iaa_device, wq)) {
-				dev_dbg(dev, "same wq probed multiple times for iaa_device %p\n",
-					iaa_device);
-				goto out;
-			}
-
-			found = iaa_device;
-
-			ret = add_iaa_wq(iaa_device, wq, NULL);
-			if (ret)
-				goto out;
-
-			break;
-		}
-	}
-
-	if (!found) {
-		struct iaa_device *new_device;
-		struct iaa_wq *new_wq;
-
-		new_device = add_iaa_device(wq->idxd);
-		if (!new_device) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		ret = add_iaa_wq(new_device, wq, &new_wq);
-		if (ret) {
-			del_iaa_device(new_device);
-			free_iaa_device(new_device);
-			goto out;
-		}
-
-		ret = init_iaa_device(new_device, new_wq);
-		if (ret) {
-			del_iaa_wq(new_device, new_wq->wq);
-			del_iaa_device(new_device);
-			free_iaa_wq(new_wq);
-			goto out;
-		}
-	}
-
-	if (WARN_ON(nr_iaa == 0))
-		return -EINVAL;
-
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-	if (!cpus_per_iaa)
-		cpus_per_iaa = 1;
-out:
-	return 0;
-}
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-static void remove_iaa_wq(struct idxd_wq *wq)
-{
-	struct iaa_device *iaa_device;
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
 
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		if (iaa_has_wq(iaa_device, wq)) {
-			del_iaa_wq(iaa_device, wq);
-			break;
-		}
-	}
+	entry->wqs[entry->n_wqs++] = wq;
 
-	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-		if (!cpus_per_iaa)
-			cpus_per_iaa = 1;
-	} else
-		cpus_per_iaa = 1;
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
@@ -930,6 +929,44 @@ static void rebalance_wq_table(void)
 	pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
+
+static int deflate_generic_decompress(struct acomp_req *req)
+{
+	ACOMP_FBREQ_ON_STACK(fbreq, req);
+	int ret;
+
+	ret = crypto_acomp_decompress(fbreq);
+	req->dlen = fbreq->dlen;
+
+	update_total_sw_decomp_calls();
+
+	return ret;
+}
+
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -990,27 +1027,132 @@ static inline int check_completion(struct device *dev,
 	return ret;
 }
 
-static int deflate_generic_decompress(struct acomp_req *req)
+static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
+				struct acomp_req *req,
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
 {
-	ACOMP_FBREQ_ON_STACK(fbreq, req);
-	int ret;
+	int ret = 0;
+	int nr_sgs;
 
-	ret = crypto_acomp_decompress(fbreq);
-	req->dlen = fbreq->dlen;
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 
-	update_total_sw_decomp_calls();
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
 
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
 	return ret;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
-
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen);
+			       dma_addr_t dst_addr, unsigned int *dlen)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 *compression_crc = acomp_request_ctx(req);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (*compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", *compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1226,133 +1368,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	u32 *compression_crc = acomp_request_ctx(req);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (*compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", *compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -1662,6 +1677,10 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1864,6 +1883,10 @@ static struct idxd_device_driver iaa_crypto_driver = {
 	.desc_complete = iaa_desc_complete,
 };
 
+/********************
+ * Module init/exit.
+ ********************/
+
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 02/24] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 01/24] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 03/24] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch re-architects the iaa_crypto driver in three main aspects, to
make it more robust, stable, generic and functionally versatile to
support zswap users on platforms with different number of cores/IAAs
running workloads with different swap characteristics, and most
importantly, better performance.

 Summary of latency improvement for large folio compression:
 ===========================================================
 When measured in zswap using a simple madvise workload, where 64K
 Folios are stored using IAA batch compressions, this is how the
 per-page compress latency changes just by setting the
 "distribute_comps" driver parameter to "1":

   --------------------------------------------------------------
   zswap compressor: deflate-iaa
   64K Folios: zswap_store() latency normalized to per-page
   --------------------------------------------------------------
                                         p50 (ns)     p99 (ns)
   --------------------------------------------------------------
   Sequential store                         3,503        3,695
   Batch compress, distribute_comps=0       1,356        1,384
   Batch compress, distribute_comps=1         706          763
   --------------------------------------------------------------

The rearchitecting aspects are:

A) Map IAA devices/wqs to cores based on packages instead of NUMA.

B) The WQ rebalancing algorithm that is invoked as WQs are
   discovered/deleted has been made very general and flexible so that
   the user can control exactly how IAA WQs are used, for optimizing
   performance.

C) Additionally, the "iaa_crypto_enabled" driver global has been
   modified to be an atomic, and used for synchronization between
   dynamic/asynchronous WQ discovery/deletion and the fundamental
   routines comp_wq_table_next_wq() and decomp_wq_table_next_wq() that
   are queried by compress/decompress job submissions.

Description/motivation for (A):
===============================
This patch modifies the algorithm for mapping available IAA devices and
WQs to cores based on packages instead of NUMA nodes. This leads to a
more realistic mapping of IAA devices as compression/decompression
resources for a package, rather than for a NUMA node. This also resolves
problems that were observed during internal validation on Intel Granite
Rapids platforms with many more NUMA nodes than packages: for such
cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static mapping
of wqs per IAA device, in the "struct iaa_device" itself. The earlier
implementation would allocate memory per-cpu for this data, which never
changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

 1) Resolves "task blocked for more than x seconds" errors observed during
    internal validation on Intel systems with the earlier NUMA node based
    mappings, which was root-caused to the non-optimal IAA-to-core mappings
    described earlier.

 2) Results in a NUM_THREADS factor reduction in memory footprint cost of
    initializing IAA devices/wqs, due to eliminating the per-cpu copies of
    each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
    8 IAA devices, this saves 140MiB.

An auxiliary change included in this patch is that the driver's "nr_iaa",
"nr_iaa_per_package" and "cpus_per_iaa" global variables are made
atomic, because iaa_crypto_probe() and iaa_crypto_remove() change the
values of these variables asynchronously and concurrently as wqs get
added/deleted and rebalance_wq_table() is called. This change allows the
rebalance_wq_table() code to see consistent values of the number of IAA
devices.

Description/motivation for (B):
===============================
This builds upon the package-based driver infrastructure, to provide
more flexibility in using particular WQs for compress-only or
decompress-only jobs. It also introduces the notion of using all the IAA
devices on a package as resources that are shared by all cores on the
package: this significantly improves batching (to be added in subsequent
patches) latency and compress/decompress throughput. sysfs driver
paramters provide configurability of these features.

Two main concepts are introduced as part of the rebalancing changes:

 1) An IAA WQ can be used for specific ops, that determines a WQ "type"
    for the iaa_crypto driver to submit compress/decompress jobs:

    - compress only
    - decompress only
    - generic, i.e, for both compresses and decompresses

    The WQ type is decided based on the number of WQs configured for a
    given IAA device, and the new "g_comp_wqs_per_iaa" driver parameter.

 2) An IAA WQ can be mapped to cores using either of the following
    balancing techniques:

    a) Shared by all cores on a package. The iaa_crypto driver will
       dispatch compress/decompress jobs to all WQs of the same type,
       across all IAA devices on the package:
       - IAA compress jobs will be distributed to all same-package IAA
         compress-only/generic WQs.
       - IAA decompress jobs will be distributed to all same-package IAA
         decompress-only/generic WQs.

    b) Handles compress/decompress jobs only from "mapped cores", i.e.,
       the cores derived by evenly dividing the number of IAAs among the
       number of cores, per package.

Server setups that are moderately to highly contended can benefit from
(2.a). When the mix of workloads running on a system need high compress
throughput, and have relatively lower decompress activity, (2.b) might
be more optimal.

These approaches can be accomplished with the following new iaa_crypto
driver parameters. These parameters are global settings and will apply
to all IAAs on a package, interpreted in the context of the number of
WQs configured per IAA device.

 g_comp_wqs_per_iaa:
 ===================
   Number of compress-only WQs. The default is 1, but is applicable only
   if the device has more than 1 WQ. If the device has exactly 1 WQ
   configured, "g_comp_wqs_per_iaa" is a don't care.

   If the IAA device has more than "g_comp_wqs_per_iaa" WQs configured,
   the last "g_comp_wqs_per_iaa" number of WQs will be considered as
   "compress only". The remaining WQs will be considered as
   "decompress only".

   If the device has less than or equal to "g_comp_wqs_per_iaa" WQs, all
   the device's WQs will be considered "generic", i.e., the driver will
   submit compress and decompress jobs to all the WQs configured for the
   device.

   For e.g., if an IAA "X" has 2 WQs, this will set up 1 decompress WQ and
   1 compress WQ:

     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa

     wqX.0: decompress jobs only.
     wqX.1: compress jobs only.

   This setting would typically benefit workloads that see a high
   level of compress and decompress activity.

   If an IAA has 1 WQ, that WQ will be considered "generic": the driver
   will submit compress and decompress jobs to the same WQ (this is
   independent of the "g_comp_wqs_per_iaa" setting):

     wqX.0: compress and decompress jobs.

   This would typically benefit workloads that see significant cold
   memory being reclaimed, and consequently, high swapout and low swapin
   activity.

 distribute_comps:
 =================
   Distribute compressions to all IAAs on package (default is Y).

   Assuming the WQ type has been established as
   compress-only/decompress-only/generic, this setting will determine if
   the driver will distribute compress jobs to all IAAs on a package
   (default behavior) or not.

   If this is turned off, the driver will dispatch compress jobs to a
   given IAA "compression enabled" WQ only from cores that are mapped to
   that IAA using an algorithm that evenly distributes IAAs per package
   to cores per package. For e.g., on a Sapphire Rapids server with
   56-physical-cores and 4 IAAs per package, with Hyperthreading, 28
   logical cores will be assigned to each IAA. With the
   "distribute_comps" driver parameter turned off, the driver will send
   compress jobs only to it's assigned IAA device.

   Enabling "distribute_comps" would typically benefit workloads in
   terms of batch compress latency and throughput.

 distribute_decomps:
 ===================
   Distribute decompressions to all IAAs on package (default is N).

   Assuming the WQ type has been established as
   compress-only/decompress-only/generic, this setting will determine if
   the driver will distribute decompress jobs to all IAAs on a package
   (default behavior) or not.

   We recommend leaving this parameter at its default setting of "N".
   Enabling "distribute_decomps = Y" can be evaluated for workloads that
   are sensitive to p99 decompress latency, and see a high level of
   compress and decompress activity (for e.g. warm memory reclaim/swapin).

Recommended settings for best compress/decompress latency, throughput
and hence memory savings for a moderately contended server, are:

   2 WQs per IAA
   g_comp_wqs_per_iaa = 1 (separate WQ for comps/decomps per IAA)
   distribute_decomps = N
   distribute_comps = Y

For systems that have one IAA device, the distribute_[de]comps settings
will be a no-op. Even for such systems, as long as considerable swapout
and swapin activity is expected, we recommend setting up 2 WQs
for the IAA, one each for compressions/decompressions. If swapouts are
significantly more than swapins, 1 WQ would be a better configuration,
as mentioned earlier.

 Examples:
 =========
   For a Sapphire Rapids server with 2 packages, 56 cores and 4 IAAs per
   package, each IAA has 2 WQs, and these settings are in effect:

     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
     echo 0 > /sys/bus/dsa/drivers/crypto/distribute_decomps

     wqX.0: decompress jobs only.
     wqX.1: compress jobs only.

   Compress jobs from all cores on package-0 will be distributed in
   round-robin manner to [iax1, iax3, iax5, iax7]'s wqX.1, to maximize
   compression throughput/latency/memory savings:

     wq1.1
     wq3.1
     wq5.1
     wq7.1

   Likewise, compress jobs from all cores on package-1 will be
   distributed in round-robin manner to [iax9, iax11, iax13, iax15]'s
   wqX.1, to maximize compression throughput/latency/memory savings for
   workloads running on package-1:

     wq9.1
     wq11.1
     wq13.1
     wq15.1

   Decompress jobs will be submitted from mapped logical cores only, as
   follows:

     package-0:

       CPU   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
       IAA:  iax1           iax3           iax5           iax7
       WQ:   wq1.0          wq3.0          wq5.0          wq7.0

     package-1:

       CPU   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
       IAA:  iax9           iax11          iax13           iax15
       WQ:   wq9.0          wq11.0         wq13.0          wq15.0

IAA WQs can be configured using higher level scripts as described in
Documentation/driver-api/crypto/iaa/iaa-crypto.rst. This documentation
has been updated for the above new parameters.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 .../driver-api/crypto/iaa/iaa-crypto.rst      | 136 +++
 drivers/crypto/intel/iaa/iaa_crypto.h         |  18 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 889 ++++++++++++++----
 3 files changed, 872 insertions(+), 171 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 8e50b900d51c2..1c4c25f0dc5e4 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -290,6 +290,142 @@ The available attributes are:
     'sync' mode. This is to ensure correct iaa_crypto behavior until true
     async polling without interrupts is enabled in iaa_crypto.
 
+  - g_comp_wqs_per_iaa
+
+    Number of compress-only WQs. The default is 1, but is applicable only
+    if the device has more than 1 WQ. If the device has exactly 1 WQ
+    configured, "g_comp_wqs_per_iaa" is a don't care.
+
+    If the IAA device has more than "g_comp_wqs_per_iaa" WQs configured,
+    the last "g_comp_wqs_per_iaa" number of WQs will be considered as
+    "compress only". The remaining WQs will be considered as "decomp only".
+
+    If the device has less than or equal to "g_comp_wqs_per_iaa" WQs, all
+    the device's WQs will be considered "generic", i.e., the driver will
+    submit compress and decompress jobs to all the WQs configured for the
+    device.
+
+    For e.g., if an IAA "X" has 2 WQs, this will set up 1 decompress WQ and
+    1 compress WQ::
+
+      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
+
+     wqX.0: decompress jobs only.
+     wqX.1: compress jobs only.
+
+    This setting would typically benefit workloads that see a high
+    level of compress and decompress activity.
+
+    If an IAA has 1 WQ, that WQ will be considered "generic": the driver
+    will submit compress and decompress jobs to the same WQ (this is
+    independent of the "g_comp_wqs_per_iaa" setting):
+
+     wqX.0: compress and decompress jobs.
+
+    This would typically benefit workloads that see significant cold
+    memory being reclaimed, and consequently, high swapout and low swapin
+    activity.
+
+  - distribute_comps
+
+    Distribute compressions to all IAAs on package (default is Y).
+
+    Assuming the WQ type has been established as
+    compress-only/decompress-only/generic, this setting will determine if
+    the driver will distribute compress jobs to all IAAs on a package
+    (default behavior) or not.
+
+    If this is turned off, the driver will dispatch compress jobs to a
+    given IAA "compression enabled" WQ only from cores that are mapped to
+    that IAA using an algorithm that evenly distributes IAAs per package
+    to cores per package. For e.g., on a Sapphire Rapids server with
+    56-physical-cores and 4 IAAs per package, with Hyperthreading, 28
+    logical cores will be assigned to each IAA. With the
+    "distribute_comps" driver parameter turned off, the driver will send
+    compress jobs only to it's assigned IAA device.
+
+    Enabling "distribute_comps" would typically benefit workloads in
+    terms of batch compress latency and throughput.
+
+  - distribute_decomps
+
+    Distribute decompressions to all IAAs on package (default is Y).
+
+    Assuming the WQ type has been established as
+    compress-only/decompress-only/generic, this setting will determine if
+    the driver will distribute decompress jobs to all IAAs on a package
+    (default behavior) or not.
+
+    Enabling "distribute_decomps" would typically benefit workloads that
+    see a high level of compress and decompress activity, especially
+    p99 decompress latency.
+
+    Recommended settings for best compress/decompress latency, throughput
+    and hence memory savings for a moderately contended server that
+    has more than 1 IAA device enabled on a given package:
+
+      2 WQs per IAA
+      g_comp_wqs_per_iaa = 1 (separate WQ for comps/decomps per IAA)
+      distribute_decomps = Y
+      distribute_comps = Y
+
+    For a system that has only 1 IAA device enabled on a given package,
+    the recommended settings are:
+
+      1 WQ per IAA
+      g_comp_wqs_per_iaa = 0 (same WQ for comps/decomps)
+      distribute_decomps = N
+      distribute_comps = N
+
+    Examples:
+
+    For a Sapphire Rapids server with 2 packages, 56 cores and 4 IAAs per
+    package, each IAA has 2 WQs, and these settings are in effect::
+
+      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
+      echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
+      echo 0 > /sys/bus/dsa/drivers/crypto/distribute_decomps
+
+    This enables the following behavior:
+
+      wqX.0: decompress jobs only.
+      wqX.1: compress jobs only.
+
+    Compress jobs from all cores on package-0 will be distributed in
+    round-robin manner to [iax1, iax3, iax5, iax7]'s wqX.1, to maximize
+    compression throughput/latency/memory savings:
+
+      wq1.1
+      wq3.1
+      wq5.1
+      wq7.1
+
+    Likewise, compress jobs from all cores on package-1 will be
+    distributed in round-robin manner to [iax9, iax11, iax13, iax15]'s
+    wqX.1, to maximize compression throughput/latency/memory savings for
+    workloads running on package-1:
+
+      wq9.1
+      wq11.1
+      wq13.1
+      wq15.1
+
+    Decompress jobs will be submitted from mapped logical cores only, as
+    follows:
+
+      package-0:
+
+        CPU   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
+        IAA:  iax1           iax3           iax5           iax7
+        WQ:   wq1.0          wq3.0          wq5.0          wq7.0
+
+      package-1:
+
+        CPU   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
+        IAA:  iax9           iax11          iax13           iax15
+        WQ:   wq9.0          wq11.0         wq13.0          wq15.0
+
+
 .. _iaa_default_config:
 
 IAA Default Configuration
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e3952637..549ac98a9366e 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -46,6 +46,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -63,6 +64,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq	**wqs;
+	unsigned int	max_wqs;
+	unsigned int	n_wqs;
+	unsigned int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -73,19 +81,15 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*generic_wq_table;
+	struct wq_table_entry		*comp_wq_table;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 760997eee8fe5..c6db721eaa799 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -23,32 +23,86 @@
 #define pr_fmt(fmt)			"idxd: " IDXD_SUBDRIVER_NAME ": " fmt
 
 #define IAA_ALG_PRIORITY               300
+#define MAX_PKG_IAA   8
+#define MAX_IAA_WQ    8
 
 /**************************************
  * Driver internal global variables.
  **************************************/
 
 /* number of iaa instances probed */
-static unsigned int nr_iaa;
+static atomic_t nr_iaa = ATOMIC_INIT(0);
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static atomic_t nr_iaa_per_package = ATOMIC_INIT(0);
 
 /* Number of physical cpus sharing each iaa instance */
-static unsigned int cpus_per_iaa;
+static atomic_t cpus_per_iaa = ATOMIC_INIT(0);
 
-/* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+/* Per-cpu lookup table for decomp wqs. */
+static struct wq_table_entry __percpu *cpu_decomp_wqs;
+
+/* Per-cpu lookup table for comp wqs. */
+static struct wq_table_entry __percpu *cpu_comp_wqs;
+
+/* All decomp wqs from IAAs on a package. */
+static struct wq_table_entry **pkg_global_decomp_wqs;
+/* All comp wqs from IAAs on a package. */
+static struct wq_table_entry **pkg_global_comp_wqs;
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
-/* If enabled, IAA hw crypto algos are registered, unavailable otherwise */
-static bool iaa_crypto_enabled;
+/*
+ * If enabled, IAA hw crypto algos are registered, unavailable otherwise:
+ *
+ * We use the atomic @iaa_crypto_enabled to know if the per-CPU
+ * compress/decompress wq tables have been setup successfully.
+ * Since @iaa_crypto_enabled is atomic, the core functions that
+ * return a wq for compression/decompression, namely,
+ * comp_wq_table_next_wq() and decomp_wq_table_next_wq() will
+ * test this atomic before proceeding to query the per-cpu wq tables.
+ *
+ * These events will set @iaa_crypto_enabled to 1:
+ * - Successful rebalance_wq_table() after individual wq addition/removal.
+ *
+ * These events will set @iaa_crypto_enabled to 0:
+ * - Error during rebalance_wq_table() after individual wq addition/removal.
+ * - check_completion() timeouts.
+ * - @nr_iaa is 0.
+ * - module cleanup.
+ */
+static atomic_t iaa_crypto_enabled = ATOMIC_INIT(0);
+
+/*
+ * First wq probed, to use until @iaa_crypto_enabled is 1:
+ *
+ * The first wq probed will be entered in the per-CPU comp/decomp wq tables
+ * until the IAA compression modes are registered. This is done to facilitate
+ * the compress/decompress calls from the crypto testmgr resulting from
+ * calling crypto_register_acomp().
+ *
+ * With the new dynamic package-level rebalancing of WQs being
+ * discovered asynchronously and concurrently with tests
+ * triggered from device registration, this is needed to
+ * determine when it is safe for the rebalancing of decomp/comp
+ * WQs to de-allocate the per-package WQs and re-allocate them
+ * based on the latest number of IAA devices and WQs.
+ */
+static struct idxd_wq *first_wq_found;
+DEFINE_MUTEX(first_wq_found_lock);
+
 static bool iaa_crypto_registered;
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
+/* Distribute decompressions across all IAAs on the package. */
+static bool iaa_distribute_decomps;
+
+/* Distribute compressions across all IAAs on the package. */
+static bool iaa_distribute_comps = true;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = true;
 
@@ -86,6 +140,9 @@ static bool async_mode;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of compress-only wqs per iaa*/
+static unsigned int g_comp_wqs_per_iaa = 1;
+
 /**************************************************
  * Driver attributes along with get/set functions.
  **************************************************/
@@ -102,7 +159,7 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 
 	mutex_lock(&iaa_devices_lock);
 
-	if (iaa_crypto_enabled)
+	if (atomic_read(&iaa_crypto_enabled))
 		goto out;
 
 	ret = kstrtobool(buf, &iaa_verify_compress);
@@ -166,7 +223,7 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 
 	mutex_lock(&iaa_devices_lock);
 
-	if (iaa_crypto_enabled)
+	if (atomic_read(&iaa_crypto_enabled))
 		goto out;
 
 	ret = set_iaa_sync_mode(buf);
@@ -179,6 +236,87 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_comp_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%u\n", g_comp_wqs_per_iaa);
+}
+
+static ssize_t g_comp_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (atomic_read(&iaa_crypto_enabled))
+		goto out;
+
+	ret = kstrtouint(buf, 10, &g_comp_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_comp_wqs_per_iaa);
+
+static ssize_t distribute_decomps_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_distribute_decomps);
+}
+
+static ssize_t distribute_decomps_store(struct device_driver *driver,
+					const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (atomic_read(&iaa_crypto_enabled))
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_distribute_decomps);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(distribute_decomps);
+
+static ssize_t distribute_comps_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_distribute_comps);
+}
+
+static ssize_t distribute_comps_store(struct device_driver *driver,
+				      const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (atomic_read(&iaa_crypto_enabled))
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_distribute_comps);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(distribute_comps);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -464,32 +602,81 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
 
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
+	struct wq_table_entry *wqt;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's generic/decomp wqs. */
+	iaa_device->generic_wq_table = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->generic_wq_table)
+		goto err;
+
+	wqt = iaa_device->generic_wq_table;
+
+	wqt->wqs = kcalloc(iaa_device->idxd->max_wqs, sizeof(struct idxd_wq *), GFP_KERNEL);
+	if (!wqt->wqs)
+		goto err;
+
+	wqt->max_wqs = iaa_device->idxd->max_wqs;
+	wqt->n_wqs = 0;
+
+	/*
+	 * IAA device's comp wqs (optional). If the device has more than
+	 * "g_comp_wqs_per_iaa" WQs configured, the last "g_comp_wqs_per_iaa"
+	 * number of WQs will be considered as "comp only". The remaining
+	 * WQs will be considered as "decomp only".
+	 * If the device has <= "g_comp_wqs_per_iaa" WQs, all the
+	 * device's WQs will be considered "generic", i.e., cores can submit
+	 * comp and decomp jobs to all the WQs configured for the device.
+	 */
+	iaa_device->comp_wq_table = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->comp_wq_table)
+		goto err;
+
+	wqt = iaa_device->comp_wq_table;
+
+	wqt->wqs = kcalloc(iaa_device->idxd->max_wqs, sizeof(struct idxd_wq *), GFP_KERNEL);
+	if (!wqt->wqs)
+		goto err;
+
+	wqt->max_wqs = iaa_device->idxd->max_wqs;
+	wqt->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->generic_wq_table) {
+			kfree(iaa_device->generic_wq_table->wqs);
+			kfree(iaa_device->generic_wq_table);
+		}
+		kfree(iaa_device->comp_wq_table);
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
-	nr_iaa++;
+	atomic_inc(&nr_iaa);
 
 	return iaa_device;
 }
@@ -509,7 +696,7 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 {
 	list_del(&iaa_device->list);
 
-	nr_iaa--;
+	atomic_dec(&nr_iaa);
 }
 
 static void free_iaa_device(struct iaa_device *iaa_device)
@@ -518,6 +705,17 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->generic_wq_table) {
+		kfree(iaa_device->generic_wq_table->wqs);
+		kfree(iaa_device->generic_wq_table);
+	}
+
+	if (iaa_device->comp_wq_table) {
+		kfree(iaa_device->comp_wq_table->wqs);
+		kfree(iaa_device->comp_wq_table);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -576,7 +774,7 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 
 			dev_dbg(dev, "removed wq %d from iaa_device %d, n_wq %d, nr_iaa %d\n",
 				wq->id, iaa_device->idxd->id,
-				iaa_device->n_wq, nr_iaa);
+				iaa_device->n_wq, atomic_read(&nr_iaa));
 
 			if (iaa_device->n_wq == 0)
 				del_iaa_device(iaa_device);
@@ -588,6 +786,7 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 static void remove_iaa_wq(struct idxd_wq *wq)
 {
 	struct iaa_device *iaa_device;
+	unsigned int num_pkg_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
 		if (iaa_has_wq(iaa_device, wq)) {
@@ -596,12 +795,20 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 		}
 	}
 
-	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-		if (!cpus_per_iaa)
-			cpus_per_iaa = 1;
-	} else
-		cpus_per_iaa = 1;
+	if (atomic_read(&nr_iaa)) {
+		atomic_set(&cpus_per_iaa, (nr_packages * nr_cpus_per_package) / atomic_read(&nr_iaa));
+		if (!atomic_read(&cpus_per_iaa))
+			atomic_set(&cpus_per_iaa, 1);
+
+		num_pkg_iaa = atomic_read(&nr_iaa) / nr_packages;
+		if (!num_pkg_iaa)
+			num_pkg_iaa = 1;
+	} else {
+		atomic_set(&cpus_per_iaa, 1);
+		num_pkg_iaa = 1;
+	}
+
+	atomic_set(&nr_iaa_per_package, num_pkg_iaa);
 }
 
 static void __free_iaa_wq(struct iaa_wq *iaa_wq)
@@ -635,6 +842,7 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	struct pci_dev *pdev;
 	struct device *dev;
 	int ret = 0;
+	unsigned int num_pkg_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
 		if (iaa_device->idxd == wq->idxd) {
@@ -687,12 +895,19 @@ static int save_iaa_wq(struct idxd_wq *wq)
 		}
 	}
 
-	if (WARN_ON(nr_iaa == 0))
+	if (WARN_ON(atomic_read(&nr_iaa) == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-	if (!cpus_per_iaa)
-		cpus_per_iaa = 1;
+	atomic_set(&cpus_per_iaa, (nr_packages * nr_cpus_per_package) / atomic_read(&nr_iaa));
+	if (!atomic_read(&cpus_per_iaa))
+		atomic_set(&cpus_per_iaa, 1);
+
+	num_pkg_iaa = atomic_read(&nr_iaa) / nr_packages;
+	if (!num_pkg_iaa)
+		num_pkg_iaa = 1;
+
+	atomic_set(&nr_iaa_per_package, num_pkg_iaa);
+
 out:
 	return 0;
 }
@@ -748,105 +963,284 @@ static int iaa_wq_put(struct idxd_wq *wq)
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
 
-static void wq_table_free_entry(int cpu)
+/*
+ * Given a cpu, find the closest IAA instance.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	int package_id, base_iaa, iaa = 0;
+
+	if (!nr_packages || !atomic_read(&nr_iaa_per_package) || !atomic_read(&nr_iaa))
+		return -1;
+
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * atomic_read(&nr_iaa_per_package);
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / atomic_read(&cpus_per_iaa));
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
+
+	if (iaa >= 0 && iaa < atomic_read(&nr_iaa))
+		return iaa;
+
+	return (atomic_read(&nr_iaa) - 1);
 }
 
-static void wq_table_clear_entry(int cpu)
+static void free_wq_tables(void)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	if (cpu_decomp_wqs) {
+		free_percpu(cpu_decomp_wqs);
+		cpu_decomp_wqs = NULL;
+	}
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	if (cpu_comp_wqs) {
+		free_percpu(cpu_comp_wqs);
+		cpu_comp_wqs = NULL;
+	}
+
+	pr_debug("freed comp/decomp wq tables\n");
 }
 
-static void clear_wq_table(void)
+static void pkg_global_wqs_dealloc(void)
 {
-	int cpu;
+	int i;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	if (pkg_global_decomp_wqs) {
+		for (i = 0; i < nr_packages; ++i) {
+			kfree(pkg_global_decomp_wqs[i]->wqs);
+			kfree(pkg_global_decomp_wqs[i]);
+		}
+		kfree(pkg_global_decomp_wqs);
+		pkg_global_decomp_wqs = NULL;
+	}
 
-	pr_debug("cleared wq table\n");
+	if (pkg_global_comp_wqs) {
+		for (i = 0; i < nr_packages; ++i) {
+			kfree(pkg_global_comp_wqs[i]->wqs);
+			kfree(pkg_global_comp_wqs[i]);
+		}
+		kfree(pkg_global_comp_wqs);
+		pkg_global_comp_wqs = NULL;
+	}
 }
 
-static void free_wq_table(void)
+static bool pkg_global_wqs_alloc(void)
 {
-	int cpu;
+	int i;
+
+	pkg_global_decomp_wqs = kcalloc(nr_packages, sizeof(*pkg_global_decomp_wqs), GFP_KERNEL);
+	if (!pkg_global_decomp_wqs)
+		return false;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_decomp_wqs[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+		if (!pkg_global_decomp_wqs[i])
+			goto err;
+
+		pkg_global_decomp_wqs[i]->wqs = kcalloc(MAX_PKG_IAA * MAX_IAA_WQ, sizeof(struct idxd_wq *), GFP_KERNEL);
+		if (!pkg_global_decomp_wqs[i]->wqs)
+			goto err;
+
+		pkg_global_decomp_wqs[i]->max_wqs = MAX_PKG_IAA * MAX_IAA_WQ;
+	}
+
+	pkg_global_comp_wqs = kcalloc(nr_packages, sizeof(*pkg_global_comp_wqs), GFP_KERNEL);
+	if (!pkg_global_comp_wqs)
+		goto err;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_comp_wqs[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+		if (!pkg_global_comp_wqs[i])
+			goto err;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+		pkg_global_comp_wqs[i]->wqs = kcalloc(MAX_PKG_IAA * MAX_IAA_WQ, sizeof(struct idxd_wq *), GFP_KERNEL);
+		if (!pkg_global_comp_wqs[i]->wqs)
+			goto err;
+
+		pkg_global_comp_wqs[i]->max_wqs = MAX_PKG_IAA * MAX_IAA_WQ;
+	}
 
-	free_percpu(wq_table);
+	return true;
 
-	pr_debug("freed wq table\n");
+err:
+	pkg_global_wqs_dealloc();
+	return false;
 }
 
 static int alloc_wq_table(int max_wqs)
 {
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
+	cpu_decomp_wqs = alloc_percpu_gfp(struct wq_table_entry, GFP_KERNEL | __GFP_ZERO);
+	if (!cpu_decomp_wqs)
 		return -ENOMEM;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(*entry->wqs), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
+	cpu_comp_wqs = alloc_percpu_gfp(struct wq_table_entry, GFP_KERNEL | __GFP_ZERO);
+	if (!cpu_comp_wqs)
+		goto err;
 
-		entry->max_wqs = max_wqs;
-	}
+	if (!pkg_global_wqs_alloc())
+		goto err;
 
 	pr_debug("initialized wq table\n");
 
 	return 0;
+
+err:
+	free_wq_tables();
+	return -ENOMEM;
+}
+
+/*
+ * The caller should have established that device_iaa_wqs is not empty,
+ * i.e., every IAA device in "iaa_devices" has at least one WQ.
+ */
+static void add_device_wqs_to_wq_table(struct wq_table_entry *dst_wq_table,
+				       struct wq_table_entry *device_wq_table)
+{
+	int i;
+
+	for (i = 0; i < device_wq_table->n_wqs; ++i)
+		dst_wq_table->wqs[dst_wq_table->n_wqs++] = device_wq_table->wqs[i];
+}
+
+static bool reinit_pkg_global_wqs(bool comp)
+{
+	int cur_iaa = 0, pkg = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry **pkg_wqs = comp ? pkg_global_comp_wqs : pkg_global_decomp_wqs;
+
+	for (pkg = 0; pkg < nr_packages; ++pkg)
+		pkg_wqs[pkg]->n_wqs = 0;
+
+	pkg = 0;
+
+one_iaa_special_case:
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		struct wq_table_entry *device_wq_table = comp ?
+			((iaa_device->comp_wq_table->n_wqs > 0) ?
+				iaa_device->comp_wq_table : iaa_device->generic_wq_table) :
+			iaa_device->generic_wq_table;
+
+		if (pkg_wqs[pkg]->n_wqs + device_wq_table->n_wqs > pkg_wqs[pkg]->max_wqs) {
+			pkg_wqs[pkg]->wqs = krealloc(pkg_wqs[pkg]->wqs,
+						     ksize(pkg_wqs[pkg]->wqs) +
+						     max((MAX_PKG_IAA * MAX_IAA_WQ), iaa_device->n_wq) * sizeof(struct idxd_wq *),
+						     GFP_KERNEL | __GFP_ZERO);
+			if (!pkg_wqs[pkg]->wqs)
+				return false;
+
+			pkg_wqs[pkg]->max_wqs = ksize(pkg_wqs[pkg]->wqs)/sizeof(struct idxd_wq *);
+		}
+
+		add_device_wqs_to_wq_table(pkg_wqs[pkg], device_wq_table);
+
+		pr_debug("pkg_global_%s_wqs[%d] has %u n_wqs %u max_wqs",
+			 (comp ? "comp" : "decomp"), pkg, pkg_wqs[pkg]->n_wqs, pkg_wqs[pkg]->max_wqs);
+
+		if (++cur_iaa == atomic_read(&nr_iaa_per_package)) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			if (atomic_read(&nr_iaa) == 1)
+				goto one_iaa_special_case;
+		}
+	}
+
+	return true;
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+static void create_cpu_wq_table(int cpu, struct wq_table_entry *wq_table, bool comp)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	struct wq_table_entry *entry = comp ?
+		per_cpu_ptr(cpu_comp_wqs, cpu) :
+		per_cpu_ptr(cpu_decomp_wqs, cpu);
+
+	if (!atomic_read(&iaa_crypto_enabled)) {
+		mutex_lock(&first_wq_found_lock);
+
+		BUG_ON(!first_wq_found && !wq_table->n_wqs);
+
+		if (!first_wq_found)
+			first_wq_found = wq_table->wqs[0];
+
+		mutex_unlock(&first_wq_found_lock);
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		entry->wqs = &first_wq_found;
+		entry->max_wqs = 1;
+		entry->n_wqs = 1;
+		entry->cur_wq = 0;
+		pr_debug("%s: cpu %d: added %u first_wq_found for %s wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs, comp ? "comp":"decomp",
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
 		return;
+	}
+
+	entry->wqs = wq_table->wqs;
+	entry->max_wqs = wq_table->max_wqs;
+	entry->n_wqs = wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->n_wqs)
+		pr_debug("%s: cpu %d: added %u iaa %s wqs up to wq %d.%d: entry->max_wqs = %u\n", __func__,
+			 cpu, entry->n_wqs, comp ? "comp":"decomp",
+			 entry->wqs[entry->n_wqs - 1]->idxd->id, entry->wqs[entry->n_wqs - 1]->id,
+			 entry->max_wqs);
+}
+
+static void set_cpu_wq_table_start_wq(int cpu, bool comp)
+{
+	struct wq_table_entry *entry = comp ?
+		per_cpu_ptr(cpu_comp_wqs, cpu) :
+		per_cpu_ptr(cpu_decomp_wqs, cpu);
+	unsigned int num_pkg_iaa = atomic_read(&nr_iaa_per_package);
+
+	int start_wq = (entry->n_wqs / num_pkg_iaa) * (cpu_to_iaa(cpu) % num_pkg_iaa);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
+}
 
-	entry->wqs[entry->n_wqs++] = wq;
+static void create_cpu_wq_table_from_pkg_wqs(bool comp)
+{
+	int cpu;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+	/*
+	 * All CPU on the same package share the same "package global"
+	 * [de]comp_wqs.
+	 */
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		int package_id = topology_logical_package_id(cpu);
+		struct wq_table_entry *pkg_wq_table = comp ?
+			((pkg_global_comp_wqs[package_id]->n_wqs > 0) ?
+				pkg_global_comp_wqs[package_id] : pkg_global_decomp_wqs[package_id])
+			: pkg_global_decomp_wqs[package_id];
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			/* Initialize decomp/comp wq_table for CPU. */
+			create_cpu_wq_table(pkg_cpu, pkg_wq_table, comp);
+			/* Stagger the starting WQ in the package WQ table, for each CPU. */
+			set_cpu_wq_table_start_wq(pkg_cpu, comp);
+		}
+	}
 }
 
-static int wq_table_add_wqs(int iaa, int cpu)
+static int add_mapped_device_wq_table_for_cpu(int iaa, int cpu, bool comp)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	struct wq_table_entry *device_wq_table;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-
 		if (cur_iaa != iaa) {
 			cur_iaa++;
 			continue;
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -861,93 +1255,219 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
+	device_wq_table = comp ?
+		((found_device->comp_wq_table->n_wqs > 0) ?
+			found_device->comp_wq_table : found_device->generic_wq_table) :
+		found_device->generic_wq_table;
+
+	create_cpu_wq_table(cpu, device_wq_table, comp);
+
+out:
+	return ret;
+}
+
+static void create_cpu_wq_table_from_mapped_device(bool comp)
+{
+	int cpu, iaa;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
+
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
+		}
+
+		if (WARN_ON(add_mapped_device_wq_table_for_cpu(iaa, cpu, comp))) {
+			pr_debug("could not add any wqs of iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
+	}
+}
+
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *generic, *for_comps;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	generic = iaa_device->generic_wq_table;
+	for_comps = iaa_device->comp_wq_table;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %p: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if ((!n_wqs_added || ((n_wqs_added + g_comp_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(generic->n_wqs < generic->max_wqs)) {
+
+			generic->wqs[generic->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %p: added decomp wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(for_comps->n_wqs == for_comps->max_wqs))
+				break;
+
+			for_comps->wqs[for_comps->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %p: added comp wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		}
+
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
 		ret = -EINVAL;
-		goto out;
 	}
-out:
+
 	return ret;
 }
 
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		BUG_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
+
 /*
- * Rebalance the wq table so that given a cpu, it's easy to find the
- * closest IAA instance.  The idea is to try to choose the most
- * appropriate IAA instance for a caller and spread available
- * workqueues around to clients.
+ * Rebalance the per-cpu wq table based on available IAA devices/WQs.
+ * Three driver parameters control how this algorithm works:
+ *
+ * - g_comp_wqs_per_iaa:
+ *
+ *   If multiple WQs are configured for a given device, this setting determines
+ *   the number of WQs to be used as "compress only" WQs. The remaining WQs will
+ *   be used as "decompress only WQs".
+ *   Note that the comp WQ can be the same as the decomp WQ, for e.g., if
+ *   g_comp_wqs_per_iaa is 0 (regardless of the # of available WQs per device), or,
+ *   if there is only 1 WQ configured for a device (regardless of
+ *   g_comp_wqs_per_iaa).
+ *
+ * - distribute_decomps, distribute_comps:
+ *
+ *   If this is enabled, all [de]comp WQs found from the IAA devices on a
+ *   package, will be aggregated into pkg_global_[de]comp_wqs, then assigned to
+ *   each CPU on the package.
+ *
+ * Note:
+ * -----
+ * rebalance_wq_table() will return true if it was able to successfully
+ * configure comp/decomp wqs for all CPUs, without changing the
+ * @iaa_crypto_enabled atomic. The caller can re-enable the use of the wq
+ * tables after rebalance_wq_table() returns true, by setting the
+ * @iaa_crypto_enabled atomic to 1.
+ * In case of any errors, the @iaa_crypto_enabled atomic will be set to 0,
+ * and rebalance_wq_table() will return false.
  */
-static void rebalance_wq_table(void)
+static bool rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node_cpu, node, cpu, iaa = 0;
+	int cpu;
 
-	if (nr_iaa == 0)
-		return;
+	if (atomic_read(&nr_iaa) == 0)
+		goto err;
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+	map_iaa_devices();
 
-	clear_wq_table();
+	pr_info("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, nr_iaa_per_package %d, cpus_per_iaa %d\n",
+		nr_packages, nr_cpus, atomic_read(&nr_iaa),
+		atomic_read(&nr_iaa_per_package), atomic_read(&cpus_per_iaa));
 
-	if (nr_iaa == 1) {
-		for_each_possible_cpu(cpu) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu)))
-				goto err;
-		}
+	if (iaa_distribute_decomps) {
+		/* Each CPU uses all IAA devices on package for decomps. */
+		if (!reinit_pkg_global_wqs(false))
+			goto err;
+		create_cpu_wq_table_from_pkg_wqs(false);
+	} else {
+		/*
+		 * Each CPU uses the decomp WQ on the mapped IAA device using
+		 * a balanced mapping of cores to IAA.
+		 */
+		create_cpu_wq_table_from_mapped_device(false);
+	}
 
-		return;
+	if (iaa_distribute_comps) {
+		/* Each CPU uses all IAA devices on package for comps. */
+		if (!reinit_pkg_global_wqs(true))
+			goto err;
+		create_cpu_wq_table_from_pkg_wqs(true);
+	} else {
+		/*
+		 * Each CPU uses the comp WQ on the mapped IAA device using
+		 * a balanced mapping of cores to IAA.
+		 */
+		create_cpu_wq_table_from_mapped_device(true);
 	}
 
-	for_each_node_with_cpus(node) {
-		cpu = 0;
-		node_cpus = cpumask_of_node(node);
+	/* Verify that each cpu has comp and decomp wqs.*/
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		struct wq_table_entry *entry = per_cpu_ptr(cpu_decomp_wqs, cpu);
 
-		for_each_cpu(node_cpu, node_cpus) {
-			iaa = cpu / cpus_per_iaa;
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu)))
-				goto err;
-			cpu++;
+		if (!entry->wqs || !entry->n_wqs) {
+			pr_err("%s: cpu %d does not have decomp_wqs", __func__, cpu);
+			goto err;
+		}
+
+		entry = per_cpu_ptr(cpu_comp_wqs, cpu);
+		if (!entry->wqs || !entry->n_wqs) {
+			pr_err("%s: cpu %d does not have comp_wqs", __func__, cpu);
+			goto err;
 		}
 	}
 
-	return;
+	pr_debug("Finished rebalance decomp/comp wqs.");
+	return true;
+
 err:
-	pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+	atomic_set(&iaa_crypto_enabled, 0);
+	pr_debug("Error during rebalance decomp/comp wqs.");
+	return false;
 }
 
 /***************************************************************
  * Assign work-queues for driver ops using per-cpu wq_tables.
  ***************************************************************/
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
+static struct idxd_wq *decomp_wq_table_next_wq(int cpu)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	struct wq_table_entry *entry = per_cpu_ptr(cpu_decomp_wqs, cpu);
+	struct idxd_wq *wq;
+
+	if (!atomic_read(&iaa_crypto_enabled))
+		return NULL;
+
+	wq = entry->wqs[entry->cur_wq];
 
-	if (++entry->cur_wq >= entry->n_wqs)
+	if (++entry->cur_wq == entry->n_wqs)
 		entry->cur_wq = 0;
 
-	if (!entry->wqs[entry->cur_wq])
+	return wq;
+}
+
+static struct idxd_wq *comp_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(cpu_comp_wqs, cpu);
+	struct idxd_wq *wq;
+
+	if (!atomic_read(&iaa_crypto_enabled))
 		return NULL;
 
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
+	wq = entry->wqs[entry->cur_wq];
 
-	return entry->wqs[entry->cur_wq];
+	if (++entry->cur_wq == entry->n_wqs)
+		entry->cur_wq = 0;
+
+	return wq;
 }
 
 /*************************************************
@@ -985,7 +1505,7 @@ static inline int check_completion(struct device *dev,
 			dev_err(dev, "%s completion timed out - "
 				"assuming broken hw, iaa_crypto now DISABLED\n",
 				op_str);
-			iaa_crypto_enabled = false;
+			atomic_set(&iaa_crypto_enabled, 0);
 			ret = -ETIMEDOUT;
 			goto out;
 		}
@@ -1501,18 +2021,13 @@ static int iaa_comp_acompress(struct acomp_req *req)
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
-	if (!iaa_crypto_enabled) {
-		pr_debug("iaa_crypto disabled, not compressing\n");
-		return -ENODEV;
-	}
-
 	if (!req->src || !req->slen) {
 		pr_debug("invalid src, not compressing\n");
 		return -EINVAL;
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
@@ -1599,18 +2114,13 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	struct device *dev;
 	struct idxd_wq *wq;
 
-	if (!iaa_crypto_enabled) {
-		pr_debug("iaa_crypto disabled, not decompressing\n");
-		return -ENODEV;
-	}
-
 	if (!req->src || !req->slen) {
 		pr_debug("invalid src, not decompressing\n");
 		return -EINVAL;
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	wq = decomp_wq_table_next_wq(cpu);
 	put_cpu();
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
@@ -1725,6 +2235,8 @@ static int iaa_register_compression_device(void)
 
 static int iaa_unregister_compression_device(void)
 {
+	atomic_set(&iaa_crypto_enabled, 0);
+
 	if (iaa_crypto_registered)
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
 
@@ -1746,10 +2258,13 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	if (data->type != IDXD_TYPE_IAX)
 		return -ENODEV;
 
+	mutex_lock(&iaa_devices_lock);
+
 	mutex_lock(&wq->wq_lock);
 
 	if (idxd_wq_get_private(wq)) {
 		mutex_unlock(&wq->wq_lock);
+		mutex_unlock(&iaa_devices_lock);
 		return -EBUSY;
 	}
 
@@ -1771,8 +2286,6 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 		goto err;
 	}
 
-	mutex_lock(&iaa_devices_lock);
-
 	if (list_empty(&iaa_devices)) {
 		ret = alloc_wq_table(wq->idxd->max_wqs);
 		if (ret)
@@ -1784,24 +2297,33 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	if (ret)
 		goto err_save;
 
-	rebalance_wq_table();
+	if (!rebalance_wq_table()) {
+		dev_dbg(dev, "%s: IAA rebalancing device wq tables failed\n", __func__);
+		goto err_register;
+	}
+	atomic_set(&iaa_crypto_enabled, 1);
 
 	if (first_wq) {
-		iaa_crypto_enabled = true;
 		ret = iaa_register_compression_device();
 		if (ret != 0) {
-			iaa_crypto_enabled = false;
 			dev_dbg(dev, "IAA compression device registration failed\n");
 			goto err_register;
 		}
+
+		if (!rebalance_wq_table()) {
+			dev_dbg(dev, "%s: Rerun after registration: IAA rebalancing device wq tables failed\n", __func__);
+			goto err_register;
+		}
+		atomic_set(&iaa_crypto_enabled, 1);
+
 		try_module_get(THIS_MODULE);
 
 		pr_info("iaa_crypto now ENABLED\n");
 	}
 
-	mutex_unlock(&iaa_devices_lock);
 out:
 	mutex_unlock(&wq->wq_lock);
+	mutex_unlock(&iaa_devices_lock);
 
 	return ret;
 
@@ -1810,9 +2332,8 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
-	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
 err:
 	wq->type = IDXD_WQT_NONE;
@@ -1827,13 +2348,17 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	struct iaa_wq *iaa_wq;
 	bool free = false;
 
+	atomic_set(&iaa_crypto_enabled, 0);
 	idxd_wq_quiesce(wq);
 
-	mutex_lock(&wq->wq_lock);
 	mutex_lock(&iaa_devices_lock);
+	mutex_lock(&wq->wq_lock);
 
 	remove_iaa_wq(wq);
 
+	if (!rebalance_wq_table())
+		pr_debug("%s: IAA rebalancing device wq tables failed\n", __func__);
+
 	spin_lock(&idxd->dev_lock);
 	iaa_wq = idxd_wq_get_private(wq);
 	if (!iaa_wq) {
@@ -1856,18 +2381,22 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	}
 
 	idxd_drv_disable_wq(wq);
-	rebalance_wq_table();
 
-	if (nr_iaa == 0) {
-		iaa_crypto_enabled = false;
-		free_wq_table();
+	if (atomic_read(&nr_iaa) == 0) {
+		atomic_set(&iaa_crypto_enabled, 0);
+		pkg_global_wqs_dealloc();
+		free_wq_tables();
+		BUG_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
+	} else {
+		atomic_set(&iaa_crypto_enabled, 1);
 	}
 out:
-	mutex_unlock(&iaa_devices_lock);
 	mutex_unlock(&wq->wq_lock);
+	mutex_unlock(&iaa_devices_lock);
 }
 
 static enum idxd_dev_type dev_types[] = {
@@ -1890,16 +2419,12 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	ret = iaa_aecs_init_fixed();
 	if (ret < 0) {
@@ -1913,6 +2438,27 @@ static int __init iaa_crypto_init_module(void)
 		goto err_driver_reg;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_comp_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_comp_wqs_per_iaa attr creation failed\n");
+		goto err_g_comp_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				 &driver_attr_distribute_decomps);
+	if (ret) {
+		pr_debug("IAA distribute_decomps attr creation failed\n");
+		goto err_distribute_decomps_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				 &driver_attr_distribute_comps);
+	if (ret) {
+		pr_debug("IAA distribute_comps attr creation failed\n");
+		goto err_distribute_comps_attr_create;
+	}
+
 	ret = driver_create_file(&iaa_crypto_driver.drv,
 				 &driver_attr_verify_compress);
 	if (ret) {
@@ -1938,6 +2484,15 @@ static int __init iaa_crypto_init_module(void)
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
 err_verify_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_comps);
+err_distribute_comps_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_decomps);
+err_distribute_decomps_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_comp_wqs_per_iaa);
+err_g_comp_wqs_per_iaa_attr_create:
 	idxd_driver_unregister(&iaa_crypto_driver);
 err_driver_reg:
 	iaa_aecs_cleanup_fixed();
@@ -1956,6 +2511,12 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_comps);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_decomps);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_comp_wqs_per_iaa);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 03/24] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 01/24] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 02/24] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 04/24] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch further simplifies the code in some places and makes it more
consistent and readable:

1) Change iaa_compress_verify() @dlen parameter to be a value instead of
   a pointer, because @dlen's value is only read, not modified by this
   procedure.

2) Simplify the success/error return paths in iaa_compress(),
   iaa_decompress() and iaa_compress_verify().

3) Delete dev_dbg() statements to make the code more readable.

4) Change return value from descriptor allocation failures to be
   -ENODEV, for better maintainability.

5) Fix a minor statistics bug in iaa_decompress(), with the
   decomp_bytes getting updated in case of errors.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 107 +++++----------------
 1 file changed, 22 insertions(+), 85 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index c6db721eaa799..ed3325bb32918 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1590,7 +1590,7 @@ static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen)
+			       dma_addr_t dst_addr, unsigned int dlen)
 {
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -1614,10 +1614,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
 	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		dev_dbg(dev, "iaa compress_verify failed: idxd descriptor allocation failure: ret=%ld\n", PTR_ERR(idxd_desc));
+		return -ENODEV;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1629,19 +1627,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->priv = 0;
 
 	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
+	desc->src1_size = dlen;
 	desc->dst_addr = (u64)src_addr;
 	desc->max_dst_size = slen;
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
@@ -1664,14 +1654,10 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 		goto err;
 	}
 
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
 
-	goto out;
+	return ret;
 }
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
@@ -1751,7 +1737,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 		}
 
 		ret = iaa_compress_verify(ctx->tfm, ctx->req, iaa_wq->wq, src_addr,
-					  ctx->req->slen, dst_addr, &ctx->req->dlen);
+					  ctx->req->slen, dst_addr, ctx->req->dlen);
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify failed ret=%d\n", __func__, ret);
 			err = -EIO;
@@ -1777,7 +1763,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	iaa_wq_put(idxd_desc->wq);
 }
 
-static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
+static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen)
@@ -1804,9 +1790,9 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
 	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n", PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		dev_dbg(dev, "iaa compress failed: idxd descriptor allocation failure: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return -ENODEV;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1832,21 +1818,8 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = true;
-
-		dev_dbg(dev, "%s use_async_irq: compression mode %s,"
-			" src_addr %llx, dst_addr %llx\n", __func__,
-			active_compression_mode->name,
-			src_addr, dst_addr);
 	}
 
-	dev_dbg(dev, "%s: compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n", __func__,
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
@@ -1859,7 +1832,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	if (ctx->async_mode) {
 		ret = -EINPROGRESS;
-		dev_dbg(dev, "%s: returning -EINPROGRESS\n", __func__);
 		goto out;
 	}
 
@@ -1877,15 +1849,10 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	*compression_crc = idxd_desc->iax_completion->crc;
 
-	if (!ctx->async_mode)
-		idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
+out:
+	return ret;
 }
 
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -1914,10 +1881,10 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
 	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa decompress failed: ret=%ld\n",
+		ret = -ENODEV;
+		dev_dbg(dev, "%s: idxd descriptor allocation failed: ret=%ld\n", __func__,
 			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		return ret;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1941,21 +1908,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = false;
-
-		dev_dbg(dev, "%s: use_async_irq compression mode %s,"
-			" src_addr %llx, dst_addr %llx\n", __func__,
-			active_compression_mode->name,
-			src_addr, dst_addr);
 	}
 
-	dev_dbg(dev, "%s: decompression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n", __func__,
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
@@ -1968,7 +1922,6 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	if (ctx->async_mode) {
 		ret = -EINPROGRESS;
-		dev_dbg(dev, "%s: returning -EINPROGRESS\n", __func__);
 		goto out;
 	}
 
@@ -1990,23 +1943,19 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		}
 	} else {
 		req->dlen = idxd_desc->iax_completion->output_size;
+
+		/* Update stats */
+		update_total_decomp_bytes_in(slen);
+		update_wq_decomp_bytes(wq, slen);
 	}
 
 	*dlen = req->dlen;
 
-	if (!ctx->async_mode)
+err:
+	if (idxd_desc)
 		idxd_free_desc(wq, idxd_desc);
-
-	/* Update stats */
-	update_total_decomp_bytes_in(slen);
-	update_wq_decomp_bytes(wq, slen);
 out:
 	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa decompress failed: ret=%d\n", ret);
-
-	goto out;
 }
 
 static int iaa_comp_acompress(struct acomp_req *req)
@@ -2053,9 +2002,6 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		goto out;
 	}
 	src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
 
 	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
 	if (nr_sgs <= 0 || nr_sgs > 1) {
@@ -2066,9 +2012,6 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		goto err_map_dst;
 	}
 	dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_compress(tfm, req, wq, src_addr, req->slen, dst_addr,
 			   &req->dlen);
@@ -2083,7 +2026,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		}
 
 		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
-					  dst_addr, &req->dlen);
+					  dst_addr, req->dlen);
 		if (ret)
 			dev_dbg(dev, "asynchronous compress verification failed ret=%d\n", ret);
 
@@ -2146,9 +2089,6 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		goto out;
 	}
 	src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
 
 	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
 	if (nr_sgs <= 0 || nr_sgs > 1) {
@@ -2159,9 +2099,6 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		goto err_map_dst;
 	}
 	dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
 			     dst_addr, &req->dlen);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 04/24] crypto: iaa - Descriptor allocation timeouts with mitigations.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 03/24] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 05/24] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the descriptor allocation from blocking to
non-blocking with bounded retries or "timeouts".

This is necessary to prevent task blocked errors in high contention
scenarios, for instance, when the platform has only 1 IAA device
enabled. With 1 IAA device enabled per package on a dual-package
Sapphire Rapids with 56 cores/package, there are 112 logical cores
mapped to this single IAA device. In this scenario, the task blocked
errors can occur because idxd_alloc_desc() is called with
IDXD_OP_BLOCK. With batching, multiple descriptors will need to be
allocated per batch. Any process that is able to do so, can cause
contention for allocating descriptors for all other processes that share
the use of the same sbitmap_queue. Under IDXD_OP_BLOCK, this causes
compress/decompress jobs to stall in stress test scenarios
(e.g. zswap_store() of 2M folios).

In order to make the iaa_crypto driver be more fail-safe, this commit
implements the following:

1) Change compress/decompress descriptor allocations to be non-blocking
   with retries ("timeouts").
2) Return compress error to zswap if descriptor allocation with timeouts
   fails during compress ops. zswap_store() will return an error and the
   folio gets stored in the backing swap device.
3) Fallback to software decompress if descriptor allocation with timeouts
   fails during decompress ops.

With these fixes, there are no task blocked errors seen under stress
testing conditions, and no performance degradation observed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  5 ++
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 58 +++++++++++++++-------
 2 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 549ac98a9366e..cc76a047b54ad 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -21,6 +21,9 @@
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
+#define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
+#define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
@@ -141,6 +144,8 @@ enum iaa_mode {
 
 struct iaa_compression_ctx {
 	enum iaa_mode	mode;
+	u16		alloc_comp_desc_timeout;
+	u16		alloc_decomp_desc_timeout;
 	bool		verify_compress;
 	bool		async_mode;
 	bool		use_irq;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index ed3325bb32918..1169cd44c8e78 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1596,7 +1596,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1612,7 +1613,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "iaa compress_verify failed: idxd descriptor allocation failure: ret=%ld\n", PTR_ERR(idxd_desc));
 		return -ENODEV;
@@ -1772,7 +1777,8 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1788,7 +1794,11 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_comp_desc_timeout)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "iaa compress failed: idxd descriptor allocation failure: ret=%ld\n",
 			PTR_ERR(idxd_desc));
@@ -1863,7 +1873,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1879,12 +1890,17 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		ret = -ENODEV;
 		dev_dbg(dev, "%s: idxd descriptor allocation failed: ret=%ld\n", __func__,
 			PTR_ERR(idxd_desc));
-		return ret;
+		idxd_desc = NULL;
+		goto fallback_software_decomp;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1913,7 +1929,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
+		goto fallback_software_decomp;
 	}
 
 	/* Update stats */
@@ -1926,19 +1942,21 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	}
 
 	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+
+fallback_software_decomp:
 	if (ret) {
-		dev_dbg(dev, "%s: check_completion failed ret=%d\n", __func__, ret);
-		if (idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
+		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
+		if (idxd_desc && idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(req);
-			if (ret) {
-				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
-					__func__, ret);
-				goto err;
-			}
-		} else {
+		}
+
+		ret = deflate_generic_decompress(req);
+
+		if (ret) {
+			pr_err("%s: iaa decompress failed: deflate-generic fallback error ret=%d\n",
+			       __func__, ret);
 			goto err;
 		}
 	} else {
@@ -2119,6 +2137,8 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 
 static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 {
+	ctx->alloc_comp_desc_timeout = IAA_ALLOC_DESC_COMP_TIMEOUT;
+	ctx->alloc_decomp_desc_timeout = IAA_ALLOC_DESC_DECOMP_TIMEOUT;
 	ctx->verify_compress = iaa_verify_compress;
 	ctx->async_mode = async_mode;
 	ctx->use_irq = use_irq;
@@ -2133,10 +2153,10 @@ static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 
-	compression_ctx_init(ctx);
-
 	ctx->mode = IAA_MODE_FIXED;
 
+	compression_ctx_init(ctx);
+
 	return 0;
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 05/24] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 04/24] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 06/24] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the reference counting on "struct iaa_wq" to be a
percpu_ref in atomic mode, instead of an "int refcount" combined with
the "idxd->dev_lock" spin_lock currently used as a synchronization
mechanism to achieve get/put semantics.

This enables a more light-weight, cleaner and effective refcount
implementation for the iaa_wq, significantly reducing latency per
compress/decompress job submitted to the IAA accelerator:

  p50: -136 ns
  p99: -880 ns

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   4 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 119 +++++++--------------
 2 files changed, 41 insertions(+), 82 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index cc76a047b54ad..9611f2518f42c 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -47,8 +47,8 @@ struct iaa_wq {
 	struct list_head	list;
 
 	struct idxd_wq		*wq;
-	int			ref;
-	bool			remove;
+	struct percpu_ref	ref;
+	bool			free;
 	bool			mapped;
 
 	struct iaa_device	*iaa_device;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 1169cd44c8e78..a12ea3dd5ba80 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -701,7 +701,7 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 
 static void free_iaa_device(struct iaa_device *iaa_device)
 {
-	if (!iaa_device)
+	if (!iaa_device || iaa_device->n_wq)
 		return;
 
 	remove_device_compression_modes(iaa_device);
@@ -731,6 +731,13 @@ static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	return false;
 }
 
+static void __iaa_wq_release(struct percpu_ref *ref)
+{
+	struct iaa_wq *iaa_wq = container_of(ref, typeof(*iaa_wq), ref);
+
+	iaa_wq->free = true;
+}
+
 static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 		      struct iaa_wq **new_wq)
 {
@@ -738,11 +745,20 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	struct pci_dev *pdev = idxd->pdev;
 	struct device *dev = &pdev->dev;
 	struct iaa_wq *iaa_wq;
+	int ret;
 
 	iaa_wq = kzalloc(sizeof(*iaa_wq), GFP_KERNEL);
 	if (!iaa_wq)
 		return -ENOMEM;
 
+	ret = percpu_ref_init(&iaa_wq->ref, __iaa_wq_release,
+			      PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
+
+	if (ret) {
+		kfree(iaa_wq);
+		return -ENOMEM;
+	}
+
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
@@ -818,6 +834,9 @@ static void __free_iaa_wq(struct iaa_wq *iaa_wq)
 	if (!iaa_wq)
 		return;
 
+	WARN_ON(!percpu_ref_is_zero(&iaa_wq->ref));
+	percpu_ref_exit(&iaa_wq->ref);
+
 	iaa_device = iaa_wq->iaa_device;
 	if (iaa_device->n_wq == 0)
 		free_iaa_device(iaa_wq->iaa_device);
@@ -912,53 +931,6 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	return 0;
 }
 
-static int iaa_wq_get(struct idxd_wq *wq)
-{
-	struct idxd_device *idxd = wq->idxd;
-	struct iaa_wq *iaa_wq;
-	int ret = 0;
-
-	spin_lock(&idxd->dev_lock);
-	iaa_wq = idxd_wq_get_private(wq);
-	if (iaa_wq && !iaa_wq->remove) {
-		iaa_wq->ref++;
-		idxd_wq_get(wq);
-	} else {
-		ret = -ENODEV;
-	}
-	spin_unlock(&idxd->dev_lock);
-
-	return ret;
-}
-
-static int iaa_wq_put(struct idxd_wq *wq)
-{
-	struct idxd_device *idxd = wq->idxd;
-	struct iaa_wq *iaa_wq;
-	bool free = false;
-	int ret = 0;
-
-	spin_lock(&idxd->dev_lock);
-	iaa_wq = idxd_wq_get_private(wq);
-	if (iaa_wq) {
-		iaa_wq->ref--;
-		if (iaa_wq->ref == 0 && iaa_wq->remove) {
-			idxd_wq_set_private(wq, NULL);
-			free = true;
-		}
-		idxd_wq_put(wq);
-	} else {
-		ret = -ENODEV;
-	}
-	spin_unlock(&idxd->dev_lock);
-	if (free) {
-		__free_iaa_wq(iaa_wq);
-		kfree(iaa_wq);
-	}
-
-	return ret;
-}
-
 /***************************************************************
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
@@ -1765,7 +1737,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 
 	if (free_desc)
 		idxd_free_desc(idxd_desc->wq, idxd_desc);
-	iaa_wq_put(idxd_desc->wq);
+	percpu_ref_put(&iaa_wq->ref);
 }
 
 static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -1996,19 +1968,13 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	cpu = get_cpu();
 	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
-	if (!wq) {
-		pr_debug("no wq configured for cpu=%d\n", cpu);
-		return -ENODEV;
-	}
 
-	ret = iaa_wq_get(wq);
-	if (ret) {
+	iaa_wq = wq ? idxd_wq_get_private(wq) : NULL;
+	if (!iaa_wq || !percpu_ref_tryget(&iaa_wq->ref)) {
 		pr_debug("no wq available for cpu=%d\n", cpu);
 		return -ENODEV;
 	}
 
-	iaa_wq = idxd_wq_get_private(wq);
-
 	dev = &wq->idxd->pdev->dev;
 
 	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
@@ -2061,7 +2027,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 err_map_dst:
 	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 out:
-	iaa_wq_put(wq);
+	percpu_ref_put(&iaa_wq->ref);
 
 	return ret;
 }
@@ -2083,19 +2049,13 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	cpu = get_cpu();
 	wq = decomp_wq_table_next_wq(cpu);
 	put_cpu();
-	if (!wq) {
-		pr_debug("no wq configured for cpu=%d\n", cpu);
-		return -ENODEV;
-	}
 
-	ret = iaa_wq_get(wq);
-	if (ret) {
+	iaa_wq = wq ? idxd_wq_get_private(wq) : NULL;
+	if (!iaa_wq || !percpu_ref_tryget(&iaa_wq->ref)) {
 		pr_debug("no wq available for cpu=%d\n", cpu);
-		return -ENODEV;
+		return deflate_generic_decompress(req);
 	}
 
-	iaa_wq = idxd_wq_get_private(wq);
-
 	dev = &wq->idxd->pdev->dev;
 
 	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
@@ -2130,7 +2090,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 err_map_dst:
 	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 out:
-	iaa_wq_put(wq);
+	percpu_ref_put(&iaa_wq->ref);
 
 	return ret;
 }
@@ -2303,7 +2263,6 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	struct idxd_wq *wq = idxd_dev_to_wq(idxd_dev);
 	struct idxd_device *idxd = wq->idxd;
 	struct iaa_wq *iaa_wq;
-	bool free = false;
 
 	atomic_set(&iaa_crypto_enabled, 0);
 	idxd_wq_quiesce(wq);
@@ -2324,18 +2283,18 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 		goto out;
 	}
 
-	if (iaa_wq->ref) {
-		iaa_wq->remove = true;
-	} else {
-		wq = iaa_wq->wq;
-		idxd_wq_set_private(wq, NULL);
-		free = true;
-	}
+	/* Drop the initial reference. */
+	percpu_ref_kill(&iaa_wq->ref);
+
+	while (!iaa_wq->free)
+		cpu_relax();
+
+	__free_iaa_wq(iaa_wq);
+
+	idxd_wq_set_private(wq, NULL);
 	spin_unlock(&idxd->dev_lock);
-	if (free) {
-		__free_iaa_wq(iaa_wq);
-		kfree(iaa_wq);
-	}
+
+	kfree(iaa_wq);
 
 	idxd_drv_disable_wq(wq);
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 06/24] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress().
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 05/24] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 07/24] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit simplifies and streamlines the logic in the core
iaa_compress() and iaa_decompress() routines, eliminates branches, etc.

This makes it easier to add improvements such as polling for job
completions, essential to accomplish batching with hardware
parallelism.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 114 ++++++++++++---------
 1 file changed, 67 insertions(+), 47 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index a12ea3dd5ba80..f80f3ab175a48 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1792,7 +1792,34 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->src2_size = sizeof(struct aecs_comp_table_record);
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	if (ctx->use_irq) {
+	if (likely(!ctx->use_irq)) {
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto out;
+		}
+
+		/* Update stats */
+		update_total_comp_calls();
+		update_wq_comp_calls(wq);
+
+		if (ctx->async_mode)
+			return -EINPROGRESS;
+
+		ret = check_completion(dev, idxd_desc->iax_completion, true, false);
+		if (ret) {
+			dev_dbg(dev, "check_completion failed ret=%d\n", ret);
+			goto out;
+		}
+
+		*dlen = idxd_desc->iax_completion->output_size;
+
+		/* Update stats */
+		update_total_comp_bytes_out(*dlen);
+		update_wq_comp_bytes(wq, *dlen);
+
+		*compression_crc = idxd_desc->iax_completion->crc;
+	} else {
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
 		idxd_desc->crypto.req = req;
@@ -1800,40 +1827,23 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = true;
-	}
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
-	}
 
-	/* Update stats */
-	update_total_comp_calls();
-	update_wq_comp_calls(wq);
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto out;
+		}
 
-	if (ctx->async_mode) {
-		ret = -EINPROGRESS;
-		goto out;
-	}
+		/* Update stats */
+		update_total_comp_calls();
+		update_wq_comp_calls(wq);
 
-	ret = check_completion(dev, idxd_desc->iax_completion, true, false);
-	if (ret) {
-		dev_dbg(dev, "check_completion failed ret=%d\n", ret);
-		goto err;
+		return -EINPROGRESS;
 	}
 
-	*dlen = idxd_desc->iax_completion->output_size;
-
-	/* Update stats */
-	update_total_comp_bytes_out(*dlen);
-	update_wq_comp_bytes(wq, *dlen);
-
-	*compression_crc = idxd_desc->iax_completion->crc;
-
-err:
-	idxd_free_desc(wq, idxd_desc);
 out:
+	idxd_free_desc(wq, idxd_desc);
+
 	return ret;
 }
 
@@ -1888,7 +1898,22 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->src1_size = slen;
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	if (ctx->use_irq) {
+	if (likely(!ctx->use_irq)) {
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto fallback_software_decomp;
+		}
+
+		/* Update stats */
+		update_total_decomp_calls();
+		update_wq_decomp_calls(wq);
+
+		if (ctx->async_mode)
+			return -EINPROGRESS;
+
+		ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	} else {
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
 		idxd_desc->crypto.req = req;
@@ -1896,25 +1921,20 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = false;
-	}
 
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto fallback_software_decomp;
-	}
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto fallback_software_decomp;
+		}
 
-	/* Update stats */
-	update_total_decomp_calls();
-	update_wq_decomp_calls(wq);
+		/* Update stats */
+		update_total_decomp_calls();
+		update_wq_decomp_calls(wq);
 
-	if (ctx->async_mode) {
-		ret = -EINPROGRESS;
-		goto out;
+		return -EINPROGRESS;
 	}
 
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-
 fallback_software_decomp:
 	if (ret) {
 		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
@@ -1929,7 +1949,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		if (ret) {
 			pr_err("%s: iaa decompress failed: deflate-generic fallback error ret=%d\n",
 			       __func__, ret);
-			goto err;
+			goto out;
 		}
 	} else {
 		req->dlen = idxd_desc->iax_completion->output_size;
@@ -1941,10 +1961,10 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	*dlen = req->dlen;
 
-err:
+out:
 	if (idxd_desc)
 		idxd_free_desc(wq, idxd_desc);
-out:
+
 	return ret;
 }
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 07/24] crypto: iaa - Refactor hardware descriptor setup into separate procedures.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 06/24] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 08/24] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch refactors the code that sets up the "struct iax_hw_desc" for
compress/decompress ops, into distinct procedures to make the code more
readable.

Also, get_iaa_device_compression_mode() is deleted and the compression
mode directly accessed from the iaa_device in the calling procedures.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 99 ++++++++++++----------
 1 file changed, 56 insertions(+), 43 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index f80f3ab175a48..a9e6809e63dff 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -483,12 +483,6 @@ int add_iaa_compression_mode(const char *name,
 }
 EXPORT_SYMBOL_GPL(add_iaa_compression_mode);
 
-static struct iaa_device_compression_mode *
-get_iaa_device_compression_mode(struct iaa_device *iaa_device, int idx)
-{
-	return iaa_device->compression_modes[idx];
-}
-
 static void free_device_compression_mode(struct iaa_device *iaa_device,
 					 struct iaa_device_compression_mode *device_mode)
 {
@@ -1564,7 +1558,6 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int dlen)
 {
-	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
@@ -1583,8 +1576,6 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
 	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
 		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
 		cpu_relax();
@@ -1660,8 +1651,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device,
-								  compression_ctx->mode);
+	active_compression_mode = iaa_device->compression_modes[compression_ctx->mode];
 	dev_dbg(dev, "%s: compression mode %s,"
 		" ctx->src_addr %llx, ctx->dst_addr %llx\n", __func__,
 		active_compression_mode->name,
@@ -1740,12 +1730,63 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	percpu_ref_put(&iaa_wq->ref);
 }
 
+static __always_inline struct iax_hw_desc *
+iaa_setup_compress_hw_desc(struct idxd_desc *idxd_desc,
+			   dma_addr_t src_addr,
+			   unsigned int slen,
+			   dma_addr_t dst_addr,
+			   unsigned int dlen,
+			   enum iaa_mode mode,
+			   struct iaa_device_compression_mode *active_compression_mode)
+{
+	struct iax_hw_desc *desc = idxd_desc->iax_hw;
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_COMPRESS;
+	desc->compr_flags = IAA_COMP_FLAGS;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)src_addr;
+	desc->src1_size = slen;
+	desc->dst_addr = (u64)dst_addr;
+	desc->max_dst_size = dlen;
+	desc->flags |= IDXD_OP_FLAG_RD_SRC2_AECS;
+	desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
+	desc->src2_size = sizeof(struct aecs_comp_table_record);
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	return desc;
+}
+
+static __always_inline struct iax_hw_desc *
+iaa_setup_decompress_hw_desc(struct idxd_desc *idxd_desc,
+			     dma_addr_t src_addr,
+			     unsigned int slen,
+			     dma_addr_t dst_addr,
+			     unsigned int dlen)
+{
+	struct iax_hw_desc *desc = idxd_desc->iax_hw;
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->max_dst_size = PAGE_SIZE;
+	desc->decompr_flags = IAA_DECOMP_FLAGS;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)src_addr;
+	desc->dst_addr = (u64)dst_addr;
+	desc->max_dst_size = dlen;
+	desc->src1_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	return desc;
+}
+
 static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
@@ -1764,8 +1805,6 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
 	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_comp_desc_timeout)) {
 		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
 		cpu_relax();
@@ -1776,21 +1815,9 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			PTR_ERR(idxd_desc));
 		return -ENODEV;
 	}
-	desc = idxd_desc->iax_hw;
 
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR |
-		IDXD_OP_FLAG_RD_SRC2_AECS | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_COMPRESS;
-	desc->compr_flags = IAA_COMP_FLAGS;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)src_addr;
-	desc->src1_size = slen;
-	desc->dst_addr = (u64)dst_addr;
-	desc->max_dst_size = *dlen;
-	desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
-	desc->src2_size = sizeof(struct aecs_comp_table_record);
-	desc->completion_addr = idxd_desc->compl_dma;
+	desc = iaa_setup_compress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen,
+					  ctx->mode, iaa_device->compression_modes[ctx->mode]);
 
 	if (likely(!ctx->use_irq)) {
 		ret = idxd_submit_desc(wq, idxd_desc);
@@ -1852,7 +1879,6 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  dma_addr_t src_addr, unsigned int slen,
 			  dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
@@ -1870,8 +1896,6 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
 	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
 		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
 		cpu_relax();
@@ -1884,19 +1908,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc = NULL;
 		goto fallback_software_decomp;
 	}
-	desc = idxd_desc->iax_hw;
 
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->max_dst_size = PAGE_SIZE;
-	desc->decompr_flags = IAA_DECOMP_FLAGS;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)src_addr;
-	desc->dst_addr = (u64)dst_addr;
-	desc->max_dst_size = *dlen;
-	desc->src1_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
+	desc = iaa_setup_decompress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen);
 
 	if (likely(!ctx->use_irq)) {
 		ret = idxd_submit_desc(wq, idxd_desc);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 08/24] crypto: iaa - Simplified, efficient job submissions for non-irq mode.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 07/24] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 09/24] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds a new procedure, iaa_submit_desc_movdir64b(), that
directly calls movdir64b. The core iaa_crypto routines that submit
compress and decompress jobs now invoke iaa_submit_desc_movdir64b() in
non-irq driver modes, instead of idxd_submit_desc().

idxd_submit_desc() is called only in irq mode.

This improves latency for the most commonly used iaa_crypto usage
(i.e., async non-irq) in zswap/zram by eliminating redundant computes
that would otherwise be incurred in idxd_submit_desc():

  p50: -32 ns
  p99: -1,048 ns

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 30 ++++++++++++++--------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index a9e6809e63dff..63d0cb4015433 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1782,6 +1782,24 @@ iaa_setup_decompress_hw_desc(struct idxd_desc *idxd_desc,
 	return desc;
 }
 
+/*
+ * Call this for non-irq, non-enqcmds job submissions.
+ */
+static __always_inline void iaa_submit_desc_movdir64b(struct idxd_wq *wq,
+						     struct idxd_desc *desc)
+{
+	void __iomem *portal = idxd_wq_portal_addr(wq);
+
+	/*
+	 * The wmb() flushes writes to coherent DMA data before
+	 * possibly triggering a DMA read. The wmb() is necessary
+	 * even on UP because the recipient is a device.
+	 */
+	wmb();
+
+	iosubmit_cmds512(portal, desc->hw, 1);
+}
+
 static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
@@ -1820,11 +1838,7 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 					  ctx->mode, iaa_device->compression_modes[ctx->mode]);
 
 	if (likely(!ctx->use_irq)) {
-		ret = idxd_submit_desc(wq, idxd_desc);
-		if (ret) {
-			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-			goto out;
-		}
+		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_comp_calls();
@@ -1912,11 +1926,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc = iaa_setup_decompress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen);
 
 	if (likely(!ctx->use_irq)) {
-		ret = idxd_submit_desc(wq, idxd_desc);
-		if (ret) {
-			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-			goto fallback_software_decomp;
-		}
+		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_decomp_calls();
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 09/24] crypto: iaa - Deprecate exporting add/remove IAA compression modes.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 08/24] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 10/24] crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap and zram Kanchana P Sridhar
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

There is no use case right now for kernel users to dynamically
add/remove IAA compression modes; hence this commit deletes the symbol
exports of add_iaa_compression_mode() and remove_iaa_compression_mode().

The only supported usage model of IAA compression modes is for the code
to be statically linked during the iaa_crypto module build,
e.g. iaa_crypto_comp_fixed.c, and for available modes to be registered
when the first IAA device wq is probed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 63d0cb4015433..182c41816a97c 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -367,10 +367,6 @@ static void free_iaa_compression_mode(struct iaa_compression_mode *mode)
  * These tables are typically generated and captured using statistics
  * collected from running actual compress/decompress workloads.
  *
- * A module or other kernel code can add and remove compression modes
- * with a given name using the exported @add_iaa_compression_mode()
- * and @remove_iaa_compression_mode functions.
- *
  * When a new compression mode is added, the tables are saved in a
  * global compression mode list.  When IAA devices are added, a
  * per-IAA device dma mapping is created for each IAA device, for each
@@ -404,7 +400,6 @@ void remove_iaa_compression_mode(const char *name)
 out:
 	mutex_unlock(&iaa_devices_lock);
 }
-EXPORT_SYMBOL_GPL(remove_iaa_compression_mode);
 
 /**
  * add_iaa_compression_mode - Add an IAA compression mode
@@ -481,7 +476,6 @@ int add_iaa_compression_mode(const char *name,
 	free_iaa_compression_mode(mode);
 	goto out;
 }
-EXPORT_SYMBOL_GPL(add_iaa_compression_mode);
 
 static void free_device_compression_mode(struct iaa_device *iaa_device,
 					 struct iaa_device_compression_mode *device_mode)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 10/24] crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap and zram.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 09/24] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 11/24] crypto: iaa - Enablers for submitting descriptors then polling for completion Kanchana P Sridhar
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch rearchitects the iaa_crypto driver to be usable by
non-crypto_acomp kernel users such as zram. The crypto_acomp interface
is also preserved for use by zswap. The core driver code is moved under
a crypto_acomp-agnostic layer that relies only on idxd, dma and
scatterlist.

Additionally, this patch resolves a race condition triggered when
IAA wqs and devices are continuously disabled/enabled when workloads are
using IAA for compression/decompression. This commit, in combination
with patches 0002 ("crypto: iaa - New architecture for IAA device WQ
comp/decomp usage & core mapping.) and 0005 (crypto: iaa - iaa_wq uses
percpu_refs for get/put reference counting.) in this series fix the race
condition. This has been verified using bisecting.

The newly added include/linux/iaa_comp.h provides the data structures
and API for use by non-crypto_acomp kernel code such as zram.

This allows kernel users i.e., zswap and zram, to use IAA's hardware
acceleration for compression/decompression without/with crypto_acomp.

Towards this goal, most of the driver code has been made independent of
crypto_acomp, by introducing a new "struct iaa_req" data structure, and
light-weight internal translation routines to/from crypto_acomp, namely,
acomp_to_iaa() and iaa_to_acomp().

The exception is that the driver defines a "static struct crypto_acomp
*deflate_crypto_comp" for the software decompress fall-back
path. Hopefully this shouldn't be an issue for zram because it is
encapsulated within the iaa_crypto driver.

The acomp_alg .compress() and .decompress() interfaces call into
iaa_comp_acompress_main() and iaa_comp_adecompress_main(), which are
wrappers around the core crypto-independent driver functions.

A zram/zcomp backend for iaa_crypto will be submitted as a separate
patch series, using these interfaces from iaa_comp.h:

       int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req);

       int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req);

These iaa_crypto interfaces will continue to be available through
crypto_acomp for use in zswap:

       int crypto_acomp_compress(struct acomp_req *req);
       int crypto_acomp_decompress(struct acomp_req *req);

Some other changes introduced by this commit are:

1) iaa_crypto symbol namespace is changed from "IDXD" to
   "CRYPTO_DEV_IAA_CRYPTO".

2) Some constants and data structures are moved to
   include/linux/iaa_comp.h so as to be usable in developing the zram
   iaa_crypto backend.

Fixes: ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto driver core")
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/Makefile          |   2 +-
 drivers/crypto/intel/iaa/iaa_crypto.h      |   7 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 368 ++++++++++++++++++---
 include/linux/iaa_comp.h                   |  86 +++++
 4 files changed, 403 insertions(+), 60 deletions(-)
 create mode 100644 include/linux/iaa_comp.h

diff --git a/drivers/crypto/intel/iaa/Makefile b/drivers/crypto/intel/iaa/Makefile
index 55bda7770fac7..ebfa1a425f808 100644
--- a/drivers/crypto/intel/iaa/Makefile
+++ b/drivers/crypto/intel/iaa/Makefile
@@ -3,7 +3,7 @@
 # Makefile for IAA crypto device drivers
 #
 
-ccflags-y += -I $(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE='"IDXD"'
+ccflags-y += -I $(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE='"CRYPTO_DEV_IAA_CRYPTO"'
 
 obj-$(CONFIG_CRYPTO_DEV_IAA_CRYPTO) := iaa_crypto.o
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 9611f2518f42c..190157967e3ba 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -6,6 +6,7 @@
 
 #include <linux/crypto.h>
 #include <linux/idxd.h>
+#include <linux/iaa_comp.h>
 #include <uapi/linux/idxd.h>
 
 #define IDXD_SUBDRIVER_NAME		"crypto"
@@ -29,8 +30,6 @@
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
 #define IAA_ERROR_WATCHDOG_EXPIRED	0x24
 
-#define IAA_COMP_MODES_MAX		2
-
 #define FIXED_HDR			0x2
 #define FIXED_HDR_SIZE			3
 
@@ -138,10 +137,6 @@ int add_iaa_compression_mode(const char *name,
 
 void remove_iaa_compression_mode(const char *name);
 
-enum iaa_mode {
-	IAA_MODE_FIXED,
-};
-
 struct iaa_compression_ctx {
 	enum iaa_mode	mode;
 	u16		alloc_comp_desc_timeout;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 182c41816a97c..fad0a9274a2de 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -11,6 +11,7 @@
 #include <linux/highmem.h>
 #include <linux/sched/smt.h>
 #include <crypto/internal/acompress.h>
+#include <linux/iaa_comp.h>
 
 #include "idxd.h"
 #include "iaa_crypto.h"
@@ -51,6 +52,9 @@ static struct wq_table_entry **pkg_global_decomp_wqs;
 /* All comp wqs from IAAs on a package. */
 static struct wq_table_entry **pkg_global_comp_wqs;
 
+/* For software deflate fallback compress/decompress. */
+static struct crypto_acomp *deflate_crypto_acomp;
+
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -93,9 +97,18 @@ static atomic_t iaa_crypto_enabled = ATOMIC_INIT(0);
 static struct idxd_wq *first_wq_found;
 DEFINE_MUTEX(first_wq_found_lock);
 
-static bool iaa_crypto_registered;
+const char *iaa_compression_mode_names[IAA_COMP_MODES_MAX] = {
+	"fixed",
+};
+
+const char *iaa_compression_alg_names[IAA_COMP_MODES_MAX] = {
+	"deflate-iaa",
+};
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+static struct iaa_compression_ctx *iaa_ctx[IAA_COMP_MODES_MAX];
+static bool iaa_mode_registered[IAA_COMP_MODES_MAX];
+static u8 num_iaa_modes_registered;
 
 /* Distribute decompressions across all IAAs on the package. */
 static bool iaa_distribute_decomps;
@@ -353,6 +366,20 @@ static struct iaa_compression_mode *find_iaa_compression_mode(const char *name,
 	return NULL;
 }
 
+static bool iaa_alg_is_registered(const char *name, int *idx)
+{
+	int i;
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		if (!strcmp(name, iaa_compression_alg_names[i]) && iaa_mode_registered[i]) {
+			*idx = i;
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static void free_iaa_compression_mode(struct iaa_compression_mode *mode)
 {
 	kfree(mode->name);
@@ -466,6 +493,7 @@ int add_iaa_compression_mode(const char *name,
 		 mode->name, idx);
 
 	iaa_compression_modes[idx] = mode;
+	++num_iaa_modes_registered;
 
 	ret = 0;
 out:
@@ -1434,11 +1462,15 @@ static struct idxd_wq *comp_wq_table_next_wq(int cpu)
  * Core iaa_crypto compress/decompress functions.
  *************************************************/
 
-static int deflate_generic_decompress(struct acomp_req *req)
+static int deflate_generic_decompress(struct iaa_req *req)
 {
-	ACOMP_FBREQ_ON_STACK(fbreq, req);
+	ACOMP_REQUEST_ON_STACK(fbreq, deflate_crypto_acomp);
 	int ret;
 
+	acomp_request_set_callback(fbreq, 0, NULL, NULL);
+	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
+				 PAGE_SIZE);
+
 	ret = crypto_acomp_decompress(fbreq);
 	req->dlen = fbreq->dlen;
 
@@ -1447,6 +1479,24 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	return ret;
 }
 
+static __always_inline void acomp_to_iaa(struct acomp_req *areq,
+					 struct iaa_req *req,
+					 struct iaa_compression_ctx *ctx)
+{
+	req->src = areq->src;
+	req->dst = areq->dst;
+	req->slen = areq->slen;
+	req->dlen = areq->dlen;
+	req->flags = areq->base.flags;
+	if (ctx->use_irq)
+		req->drv_data = areq;
+}
+
+static __always_inline void iaa_to_acomp(struct iaa_req *req, struct acomp_req *areq)
+{
+	areq->dlen = req->dlen;
+}
+
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -1508,7 +1558,7 @@ static inline int check_completion(struct device *dev,
 }
 
 static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
+				struct iaa_req *req,
 				dma_addr_t *src_addr, dma_addr_t *dst_addr)
 {
 	int ret = 0;
@@ -1547,13 +1597,11 @@ static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 	return ret;
 }
 
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
+static int iaa_compress_verify(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int dlen)
 {
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
 	u16 alloc_desc_retries = 0;
@@ -1606,10 +1654,10 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 		goto err;
 	}
 
-	if (*compression_crc != idxd_desc->iax_completion->crc) {
+	if (req->compression_crc != idxd_desc->iax_completion->crc) {
 		ret = -EINVAL;
 		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", *compression_crc,
+			" comp=0x%x, decomp=0x%x\n", req->compression_crc,
 			idxd_desc->iax_completion->crc);
 		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
 			       8, 1, idxd_desc->iax_completion, 64, 0);
@@ -1635,6 +1683,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	struct iaa_wq *iaa_wq;
 	struct pci_dev *pdev;
 	struct device *dev;
+	struct iaa_req req;
 	int ret, err = 0;
 
 	compression_ctx = crypto_tfm_ctx(ctx->tfm);
@@ -1660,12 +1709,18 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(ctx->req);
+
+			acomp_to_iaa(ctx->req, &req, compression_ctx);
+			ret = deflate_generic_decompress(&req);
+			iaa_to_acomp(&req, ctx->req);
+
 			if (ret) {
 				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
 					__func__, ret);
 				err = -EIO;
 				goto err;
+			} else {
+				goto verify;
 			}
 		} else {
 			err = -EIO;
@@ -1684,21 +1739,26 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 		update_wq_decomp_bytes(iaa_wq->wq, ctx->req->slen);
 	}
 
+verify:
 	if (ctx->compress && compression_ctx->verify_compress) {
-		u32 *compression_crc = acomp_request_ctx(ctx->req);
 		dma_addr_t src_addr, dst_addr;
 
-		*compression_crc = idxd_desc->iax_completion->crc;
+		acomp_to_iaa(ctx->req, &req, compression_ctx);
+		req.compression_crc = idxd_desc->iax_completion->crc;
+
+		ret = iaa_remap_for_verify(dev, iaa_wq, &req, &src_addr, &dst_addr);
+		iaa_to_acomp(&req, ctx->req);
 
-		ret = iaa_remap_for_verify(dev, iaa_wq, ctx->req, &src_addr, &dst_addr);
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify remap failed ret=%d\n", __func__, ret);
 			err = -EIO;
 			goto out;
 		}
 
-		ret = iaa_compress_verify(ctx->tfm, ctx->req, iaa_wq->wq, src_addr,
+		ret = iaa_compress_verify(compression_ctx, &req, iaa_wq->wq, src_addr,
 					  ctx->req->slen, dst_addr, ctx->req->dlen);
+		iaa_to_acomp(&req, ctx->req);
+
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify failed ret=%d\n", __func__, ret);
 			err = -EIO;
@@ -1794,13 +1854,11 @@ static __always_inline void iaa_submit_desc_movdir64b(struct idxd_wq *wq,
 	iosubmit_cmds512(portal, desc->hw, 1);
 }
 
-static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
+static int iaa_compress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
 	u16 alloc_desc_retries = 0;
@@ -1848,17 +1906,18 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 		}
 
 		*dlen = idxd_desc->iax_completion->output_size;
+		req->compression_crc = idxd_desc->iax_completion->crc;
 
 		/* Update stats */
 		update_total_comp_bytes_out(*dlen);
 		update_wq_comp_bytes(wq, *dlen);
-
-		*compression_crc = idxd_desc->iax_completion->crc;
 	} else {
+		struct acomp_req *areq = req->drv_data;
+
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
-		idxd_desc->crypto.req = req;
-		idxd_desc->crypto.tfm = tfm;
+		idxd_desc->crypto.req = areq;
+		idxd_desc->crypto.tfm = areq->base.tfm;
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = true;
@@ -1882,12 +1941,11 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	return ret;
 }
 
-static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
+static int iaa_decompress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
 			  dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
 	u16 alloc_desc_retries = 0;
@@ -1931,10 +1989,12 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 		ret = check_completion(dev, idxd_desc->iax_completion, false, false);
 	} else {
+		struct acomp_req *areq = req->drv_data;
+
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
-		idxd_desc->crypto.req = req;
-		idxd_desc->crypto.tfm = tfm;
+		idxd_desc->crypto.req = areq;
+		idxd_desc->crypto.tfm = areq->base.tfm;
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = false;
@@ -1985,20 +2045,16 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	return ret;
 }
 
-static int iaa_comp_acompress(struct acomp_req *req)
+static int iaa_comp_acompress(struct iaa_compression_ctx *ctx, struct iaa_req *req)
 {
-	struct iaa_compression_ctx *compression_ctx;
-	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct idxd_wq *wq;
 	struct device *dev;
 
-	compression_ctx = crypto_tfm_ctx(tfm);
-
-	if (!req->src || !req->slen) {
-		pr_debug("invalid src, not compressing\n");
+	if (!req->src || !req->slen || !req->dst) {
+		pr_debug("invalid src/dst, not compressing\n");
 		return -EINVAL;
 	}
 
@@ -2034,19 +2090,19 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	}
 	dst_addr = sg_dma_address(req->dst);
 
-	ret = iaa_compress(tfm, req, wq, src_addr, req->slen, dst_addr,
+	ret = iaa_compress(ctx, req, wq, src_addr, req->slen, dst_addr,
 			   &req->dlen);
 	if (ret == -EINPROGRESS)
 		return ret;
 
-	if (!ret && compression_ctx->verify_compress) {
+	if (!ret && ctx->verify_compress) {
 		ret = iaa_remap_for_verify(dev, iaa_wq, req, &src_addr, &dst_addr);
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify remap failed ret=%d\n", __func__, ret);
 			goto out;
 		}
 
-		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+		ret = iaa_compress_verify(ctx, req, wq, src_addr, req->slen,
 					  dst_addr, req->dlen);
 		if (ret)
 			dev_dbg(dev, "asynchronous compress verification failed ret=%d\n", ret);
@@ -2069,9 +2125,8 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	return ret;
 }
 
-static int iaa_comp_adecompress(struct acomp_req *req)
+static int iaa_comp_adecompress(struct iaa_compression_ctx *ctx, struct iaa_req *req)
 {
-	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
@@ -2115,7 +2170,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	}
 	dst_addr = sg_dma_address(req->dst);
 
-	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
+	ret = iaa_decompress(ctx, req, wq, src_addr, req->slen,
 			     dst_addr, &req->dlen);
 	if (ret == -EINPROGRESS)
 		return ret;
@@ -2132,8 +2187,9 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	return ret;
 }
 
-static void compression_ctx_init(struct iaa_compression_ctx *ctx)
+static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
+	ctx->mode = mode;
 	ctx->alloc_comp_desc_timeout = IAA_ALLOC_DESC_COMP_TIMEOUT;
 	ctx->alloc_decomp_desc_timeout = IAA_ALLOC_DESC_DECOMP_TIMEOUT;
 	ctx->verify_compress = iaa_verify_compress;
@@ -2141,26 +2197,164 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static __always_inline bool iaa_compressor_enabled(void)
+{
+	return (atomic_read(&iaa_crypto_enabled) && num_iaa_modes_registered);
+}
+
+static __always_inline enum iaa_mode iaa_compressor_is_registered(const char *compressor_name)
+{
+	u8 i;
+
+	if (!atomic_read(&iaa_crypto_enabled) || !num_iaa_modes_registered)
+		return IAA_MODE_NONE;
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		if (iaa_mode_registered[i] &&
+		    !strcmp(iaa_compression_alg_names[i], compressor_name))
+			return (enum iaa_mode)i;
+	}
+
+	return IAA_MODE_NONE;
+}
+
+/***********************************************************
+ * Interfaces for non-crypto_acomp kernel users, e.g. zram.
+ ***********************************************************/
+
+__always_inline bool iaa_comp_enabled(void)
+{
+	return iaa_compressor_enabled();
+}
+EXPORT_SYMBOL_GPL(iaa_comp_enabled);
+
+__always_inline enum iaa_mode iaa_comp_get_compressor_mode(const char *compressor_name)
+{
+	return iaa_compressor_is_registered(compressor_name);
+}
+EXPORT_SYMBOL_GPL(iaa_comp_get_compressor_mode);
+
+__always_inline bool iaa_comp_mode_is_registered(enum iaa_mode mode)
+{
+	return iaa_mode_registered[mode];
+}
+EXPORT_SYMBOL_GPL(iaa_comp_mode_is_registered);
+
+void iaa_comp_put_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes, u8 nr_modes)
+{
+	u8 i;
+
+	if (iaa_mode_names) {
+		for (i = 0; i < nr_modes; ++i)
+			kfree(iaa_mode_names[i]);
+		kfree(iaa_mode_names);
+	}
+
+	kfree(iaa_modes);
+}
+EXPORT_SYMBOL_GPL(iaa_comp_put_modes);
+
+u8 iaa_comp_get_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes)
+{
+	u8 i, nr_modes = 0;
+
+	if (!atomic_read(&iaa_crypto_enabled) || !num_iaa_modes_registered)
+		return 0;
+
+	iaa_mode_names = kcalloc(num_iaa_modes_registered, sizeof(char *), GFP_KERNEL);
+	if (!iaa_mode_names)
+		goto err;
+
+	iaa_modes = kcalloc(num_iaa_modes_registered, sizeof(enum iaa_mode), GFP_KERNEL);
+	if (!iaa_modes)
+		goto err;
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		if (iaa_mode_registered[i]) {
+			iaa_mode_names[nr_modes] = kzalloc(sizeof(char) * 30, GFP_KERNEL);
+			if (!iaa_mode_names[nr_modes])
+				goto err;
+			strscpy(iaa_mode_names[nr_modes], iaa_compression_alg_names[i],
+				sizeof(iaa_mode_names[nr_modes]));
+			iaa_modes[nr_modes] = (enum iaa_mode)nr_modes;
+			++nr_modes;
+		}
+	}
+
+	return nr_modes;
+
+err:
+	iaa_comp_put_modes(iaa_mode_names, iaa_modes, num_iaa_modes_registered);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iaa_comp_get_modes);
+
+__always_inline int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req)
+{
+	return iaa_comp_acompress(iaa_ctx[mode], req);
+}
+EXPORT_SYMBOL_GPL(iaa_comp_compress);
+
+__always_inline int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req)
+{
+	return iaa_comp_adecompress(iaa_ctx[mode], req);
+}
+EXPORT_SYMBOL_GPL(iaa_comp_decompress);
+
 /*********************************************
  * Interfaces to crypto_alg and crypto_acomp.
  *********************************************/
 
+static __always_inline int iaa_comp_acompress_main(struct acomp_req *areq)
+{
+	struct crypto_tfm *tfm = areq->base.tfm;
+	struct iaa_compression_ctx *ctx;
+	struct iaa_req req;
+	int ret = -ENODEV, idx;
+
+	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
+		ctx = iaa_ctx[idx];
+
+		acomp_to_iaa(areq, &req, ctx);
+		ret = iaa_comp_acompress(ctx, &req);
+		iaa_to_acomp(&req, areq);
+	}
+
+	return ret;
+}
+
+static __always_inline int iaa_comp_adecompress_main(struct acomp_req *areq)
+{
+	struct crypto_tfm *tfm = areq->base.tfm;
+	struct iaa_compression_ctx *ctx;
+	struct iaa_req req;
+	int ret = -ENODEV, idx;
+
+	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
+		ctx = iaa_ctx[idx];
+
+		acomp_to_iaa(areq, &req, ctx);
+		ret = iaa_comp_adecompress(ctx, &req);
+		iaa_to_acomp(&req, areq);
+	}
+
+	return ret;
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 
-	ctx->mode = IAA_MODE_FIXED;
-
-	compression_ctx_init(ctx);
+	ctx = iaa_ctx[IAA_MODE_FIXED];
 
 	return 0;
 }
 
 static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.init			= iaa_comp_init_fixed,
-	.compress		= iaa_comp_acompress,
-	.decompress		= iaa_comp_adecompress,
+	.compress		= iaa_comp_acompress_main,
+	.decompress		= iaa_comp_adecompress_main,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
@@ -2172,29 +2366,89 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	}
 };
 
+/*******************************************
+ * Implement idxd_device_driver interfaces.
+ *******************************************/
+
+static void iaa_unregister_compression_device(void)
+{
+	unsigned int i;
+
+	atomic_set(&iaa_crypto_enabled, 0);
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		iaa_mode_registered[i] = false;
+		kfree(iaa_ctx[i]);
+		iaa_ctx[i] = NULL;
+	}
+
+	num_iaa_modes_registered = 0;
+}
+
 static int iaa_register_compression_device(void)
 {
-	int ret;
+	struct iaa_compression_mode *mode;
+	int i, idx;
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		iaa_mode_registered[i] = false;
+		mode = find_iaa_compression_mode(iaa_compression_mode_names[i], &idx);
+		if (mode) {
+			iaa_ctx[i] = kmalloc(sizeof(struct iaa_compression_ctx), GFP_KERNEL);
+			if (!iaa_ctx[i])
+				goto err;
+
+			compression_ctx_init(iaa_ctx[i], (enum iaa_mode)i);
+			iaa_mode_registered[i] = true;
+		}
+	}
+
+	BUG_ON(!iaa_mode_registered[IAA_MODE_FIXED]);
+	return 0;
+
+err:
+	iaa_unregister_compression_device();
+	return -ENODEV;
+}
+
+static int iaa_register_acomp_compression_device(void)
+{
+	int ret = -ENOMEM;
+
+	deflate_crypto_acomp = crypto_alloc_acomp("deflate", 0, 0);
+	if (IS_ERR_OR_NULL(deflate_crypto_acomp))
+		goto err_deflate_acomp;
 
 	ret = crypto_register_acomp(&iaa_acomp_fixed_deflate);
 	if (ret) {
 		pr_err("deflate algorithm acomp fixed registration failed (%d)\n", ret);
-		goto out;
+		goto err_fixed;
 	}
 
-	iaa_crypto_registered = true;
-out:
+	return 0;
+
+err_fixed:
+	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
+		crypto_free_acomp(deflate_crypto_acomp);
+		deflate_crypto_acomp = NULL;
+	}
+
+err_deflate_acomp:
+	iaa_unregister_compression_device();
 	return ret;
 }
 
-static int iaa_unregister_compression_device(void)
+static void iaa_unregister_acomp_compression_device(void)
 {
 	atomic_set(&iaa_crypto_enabled, 0);
 
-	if (iaa_crypto_registered)
+	if (iaa_mode_registered[IAA_MODE_FIXED])
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
 
-	return 0;
+	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
+		crypto_free_acomp(deflate_crypto_acomp);
+		deflate_crypto_acomp = NULL;
+	}
 }
 
 static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
@@ -2264,6 +2518,12 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 			goto err_register;
 		}
 
+		ret = iaa_register_acomp_compression_device();
+		if (ret != 0) {
+			dev_dbg(dev, "IAA compression device acomp registration failed\n");
+			goto err_register;
+		}
+
 		if (!rebalance_wq_table()) {
 			dev_dbg(dev, "%s: Rerun after registration: IAA rebalancing device wq tables failed\n", __func__);
 			goto err_register;
@@ -2340,6 +2600,8 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 		pkg_global_wqs_dealloc();
 		free_wq_tables();
 		BUG_ON(!list_empty(&iaa_devices));
+		iaa_unregister_acomp_compression_device();
+		iaa_unregister_compression_device();
 		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
@@ -2456,8 +2718,8 @@ static int __init iaa_crypto_init_module(void)
 
 static void __exit iaa_crypto_cleanup_module(void)
 {
-	if (iaa_unregister_compression_device())
-		pr_debug("IAA compression device unregister failed\n");
+	iaa_unregister_acomp_compression_device();
+	iaa_unregister_compression_device();
 
 	iaa_crypto_debugfs_cleanup();
 	driver_remove_file(&iaa_crypto_driver.drv,
diff --git a/include/linux/iaa_comp.h b/include/linux/iaa_comp.h
new file mode 100644
index 0000000000000..ec061315f4772
--- /dev/null
+++ b/include/linux/iaa_comp.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2021 Intel Corporation. All rights rsvd. */
+
+#ifndef __IAA_COMP_H__
+#define __IAA_COMP_H__
+
+#if IS_ENABLED(CONFIG_CRYPTO_DEV_IAA_CRYPTO)
+
+#include <linux/scatterlist.h>
+
+#define IAA_COMP_MODES_MAX  IAA_MODE_NONE
+
+enum iaa_mode {
+	IAA_MODE_FIXED = 0,
+	IAA_MODE_NONE = 1,
+};
+
+struct iaa_req {
+	struct scatterlist *src;
+	struct scatterlist *dst;
+	unsigned int slen;
+	unsigned int dlen;
+	u32 flags;
+	u32 compression_crc;
+	void *drv_data; /* for driver internal use */
+};
+
+extern bool iaa_comp_enabled(void);
+
+extern enum iaa_mode iaa_comp_get_compressor_mode(const char *compressor_name);
+
+extern bool iaa_comp_mode_is_registered(enum iaa_mode mode);
+
+extern u8 iaa_comp_get_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes);
+
+extern void iaa_comp_put_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes, u8 nr_modes);
+
+extern int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req);
+
+extern int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req);
+
+#else /* CONFIG_CRYPTO_DEV_IAA_CRYPTO */
+
+enum iaa_mode {
+	IAA_MODE_NONE = 1,
+};
+
+struct iaa_req {};
+
+static inline bool iaa_comp_enabled(void)
+{
+	return false;
+}
+
+static inline enum iaa_mode iaa_comp_get_compressor_mode(const char *compressor_name)
+{
+	return IAA_MODE_NONE;
+}
+
+static inline bool iaa_comp_mode_is_registered(enum iaa_mode mode)
+{
+	return false;
+}
+
+static inline u8 iaa_comp_get_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes)
+{
+	return 0;
+}
+
+static inline void iaa_comp_put_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes, u8 nr_modes)
+{
+}
+
+static inline int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req)
+{
+	return -EINVAL;
+}
+
+static inline int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req)
+{
+	return -EINVAL;
+}
+
+#endif /* CONFIG_CRYPTO_DEV_IAA_CRYPTO */
+
+#endif
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 11/24] crypto: iaa - Enablers for submitting descriptors then polling for completion.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 10/24] crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap and zram Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 12/24] crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for kernel users Kanchana P Sridhar
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds capabilities in the IAA driver for kernel users to avail
of the benefits of compressing/decompressing multiple jobs in parallel
using IAA hardware acceleration, without the use of interrupts. Instead,
this is accomplished using an async "submit-poll" mechanism.

To achieve this, we break down a compress/decompress job into two
separate activities if the driver is configured for non-irq async mode:

1) Submit a descriptor after caching the "idxd_desc" descriptor in the
   req->drv_data, and return -EINPROGRESS.
2) Poll: Given a request, retrieve the descriptor and poll its completion
   status for success/error.

This is enabled by the following additions in the driver:

1) The idxd_desc is cached in the "drv_data" member of "struct iaa_req".

2) IAA_REQ_POLL_FLAG: if set in the iaa_req's flags, this tells
   the driver that it should submit the descriptor and return
   -EINPROGRESS. If not set, the driver will proceed to call
   check_completion() in fully synchronous mode, until the hardware
   returns a completion status.

3) iaa_comp_poll() procedure: This routine is intended to be called
   after submission returns -EINPROGRESS. It will check the completion
   status once, and return -EAGAIN if the job has not completed. If the
   job has completed, it will return the completion status.

The purpose of this commit is to allow kernel users of iaa_crypto, such
as zswap, to be able to invoke the crypto_acomp_compress() API in fully
synchronous mode for sequential/non-batching use cases (i.e. today's
status-quo), wherein zswap calls:

  crypto_wait_req(crypto_acomp_compress(req), wait);

and to non-instrusively invoke the fully asynchronous batch
compress/decompress functionality that will be introduced in subsequent
patches. Both use cases need to reuse same code paths in the driver to
interface with hardware: the IAA_REQ_POLL_FLAG allows this
shared code to determine whether we need to process an iaa_req
synchronously/asynchronously. The idea is to simplify iaa_crypto's
sequential/batching interfaces for use by zswap and zram.

Thus, regardless of the iaa_crypto driver's 'sync_mode' setting, it
can still be forced to use synchronous mode by *not setting* the
IAA_REQ_POLL_FLAG in iaa_req->flags: this is the default to support
sequential use cases in zswap today.

When IAA batching functionality is introduced subsequently, it will set
the IAA_REQ_POLL_FLAG for the requests in a batch. We will submit the
descriptors for each request in the batch in iaa_[de]compress(), and
return -EINPROGRESS. The hardware will begin processing each request as
soon as it is submitted, essentially all compress/decompress jobs will
be parallelized. The polling function, "iaa_comp_poll()", will retrieve
the descriptor from each iaa_req->drv_data to check its completion
status. This enables the iaa_crypto driver to implement true async
"submit-polling" for parallel compressions and decompressions in the IAA
hardware accelerator.

Both these conditions need to be met for a request to be processed in
fully async submit-poll mode:

 1) use_irq should be "false"
 2) iaa_req->flags & IAA_REQ_POLL_FLAG should be "true"

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  6 ++
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 71 +++++++++++++++++++++-
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 190157967e3ba..1cc383c94fb80 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -41,6 +41,12 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define IAA_REQ_POLL_FLAG		0x00000002
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index fad0a9274a2de..107522142be5c 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1890,13 +1890,14 @@ static int iaa_compress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 					  ctx->mode, iaa_device->compression_modes[ctx->mode]);
 
 	if (likely(!ctx->use_irq)) {
+		req->drv_data = idxd_desc;
 		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_comp_calls();
 		update_wq_comp_calls(wq);
 
-		if (ctx->async_mode)
+		if (req->flags & IAA_REQ_POLL_FLAG)
 			return -EINPROGRESS;
 
 		ret = check_completion(dev, idxd_desc->iax_completion, true, false);
@@ -1978,13 +1979,14 @@ static int iaa_decompress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 	desc = iaa_setup_decompress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen);
 
 	if (likely(!ctx->use_irq)) {
+		req->drv_data = idxd_desc;
 		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_decomp_calls();
 		update_wq_decomp_calls(wq);
 
-		if (ctx->async_mode)
+		if (req->flags & IAA_REQ_POLL_FLAG)
 			return -EINPROGRESS;
 
 		ret = check_completion(dev, idxd_desc->iax_completion, false, false);
@@ -2187,6 +2189,71 @@ static int iaa_comp_adecompress(struct iaa_compression_ctx *ctx, struct iaa_req
 	return ret;
 }
 
+static int __maybe_unused iaa_comp_poll(struct iaa_compression_ctx *ctx, struct iaa_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->drv_data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, compress_op, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (compress_op && ctx->verify_compress) {
+		dma_addr_t src_addr, dst_addr;
+
+		req->compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(ctx, req, wq, src_addr, req->slen,
+					  dst_addr, req->dlen);
+	}
+
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+	percpu_ref_put(&iaa_wq->ref);
+
+	return ret;
+}
+
 static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
 	ctx->mode = mode;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 12/24] crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for kernel users.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 11/24] crypto: iaa - Enablers for submitting descriptors then polling for completion Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 13/24] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds a "void *kernel_data" member in "struct acomp_req":

  @kernel_data:  Private API kernel code data for kernel users

This allows kernel modules such as zswap and zram to input driver data
without interfering with existing usage of acomp_req->base.data.

Since acomp_request_set_params() is the main interface for kernel users
to initialize the acomp_req members, this routine sets
acomp_req->kernel_data to NULL. Kernel users such as zswap will need to
explicitly set acomp_req->kernel_data for interacting with
crypto_acomp_[de]compress(). This usage model will be covered in a
separate patch-series.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/crypto/acompress.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 9eacb9fa375d7..0312322d2ca03 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -79,6 +79,7 @@ struct acomp_req_chain {
  * @dvirt:	Destination virtual address
  * @slen:	Size of the input buffer
  * @dlen:	Size of the output buffer and number of bytes produced
+ * @kernel_data:  Private API kernel code data for kernel users
  * @chain:	Private API code data, do not use
  * @__ctx:	Start of private context data
  */
@@ -95,6 +96,7 @@ struct acomp_req {
 	unsigned int slen;
 	unsigned int dlen;
 
+	void *kernel_data;
 	struct acomp_req_chain chain;
 
 	void *__ctx[] CRYPTO_MINALIGN_ATTR;
@@ -354,6 +356,7 @@ static inline void acomp_request_set_params(struct acomp_req *req,
 	req->dst = dst;
 	req->slen = slen;
 	req->dlen = dlen;
+	req->kernel_data = NULL;
 
 	req->base.flags &= ~(CRYPTO_ACOMP_REQ_SRC_VIRT |
 			     CRYPTO_ACOMP_REQ_SRC_NONDMA |
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 13/24] crypto: iaa - IAA Batching for parallel compressions/decompressions.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 12/24] crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for kernel users Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 14/24] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces batch compressions/decompressions in
iaa_crypto. Two new interfaces are provided for use in the kernel,
either directly, in the zram/zcomp backend, or through the
acomp_req->kernel_data pointer when calling crypto_acomp_[de]compress()
in the case of zswap.

IAA Batching allows the kernel swap modules to compress/decompress
multiple pages/buffers in parallel in hardware, significantly improving
swapout/swapin latency and throughput.

The patch defines an iaa_crypto constant, IAA_CRYPTO_MAX_BATCH_SIZE
(set to 8U currently). This is the maximum batch-size for IAA, and
represents the maximum number of pages/buffers that can be
compressed/decompressed in parallel, respectively.

In order to support IAA batching, the iaa_crypto driver allocates
IAA_CRYPTO_MAX_BATCH_SIZE "struct iaa_req *reqs[]" per-CPU, upon
initialization. Notably, the task of allocating multiple requests to
submit to the hardware for parallel [de]compressions is taken over by
iaa_crypto, so that zswap/zram don't need to allocate the reqs.

Batching is called with multiple iaa_reqs and pages, and works as
follows:

1) All input iaa_reqs are submitted to the hardware in async mode, using
   movdir64b. This enables hardware parallelism, because we don't wait
   for one compress/decompress job to finish before submitting the next
   one.

2) The iaa_reqs submitted are polled for completion statuses in a
   non-blocking manner in a while loop: each request that is still
   pending is polled once, and this repeats, until all requests have
   completed.

IAA's maximum batch-size can be queried with the following API:

  unsigned int iaa_comp_get_max_batch_size(void);

This allows swap modules such as zram to allocate required batching
dst buffers and then invoke fully asynchronous batch parallel
compression/decompression of pages/buffers on systems with Intel IAA, by
invoking these batching API, respectively:

  int iaa_comp_compress_batch(
        enum iaa_mode mode,
        struct iaa_req *reqs[],
        struct page *pages[],
        u8 *dsts[],
        unsigned int dlens[],
        int errors[],
        int nr_reqs);

  int iaa_comp_decompress_batch(
        enum iaa_mode mode,
        struct iaa_req *reqs[],
        u8 *srcs[],
        struct page *pages[],
        unsigned int slens[],
        unsigned int dlens[],
        int errors[],
        int nr_reqs);

A zram/zcomp backend_deflate_iaa.c will be submitted as a separate patch
series, and will enable single-page and batch IAA compress/decompress
ops.

The zswap interface to these batching API will be done setting the
acomp_req->kernel_data to a "struct swap_batch_comp_data *" or
"struct swap_batch_decomp_data *" for batch compression/decompression
respectively, using the existing
crypto_acomp_compress()/crypto_acomp_decompress() interfaces.

The new crypto_acomp-agnostic iaa_comp_[de]compress_batch() API result
in impressive latency improvements for zswap batch [de]compression, as
compared to a crypto_acomp based batching interface, most likely because
we avoid the overhead of crypto_acomp: we observe 17.78 micro-seconds
p99 latency savings for a decompress batch of 8 with the new
iaa_comp_decompress_batch() API.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  14 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 352 ++++++++++++++++++++-
 include/linux/iaa_comp.h                   |  72 +++++
 3 files changed, 430 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 1cc383c94fb80..3086bf18126e5 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -47,6 +47,20 @@
  */
 #define IAA_REQ_POLL_FLAG		0x00000002
 
+/*
+ * The maximum compress/decompress batch size for IAA's batch compression
+ * and batch decompression functionality.
+ */
+#define IAA_CRYPTO_MAX_BATCH_SIZE 8U
+
+/*
+ * Used to create per-CPU structure comprising of IAA_CRYPTO_MAX_BATCH_SIZE
+ * reqs for batch [de]compressions.
+ */
+struct iaa_batch_ctx {
+	struct iaa_req **reqs;
+};
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 107522142be5c..19f87923e2466 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -55,6 +55,9 @@ static struct wq_table_entry **pkg_global_comp_wqs;
 /* For software deflate fallback compress/decompress. */
 static struct crypto_acomp *deflate_crypto_acomp;
 
+/* Per-cpu iaa_reqs for batching. */
+static struct iaa_batch_ctx __percpu *iaa_batch_ctx;
+
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -2189,7 +2192,12 @@ static int iaa_comp_adecompress(struct iaa_compression_ctx *ctx, struct iaa_req
 	return ret;
 }
 
-static int __maybe_unused iaa_comp_poll(struct iaa_compression_ctx *ctx, struct iaa_req *req)
+static __always_inline unsigned int iaa_get_max_batch_size(void)
+{
+	return IAA_CRYPTO_MAX_BATCH_SIZE;
+}
+
+static int iaa_comp_poll(struct iaa_compression_ctx *ctx, struct iaa_req *req)
 {
 	struct idxd_desc *idxd_desc;
 	struct idxd_device *idxd;
@@ -2254,6 +2262,224 @@ static int __maybe_unused iaa_comp_poll(struct iaa_compression_ctx *ctx, struct
 	return ret;
 }
 
+static __always_inline void iaa_set_req_poll(
+	struct iaa_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->flags |= IAA_REQ_POLL_FLAG) :
+			   (reqs[i]->flags &= ~IAA_REQ_POLL_FLAG);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @ctx:  compression ctx for the requested IAA mode (fixed/dynamic).
+ * @reqs: @nr_reqs compress requests.
+ * @pages: Pages to be compressed by IAA.
+ * @dsts: Pre-allocated destination buffers to store results of IAA
+ *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of requests, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *           to be compressed.
+ *
+ * Returns 0 if all compress requests in the batch complete successfully,
+ * -EINVAL otherwise.
+ */
+static int iaa_comp_acompress_batch(
+	struct iaa_compression_ctx *ctx,
+	struct iaa_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool compressions_done = false;
+	int i, err = 0;
+
+	BUG_ON(nr_reqs > IAA_CRYPTO_MAX_BATCH_SIZE);
+
+	iaa_set_req_poll(reqs, nr_reqs, true);
+
+	/*
+	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
+	 * these compress jobs in parallel.
+	 */
+	for (i = 0; i < nr_reqs; ++i) {
+		reqs[i]->src = &inputs[i];
+		reqs[i]->dst = &outputs[i];
+		sg_init_table(reqs[i]->src, 1);
+		sg_set_page(reqs[i]->src, pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+		 * and hardware-accelerators may won't check the dst buffer size, so
+		 * giving the dst buffer with enough length to avoid buffer overflow.
+		 */
+		sg_init_one(reqs[i]->dst, dsts[i], PAGE_SIZE * 2);
+		reqs[i]->slen = PAGE_SIZE;
+		reqs[i]->dlen = PAGE_SIZE;
+
+		errors[i] = iaa_comp_acompress(ctx, reqs[i]);
+
+		if (likely(errors[i] == -EINPROGRESS))
+			errors[i] = -EAGAIN;
+		else if (errors[i])
+			err = -EINVAL;
+		else
+			dlens[i] = reqs[i]->dlen;
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA compress job completions.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_reqs; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(ctx, reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress(),
+	 * clear the IAA_REQ_POLL_FLAG bit on all iaa_reqs.
+	 */
+	iaa_set_req_poll(reqs, nr_reqs, false);
+
+	return err;
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @ctx:  compression ctx for the requested IAA mode (fixed/dynamic).
+ * @reqs: @nr_reqs decompress requests.
+ * @srcs: The src buffers to be decompressed by IAA.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @dlens: Will contain the decompressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *            to be decompressed.
+ *
+ * The caller should check @errors and handle reqs[i]->dlen != PAGE_SIZE.
+ *
+ * Returns 0 if all decompress requests complete successfully,
+ * -EINVAL otherwise.
+ */
+static int iaa_comp_adecompress_batch(
+	struct iaa_compression_ctx *ctx,
+	struct iaa_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool decompressions_done = false;
+	int i, err = 0;
+
+	BUG_ON(nr_reqs > IAA_CRYPTO_MAX_BATCH_SIZE);
+
+	iaa_set_req_poll(reqs, nr_reqs, true);
+
+	/*
+	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
+	 * these decompress jobs in parallel.
+	 */
+	for (i = 0; i < nr_reqs; ++i) {
+		reqs[i]->src = &inputs[i];
+		reqs[i]->dst = &outputs[i];
+		sg_init_one(reqs[i]->src, srcs[i], slens[i]);
+		sg_init_table(reqs[i]->dst, 1);
+		sg_set_page(reqs[i]->dst, pages[i], PAGE_SIZE, 0);
+		reqs[i]->slen = slens[i];
+		reqs[i]->dlen = PAGE_SIZE;
+
+		errors[i] = iaa_comp_adecompress(ctx, reqs[i]);
+
+		/*
+		 * If it failed desc allocation/submission, errors[i] can
+		 * be 0 or error value from software decompress.
+		 */
+		if (likely(errors[i] == -EINPROGRESS))
+			errors[i] = -EAGAIN;
+		else if (errors[i])
+			err = -EINVAL;
+		else
+			dlens[i] = reqs[i]->dlen;
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA decompress job completions.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_reqs; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(ctx, reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress(),
+	 * clear the IAA_REQ_POLL_FLAG bit on all iaa_reqs.
+	 */
+	iaa_set_req_poll(reqs, nr_reqs, false);
+
+	return err;
+}
+
 static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
 	ctx->mode = mode;
@@ -2356,6 +2582,12 @@ u8 iaa_comp_get_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes)
 }
 EXPORT_SYMBOL_GPL(iaa_comp_get_modes);
 
+__always_inline unsigned int iaa_comp_get_max_batch_size(void)
+{
+	return iaa_get_max_batch_size();
+}
+EXPORT_SYMBOL_GPL(iaa_comp_get_max_batch_size);
+
 __always_inline int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req)
 {
 	return iaa_comp_acompress(iaa_ctx[mode], req);
@@ -2368,6 +2600,33 @@ __always_inline int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req)
 }
 EXPORT_SYMBOL_GPL(iaa_comp_decompress);
 
+__always_inline int iaa_comp_compress_batch(
+	enum iaa_mode mode,
+	struct iaa_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	return iaa_comp_acompress_batch(iaa_ctx[mode], reqs, pages, dsts, dlens, errors, nr_reqs);
+}
+EXPORT_SYMBOL_GPL(iaa_comp_compress_batch);
+
+__always_inline int iaa_comp_decompress_batch(
+	enum iaa_mode mode,
+	struct iaa_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	return iaa_comp_adecompress_batch(iaa_ctx[mode], reqs, srcs, pages, slens, dlens, errors, nr_reqs);
+}
+EXPORT_SYMBOL_GPL(iaa_comp_decompress_batch);
+
 /*********************************************
  * Interfaces to crypto_alg and crypto_acomp.
  *********************************************/
@@ -2382,9 +2641,19 @@ static __always_inline int iaa_comp_acompress_main(struct acomp_req *areq)
 	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
 		ctx = iaa_ctx[idx];
 
-		acomp_to_iaa(areq, &req, ctx);
-		ret = iaa_comp_acompress(ctx, &req);
-		iaa_to_acomp(&req, areq);
+		if (likely(!areq->kernel_data)) {
+			acomp_to_iaa(areq, &req, ctx);
+			ret = iaa_comp_acompress(ctx, &req);
+			iaa_to_acomp(&req, areq);
+			return ret;
+		} else {
+			struct iaa_batch_comp_data *bcdata = (struct iaa_batch_comp_data *)areq->kernel_data;
+			struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
+
+			return iaa_comp_acompress_batch(ctx, cpu_ctx->reqs, bcdata->pages,
+							bcdata->dsts, bcdata->dlens,
+							bcdata->errors, bcdata->nr_comps);
+		}
 	}
 
 	return ret;
@@ -2400,9 +2669,19 @@ static __always_inline int iaa_comp_adecompress_main(struct acomp_req *areq)
 	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
 		ctx = iaa_ctx[idx];
 
-		acomp_to_iaa(areq, &req, ctx);
-		ret = iaa_comp_adecompress(ctx, &req);
-		iaa_to_acomp(&req, areq);
+		if (likely(!areq->kernel_data)) {
+			acomp_to_iaa(areq, &req, ctx);
+			ret = iaa_comp_adecompress(ctx, &req);
+			iaa_to_acomp(&req, areq);
+			return ret;
+		} else {
+			struct iaa_batch_decomp_data *bddata = (struct iaa_batch_decomp_data *)areq->kernel_data;
+			struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
+
+			return iaa_comp_adecompress_batch(ctx, cpu_ctx->reqs, bddata->srcs, bddata->pages,
+							  bddata->slens, bddata->dlens,
+							  bddata->errors, bddata->nr_decomps);
+		}
 	}
 
 	return ret;
@@ -2698,9 +2977,31 @@ static struct idxd_device_driver iaa_crypto_driver = {
  * Module init/exit.
  ********************/
 
+static void iaa_batch_ctx_dealloc(void)
+{
+	int cpu;
+	u8 i;
+
+	if (!iaa_batch_ctx)
+		return;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		struct iaa_batch_ctx *cpu_ctx = per_cpu_ptr(iaa_batch_ctx, cpu);
+
+		if (cpu_ctx && cpu_ctx->reqs) {
+			for (i = 0; i < IAA_CRYPTO_MAX_BATCH_SIZE; ++i)
+				kfree(cpu_ctx->reqs[i]);
+			kfree(cpu_ctx->reqs);
+		}
+	}
+
+	free_percpu(iaa_batch_ctx);
+}
+
 static int __init iaa_crypto_init_module(void)
 {
-	int ret = 0;
+	int cpu, ret = 0;
+	u8 i;
 
 	INIT_LIST_HEAD(&iaa_devices);
 
@@ -2755,6 +3056,35 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	/* Allocate batching resources for iaa_crypto. */
+	iaa_batch_ctx = alloc_percpu_gfp(struct iaa_batch_ctx, GFP_KERNEL | __GFP_ZERO);
+	if (!iaa_batch_ctx) {
+		pr_err("Failed to allocate per-cpu iaa_batch_ctx\n");
+		goto batch_ctx_fail;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		struct iaa_batch_ctx *cpu_ctx = per_cpu_ptr(iaa_batch_ctx, cpu);
+
+		cpu_ctx->reqs = kcalloc_node(IAA_CRYPTO_MAX_BATCH_SIZE,
+					     sizeof(struct iaa_req *),
+					     GFP_KERNEL,
+					     cpu_to_node(cpu));
+
+		if (!cpu_ctx->reqs)
+			goto reqs_fail;
+
+		for (i = 0; i < IAA_CRYPTO_MAX_BATCH_SIZE; ++i) {
+			cpu_ctx->reqs[i] = kzalloc_node(sizeof(struct iaa_req),
+							GFP_KERNEL,
+							cpu_to_node(cpu));
+			if (!cpu_ctx->reqs[i]) {
+				pr_err("could not alloc iaa_req reqs[%d]\n", i);
+				goto reqs_fail;
+			}
+		}
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
@@ -2762,6 +3092,11 @@ static int __init iaa_crypto_init_module(void)
 out:
 	return ret;
 
+reqs_fail:
+	iaa_batch_ctx_dealloc();
+batch_ctx_fail:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2788,6 +3123,7 @@ static void __exit iaa_crypto_cleanup_module(void)
 	iaa_unregister_acomp_compression_device();
 	iaa_unregister_compression_device();
 
+	iaa_batch_ctx_dealloc();
 	iaa_crypto_debugfs_cleanup();
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_sync_mode);
diff --git a/include/linux/iaa_comp.h b/include/linux/iaa_comp.h
index ec061315f4772..cbd78f83668d5 100644
--- a/include/linux/iaa_comp.h
+++ b/include/linux/iaa_comp.h
@@ -25,6 +25,27 @@ struct iaa_req {
 	void *drv_data; /* for driver internal use */
 };
 
+/*
+ * These next two data structures should exactly mirror the definitions of
+ * @struct swap_batch_comp_data and @struct swap_batch_decomp_data in mm/swap.h.
+ */
+struct iaa_batch_comp_data {
+	struct page **pages;
+	u8 **dsts;
+	unsigned int *dlens;
+	int *errors;
+	u8 nr_comps;
+};
+
+struct iaa_batch_decomp_data {
+	u8 **srcs;
+	struct page **pages;
+	unsigned int *slens;
+	unsigned int *dlens;
+	int *errors;
+	u8 nr_decomps;
+};
+
 extern bool iaa_comp_enabled(void);
 
 extern enum iaa_mode iaa_comp_get_compressor_mode(const char *compressor_name);
@@ -35,10 +56,31 @@ extern u8 iaa_comp_get_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes);
 
 extern void iaa_comp_put_modes(char **iaa_mode_names, enum iaa_mode *iaa_modes, u8 nr_modes);
 
+extern unsigned int iaa_comp_get_max_batch_size(void);
+
 extern int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req);
 
 extern int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req);
 
+extern int iaa_comp_compress_batch(
+	enum iaa_mode mode,
+	struct iaa_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs);
+
+extern int iaa_comp_decompress_batch(
+	enum iaa_mode mode,
+	struct iaa_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs);
+
 #else /* CONFIG_CRYPTO_DEV_IAA_CRYPTO */
 
 enum iaa_mode {
@@ -71,6 +113,11 @@ static inline void iaa_comp_put_modes(char **iaa_mode_names, enum iaa_mode *iaa_
 {
 }
 
+static inline unsigned int iaa_comp_get_max_batch_size(void)
+{
+	return 0;
+}
+
 static inline int iaa_comp_compress(enum iaa_mode mode, struct iaa_req *req)
 {
 	return -EINVAL;
@@ -81,6 +128,31 @@ static inline int iaa_comp_decompress(enum iaa_mode mode, struct iaa_req *req)
 	return -EINVAL;
 }
 
+static inline int iaa_comp_compress_batch(
+	enum iaa_mode mode,
+	struct iaa_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	return false;
+}
+
+static inline int iaa_comp_decompress_batch(
+	enum iaa_mode mode,
+	struct iaa_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	return false;
+}
+
 #endif /* CONFIG_CRYPTO_DEV_IAA_CRYPTO */
 
 #endif
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 14/24] crypto: iaa - Enable async mode and make it the default.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 13/24] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 15/24] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the 'async' sync_mode in the driver. Further, it sets
the default sync_mode to 'async', which makes it easier for IAA hardware
acceleration in the iaa_crypto driver to be loaded by default in the most
efficient/recommended 'async' mode for parallel
compressions/decompressions, namely, asynchronous submission of
descriptors, followed by polling for job completions. Earlier, the
"sync" mode used to be the default.

The iaa_crypto driver documentation has been updated with these
changes.

This way, anyone who wants to use IAA for zswap/zram can do so after
building the kernel, and without having to go through these steps to use
async mode:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 11 ++---------
 drivers/crypto/intel/iaa/iaa_crypto_main.c         |  4 ++--
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 1c4c25f0dc5e4..4c235bf769824 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -272,7 +272,7 @@ The available attributes are:
       echo async_irq > /sys/bus/dsa/drivers/crypto/sync_mode
 
     Async mode without interrupts (caller must poll) can be enabled by
-    writing 'async' to it (please see Caveat)::
+    writing 'async' to it::
 
       echo async > /sys/bus/dsa/drivers/crypto/sync_mode
 
@@ -281,14 +281,7 @@ The available attributes are:
 
       echo sync > /sys/bus/dsa/drivers/crypto/sync_mode
 
-    The default mode is 'sync'.
-
-    Caveat: since the only mechanism that iaa_crypto currently implements
-    for async polling without interrupts is via the 'sync' mode as
-    described earlier, writing 'async' to
-    '/sys/bus/dsa/drivers/crypto/sync_mode' will internally enable the
-    'sync' mode. This is to ensure correct iaa_crypto behavior until true
-    async polling without interrupts is enabled in iaa_crypto.
+    The default mode is 'async'.
 
   - g_comp_wqs_per_iaa
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 19f87923e2466..7b5b202a8021a 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -152,7 +152,7 @@ static bool iaa_verify_compress = true;
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
@@ -206,7 +206,7 @@ static int set_iaa_sync_mode(const char *name)
 		async_mode = false;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async")) {
-		async_mode = false;
+		async_mode = true;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async_irq")) {
 		async_mode = true;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 15/24] crypto: iaa - Disable iaa_verify_compress by default.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (13 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 14/24] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 16/24] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

The iaa_crypto driver documentation has been updated with this change.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 2 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 4c235bf769824..9d2f3f895bdd8 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -239,7 +239,7 @@ The available attributes are:
 
       echo 0 > /sys/bus/dsa/drivers/crypto/verify_compress
 
-    The default setting is '1' - verify all compresses.
+    The default setting is '0' - to not verify compresses.
 
   - sync_mode
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 7b5b202a8021a..1166077900522 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -120,7 +120,7 @@ static bool iaa_distribute_decomps;
 static bool iaa_distribute_comps = true;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress;
 
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 16/24] crypto: iaa - Submit the two largest source buffers first in decompress batching.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (14 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 15/24] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 17/24] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch finds the two largest source buffers in a given decompression
batch, and submits them first to the IAA decompress engines.

This improves decompress batching latency because the hardware has a
head start on decompressing the highest latency source buffers in the
batch. Workload performance is also significantly improved as a result
of this optimization.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 60 +++++++++++++++++++++-
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 1166077900522..2f25e02ca0aa3 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -2377,6 +2377,35 @@ static int iaa_comp_acompress_batch(
 	return err;
 }
 
+/*
+ * Find the two largest source buffers in @slens for a decompress batch,
+ * and pass their indices back in @idx_max and @idx_next_max.
+ *
+ * Returns true if there is no second largest source buffer, only a max buffer.
+ */
+static __always_inline bool decomp_batch_get_max_slens_idx(
+	unsigned int slens[],
+	int nr_pages,
+	int *idx_max,
+	int *idx_next_max)
+{
+	int i, max_i = 0, next_max_i = 0;
+
+	for (i = 0; i < nr_pages; ++i) {
+		if (slens[i] >= slens[max_i]) {
+			next_max_i = max_i;
+			max_i = i;
+		} else if ((next_max_i == max_i) || (slens[i] > slens[next_max_i])) {
+			next_max_i = i;
+		}
+	}
+
+	*idx_max = max_i;
+	*idx_next_max = next_max_i;
+
+	return (next_max_i == max_i);
+}
+
 /**
  * This API provides IAA decompress batching functionality for use by swap
  * modules.
@@ -2409,18 +2438,36 @@ static int iaa_comp_adecompress_batch(
 {
 	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
 	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool max_processed = false, next_max_processed = false;
 	bool decompressions_done = false;
-	int i, err = 0;
+	int i, max_i, next_max_i, err = 0;
 
 	BUG_ON(nr_reqs > IAA_CRYPTO_MAX_BATCH_SIZE);
 
 	iaa_set_req_poll(reqs, nr_reqs, true);
 
+	/*
+	 * Get the indices of the two largest decomp buffers in the batch.
+	 * Submit them first. This improves latency of the batch.
+	 */
+	next_max_processed = decomp_batch_get_max_slens_idx(slens, nr_reqs,
+							    &max_i, &next_max_i);
+
+	i = max_i;
+
 	/*
 	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
 	 * these decompress jobs in parallel.
 	 */
-	for (i = 0; i < nr_reqs; ++i) {
+	for (; i < nr_reqs; ++i) {
+		if ((i == max_i) && max_processed)
+			continue;
+		if ((i == next_max_i) && max_processed && next_max_processed)
+			continue;
+
+		if (max_processed && !next_max_processed)
+			i = next_max_i;
+
 		reqs[i]->src = &inputs[i];
 		reqs[i]->dst = &outputs[i];
 		sg_init_one(reqs[i]->src, srcs[i], slens[i]);
@@ -2441,6 +2488,15 @@ static int iaa_comp_adecompress_batch(
 			err = -EINVAL;
 		else
 			dlens[i] = reqs[i]->dlen;
+
+		if (i == max_i) {
+			max_processed = true;
+			i = -1;
+		}
+		if (i == next_max_i) {
+			next_max_processed = true;
+			i = -1;
+		}
 	}
 
 	/*
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 17/24] crypto: iaa - Add deflate-iaa-dynamic compression mode.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (15 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 16/24] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Some versions of Intel IAA support dynamic compression where the hardware
dynamically computes the Huffman tables and generates a Deflate header
if the input size is no larger than 4KB. This patch will use IAA for
dynamic compression if an appropriate IAA is present and the input size is
not too big. If an IAA is not present, the algorithm will not
be available. Otherwise, if the size of the input is greater than
PAGE_SIZE, zlib is used to do the compression. If the algorithm is
selected, IAA will be used for decompression. If the compressed stream
contains a reference whose distance is greater than 4KB, hardware
decompression will fail, and the decompression will be done with zlib.

Intel IAA dynamic compression results in a compression ratio that is
better than or equal to the currently supported "fixed" compression mode
on the same data set. Compressing a data set of 4300 4KB pages sampled
from SPEC CPU17 workloads produces a compression ratio of 3.14 for IAA
dynamic compression and 2.69 for IAA fixed compression.

If an appropriate IAA exists, dynamic mode can be chosen as the IAA
compression mode by selecting the corresponding algorithm.

For example, to use IAA dynamic mode in zswap:

      echo deflate-iaa-dynamic > /sys/module/zswap/parameters/compressor

This patch also adds a deflate_generic_compress() fallback when dynamic
mode is selected and the input size is over 4KB; along with stats
support that will count these software fallback calls as
"total_sw_comp_calls" in the driver's global_stats.

Furthermore, we define IAA_DYN_ALLOC_DESC_COMP_TIMEOUT as 2000 for
dynamic mode compression on Granite Rapids.

Signed-off-by: Andre Glover <andre.glover@linux.intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 .../driver-api/crypto/iaa/iaa-crypto.rst      | 21 ++++
 crypto/testmgr.c                              | 10 ++
 crypto/testmgr.h                              | 74 ++++++++++++++
 drivers/crypto/intel/iaa/Makefile             |  2 +-
 drivers/crypto/intel/iaa/iaa_crypto.h         |  5 +
 .../intel/iaa/iaa_crypto_comp_dynamic.c       | 22 +++++
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 98 +++++++++++++++++--
 drivers/crypto/intel/iaa/iaa_crypto_stats.c   |  8 ++
 drivers/crypto/intel/iaa/iaa_crypto_stats.h   |  2 +
 include/linux/iaa_comp.h                      |  5 +-
 10 files changed, 236 insertions(+), 11 deletions(-)
 create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 9d2f3f895bdd8..5632c5072a90e 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -33,6 +33,8 @@ compresses and decompresses.
 Currently, there is only one compression modes available, 'fixed'
 mode.
 
+'dynamic' mode is available on certain generations of IAA hardware.
+
 The 'fixed' compression mode implements the compression scheme
 specified by RFC 1951 and is given the crypto algorithm name
 'deflate-iaa'.  (Because the IAA hardware has a 4k history-window
@@ -43,6 +45,25 @@ the IAA fixed mode deflate algorithm is given its own algorithm name
 rather than simply 'deflate').
 
 
+The 'dynamic' compression mode implements a compression scheme where
+the IAA hardware will internally do one pass through the data, compute the
+Huffman tables and generate a Deflate header, then automatically do a
+second pass through the data, generating the final compressed output. IAA
+dynamic compression can be used if an appropriate IAA is present and the
+input size is not too big.  If an appropriate IAA is not present, the
+algorithm will not be available. Otherwise, if the size of the input is too
+big, zlib is used to do the compression. If the algorithm is selected,
+IAA will be used for decompression. If the compressed stream contains a
+reference whose distance is greater than 4KB, hardware decompression will
+fail, and the decompression will be done with zlib. If an appropriate IAA
+exists, 'dynamic' compression, it is implemented by the
+'deflate-iaa-dynamic' crypto algorithm.
+
+A zswap device can select the IAA 'dynamic' mode represented by
+selecting the 'deflate-iaa-dynamic' crypto compression algorithm::
+
+  # echo deflate-iaa-dynamic> /sys/module/zswap/parameters/compressor
+
 Config options and other setup
 ==============================
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 32f753d6c4302..36ae7a8b42860 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4703,6 +4703,16 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.decomp = __VECS(deflate_decomp_tv_template)
 			}
 		}
+	}, {
+		.alg = "deflate-iaa-dynamic",
+		.test = alg_test_comp,
+		.fips_allowed = 1,
+		.suite = {
+			.comp = {
+				.comp = __VECS(deflate_iaa_dynamic_comp_tv_template),
+				.decomp = __VECS(deflate_iaa_dynamic_decomp_tv_template)
+			}
+		}
 	}, {
 		.alg = "dh",
 		.test = alg_test_kpp,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 32d099ac9e737..42db2399013eb 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -34575,6 +34575,80 @@ static const struct comp_testvec deflate_decomp_tv_template[] = {
 	},
 };
 
+static const struct comp_testvec deflate_iaa_dynamic_comp_tv_template[] = {
+	{
+		.inlen	= 70,
+		.outlen	= 46,
+		.input	= "Join us now and share the software "
+			"Join us now and share the software ",
+		.output = "\x85\xca\xc1\x09\x00\x20\x08\x05"
+			  "\xd0\x55\xfe\x3c\x6e\x21\x64\xd8"
+			  "\x45\x21\x0d\xd7\xb7\x26\xe8\xf8"
+			  "\xe0\x91\x2f\xc3\x09\x98\x17\xd8"
+			  "\x06\x42\x79\x0b\x52\x05\xe1\x33"
+			  "\xeb\x81\x3e\xe5\xa2\x01",
+	}, {
+		.inlen	= 191,
+		.outlen	= 121,
+		.input	= "This document describes a compression method based on the DEFLATE"
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+		.output = "\x5d\x8d\xc1\x0d\xc2\x30\x10\x04"
+			  "\x5b\xd9\x0a\xd2\x03\x82\x20\x21"
+			  "\xf1\xf0\x23\x0d\x5c\xec\x0b\xb6"
+			  "\x64\xfb\x2c\xdf\xf1\xa0\x7b\x12"
+			  "\x3e\x58\x79\xae\x76\x67\x76\x89"
+			  "\x49\x11\xc4\xbf\x0b\x57\x43\x60"
+			  "\xf5\x3d\xad\xac\x20\x78\x29\xad"
+			  "\xb3\x6a\x92\x8a\xc2\x16\x25\x60"
+			  "\x25\xe5\x80\x3d\x5b\x64\xdc\xe6"
+			  "\xfb\xf3\xb2\xcc\xe3\x8c\xf2\x4b"
+			  "\x7a\xb2\x58\x26\xe0\x2c\xde\x52"
+			  "\xdd\xb5\x07\x48\xad\xe5\xe4\xc9"
+			  "\x0e\x42\xb6\xd1\xf5\x17\xc0\xe4"
+			  "\x57\x3c\x1c\x1c\x7d\xb2\x50\xc0"
+			  "\x75\x38\x72\x5d\x4c\xbc\xe4\xe9"
+			  "\x0b",
+	},
+};
+
+static const struct comp_testvec deflate_iaa_dynamic_decomp_tv_template[] = {
+	{
+		.inlen	= 121,
+		.outlen	= 191,
+		.input	= "\x5d\x8d\xc1\x0d\xc2\x30\x10\x04"
+			  "\x5b\xd9\x0a\xd2\x03\x82\x20\x21"
+			  "\xf1\xf0\x23\x0d\x5c\xec\x0b\xb6"
+			  "\x64\xfb\x2c\xdf\xf1\xa0\x7b\x12"
+			  "\x3e\x58\x79\xae\x76\x67\x76\x89"
+			  "\x49\x11\xc4\xbf\x0b\x57\x43\x60"
+			  "\xf5\x3d\xad\xac\x20\x78\x29\xad"
+			  "\xb3\x6a\x92\x8a\xc2\x16\x25\x60"
+			  "\x25\xe5\x80\x3d\x5b\x64\xdc\xe6"
+			  "\xfb\xf3\xb2\xcc\xe3\x8c\xf2\x4b"
+			  "\x7a\xb2\x58\x26\xe0\x2c\xde\x52"
+			  "\xdd\xb5\x07\x48\xad\xe5\xe4\xc9"
+			  "\x0e\x42\xb6\xd1\xf5\x17\xc0\xe4"
+			  "\x57\x3c\x1c\x1c\x7d\xb2\x50\xc0"
+			  "\x75\x38\x72\x5d\x4c\xbc\xe4\xe9"
+			  "\x0b",
+		.output	= "This document describes a compression method based on the DEFLATE"
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+	}, {
+		.inlen	= 46,
+		.outlen	= 70,
+		.input	= "\x85\xca\xc1\x09\x00\x20\x08\x05"
+			  "\xd0\x55\xfe\x3c\x6e\x21\x64\xd8"
+			  "\x45\x21\x0d\xd7\xb7\x26\xe8\xf8"
+			  "\xe0\x91\x2f\xc3\x09\x98\x17\xd8"
+			  "\x06\x42\x79\x0b\x52\x05\xe1\x33"
+			  "\xeb\x81\x3e\xe5\xa2\x01",
+		.output	= "Join us now and share the software "
+			"Join us now and share the software ",
+	},
+};
+
 /*
  * LZO test vectors (null-terminated strings).
  */
diff --git a/drivers/crypto/intel/iaa/Makefile b/drivers/crypto/intel/iaa/Makefile
index ebfa1a425f808..96f22cd39924a 100644
--- a/drivers/crypto/intel/iaa/Makefile
+++ b/drivers/crypto/intel/iaa/Makefile
@@ -7,6 +7,6 @@ ccflags-y += -I $(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE='"CRYPTO_
 
 obj-$(CONFIG_CRYPTO_DEV_IAA_CRYPTO) := iaa_crypto.o
 
-iaa_crypto-y := iaa_crypto_main.o iaa_crypto_comp_fixed.o
+iaa_crypto-y := iaa_crypto_main.o iaa_crypto_comp_fixed.o iaa_crypto_comp_dynamic.o
 
 iaa_crypto-$(CONFIG_CRYPTO_DEV_IAA_CRYPTO_STATS) += iaa_crypto_stats.o
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 3086bf18126e5..60c96a83e0ae4 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -19,12 +19,15 @@
 
 #define IAA_COMP_FLUSH_OUTPUT		BIT(1)
 #define IAA_COMP_APPEND_EOB		BIT(2)
+#define IAA_COMP_GEN_HDR_1_PASS		(BIT(12) | BIT(13))
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
 #define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
 #define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
 
+#define IAA_DYN_ALLOC_DESC_COMP_TIMEOUT	   2000
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
@@ -133,6 +136,8 @@ struct aecs_comp_table_record {
 
 int iaa_aecs_init_fixed(void);
 void iaa_aecs_cleanup_fixed(void);
+int iaa_aecs_init_dynamic(void);
+void iaa_aecs_cleanup_dynamic(void);
 
 typedef int (*iaa_dev_comp_init_fn_t) (struct iaa_device_compression_mode *mode);
 typedef int (*iaa_dev_comp_free_fn_t) (struct iaa_device_compression_mode *mode);
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c b/drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
new file mode 100644
index 0000000000000..3a93d79134431
--- /dev/null
+++ b/drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Intel Corporation. All rights rsvd. */
+
+#include "idxd.h"
+#include "iaa_crypto.h"
+
+int iaa_aecs_init_dynamic(void)
+{
+	int ret;
+
+	ret = add_iaa_compression_mode("dynamic", NULL, 0, NULL, 0, NULL, NULL);
+
+	if (!ret)
+		pr_debug("IAA dynamic compression mode initialized\n");
+
+	return ret;
+}
+
+void iaa_aecs_cleanup_dynamic(void)
+{
+	remove_iaa_compression_mode("dynamic");
+}
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 2f25e02ca0aa3..480e12c1d77a5 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -102,10 +102,12 @@ DEFINE_MUTEX(first_wq_found_lock);
 
 const char *iaa_compression_mode_names[IAA_COMP_MODES_MAX] = {
 	"fixed",
+	"dynamic",
 };
 
 const char *iaa_compression_alg_names[IAA_COMP_MODES_MAX] = {
 	"deflate-iaa",
+	"deflate-iaa-dynamic",
 };
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
@@ -1482,6 +1484,23 @@ static int deflate_generic_decompress(struct iaa_req *req)
 	return ret;
 }
 
+static int deflate_generic_compress(struct iaa_req *req)
+{
+	ACOMP_REQUEST_ON_STACK(fbreq, deflate_crypto_acomp);
+	int ret;
+
+	acomp_request_set_callback(fbreq, 0, NULL, NULL);
+	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
+				 PAGE_SIZE);
+
+	ret = crypto_acomp_compress(fbreq);
+	req->dlen = fbreq->dlen;
+
+	update_total_sw_comp_calls();
+
+	return ret;
+}
+
 static __always_inline void acomp_to_iaa(struct acomp_req *areq,
 					 struct iaa_req *req,
 					 struct iaa_compression_ctx *ctx)
@@ -1807,9 +1826,13 @@ iaa_setup_compress_hw_desc(struct idxd_desc *idxd_desc,
 	desc->src1_size = slen;
 	desc->dst_addr = (u64)dst_addr;
 	desc->max_dst_size = dlen;
-	desc->flags |= IDXD_OP_FLAG_RD_SRC2_AECS;
-	desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
-	desc->src2_size = sizeof(struct aecs_comp_table_record);
+	if (mode == IAA_MODE_DYNAMIC) {
+		desc->compr_flags |= IAA_COMP_GEN_HDR_1_PASS;
+	} else {
+		desc->flags |= IDXD_OP_FLAG_RD_SRC2_AECS;
+		desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
+		desc->src2_size = sizeof(struct aecs_comp_table_record);
+	}
 	desc->completion_addr = idxd_desc->compl_dma;
 
 	return desc;
@@ -2063,6 +2086,9 @@ static int iaa_comp_acompress(struct iaa_compression_ctx *ctx, struct iaa_req *r
 		return -EINVAL;
 	}
 
+	if (ctx->mode == IAA_MODE_DYNAMIC && req->slen > PAGE_SIZE)
+		return deflate_generic_compress(req);
+
 	cpu = get_cpu();
 	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
@@ -2539,7 +2565,9 @@ static int iaa_comp_adecompress_batch(
 static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
 	ctx->mode = mode;
-	ctx->alloc_comp_desc_timeout = IAA_ALLOC_DESC_COMP_TIMEOUT;
+	ctx->alloc_comp_desc_timeout = (mode == IAA_MODE_DYNAMIC ?
+					IAA_DYN_ALLOC_DESC_COMP_TIMEOUT :
+					IAA_ALLOC_DESC_COMP_TIMEOUT);
 	ctx->alloc_decomp_desc_timeout = IAA_ALLOC_DESC_DECOMP_TIMEOUT;
 	ctx->verify_compress = iaa_verify_compress;
 	ctx->async_mode = async_mode;
@@ -2768,6 +2796,30 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	}
 };
 
+static int iaa_comp_init_dynamic(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+
+	ctx = iaa_ctx[IAA_MODE_DYNAMIC];
+
+	return 0;
+}
+
+static struct acomp_alg iaa_acomp_dynamic_deflate = {
+	.init			= iaa_comp_init_dynamic,
+	.compress		= iaa_comp_acompress_main,
+	.decompress		= iaa_comp_adecompress_main,
+	.base			= {
+		.cra_name		= "deflate",
+		.cra_driver_name	= "deflate-iaa-dynamic",
+		.cra_flags		= CRYPTO_ALG_ASYNC,
+		.cra_ctxsize		= sizeof(struct iaa_compression_ctx),
+		.cra_module		= THIS_MODULE,
+		.cra_priority		= IAA_ALG_PRIORITY + 1,
+	}
+};
+
 /*******************************************
  * Implement idxd_device_driver interfaces.
  *******************************************/
@@ -2787,7 +2839,7 @@ static void iaa_unregister_compression_device(void)
 	num_iaa_modes_registered = 0;
 }
 
-static int iaa_register_compression_device(void)
+static int iaa_register_compression_device(struct idxd_device *idxd)
 {
 	struct iaa_compression_mode *mode;
 	int i, idx;
@@ -2796,6 +2848,13 @@ static int iaa_register_compression_device(void)
 		iaa_mode_registered[i] = false;
 		mode = find_iaa_compression_mode(iaa_compression_mode_names[i], &idx);
 		if (mode) {
+			/* Header Generation Capability is required for the dynamic algorithm. */
+			if ((!strcmp(mode->name, "dynamic")) && !idxd->hw.iaa_cap.header_gen) {
+				if (num_iaa_modes_registered > 0)
+					--num_iaa_modes_registered;
+				continue;
+			}
+
 			iaa_ctx[i] = kmalloc(sizeof(struct iaa_compression_ctx), GFP_KERNEL);
 			if (!iaa_ctx[i])
 				goto err;
@@ -2813,7 +2872,7 @@ static int iaa_register_compression_device(void)
 	return -ENODEV;
 }
 
-static int iaa_register_acomp_compression_device(void)
+static int iaa_register_acomp_compression_device(struct idxd_device *idxd)
 {
 	int ret = -ENOMEM;
 
@@ -2827,8 +2886,19 @@ static int iaa_register_acomp_compression_device(void)
 		goto err_fixed;
 	}
 
+	if (iaa_mode_registered[IAA_MODE_DYNAMIC]) {
+		ret = crypto_register_acomp(&iaa_acomp_dynamic_deflate);
+		if (ret) {
+			pr_err("deflate algorithm acomp dynamic registration failed (%d)\n", ret);
+			goto err_dynamic;
+		}
+	}
+
 	return 0;
 
+err_dynamic:
+	crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
+
 err_fixed:
 	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
 		crypto_free_acomp(deflate_crypto_acomp);
@@ -2847,6 +2917,9 @@ static void iaa_unregister_acomp_compression_device(void)
 	if (iaa_mode_registered[IAA_MODE_FIXED])
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
 
+	if (iaa_mode_registered[IAA_MODE_DYNAMIC])
+		crypto_unregister_acomp(&iaa_acomp_dynamic_deflate);
+
 	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
 		crypto_free_acomp(deflate_crypto_acomp);
 		deflate_crypto_acomp = NULL;
@@ -2914,13 +2987,13 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	atomic_set(&iaa_crypto_enabled, 1);
 
 	if (first_wq) {
-		ret = iaa_register_compression_device();
+		ret = iaa_register_compression_device(idxd);
 		if (ret != 0) {
 			dev_dbg(dev, "IAA compression device registration failed\n");
 			goto err_register;
 		}
 
-		ret = iaa_register_acomp_compression_device();
+		ret = iaa_register_acomp_compression_device(idxd);
 		if (ret != 0) {
 			dev_dbg(dev, "IAA compression device acomp registration failed\n");
 			goto err_register;
@@ -3071,6 +3144,12 @@ static int __init iaa_crypto_init_module(void)
 		goto err_aecs_init;
 	}
 
+	ret = iaa_aecs_init_dynamic();
+	if (ret < 0) {
+		pr_debug("IAA dynamic compression mode init failed\n");
+		goto err_dynamic;
+	}
+
 	ret = idxd_driver_register(&iaa_crypto_driver);
 	if (ret) {
 		pr_debug("IAA wq sub-driver registration failed\n");
@@ -3168,6 +3247,8 @@ static int __init iaa_crypto_init_module(void)
 err_g_comp_wqs_per_iaa_attr_create:
 	idxd_driver_unregister(&iaa_crypto_driver);
 err_driver_reg:
+	iaa_aecs_cleanup_dynamic();
+err_dynamic:
 	iaa_aecs_cleanup_fixed();
 err_aecs_init:
 
@@ -3192,6 +3273,7 @@ static void __exit iaa_crypto_cleanup_module(void)
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_g_comp_wqs_per_iaa);
 	idxd_driver_unregister(&iaa_crypto_driver);
+	iaa_aecs_cleanup_dynamic();
 	iaa_aecs_cleanup_fixed();
 
 	pr_debug("cleaned up\n");
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_stats.c b/drivers/crypto/intel/iaa/iaa_crypto_stats.c
index f5cc3d29ca19e..42aae8a738ac1 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_stats.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_stats.c
@@ -19,6 +19,7 @@
 
 static atomic64_t total_comp_calls;
 static atomic64_t total_decomp_calls;
+static atomic64_t total_sw_comp_calls;
 static atomic64_t total_sw_decomp_calls;
 static atomic64_t total_comp_bytes_out;
 static atomic64_t total_decomp_bytes_in;
@@ -43,6 +44,11 @@ void update_total_decomp_calls(void)
 	atomic64_inc(&total_decomp_calls);
 }
 
+void update_total_sw_comp_calls(void)
+{
+	atomic64_inc(&total_sw_comp_calls);
+}
+
 void update_total_sw_decomp_calls(void)
 {
 	atomic64_inc(&total_sw_decomp_calls);
@@ -174,6 +180,8 @@ static int global_stats_show(struct seq_file *m, void *v)
 		   atomic64_read(&total_comp_calls));
 	seq_printf(m, "  total_decomp_calls: %llu\n",
 		   atomic64_read(&total_decomp_calls));
+	seq_printf(m, "  total_sw_comp_calls: %llu\n",
+		   atomic64_read(&total_sw_comp_calls));
 	seq_printf(m, "  total_sw_decomp_calls: %llu\n",
 		   atomic64_read(&total_sw_decomp_calls));
 	seq_printf(m, "  total_comp_bytes_out: %llu\n",
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_stats.h b/drivers/crypto/intel/iaa/iaa_crypto_stats.h
index 3787a5f507eb2..6e0c6f9939bfa 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_stats.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto_stats.h
@@ -11,6 +11,7 @@ void	iaa_crypto_debugfs_cleanup(void);
 void	update_total_comp_calls(void);
 void	update_total_comp_bytes_out(int n);
 void	update_total_decomp_calls(void);
+void	update_total_sw_comp_calls(void);
 void	update_total_sw_decomp_calls(void);
 void	update_total_decomp_bytes_in(int n);
 void	update_completion_einval_errs(void);
@@ -29,6 +30,7 @@ static inline void	iaa_crypto_debugfs_cleanup(void) {}
 static inline void	update_total_comp_calls(void) {}
 static inline void	update_total_comp_bytes_out(int n) {}
 static inline void	update_total_decomp_calls(void) {}
+static inline void	update_total_sw_comp_calls(void) {}
 static inline void	update_total_sw_decomp_calls(void) {}
 static inline void	update_total_decomp_bytes_in(int n) {}
 static inline void	update_completion_einval_errs(void) {}
diff --git a/include/linux/iaa_comp.h b/include/linux/iaa_comp.h
index cbd78f83668d5..97d08702a8ca4 100644
--- a/include/linux/iaa_comp.h
+++ b/include/linux/iaa_comp.h
@@ -12,7 +12,8 @@
 
 enum iaa_mode {
 	IAA_MODE_FIXED = 0,
-	IAA_MODE_NONE = 1,
+	IAA_MODE_DYNAMIC = 1,
+	IAA_MODE_NONE = 2,
 };
 
 struct iaa_req {
@@ -84,7 +85,7 @@ extern int iaa_comp_decompress_batch(
 #else /* CONFIG_CRYPTO_DEV_IAA_CRYPTO */
 
 enum iaa_mode {
-	IAA_MODE_NONE = 1,
+	IAA_MODE_NONE = 2,
 };
 
 struct iaa_req {};
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (16 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 17/24] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-15  5:28   ` Herbert Xu
  2025-08-01  4:36 ` [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface Kanchana P Sridhar
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds a get_batch_size() interface to:

  struct acomp_alg
  struct crypto_acomp

A crypto_acomp compression algorithm that supports batching of compressions
and decompressions must provide an implementation for this API to return
the maximum batch-size that the compressor supports, so that kernel
users of crypto_acomp, such as zswap, can allocate resources for
submitting multiple compress/decompress jobs that can be batched, and
invoke batching of [de]compressions.

A new helper function acomp_has_async_batching() can be invoked to query
if a crypto_acomp implements get_batch_size().

The new crypto_acomp_batch_size() API uses this helper function to return
the batch-size for compressors that implement get_batch_size(). If no
implementation is provided by the crypto_acomp, a default of "1" is
returned for the batch-size.

zswap can invoke crypto_acomp_batch_size() to query the maximum number
of requests that can be batch [de]compressed. Based on this, zswap
can use the minimum of any zswap-specific upper limits for batch-size
and the compressor's max batch-size, to allocate batching resources.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |  1 +
 include/crypto/acompress.h          | 27 +++++++++++++++++++++++++++
 include/crypto/internal/acompress.h |  3 +++
 3 files changed, 31 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index be28cbfd22e32..f440724719655 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -105,6 +105,7 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->get_batch_size = alg->get_batch_size;
 	acomp->reqsize = alg->base.cra_reqsize;
 
 	acomp->base.exit = crypto_acomp_exit_tfm;
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 0312322d2ca03..898104745cd24 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -108,6 +108,8 @@ struct acomp_req {
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @get_batch_size:	Maximum batch-size for batching compress/decompress
+ *			operations.
  * @reqsize:		Context size for (de)compression requests
  * @fb:			Synchronous fallback tfm
  * @base:		Common crypto API algorithm data structure
@@ -115,6 +117,7 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
 	unsigned int reqsize;
 	struct crypto_tfm base;
 };
@@ -205,6 +208,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->get_batch_size);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -545,6 +555,23 @@ int crypto_acomp_compress(struct acomp_req *req);
  */
 int crypto_acomp_decompress(struct acomp_req *req);
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	crypto_acomp's batch size.
+ */
+static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_has_async_batching(tfm))
+		return tfm->get_batch_size();
+
+	return 1;
+}
+
 static inline struct acomp_req *acomp_request_on_stack_init(
 	char *buf, struct crypto_acomp *tfm)
 {
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index ffffd88bbbad3..2325ee18e7a10 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -28,6 +28,8 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @get_batch_size:	Maximum batch-size for batching compress/decompress
+ *			operations.
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
  *		transformation object. This function is called only once at
@@ -46,6 +48,7 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (17 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-29  0:16   ` Barry Song
  2025-08-01  4:36 ` [PATCH v11 20/24] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

The Fixed ("deflate-iaa") and Dynamic ("deflate-iaa-dynamic") IAA
acomp_algs register an implementation for get_batch_size(). zswap can
query crypto_acomp_batch_size() to get the maximum number of requests
that can be batch [de]compressed. zswap can use the minimum of this, and
any zswap-specific upper limits for batch-size to allocate batching
resources.

This enables zswap to compress/decompress pages in parallel in the IAA
hardware accelerator to improve swapout/swapin performance and memory
savings.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 480e12c1d77a5..b7c6fc334dae7 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -2785,6 +2785,7 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.init			= iaa_comp_init_fixed,
 	.compress		= iaa_comp_acompress_main,
 	.decompress		= iaa_comp_adecompress_main,
+	.get_batch_size		= iaa_comp_get_max_batch_size,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
@@ -2810,6 +2811,7 @@ static struct acomp_alg iaa_acomp_dynamic_deflate = {
 	.init			= iaa_comp_init_dynamic,
 	.compress		= iaa_comp_acompress_main,
 	.decompress		= iaa_comp_adecompress_main,
+	.get_batch_size		= iaa_comp_get_max_batch_size,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa-dynamic",
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 20/24] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (18 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 21/24] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
management. Similar to the per-CPU acomp_ctx itself, the per-CPU
acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
pool creation to pool deletion. These resources will persist through CPU
hotplug operations. The zswap_cpu_comp_dead() teardown callback has been
deleted from the call to
cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
offline hotplug operations will be no-ops as far as the acomp_ctx
resources are concerned.

This commit refactors the code from zswap_cpu_comp_dead() into a
new function acomp_ctx_dealloc() that preserves the IS_ERR_OR_NULL()
checks on acomp_ctx, req and acomp from the existing mainline
implementation of zswap_cpu_comp_dead(). acomp_ctx_dealloc() is called
to clean up acomp_ctx resources from all these procedures:

1) zswap_cpu_comp_prepare() when an error is encountered,
2) zswap_pool_create() when an error is encountered, and
3) from zswap_pool_destroy().

The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the
cores from being offlined until the multi state instance addition call
returns.

  From Documentation/core-api/cpu_hotplug.rst:

    "The node list add/remove operations and the callback invocations are
     serialized against CPU hotplug operations."

Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:

  - During pool creation/deletion, the pool is not in the zswap_pools
    list.

  - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
    out. zswap_cpu_comp_prepare() will be executed on a control CPU,
    since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
    cpuhp_state". Thanks Yosry for sharing this observation!

  In both these cases, any recursions into zswap reclaim from
  zswap_cpu_comp_prepare() will be handled by the old pool.

The above two observations enable the following simplifications:

 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
    the pool. Considerations for mutex init/locking and handling
    subsequent CPU hotplug online-offlines:

    Should we lock the mutex of current CPU's acomp_ctx from start to
    end? It doesn't seem like this is required. The CPU hotplug
    operations acquire a "cpuhp_state_mutex" before proceeding, hence
    they are serialized against CPU hotplug operations.

    If the process gets migrated while zswap_cpu_comp_prepare() is
    running, it will complete on the new CPU. In case of failures, we
    pass the acomp_ctx pointer obtained at the start of
    zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
    only undergo migration. There appear to be no contention scenarios
    that might cause inconsistent values of acomp_ctx's members. Hence,
    it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
    zswap_cpu_comp_prepare().

    Since the pool is not yet on zswap_pools list, we don't need to
    initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
    has been restored to occur in zswap_cpu_comp_prepare().

    zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
    valid. If so, it returns success. This should handle any CPU
    hotplug online-offline transitions after pool creation is done.

 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
    migrated to another CPU before the current CPU is dysfunctional. If
    zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
    CPU, that mutex will be released once it completes on the new
    CPU. Since there is no teardown callback, there is no possibility of
    UAF.

 3) Pool creation/deletion and process migration to another CPU:

    - During pool creation/deletion, the pool is not in the zswap_pools
      list. Hence it cannot contend with zswap ops on that CPU. However,
      the process can get migrated.

      Pool creation --> zswap_cpu_comp_prepare()
                                --> process migrated:
                                    * CPU offline: no-op.
                                    * zswap_cpu_comp_prepare() continues
                                      to run on the new CPU to finish
                                      allocating acomp_ctx resources for
                                      the offlined CPU.

      Pool deletion --> acomp_ctx_dealloc()
                                --> process migrated:
                                    * CPU offline: no-op.
                                    * acomp_ctx_dealloc() continues
                                      to run on the new CPU to finish
                                      de-allocating acomp_ctx resources
                                      for the offlined CPU.

 4) Pool deletion vis-a-vis CPU onlining:
    To prevent possibility of race conditions between
    acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial
    check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), we
    need to delete the multi state instance right after it is added, in
    zswap_pool_create().

 Summary of changes based on the above:
 --------------------------------------
 1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to
    simplify and share common code for different error handling/cleanup
    related to the acomp_ctx.

 2) Remove the node list instance right after node list add function
    call in zswap_pool_create(). This prevents race conditions between
    CPU onlining after initial pool creation, and acomp_ctx_dealloc()
    freeing the acomp_ctx resources.

 3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate
    the per-CPU acomp_ctx resources.

 4) Changes to zswap_cpu_comp_prepare():

    a) Check if acomp_ctx->acomp is valid at the beginning and return,
       because the acomp_ctx is already initialized.
    b) Move the mutex_init to happen in this procedure, before it
       returns.
    c) All error conditions handled by calling acomp_ctx_dealloc().

 5) New procedure acomp_ctx_dealloc() for common error/cleanup code.

 6) No more multi state instance teardown callback. CPU offlining is a
    no-op as far as acomp_ctx resources are concerned.

 7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly
    call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx->mutex)
    in zswap_[de]compress().

The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is as
follows, on x86_64:

    IAA with 8 dst buffers for batching:    64.34 KB
    Software compressors with 1 dst buffer:  8.28 KB

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 195 ++++++++++++++++++++++++++---------------------------
 1 file changed, 94 insertions(+), 101 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 3c0fd8a137182..7970bd67f0109 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -248,6 +248,30 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 **********************************/
 static void __zswap_pool_empty(struct percpu_ref *ref);
 
+/*
+ * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This makes
+ * it easy for different error conditions/cleanup related to the acomp_ctx
+ * to be handled by acomp_ctx_dealloc():
+ * - Errors during zswap_cpu_comp_prepare().
+ * - Partial success/error of cpuhp_state_add_instance() call in
+ *   zswap_pool_create(). Only some cores could have executed
+ *   zswap_cpu_comp_prepare(), not others.
+ * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
+ */
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+{
+	if (IS_ERR_OR_NULL(acomp_ctx))
+		return;
+
+	if (!IS_ERR_OR_NULL(acomp_ctx->req))
+		acomp_request_free(acomp_ctx->req);
+
+	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+		crypto_free_acomp(acomp_ctx->acomp);
+
+	kfree(acomp_ctx->buffer);
+}
+
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	struct zswap_pool *pool;
@@ -281,19 +305,43 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 
 	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
 
-	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
+	/* Many things rely on the zero-initialization. */
+	pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
+					   GFP_KERNEL | __GFP_ZERO);
 	if (!pool->acomp_ctx) {
 		pr_err("percpu alloc failed\n");
 		goto error;
 	}
 
-	for_each_possible_cpu(cpu)
-		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
-
+	/*
+	 * This is serialized against CPU hotplug operations. Hence, cores
+	 * cannot be offlined until this finishes.
+	 * In case of errors, we need to goto "ref_fail" instead of "error"
+	 * because there is no teardown callback registered anymore, for
+	 * cpuhp_state_add_instance() to de-allocate resources as it rolls back
+	 * state on cores before the CPU on which error was encountered.
+	 */
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
+
+	/*
+	 * We only needed the multi state instance add operation to invoke the
+	 * startup callback for all cores without cores getting offlined. Since
+	 * the acomp_ctx resources will now only be de-allocated when the pool
+	 * is destroyed, we can safely remove the multi state instance. This
+	 * minimizes (but does not eliminate) the possibility of
+	 * zswap_cpu_comp_prepare() being invoked again due to a CPU
+	 * offline-online transition. Removing the instance also prevents race
+	 * conditions between CPU onlining after initial pool creation, and
+	 * acomp_ctx_dealloc() freeing the acomp_ctx resources.
+	 * Note that we delete the instance before checking the error status of
+	 * the node list add operation because we want the instance removal even
+	 * in case of errors in the former.
+	 */
+	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
 	if (ret)
-		goto error;
+		goto ref_fail;
 
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
@@ -309,7 +357,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	return pool;
 
 ref_fail:
-	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+	for_each_possible_cpu(cpu)
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -363,9 +412,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
 
 static void zswap_pool_destroy(struct zswap_pool *pool)
 {
+	int cpu;
+
 	zswap_pool_debug("destroying", pool);
 
-	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+	for_each_possible_cpu(cpu)
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+
 	free_percpu(pool->acomp_ctx);
 
 	zpool_destroy_pool(pool->zpool);
@@ -822,39 +875,39 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct crypto_acomp *acomp = NULL;
-	struct acomp_req *req = NULL;
-	u8 *buffer = NULL;
-	int ret;
+	int ret = -ENOMEM;
 
-	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!buffer) {
-		ret = -ENOMEM;
-		goto fail;
-	}
+	/*
+	 * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
+	 * Even though we delete the multi state instance right after successful
+	 * addition of the instance in zswap_pool_create(), we cannot eliminate
+	 * the possibility of the CPU going through offline-online transitions.
+	 * If this does happen, we check if the acomp_ctx has already been
+	 * initialized, and return.
+	 */
+	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+		return 0;
 
-	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp)) {
+	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffer)
+		return ret;
+
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	if (IS_ERR(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
-				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
+				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
+		ret = PTR_ERR(acomp_ctx->acomp);
 		goto fail;
 	}
+	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
-	req = acomp_request_alloc(acomp);
-	if (!req) {
+	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+	if (!acomp_ctx->req) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
-		ret = -ENOMEM;
 		goto fail;
 	}
 
-	/*
-	 * Only hold the mutex after completing allocations, otherwise we may
-	 * recurse into zswap through reclaim and attempt to hold the mutex
-	 * again resulting in a deadlock.
-	 */
-	mutex_lock(&acomp_ctx->mutex);
 	crypto_init_wait(&acomp_ctx->wait);
 
 	/*
@@ -862,81 +915,17 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
-	acomp_ctx->buffer = buffer;
-	acomp_ctx->acomp = acomp;
-	acomp_ctx->is_sleepable = acomp_is_async(acomp);
-	acomp_ctx->req = req;
-	mutex_unlock(&acomp_ctx->mutex);
+	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
 fail:
-	if (acomp)
-		crypto_free_acomp(acomp);
-	kfree(buffer);
+	acomp_ctx_dealloc(acomp_ctx);
 	return ret;
 }
 
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct acomp_req *req;
-	struct crypto_acomp *acomp;
-	u8 *buffer;
-
-	if (IS_ERR_OR_NULL(acomp_ctx))
-		return 0;
-
-	mutex_lock(&acomp_ctx->mutex);
-	req = acomp_ctx->req;
-	acomp = acomp_ctx->acomp;
-	buffer = acomp_ctx->buffer;
-	acomp_ctx->req = NULL;
-	acomp_ctx->acomp = NULL;
-	acomp_ctx->buffer = NULL;
-	mutex_unlock(&acomp_ctx->mutex);
-
-	/*
-	 * Do the actual freeing after releasing the mutex to avoid subtle
-	 * locking dependencies causing deadlocks.
-	 */
-	if (!IS_ERR_OR_NULL(req))
-		acomp_request_free(req);
-	if (!IS_ERR_OR_NULL(acomp))
-		crypto_free_acomp(acomp);
-	kfree(buffer);
-
-	return 0;
-}
-
-static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
-{
-	struct crypto_acomp_ctx *acomp_ctx;
-
-	for (;;) {
-		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-		mutex_lock(&acomp_ctx->mutex);
-		if (likely(acomp_ctx->req))
-			return acomp_ctx;
-		/*
-		 * It is possible that we were migrated to a different CPU after
-		 * getting the per-CPU ctx but before the mutex was acquired. If
-		 * the old CPU got offlined, zswap_cpu_comp_dead() could have
-		 * already freed ctx->req (among other things) and set it to
-		 * NULL. Just try again on the new CPU that we ended up on.
-		 */
-		mutex_unlock(&acomp_ctx->mutex);
-	}
-}
-
-static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
-{
-	mutex_unlock(&acomp_ctx->mutex);
-}
-
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -949,7 +938,10 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	gfp_t gfp;
 	u8 *dst;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+
+	mutex_lock(&acomp_ctx->mutex);
+
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -997,7 +989,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	else if (alloc_ret)
 		zswap_reject_alloc_fail++;
 
-	acomp_ctx_put_unlock(acomp_ctx);
+	mutex_unlock(&acomp_ctx->mutex);
 	return comp_ret == 0 && alloc_ret == 0;
 }
 
@@ -1009,7 +1001,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	int decomp_ret, dlen;
 	u8 *src, *obj;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
 	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
 
 	/*
@@ -1033,7 +1026,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	dlen = acomp_ctx->req->dlen;
 
 	zpool_obj_read_end(zpool, entry->handle, obj);
-	acomp_ctx_put_unlock(acomp_ctx);
+	mutex_unlock(&acomp_ctx->mutex);
 
 	if (!decomp_ret && dlen == PAGE_SIZE)
 		return true;
@@ -1846,7 +1839,7 @@ static int zswap_setup(void)
 	ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
 				      "mm/zswap_pool:prepare",
 				      zswap_cpu_comp_prepare,
-				      zswap_cpu_comp_dead);
+				      NULL);
 	if (ret)
 		goto hp_fail;
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 21/24] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (19 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 20/24] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-01  4:36 ` [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
for valid acomp/req, thereby making it consistent with
acomp_ctx_dealloc().

This is based on this earlier comment [1] from Yosry, when reviewing v8.

[1] https://patchwork.kernel.org/comment/26282128/

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 7970bd67f0109..efd501a7fe294 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -893,7 +893,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		return ret;
 
 	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp_ctx->acomp)) {
+	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
 		ret = PTR_ERR(acomp_ctx->acomp);
@@ -902,7 +902,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
 	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!acomp_ctx->req) {
+	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
 		goto fail;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (20 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 21/24] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-14 20:58   ` Nhat Pham
  2025-08-26  3:48   ` Barry Song
  2025-08-01  4:36 ` [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
                   ` (2 subsequent siblings)
  24 siblings, 2 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch sets up zswap for allocating per-CPU resources optimally for
non-batching and batching compressors.

A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
limit on the number of pages in large folios that will be batch
compressed.

As per Herbert's comments in [2] in response to the
crypto_acomp_batch_compress() and crypto_acomp_batch_decompress() API
proposed in [1], this series does not create new crypto_acomp batching
API. Instead, zswap compression batching uses the existing
crypto_acomp_compress() API in combination with the "void *kernel_data"
member added to "struct acomp_req" earlier in this series.

It is up to the compressor to manage multiple requests, as needed, to
accomplish batch parallelism. zswap only needs to allocate the per-CPU
dst buffers according to the batch size supported by the compressor.

A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
Yosry's suggestion. pool->compr_batch_size is set as the minimum of the
compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it
proceeds to allocate the necessary compression dst buffers in the
per-CPU acomp_ctx.

Another "u8 batch_size" member is added to "struct zswap_pool" to store
the unit for batching large folio stores: for batching compressors, this
is the pool->compr_batch_size. For non-batching compressors, this is
ZSWAP_MAX_BATCH_SIZE.

zswap does not use more than one dst buffer yet. Follow-up patches will
actually utilize the multiple acomp_ctx buffers for batch
compression/decompression of multiple pages.

Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used for
batching. There is a small extra memory overhead of allocating
the acomp_ctx->buffers array for compressors that do not support
batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).

[1]: https://patchwork.kernel.org/project/linux-mm/patch/20250508194134.28392-11-kanchana.p.sridhar@intel.com/
[2]: https://patchwork.kernel.org/comment/26382610

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 82 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index efd501a7fe294..63a997b999537 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -80,6 +80,9 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
+/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
+#define ZSWAP_MAX_BATCH_SIZE 8U
+
 static int zswap_setup(void);
 
 /* Enable/disable zswap */
@@ -147,7 +150,7 @@ struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
 	struct acomp_req *req;
 	struct crypto_wait wait;
-	u8 *buffer;
+	u8 **buffers;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -166,6 +169,8 @@ struct zswap_pool {
 	struct work_struct release_work;
 	struct hlist_node node;
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
+	u8 compr_batch_size;
+	u8 batch_size;
 };
 
 /* Global LRU lists shared by all zswap pools. */
@@ -258,8 +263,10 @@ static void __zswap_pool_empty(struct percpu_ref *ref);
  *   zswap_cpu_comp_prepare(), not others.
  * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
  */
-static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 {
+	u8 i;
+
 	if (IS_ERR_OR_NULL(acomp_ctx))
 		return;
 
@@ -269,7 +276,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		crypto_free_acomp(acomp_ctx->acomp);
 
-	kfree(acomp_ctx->buffer);
+	if (acomp_ctx->buffers) {
+		for (i = 0; i < nr_buffers; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
@@ -290,6 +301,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 			return NULL;
 	}
 
+	/* Many things rely on the zero-initialization. */
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool)
 		return NULL;
@@ -352,13 +364,28 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 		goto ref_fail;
 	INIT_LIST_HEAD(&pool->list);
 
+	/*
+	 * Set the unit of compress batching for large folios, for quick
+	 * retrieval in the zswap_compress() fast path:
+	 * If the compressor is sequential (@pool->compr_batch_size is 1),
+	 * large folios will be compressed in batches of ZSWAP_MAX_BATCH_SIZE
+	 * pages, where each page in the batch is compressed sequentially.
+	 * We see better performance by processing the folio in batches of
+	 * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
+	 * structures.
+	 */
+	pool->batch_size = (pool->compr_batch_size > 1) ?
+				pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
+
 	zswap_pool_debug("created", pool);
 
 	return pool;
 
 ref_fail:
 	for_each_possible_cpu(cpu)
-		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
+				  pool->compr_batch_size);
+
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -417,7 +444,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
 	zswap_pool_debug("destroying", pool);
 
 	for_each_possible_cpu(cpu)
-		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
+				  pool->compr_batch_size);
 
 	free_percpu(pool->acomp_ctx);
 
@@ -876,6 +904,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	int ret = -ENOMEM;
+	u8 i;
 
 	/*
 	 * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
@@ -888,10 +917,6 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		return 0;
 
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return ret;
-
 	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
@@ -904,17 +929,36 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
 	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
 		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
+			pool->tfm_name);
 		goto fail;
 	}
 
-	crypto_init_wait(&acomp_ctx->wait);
+	/*
+	 * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
+	 * compressor supports batching.
+	 */
+	pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
+				     crypto_acomp_batch_size(acomp_ctx->acomp));
+
+	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
+					  GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
+		goto fail;
+
+	for (i = 0; i < pool->compr_batch_size; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
+						     cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i])
+			goto fail;
+	}
 
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
+	crypto_init_wait(&acomp_ctx->wait);
+
 	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
@@ -922,7 +966,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	return 0;
 
 fail:
-	acomp_ctx_dealloc(acomp_ctx);
+	acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
 	return ret;
 }
 
@@ -942,7 +986,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -1003,19 +1047,19 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
-	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
+	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffers[0]);
 
 	/*
 	 * zpool_obj_read_begin() might return a kmap address of highmem when
-	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
-	 * handle highmem addresses, so copy the object to acomp_ctx->buffer.
+	 * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
+	 * handle highmem addresses, so copy the object to acomp_ctx->buffers[0].
 	 */
 	if (virt_addr_valid(obj)) {
 		src = obj;
 	} else {
-		WARN_ON_ONCE(obj == acomp_ctx->buffer);
-		memcpy(acomp_ctx->buffer, obj, entry->length);
-		src = acomp_ctx->buffer;
+		WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
+		memcpy(acomp_ctx->buffers[0], obj, entry->length);
+		src = acomp_ctx->buffers[0];
 	}
 
 	sg_init_one(&input, src, entry->length);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (21 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-14 21:05   ` Nhat Pham
  2025-08-28 23:59   ` Barry Song
  2025-08-01  4:36 ` [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
  2025-08-08 23:51 ` [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Nhat Pham
  24 siblings, 2 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies zswap_store() to store a batch of pages in large
folios at a time, instead of storing one page at a time. It does this by
calling a new procedure zswap_store_pages() with a range of
"pool->batch_size" indices in the folio.

zswap_store_pages() implements all the computes done earlier in
zswap_store_page() for a single-page, for multiple pages in a folio,
namely the "batch":

1) It starts by allocating all zswap entries required to store the
   batch. New procedures, zswap_entries_cache_alloc_batch() and
   zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
   to optimize the performance of this step.

2) Next, the entries fields are written, computes that need to be happen
   anyway, without modifying the zswap xarray/LRU publishing order. This
   improves latency by avoiding having the bring the entries into the
   cache for writing in different code blocks within this procedure.

3) Next, it calls zswap_compress() to sequentially compress each page in
   the batch.

4) Finally, it adds the batch's zswap entries to the xarray and LRU,
   charges zswap memory and increments zswap stats.

5) The error handling and cleanup required for all failure scenarios
   that can occur while storing a batch in zswap are consolidated to a
   single "store_pages_failed" label in zswap_store_pages(). Here again,
   we optimize performance by calling kmem_cache_free_bulk().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 218 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 149 insertions(+), 69 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 63a997b999537..8ca69c3f30df2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -879,6 +879,24 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 	kmem_cache_free(zswap_entry_cache, entry);
 }
 
+/*
+ * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number otherwise.
+ * The code for __kmem_cache_alloc_bulk() indicates that this positive number
+ * will be the @size requested, i.e., @nr_entries.
+ */
+static __always_inline int zswap_entries_cache_alloc_batch(void **entries,
+							   unsigned int nr_entries,
+							   gfp_t gfp)
+{
+	return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries, entries);
+}
+
+static __always_inline void zswap_entries_cache_free_batch(void **entries,
+							   unsigned int nr_entries)
+{
+	kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
+}
+
 /*
  * Carries out the common pattern of freeing and entry's zpool allocation,
  * freeing the entry itself, and decrementing the number of stored pages.
@@ -1512,93 +1530,154 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
-static bool zswap_store_page(struct page *page,
-			     struct obj_cgroup *objcg,
-			     struct zswap_pool *pool)
+/*
+ * Store multiple pages in @folio, starting from the page at index @start up to
+ * the page at index @end-1.
+ */
+static bool zswap_store_pages(struct folio *folio,
+			      long start,
+			      long end,
+			      struct obj_cgroup *objcg,
+			      struct zswap_pool *pool,
+			      int node_id)
 {
-	swp_entry_t page_swpentry = page_swap_entry(page);
-	struct zswap_entry *entry, *old;
-
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		return false;
+	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
+	u8 i, store_fail_idx = 0, nr_pages = end - start;
+
+	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
+						      nr_pages, GFP_KERNEL))) {
+		for (i = 0; i < nr_pages; ++i) {
+			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+
+			if (unlikely(!entries[i])) {
+				zswap_reject_kmemcache_fail++;
+				/*
+				 * While handling this error, we only need to
+				 * call zswap_entries_cache_free_batch() for
+				 * entries[0 .. i-1].
+				 */
+				nr_pages = i;
+				goto store_pages_failed;
+			}
+		}
 	}
 
-	if (!zswap_compress(page, entry, pool))
-		goto compress_failed;
+	/*
+	 * Three sets of initializations are done to minimize bringing
+	 * @entries into the cache for writing at different parts of this
+	 * procedure, since doing so regresses performance:
+	 *
+	 * 1) Do all the writes to each entry in one code block. These
+	 *    writes need to be done anyway upon success which is more likely
+	 *    than not.
+	 *
+	 * 2) Initialize the handle to an error value. This facilitates
+	 *    having a consolidated failure handling
+	 *    'goto store_pages_failed' that can inspect the value of the
+	 *    handle to determine whether zpool memory needs to be
+	 *    de-allocated.
+	 *
+	 * 3) The page_swap_entry() is obtained once and stored in the entry.
+	 *    Subsequent store in xarray gets the entry->swpentry instead of
+	 *    calling page_swap_entry(), minimizing computes.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
+		entries[i]->pool = pool;
+		entries[i]->swpentry = page_swap_entry(folio_page(folio, start + i));
+		entries[i]->objcg = objcg;
+		entries[i]->referenced = true;
+		INIT_LIST_HEAD(&entries[i]->lru);
+	}
 
-	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
-		       entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
+	for (i = 0; i < nr_pages; ++i) {
+		struct page *page = folio_page(folio, start + i);
 
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
-		goto store_failed;
+		if (!zswap_compress(page, entries[i], pool))
+			goto store_pages_failed;
 	}
 
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
+	for (i = 0; i < nr_pages; ++i) {
+		struct zswap_entry *old, *entry = entries[i];
 
-	/*
-	 * The entry is successfully compressed and stored in the tree, there is
-	 * no further possibility of failure. Grab refs to the pool and objcg,
-	 * charge zswap memory, and increment zswap_stored_pages.
-	 * The opposite actions will be performed by zswap_entry_free()
-	 * when the entry is removed from the tree.
-	 */
-	zswap_pool_get(pool);
-	if (objcg) {
-		obj_cgroup_get(objcg);
-		obj_cgroup_charge_zswap(objcg, entry->length);
-	}
-	atomic_long_inc(&zswap_stored_pages);
+		old = xa_store(swap_zswap_tree(entry->swpentry),
+			       swp_offset(entry->swpentry),
+			       entry, GFP_KERNEL);
+		if (unlikely(xa_is_err(old))) {
+			int err = xa_err(old);
 
-	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
-	 */
-	entry->pool = pool;
-	entry->swpentry = page_swpentry;
-	entry->objcg = objcg;
-	entry->referenced = true;
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+			zswap_reject_alloc_fail++;
+			/*
+			 * Entries up to this point have been stored in the
+			 * xarray. zswap_store() will erase them from the xarray
+			 * and call zswap_entry_free(). Local cleanup in
+			 * 'store_pages_failed' only needs to happen for
+			 * entries from [@i to @nr_pages).
+			 */
+			store_fail_idx = i;
+			goto store_pages_failed;
+		}
+
+		/*
+		 * We may have had an existing entry that became stale when
+		 * the folio was redirtied and now the new version is being
+		 * swapped out. Get rid of the old.
+		 */
+		if (unlikely(old))
+			zswap_entry_free(old);
+
+		/*
+		 * The entry is successfully compressed and stored in the tree, there is
+		 * no further possibility of failure. Grab refs to the pool and objcg,
+		 * charge zswap memory, and increment zswap_stored_pages.
+		 * The opposite actions will be performed by zswap_entry_free()
+		 * when the entry is removed from the tree.
+		 */
+		zswap_pool_get(pool);
+		if (objcg) {
+			obj_cgroup_get(objcg);
+			obj_cgroup_charge_zswap(objcg, entry->length);
+		}
+		atomic_long_inc(&zswap_stored_pages);
+
+		/*
+		 * We finish by adding the entry to the LRU while it's already
+		 * in xarray. This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		if (likely(entry->length))
+			zswap_lru_add(&zswap_list_lru, entry);
 	}
 
 	return true;
 
-store_failed:
-	zpool_free(pool->zpool, entry->handle);
-compress_failed:
-	zswap_entry_cache_free(entry);
+store_pages_failed:
+	for (i = store_fail_idx; i < nr_pages; ++i) {
+		if (!IS_ERR_VALUE(entries[i]->handle))
+			zpool_free(pool->zpool, entries[i]->handle);
+	}
+	zswap_entries_cache_free_batch((void **)&entries[store_fail_idx],
+				       nr_pages - store_fail_idx);
+
 	return false;
 }
 
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
+	int node_id = folio_nid(folio);
 	swp_entry_t swp = folio->swap;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
 	bool ret = false;
-	long index;
+	long start, end;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1632,10 +1711,11 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
+	/* Store the folio in batches of @pool->batch_size pages. */
+	for (start = 0; start < nr_pages; start += pool->batch_size) {
+		end = min(start + pool->batch_size, nr_pages);
 
-		if (!zswap_store_page(page, objcg, pool))
+		if (!zswap_store_pages(folio, start, end, objcg, pool, node_id))
 			goto put_pool;
 	}
 
@@ -1665,9 +1745,9 @@ bool zswap_store(struct folio *folio)
 		struct zswap_entry *entry;
 		struct xarray *tree;
 
-		for (index = 0; index < nr_pages; ++index) {
-			tree = swap_zswap_tree(swp_entry(type, offset + index));
-			entry = xa_erase(tree, offset + index);
+		for (start = 0; start < nr_pages; ++start) {
+			tree = swap_zswap_tree(swp_entry(type, offset + start));
+			entry = xa_erase(tree, offset + start);
 			if (entry)
 				zswap_entry_free(entry);
 		}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (22 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
@ 2025-08-01  4:36 ` Kanchana P Sridhar
  2025-08-14 21:14   ` Nhat Pham
  2025-08-28 23:54   ` Barry Song
  2025-08-08 23:51 ` [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Nhat Pham
  24 siblings, 2 replies; 68+ messages in thread
From: Kanchana P Sridhar @ 2025-08-01  4:36 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces a new unified implementation of zswap_compress()
for compressors that do and do not support batching. This eliminates
code duplication and facilitates maintainability of the code with the
introduction of compress batching.

The vectorized implementation of calling the earlier zswap_compress()
sequentially, one page at a time in zswap_store_pages(), is replaced
with this new version of zswap_compress() that accepts multiple pages to
compress as a batch.

If the compressor does not support batching, each page in the batch is
compressed and stored sequentially.

If the compressor supports batching, for e.g., 'deflate-iaa', the Intel
IAA hardware accelerator, the batch is compressed in parallel in
hardware by setting the acomp_ctx->req->kernel_data to contain the
necessary batching data before calling crypto_acomp_compress(). If all
requests in the batch are compressed without errors, the compressed
buffers are then stored in zpool.

Another important change this patch makes is with the acomp_ctx mutex
locking in zswap_compress(). Earlier, the mutex was held per page's
compression. With the new code, [un]locking the mutex per page caused
regressions for software compressors when testing with usemem
(30 processes) and also kernel compilation with 'allmod' config. The
regressions were more eggregious when PMD folios were stored. The
implementation in this commit locks/unlocks the mutex once per batch,
that resolves the regression.

The use of prefetchw() for zswap entries and likely()/unlikely()
annotations prevent regressions with software compressors like zstd, and
generally improve non-batching compressors' performance with the
batching code by ~3%.

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be
hardware-agnostic. It has no dependencies on Intel-specific features and
can be leveraged by any hardware accelerator or software-based
compressor. In other words, the framework is open and inclusive by
design.

Other ongoing work that can use batching:
=========================================
This patch-series demonstrates the performance benefits of compress
batching when used in zswap_store() of large folios. shrink_folio_list()
"reclaim batching" of any-order folios is the major next work that uses
the zswap compress batching framework: our testing of kernel_compilation
with writeback and the zswap shrinker indicates 10X fewer pages get
written back when we reclaim 32 folios as a batch, as compared to one
folio at a time: this is with deflate-iaa and with zstd. We expect to
submit a patch-series with this data and the resulting performance
improvements shortly. Reclaim batching relieves memory pressure faster
than reclaiming one folio at a time, hence alleviates the need to scan
slab memory for writeback.

Nhat has given ideas on using batching with the ongoing kcompressd work,
as well as beneficially using decompression batching & block IO batching
to improve zswap writeback efficiency.

Experiments that combine zswap compress batching, reclaim batching,
swapin_readahead() decompression batching of prefetched pages, and
writeback batching show that 0 pages are written back with deflate-iaa
and zstd. For comparison, the baselines for these compressors see
200K-800K pages written to disk (kernel compilation 'allmod' config).

To summarize, these are future clients of the batching framework:

   - shrink_folio_list() reclaim batching of multiple folios:
       Implemented, will submit patch-series.
   - zswap writeback with decompress batching:
       Implemented, will submit patch-series.
   - zram:
       Implemented, will submit patch-series.
   - kcompressd:
       Not yet implemented.
   - file systems:
       Not yet implemented.
   - swapin_readahead() decompression batching of prefetched pages:
       Implemented, will submit patch-series.

Additionally, any place we have folios that need to be compressed, can
potentially be parallelized.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/swap.h  |  23 ++++++
 mm/zswap.c | 201 ++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 168 insertions(+), 56 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 911ad5ff0f89f..2afbf00f59fea 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -11,6 +11,29 @@ extern int page_cluster;
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+/* linux/mm/zswap.c */
+/*
+ * A compression algorithm that wants to batch compressions/decompressions
+ * must define its own internal data structures that exactly mirror
+ * @struct swap_batch_comp_data and @struct swap_batch_decomp_data.
+ */
+struct swap_batch_comp_data {
+	struct page **pages;
+	u8 **dsts;
+	unsigned int *dlens;
+	int *errors;
+	u8 nr_comps;
+};
+
+struct swap_batch_decomp_data {
+	u8 **srcs;
+	struct page **pages;
+	unsigned int *slens;
+	unsigned int *dlens;
+	int *errors;
+	u8 nr_decomps;
+};
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
diff --git a/mm/zswap.c b/mm/zswap.c
index 8ca69c3f30df2..c30c1f325f573 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,7 @@
 #include <linux/pagemap.h>
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
+#include <linux/prefetch.h>
 
 #include "swap.h"
 #include "internal.h"
@@ -988,71 +989,163 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	return ret;
 }
 
-static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool)
+/*
+ * Unified code path for compressors that do and do not support batching. This
+ * procedure will compress multiple @nr_pages in @folio starting from the
+ * @start index.
+ *
+ * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
+ * sure of this by design.
+ *
+ * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
+ * support batching.
+ *
+ * If @pool->compr_batch_size is 1, each page is processed sequentially.
+ *
+ * If @pool->compr_batch_size is > 1, compression batching is invoked, except if
+ * @nr_pages is 1: if so, we call the fully synchronous non-batching
+ * crypto_acomp API.
+ *
+ * In both cases, if all compressions are successful, the compressed buffers
+ * are stored in zpool.
+ *
+ * A few important changes made to not regress and in fact improve
+ * compression performance with non-batching software compressors, using this
+ * new/batching code:
+ *
+ * 1) acomp_ctx mutex locking:
+ *    Earlier, the mutex was held per page compression. With the new code,
+ *    [un]locking the mutex per page caused regressions for software
+ *    compressors. We now lock the mutex once per batch, which resolves the
+ *    regression.
+ *
+ * 2) The prefetchw() and likely()/unlikely() annotations prevent
+ *    regressions with software compressors like zstd, and generally improve
+ *    non-batching compressors' performance with the batching code by ~3%.
+ */
+static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
+			   struct zswap_entry *entries[], struct zswap_pool *pool,
+			   int node_id)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
-	unsigned int dlen = PAGE_SIZE;
-	unsigned long handle;
-	struct zpool *zpool;
+	struct zpool *zpool = pool->zpool;
+
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
+	int errors[ZSWAP_MAX_BATCH_SIZE];
+
+	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+	unsigned int i, j;
+	int err;
 	gfp_t gfp;
-	u8 *dst;
+
+	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffers[0];
-	sg_init_table(&input, 1);
-	sg_set_page(&input, page, PAGE_SIZE, 0);
-
 	/*
-	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
-	 * and hardware-accelerators may won't check the dst buffer size, so
-	 * giving the dst buffer with enough length to avoid buffer overflow.
+	 * Note:
+	 * [i] refers to the incoming batch space and is used to
+	 *     index into the folio pages, @entries and @errors.
 	 */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	for (i = 0; i < nr_pages; i += nr_comps) {
+		if (nr_comps == 1) {
+			sg_init_table(&input, 1);
+			sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
 
-	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zwap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
-	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
-	if (comp_ret)
-		goto unlock;
+			/*
+			 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+			 * and hardware-accelerators may won't check the dst buffer size, so
+			 * giving the dst buffer with enough length to avoid buffer overflow.
+			 */
+			sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
+			acomp_request_set_params(acomp_ctx->req, &input,
+						 &output, PAGE_SIZE, PAGE_SIZE);
+
+			errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+						    &acomp_ctx->wait);
+			if (unlikely(errors[i]))
+				goto compress_error;
+
+			dlens[i] = acomp_ctx->req->dlen;
+		} else {
+			struct page *pages[ZSWAP_MAX_BATCH_SIZE];
+			unsigned int k;
+
+			for (k = 0; k < nr_pages; ++k)
+				pages[k] = folio_page(folio, start + k);
+
+			struct swap_batch_comp_data batch_comp_data = {
+				.pages = pages,
+				.dsts = acomp_ctx->buffers,
+				.dlens = dlens,
+				.errors = errors,
+				.nr_comps = nr_pages,
+			};
+
+			acomp_ctx->req->kernel_data = &batch_comp_data;
+
+			if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
+				goto compress_error;
+		}
 
-	zpool = pool->zpool;
-	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
-	if (alloc_ret)
-		goto unlock;
-
-	zpool_obj_write(zpool, handle, dst, dlen);
-	entry->handle = handle;
-	entry->length = dlen;
-
-unlock:
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
-		zswap_reject_compress_poor++;
-	else if (comp_ret)
-		zswap_reject_compress_fail++;
-	else if (alloc_ret)
-		zswap_reject_alloc_fail++;
+		/*
+		 * All @nr_comps pages were successfully compressed.
+		 * Store the pages in zpool.
+		 *
+		 * Note:
+		 * [j] refers to the incoming batch space and is used to
+		 *     index into the folio pages, @entries, @dlens and @errors.
+		 * [k] refers to the @acomp_ctx space, as determined by
+		 *     @pool->compr_batch_size, and is used to index into
+		 *     @acomp_ctx->buffers.
+		 */
+		for (j = i; j < i + nr_comps; ++j) {
+			unsigned int k = j - i;
+			unsigned long handle;
+
+			/*
+			 * prefetchw() minimizes cache-miss latency by
+			 * moving the zswap entry to the cache before it
+			 * is written to; reducing sys time by ~1.5% for
+			 * non-batching software compressors.
+			 */
+			prefetchw(entries[j]);
+			err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
+
+			if (unlikely(err)) {
+				if (err == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
+
+				goto err_unlock;
+			}
+
+			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]);
+			entries[j]->handle = handle;
+			entries[j]->length = dlens[j];
+		}
+	} /* finished compress and store nr_pages. */
 
 	mutex_unlock(&acomp_ctx->mutex);
-	return comp_ret == 0 && alloc_ret == 0;
+	return true;
+
+compress_error:
+	for (j = i; j < i + nr_comps; ++j) {
+		if (errors[j]) {
+			if (errors[j] == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+		}
+	}
+
+err_unlock:
+	mutex_unlock(&acomp_ctx->mutex);
+	return false;
 }
 
 static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
@@ -1590,12 +1683,8 @@ static bool zswap_store_pages(struct folio *folio,
 		INIT_LIST_HEAD(&entries[i]->lru);
 	}
 
-	for (i = 0; i < nr_pages; ++i) {
-		struct page *page = folio_page(folio, start + i);
-
-		if (!zswap_compress(page, entries[i], pool))
-			goto store_pages_failed;
-	}
+	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool, node_id)))
+		goto store_pages_failed;
 
 	for (i = 0; i < nr_pages; ++i) {
 		struct zswap_entry *old, *entry = entries[i];
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (23 preceding siblings ...)
  2025-08-01  4:36 ` [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
@ 2025-08-08 23:51 ` Nhat Pham
  2025-08-09  0:03   ` Sridhar, Kanchana P
  2025-08-15  5:27   ` Herbert Xu
  24 siblings, 2 replies; 68+ messages in thread
From: Nhat Pham @ 2025-08-08 23:51 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>

Can we get some comments from crypto tree maintainers as well? I feel
like this patch series is more crypto patch than zswap patch, at this
point.

Can we land any zswap parts without the crypto API change? Grasping at
straws here, in case we can parallelize the reviewing and merging
process.

>
> Following Andrew's suggestion, the next two paragraphs emphasize generality
> and alignment with current kernel efforts.
>
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be
> hardware-agnostic. It has no dependencies on Intel-specific features and
> can be leveraged by any hardware accelerator or software-based
> compressor. In other words, the framework is open and inclusive by
> design.
>
> Other ongoing work that can use batching:
> =========================================
> This patch-series demonstrates the performance benefits of compress
> batching when used in zswap_store() of large folios. shrink_folio_list()
> "reclaim batching" of any-order folios is the next major work that uses
> this zswap compress batching framework: our testing of kernel_compilation
> with writeback and the zswap shrinker indicates 10X fewer pages get
> written back when we reclaim 32 folios as a batch, as compared to one
> folio at a time: this is with deflate-iaa and with zstd. We expect to
> submit a patch-series with this data and the resulting performance
> improvements shortly. Reclaim batching relieves memory pressure faster
> than reclaiming one folio at a time, hence alleviates the need to scan
> slab memory for writeback.
>
> Many thanks to Nhat for suggesting ideas on using batching with the
> ongoing kcompressd work, as well as beneficially using decompression
> batching & block IO batching to improve zswap writeback efficiency.

My pleasure :)

>
> Experiments with kernel compilation benchmark (allmod config) that
> combine zswap compress batching, reclaim batching, swapin_readahead()
> decompression batching of prefetched pages, and writeback batching show
> that 0 pages are written back to disk with deflate-iaa and zstd. For
> comparison, the baselines for these compressors see 200K-800K pages
> written to disk.
>
> To summarize, these are future clients of the batching framework:
>
>    - shrink_folio_list() reclaim batching of multiple folios:
>        Implemented, will submit patch-series.
>    - zswap writeback with decompress batching:
>        Implemented, will submit patch-series.
>    - zram:
>        Implemented, will submit patch-series.
>    - kcompressd:
>        Not yet implemented.
>    - file systems:
>        Not yet implemented.
>    - swapin_readahead() decompression batching of prefetched pages:
>        Implemented, will submit patch-series.
>
>
> iaa_crypto Driver Rearchitecting and Optimizations:
> ===================================================
>
> The most significant highlight of v11 is a new, lightweight and highly
> optimized iaa_crypto driver, resulting directly in the latency and
> throughput improvements noted later in this cover letter.
>
>  1) Better stability, more functionally versatile to support zswap and
>     zram with better performance on different Intel platforms.
>
>     a) Patches 0002, 0005 and 0010 together resolve a race condition in
>        mainline v6.15, reported from internal validation, when IAA
>        wqs/devices are disabled while workloads are using IAA.
>
>     b) Patch 0002 introduces a new architecture for mapping cores to
>        IAAs based on packages instead of NUMA nodes, and generalizing
>        how WQs are used: as package level shared resources for all
>        same-package cores (default for compress WQs), or dedicated to
>        mapped cores (default for decompress WQs). Further, users are
>        able to configure multiple WQs and specify how many of those are
>        for compress jobs only vs. decompress jobs only. sysfs iaa_crypto
>        driver parameters can be used to change the default settings for
>        performance tuning.
>
>     c) idxd descriptor allocation moved from blocking to non-blocking
>        with retry limits and mitigations if limits are exceeded.
>
>     d) Code cleanup for readability and clearer code flow.
>
>     e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs
>        and devices that exists in the mainline v6.15.
>
>     f) Rearchitecting iaa_crypto to be independent of crypto_acomp to
>        enable a zram/zcomp backend_deflate_iaa.c, while fully supporting
>        the crypto_acomp interfaces for zswap. A new
>        include/linux/iaa_comp.h is added.
>
>     g) New Dynamic compression mode for Granite Rapids to get better
>        compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap
>        compressor.
>
>     h) New crypto_acomp API crypto_acomp_batch_size() that will return
>        the driver's max batch size if the driver has registered the new
>        get_batch_size() acomp_alg interface; or 1 if there is no driver
>        specific implementation of get_batch_size().
>
>        Accordingly, iaa_crypto provides an implementation for
>        get_batch_size().
>
>     i) A versatile set of interfaces independent of crypto_acomp for use
>        in developing a zram zcomp backend for iaa_crypto.
>
>  2) Performance optimizations (please refer to the latency data per
>     optimization in the commit logs):
>
>     a) Distributing [de]compress jobs in round-robin manner to available
>        IAAs on package.
>
>     b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a
>        percpu_ref in struct iaa_wq, thereby eliminating acquiring a
>        spinlock in the fast path, while using a combination of the
>        iaa_crypto_enabled atomic with spinlocks in the slow path to
>        ensure the compress/decompress code sees a consistent state of the
>        wq tables.
>
>     c) Directly call movdir64b for non-irq use cases, i.e., the most
>        common usage. Avoid the overhead of irq-specific computes in
>        idxd_submit_desc() to gain latency.
>
>     d) Batching of compressions/decompressions using async submit-poll
>        mechanism to derive the benefits of hardware parallelism.
>
>     e) Batching compressors need to manage their own "request"
>        abstraction, and remove this driver-specific aspect from being
>        managed by kernel users such as zswap. iaa_crypto maintains
>        per-CPU "struct iaa_req **reqs" to submit multiple jobs to the
>        hardware accelerator to run in parallel.
>
>     f) Add a "void *kernel_data" member to struct acomp_req for use by
>        kernel modules to pass batching data to algorithms that support
>        batching. This allows us to enable compress batching with only
>        the crypto_acomp_batch_size() API, and without changes to
>        existing crypto_acomp API.
>
>     g) Submit the two largest data buffers first for decompression
>        batching, so that the longest running jobs get a head start,
>        reducing latency for the batch.
>
>
> Main Changes in Zswap Compression Batching:
> ===========================================
>
>  Note to zswap maintainers:
>  --------------------------
>  Patches 20 and 21 can be reviewed and improved/merged independently
>  of this series, since they are zswap centric. These 2 patches help
>  batching but the crypto_acomp_batch_size() from the iaa_crypto commits
>  in this series is not a requirement, unlike patches 22-24.
>
>  1) v11 preserves the pool acomp_ctx resources creation/deletion
>     simplification of v9, namely, lasting from pool creation-deletion,
>     persisting through CPU hot[un]plug operations. Further, zswap no
>     longer needs to create multiple "struct acomp_req" in the per-CPU
>     acomp_ctx. zswap only needs to manage multiple "u8 **buffers".
>
>  2) We store the compressor's batch-size (@pool->compr_batch_size) and
>     the batch-size to use during compression batching
>     (@pool->batch_size) directly in struct zswap_pool for quick
>     retrieval in the zswap_store() fast path.
>
>  3) Optimizations to not cause regressions in software compressors with
>     the introduction of the new unified zswap_compress() procedure that
>     implements compression batching for all compressors. Since v9, the
>     new zpool_malloc() interface that allocates pool memory on the NUMA
>     node, when used in the new zswap_compress() batching implementation,
>     caused some performance loss (verified by replacing
>     page_to_nid(page) with NUMA_NO_NODE). These optimizations help
>     recover the performance and are included in this series:
>
>     a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free
>        batch zswap_entry-s. These kmem_cache API allow allocator
>        optimizations with internal locks for multiple allocations.
>
>     b) Writes to the zswap_entry right after it is allocated without
>        modifying the publishing order. This avoids different code blocks
>        in zswap_store_pages() having to bring the zswap_entries to the
>        cache for writing, potentially evicting other working set
>        structures, impacting performance.
>
>     c) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software
>        compressors, since this gives the best performance with zstd when
>        writeback is enabled, and does not regress performance when
>        writeback is not enabled.
>
>     d) More likely()/unlikely() annotations to try and minimize branch
>        mis-predicts.
>
>  4) "struct swap_batch_comp_data" and "struct swap_batch_decomp_data"
>      added in mm/swap.h:
>
>      /*
>       * A compression algorithm that wants to batch compressions/decompressions
>       * must define its own internal data structures that exactly mirror
>       * @struct swap_batch_comp_data and @struct swap_batch_decomp_data.
>       */
>
>      Accordingly, zswap_compress() uses struct swap_batch_comp_data to
>      pass batching data in the acomp_req->kernel_data
>      pointer if the compressor supports batching.
>
>      include/linux/iaa_comp.h has matching definitions of
>      "struct iaa_batch_comp_data" and "struct iaa_batch_decomp_data".
>
>      Feedback from the zswap maintainers is requested on whether this
>      is a good approach. Suggestions for alternative approaches are also
>      very welcome.
>
>
> Compression Batching:
> =====================
>
> This patch-series introduces batch compression of pages in large folios to
> improve zswap swapout latency. It preserves the existing zswap protocols
> for non-batching software compressors by calling crypto_acomp sequentially
> per page in the batch. Additionally, in support of hardware accelerators
> that can process a batch as an integral unit, the patch-series allows
> zswap to call crypto_acomp without API changes, for compressors
> that intrinsically support batching.
>
> The patch series provides a proof point by using the Intel Analytics
> Accelerator (IAA) for implementing the compress/decompress batching API
> using hardware parallelism in the iaa_crypto driver and another proof point
> with a sequential software compressor, zstd.
>
> SUMMARY:
> ========
>
>   The first proof point is to test with IAA using a sequential call (fully
>   synchronous, compress one page at a time) vs. a batching call (fully
>   asynchronous, submit a batch to IAA for parallel compression, then poll for
>   completion statuses).
>
>     The performance testing data with 30 usemem processes/64K folios
>     shows 52% throughput gains and 24% elapsed/sys time reductions with
>     deflate-iaa; and 11% sys time reduction with zstd for a small
>     throughput increase.
>
>     Kernel compilation test with 64K folios using 28 threads and the
>     zswap shrinker_enabled set to "Y", demonstrates similar
>     improvements: zswap_store() large folios using IAA compress batching
>     improves the workload performance by 6.8% and reduces sys time by
>     19% as compared to IAA sequential. For zstd, compress batching
>     improves workload performance by 5.2% and reduces sys time by
>     27.4% as compared to sequentially calling zswap_compress() per page
>     in a folio.
>
>   The second proof point is to make sure that software algorithms such as
>   zstd do not regress. The data indicates that for sequential software
>   algorithms a performance gain is achieved.
>
>     With the performance optimizations implemented in patches 22-24
>     of v11:
>     *  zstd usemem30 throughput with PMD folios increases by
>        1%. Throughput with 64K folios is within range of variation
>        with a slight improvement. Workload performance with zstd
>        improves by 8%-6%, and sys time reduces by 11%-8% with 64K/PMD
>        folios.
>
>     *  With kernel compilation using zstd with the zswap shrinker, we
>        get a 27.4%-28.2% reduction in sys time, a 5.2%-2.1% improvement
>        in workload performance, and 65%-59% fewer pages written back to
>        disk for 64K/PMD folios respectively.
>
>     These optimizations pertain to ensuring common code paths, removing
>     redundant branches/computes, using prefetchw() of the zswap entry
>     before it is written, and selectively annotating branches with
>     likely()/unlikely() compiler directives to minimize branch
>     mis-prediction penalty. Additionally, using the batching code for
>     non-batching compressors to sequentially compress/store batches of up
>     to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to
>     cache locality of working set structures such as the array of
>     zswap_entry-s for the batch.
>
>     Our internal validation of zstd with the batching interface vs. IAA with
>     the batching interface on Emerald Rapids has shown that IAA
>     compress/decompress batching gives 21.3% more memory savings as compared
>     to zstd, for 5% performance loss as compared to the baseline without any
>     memory pressure. IAA batching demonstrates more than 2X the memory
>     savings obtained by zstd at this 95% performance KPI.
>     The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
>     this compression ratio deficit for IAA, batching is extremely
>     beneficial. As we improve the compression ratio of the IAA accelerator,
>     we expect to see even better memory savings with IAA as compared to
>     software compressors.
>
>
>   Batching Roadmap:
>   =================
>
>   1) Compression batching within large folios (this series).
>
>   2) zswap writeback decompression batching:
>
>      This is being co-developed with Nhat Pham, and shows promising
>      results. We plan to submit an RFC shortly.
>
>   3) Reclaim batching of hybrid folios:
>
>      We can expect to see even more significant performance and throughput
>      improvements if we use the parallelism offered by IAA to do reclaim
>      batching of 4K/large folios (really any-order folios), and using the
>      zswap_store() high throughput compression pipeline to batch-compress
>      pages comprising these folios, not just batching within large
>      folios. This is the reclaim batching patch 13 in v1, which we expect
>      to submit in a separate patch-series. As mentioned earlier, reclaim
>      batching reduces the # of writeback pages by 10X for zstd and
>      deflate-iaa.
>
>   4) swapin_readahead() decompression batching:
>
>      We have developed a zswap load batching interface to be used
>      for parallel decompression batching, using swapin_readahead().
>
>   These capabilities are architected so as to be useful to zswap and
>   zram. We are actively working on integrating these components with zram.
>
>
>   v11 Performance Summary:
>   ========================
>
>   This is a performance testing summary of results with usemem30
>   (30 usemem processes running in a cgroup limited at 150G, each trying to
>    allocate 10G).
>
>   zswap shrinker_enabled = N.
>
>   usemem30 with 64K folios:
>   =========================
>
>      -----------------------------------------------------------------------
>                      mm-unstable-7-30-2025             v11
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     7,153,359      10,856,388        52%
>      Avg throughput (KB/s)         238,445         361,879
>      elapsed time (sec)              92.61           70.50       -24%
>      sys time (sec)               2,193.59        1,675.32       -24%
>      -----------------------------------------------------------------------
>
>      -----------------------------------------------------------------------
>                      mm-unstable-7-30-2025             v11
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v11 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,866,411       6,874,244       0.1%
>      Avg throughput (KB/s)         228,880         229,141
>      elapsed time (sec)              96.45           89.05        -8%
>      sys time (sec)               2,410.72        2,150.63       -11%
>      -----------------------------------------------------------------------
>
>
>   usemem30 with 2M folios:
>   ========================
>
>      -----------------------------------------------------------------------
>                      mm-unstable-7-30-2025             v11
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     7,268,929      11,312,195        56%
>      Avg throughput (KB/s)         242,297         377,073
>      elapsed time (sec)              80.40           68.73       -15%
>      sys time (sec)               1,856.54        1,599.25       -14%
>      -----------------------------------------------------------------------
>
>      -----------------------------------------------------------------------
>                      mm-unstable-7-30-2025             v11
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v11 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     7,560,441       7,627,155       0.9%
>      Avg throughput (KB/s)         252,014         254,238
>      elapsed time (sec)              88.89           83.22        -6%
>      sys time (sec)               2,132.05        1,952.98        -8%
>      -----------------------------------------------------------------------
>
>
>   This is a performance testing summary of results with
>   kernel_compilation test (allmod config, 28 cores, cgroup limited to 2G).
>
>   Writeback to disk is enabled by setting zswap shrinker_enabled = Y.
>
>   kernel_compilation with 64K folios:
>   ===================================
>
>      --------------------------------------------------------------------------
>                         mm-unstable-7-30-2025             v11
>      --------------------------------------------------------------------------
>      zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                      vs.
>                                                                  IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                          901.81          840.60       -6.8%
>      sys_sec                         2,672.93        2,171.17        -19%
>      zswpout                       34,700,692      24,076,095        -31%
>      zswap_written_back_pages       2,612,474       1,451,961        -44%
>      --------------------------------------------------------------------------
>
>      --------------------------------------------------------------------------
>                         mm-unstable-7-30-2025             v11
>      --------------------------------------------------------------------------
>      zswap compressor                    zstd            zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                          882.67          837.21       -5.2%
>      sys_sec                         3,573.31        2,593.94      -27.4%
>      zswpout                       42,768,967      22,660,215        -47%
>      zswap_written_back_pages       2,109,739         727,919        -65%
>      --------------------------------------------------------------------------
>
>
>   kernel_compilation with PMD folios:
>   ===================================
>
>      --------------------------------------------------------------------------
>                         mm-unstable-7-30-2025             v11
>      --------------------------------------------------------------------------
>      zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                      vs.
>                                                                  IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                          838.76          804.83         -4%
>      sys_sec                         3,173.57        2,422.63        -24%
>      zswpout                       59,544,198      38,093,995        -36%
>      zswap_written_back_pages       2,726,367         929,614        -66%
>      --------------------------------------------------------------------------
>
>
>      --------------------------------------------------------------------------
>                         mm-unstable-7-30-2025             v11
>      --------------------------------------------------------------------------
>      zswap compressor                    zstd            zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                          831.09          813.40       -2.1%
>      sys_sec                         4,251.11        3,053.95      -28.2%
>      zswpout                       59,452,638      35,832,407        -40%
>      zswap_written_back_pages       1,041,721         423,334        -59%
>      --------------------------------------------------------------------------

I see a lot of good numbers for both IAA and zstd here. Thanks for
working on it, Kanchana!

>
>
>
> DETAILS:
> ========
>
> (A) From zswap's perspective, the most significant changes are:
> ===============================================================
>
> 1) A unified zswap_compress() API is added to compress multiple
>    pages:
>
>    - If the compressor has multiple acomp requests, i.e., internally
>      supports batching, crypto_acomp_batch_compress() is called. If all
>      pages are successfully compressed, the batch is stored in zpool.
>
>    - If the compressor can only compress one page at a time, each page
>      is compressed and stored sequentially.
>
>    Many thanks to Yosry for this suggestion, because it is an essential
>    component of unifying common code paths between sequential/batching
>    compressions.
>
>    prefetchw() is used in zswap_compress() to minimize cache-miss
>    latency by moving the zswap entry to the cache before it is written
>    to; reducing sys time by ~1.5% for zstd (non-batching software
>    compression). In other words, this optimization helps both batching and
>    software compressors.
>
>    Overall, the prefetchw() and likely()/unlikely() annotations prevent
>    regressions with software compressors like zstd, and generally improve
>    non-batching compressors' performance with the batching code by ~8%.
>
> 2) A new zswap_store_pages() is added, that stores multiple pages in a
>    folio in a range of indices. This is an extension of the earlier
>    zswap_store_page(), except it operates on a batch of pages.
>
> 3) zswap_store() is modified to store the folio's pages in batches
>    by calling zswap_store_pages(). If the compressor supports batching,
>    the folio will be compressed in batches of
>    "pool->compr_batch_size". If the compressor does not support
>    batching, the folio will be compressed in batches of
>    ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch is
>    compressed sequentially. We see better performance by processing
>    the folio in batches of ZSWAP_MAX_BATCH_SIZE, due to cache locality
>    of working set structures such as the array of zswap_entry-s for the
>    batch.
>
>    Many thanks to Yosry and Johannes for steering towards a common
>    design and code paths for sequential and batched compressions (i.e.,
>    for software compressors and hardware accelerators such as IAA). As per
>    Yosry's suggestion in v8, the "batch_size" is an attribute of the
>    compressor/pool, and hence is stored in struct zswap_pool instead of
>    in struct crypto_acomp_ctx.
>
> 4) Simplifications to the acomp_ctx resources allocation/deletion
>    vis-a-vis CPU hot[un]plug. This further improves upon v8 of this
>    patch-series based on the discussion with Yosry, and formalizes the
>    lifetime of these resources from pool creation to pool
>    deletion. zswap does not register a CPU hotplug teardown
>    callback. The acomp_ctx resources will persist through CPU
>    online/offline transitions. The main changes made to avoid UAF/race
>    conditions, and correctly handle process migration, are:
>
>    a) No acomp_ctx mutex locking in zswap_cpu_comp_prepare().
>    b) No CPU hotplug teardown callback, no acomp_ctx resources deleted.
>    c) New acomp_ctx_dealloc() procedure that cleans up the acomp_ctx
>       resources, and is shared by
>       zswap_cpu_comp_prepare()/zswap_pool_create() error handling and
>       zswap_pool_destroy().
>    d) The zswap_pool node list instance is removed right after the node
>       list add function in zswap_pool_create().
>    e) We directly call mutex_[un]lock(&acomp_ctx->mutex) in
>       zswap_[de]compress(). acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock()
>       are deleted.
>
>    The commit log of patch 0020 has a more detailed analysis.
>
>
> (B) Main changes in crypto_acomp and iaa_crypto:
> ================================================
>
> 1) A new architecture is introduced for IAA device WQs' usage as:
>    - compress only
>    - decompress only
>    - generic, i.e., both compress/decompress.
>
>    Further, IAA devices/wqs are assigned to cores based on packages
>    instead of NUMA nodes.
>
>    The WQ rebalancing algorithm that is invoked as WQs are
>    discovered/deleted has been made very general and flexible so that
>    the user can control exactly how IAA WQs are used. In addition to the
>    user being able to specify a WQ type as comp/decomp/generic, the user
>    can also configure if WQs need to be shared among all same-package
>    cores, or, whether the cores should be divided up amongst the
>    available IAA devices.
>
>    If distribute_[de]comps is enabled, from a given core's perspective,
>    the iaa_crypto driver will distribute comp/decomp jobs among all
>    devices' WQs in round-robin manner. This improves batching latency
>    and can improve compression/decompression throughput for workloads
>    that see a lot of swap activity.
>
>    The commit log of patch 0002 provides more details on new iaa_crypto
>    driver parameters added, along with recommended settings (defaults
>    are optimal settings).
>
> 2) Compress/decompress batching are implemented using
>    crypto_acomp_[de]compress() with batching data passed to the driver
>    using the acomp_req->kernel_data pointer.
>
>
> (C) The patch-series is organized as follows:
> =============================================
>
>  1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
>     patches are tagged with "crypto:" in the subject:
>
>     Patch 1) Reorganizes the iaa_crypto driver code into logically related
>              sections and avoids forward declarations, in order to facilitate
>              subsequent iaa_crypto patches. This patch makes no
>              functional changes.
>
>     Patch 2) Makes an infrastructure change in the iaa_crypto driver
>              to map IAA devices/work-queues to cores based on packages
>              instead of NUMA nodes. This doesn't impact performance on
>              the Sapphire Rapids system used for performance
>              testing. However, this change fixes functional problems we
>              found on Granite Rapids during internal validation, where the
>              number of NUMA nodes is greater than the number of packages,
>              which was resulting in over-utilization of some IAA devices
>              and non-usage of other IAA devices as per the current NUMA
>              based mapping infrastructure.
>
>              This patch also develops a new architecture that
>              generalizes how IAA device WQs are used. It enables
>              designating IAA device WQs as either compress-only or
>              decompress-only or generic. Once IAA device WQ types are
>              thus defined, it also allows the configuration of whether
>              device WQs will be shared by all cores on the package, or
>              used only by "mapped cores" obtained by a simple allocation
>              of available IAAs to cores on the package.
>
>              As a result of the overhaul of wq_table definition,
>              allocation and rebalancing, this patch eliminates
>              duplication of device WQs in per-CPU wq_tables, thereby
>              saving 140MiB on a 384 cores dual socket Granite Rapids server
>              with 8 IAAs.
>
>              Regardless of how the user has configured the WQs' usage,
>              the next WQ to use is obtained through a direct look-up in
>              per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
>              as to minimize latency in the critical path driver compress
>              and decompress routines.
>
>     Patch 3) Code cleanup, consistency of function parameters.
>
>     Patch 4) Makes a change to iaa_crypto driver's descriptor allocation,
>              from blocking to non-blocking with retries/timeouts and
>              mitigations in case of timeouts during compress/decompress
>              ops. This prevents tasks getting blocked indefinitely, which
>              was observed when testing 30 cores running workloads, with
>              only 1 IAA enabled on Sapphire Rapids (out of 4). These
>              timeouts are typically only encountered, and associated
>              mitigations exercised, only in configurations with 1 IAA
>              device shared by 30+ cores.
>
>     Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of
>              spinlocks and "int refcount".
>
>     Patch 6) Code simplification and restructuring for understandability
>              in core iaa_compress() and iaa_decompress() routines.
>
>     Patch 7) Refactor hardware descriptor setup to their own procedures
>              to reduce code clutter.
>
>     Patch 8) Simplify and optimize (i.e. reduce computes) job submission
>              for the most commonly used non-irq async mode by directly
>              calling movdir64b.
>
>     Patch 9) Deprecate exporting symbols for adding IAA compression
>              modes.
>
>     Patch 10) Rearchitect iaa_crypto to be agnostic of crypto_acomp for
>               it be usable in both zswap and zram. crypto_acomp interfaces are
>               maintained as earlier, for use in zswap.
>
>     Patch 11) Descriptor submit and polling mechanisms, enablers for batching.
>
>     Patch 12) Add a "void *kernel_data" member to struct acomp_req. This
>               gets initialized to NULL in acomp_request_set_params().
>
>     Patch 13) Implement IAA batching of compressions and decompressions
>               for deriving hardware parallelism.
>
>     Patch 14) Enables the "async" mode, sets it as the default.
>
>     Patch 15) Disables verify_compress by default.
>
>     Patch 16) Decompress batching optimization: Find the two largest
>               buffers in the batch and submit them first.
>
>     Patch 17) Add a new Dynamic compression mode that can be used on
>               Granite Rapids.
>
>     Patch 18) Add get_batch_size() to structs acomp_alg/crypto_acomp and
>               a crypto_acomp_batch_size() API that returns the compressor's
>               batch-size, if it has provided an implementation for
>               get_batch_size(); 1 otherwise.
>
>     Patch 19) iaa-crypto implementation for get_batch_size(), that
>               returns an iaa_driver specific constant,
>               IAA_CRYPTO_MAX_BATCH_SIZE (set to 8U currently).
>
>
>  2) zswap modifications to enable compress batching in zswap_store()
>     of large folios (including pmd-mappable folios):
>
>     Patch 20) Simplifies the zswap_pool's per-CPU acomp_ctx resource
>               management and lifetime to be from pool creation to pool
>               deletion.
>
>     Patch 21) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for
>               valid acomp/req, thereby making it consistent with the resource
>               de-allocation code.
>
>     Patch 22) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
>               as 8U) to denote the maximum number of acomp_ctx batching
>               resources to allocate, thus limiting the amount of extra
>               memory used for batching. Further, the "struct
>               crypto_acomp_ctx" is modified to contain multiple buffers.
>               New "u8 compr_batch_size" and "u8 batch_size" members are
>               added to "struct zswap_pool" to track the number of dst
>               buffers associated with the compressor (more than 1 if
>               the compressor supports batching) and the unit for storing
>               large folios using compression batching respectively.
>
>     Patch 23) Modifies zswap_store() to store the folio in batches of
>               pool->batch_size by calling a new zswap_store_pages() that takes
>               a range of indices in the folio to be stored.
>               zswap_store_pages() pre-allocates zswap entries for the batch,
>               calls zswap_compress() for each page in this range, and stores
>               the entries in xarray/LRU.
>
>     Patch 24) Introduces a new unified implementation of zswap_compress()
>               for compressors that do and do not support batching. This
>               eliminates code duplication and facilitates maintainability of
>               the code with the introduction of compress batching. Further,
>               there are many optimizations to this common code that result
>               in workload throughput and performance improvements with
>               software compressors and hardware accelerators such as IAA.
>
>               zstd performance is better or on par with mm-unstable. We
>               see impressive throughput/performance improvements with
>               IAA and zstd batching vs. no-batching.
>
>
> With v11 of this patch series, the IAA compress batching feature will be
> enabled seamlessly on Intel platforms that have IAA by selecting
> 'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
> sync_mode driver attribute (the default).
>
>
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 7-30-2025,
> commit 01da54f10fdd, without and with this patch-series. Data was
> gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries,
> 503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed
> at 2500MHz.
>
> Other kernel configuration parameters:
>
>     zswap compressor  : zstd, deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 0
>
> IAA "compression verification" is disabled and IAA is run in the async
> mode (the defaults with this series).
>
> I ran experiments with these workloads:
>
> 1) usemem 30 processes with zswap shrinker_enabled=N. Two sets of
>    experiments, one with 64K folios, another with PMD folios.
>
> 2) Kernel compilation allmodconfig with 2G max memory, 28 threads, with
>    zswap shrinker_enabled=Y to test batching performance impact when
>    writeback is enabled. Two sets of experiments, one with 64K folios,
>    another with PMD folios.
>
> IAA configuration is done by a CLI: script is included at the end of the
> cover letter.
>
>
> Performance testing (usemem30):
> ===============================
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and
> sleeping for 10 sec before exiting:
>
>  usemem --init-time -w -O -b 1 -s 10 -n 30 10g
>  echo 0 > /sys/module/zswap/parameters/shrinker_enabled
>
>  IAA WQ Configuration (script is iincluded at the end of the cover
>  letter):
>
>    ./enable_iaa.sh -d 4 -q 1
>
>  This enables all 4 IAAs on the socket, and configures 1 WQ per IAA
>  device, each containing 128 entries. The driver distributes compress
>  jobs from each core to wqX.0 of all same-package IAAs in a
>  round-robin manner. Decompress jobs are send to the wqX.0 of the
>  mapped IAA device.
>
>  Since usemem has significantly more swapouts than swapins, this
>  configuration is the most optimal.
>
>  64K folios: usemem30: deflate-iaa:
>  ==================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        7,153,359      10,856,388         52%
>  Avg throughput (KB/s)            238,445         361,879
>  elapsed time (sec)                 92.61           70.50        -24%
>  sys time (sec)                  2,193.59        1,675.32        -24%
>
>  -------------------------------------------------------------------------------
>  memcg_high                     1,061,494       1,340,863
>  memcg_swap_fail                    1,496             240
>  64kB_swpout_fallback               1,496             240
>  zswpout                       61,642,322      71,374,066
>  zswpin                               130             250
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,851,135       4,460,571
>  SWPOUT-64kB                            0               0
>  pgmajfault                         2,446           2,545
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages               0               0
>  IAA incompressible pages               0               0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: usemem30: deflate-iaa:
>  =================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa     IAA Batching
>                                                                   vs.
>                                                               IAA Sequential
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        7,268,929      11,312,195         56%
>  Avg throughput (KB/s)            242,297         377,073
>  elapsed time (sec)                 80.40           68.73        -15%
>  sys time (sec)                  1,856.54        1,599.25        -14%
>
>  -------------------------------------------------------------------------------
>  memcg_high                        99,426         119,834
>  memcg_swap_fail                      371             293
>  thp_swpout_fallback                  371             293
>  zswpout                       63,227,705      71,567,857
>  zswpin                               456             482
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                   123,119         139,505
>  thp_swpout                             0               0
>  pgmajfault                         2,901           2,813
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages               0               0
>  IAA incompressible pages               0               0
>  -------------------------------------------------------------------------------
>
>
>
>  64K folios: usemem30: zstd:
>  ===========================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd        v11 zstd
>                                                                  improvement
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        6,866,411       6,874,244        0.1%
>  Avg throughput (KB/s)            228,880         229,141
>  elapsed time (sec)                 96.45           89.05         -8%
>  sys time (sec)                  2,410.72        2,150.63        -11%
>
>  -------------------------------------------------------------------------------
>  memcg_high                     1,070,285       1,075,178
>  memcg_swap_fail                    2,404              66
>  64kB_swpout_fallback               2,404              66
>  zswpout                       49,767,024      49,672,972
>  zswpin                               454             192
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,108,029       3,104,433
>  SWPOUT-64kB                            0               0
>  pgmajfault                         2,758           2,481
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages               0               0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: usemem30: zstd:
>  ==========================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd        v11 zstd
>                                                                  improvement
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        7,560,441       7,627,155        0.9%
>  Avg throughput (KB/s)            252,014         254,238
>  elapsed time (sec)                 88.89           83.22         -6%
>  sys time (sec)                  2,132.05        1,952.98         -8%
>
>  -------------------------------------------------------------------------------
>  memcg_high                        89,486          88,982
>  memcg_swap_fail                      183              41
>  thp_swpout_fallback                  183              41
>  zswpout                       48,947,054      48,598,306
>  zswpin                               472             252
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    95,420          94,876
>  thp_swpout                             0               0
>  pgmajfault                         2,789           2,540
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages               0               0
>  -------------------------------------------------------------------------------
>
>
>
> Performance testing (Kernel compilation, allmodconfig):
> =======================================================
>
> The experiments with kernel compilation test use 28 threads and build
> the "allmodconfig" that takes ~14 minutes, and has considerable
> swapout/swapin activity. The cgroup's memory.max is set to 2G. We
> trigger writeback by enabling the zswap shrinker.
>
>  echo 1 > /sys/module/zswap/parameters/shrinker_enabled
>
>  IAA WQ Configuration (script is at the end of the cover letter):
>
>    ./enable_iaa.sh -d 4 -q 2
>
>  This enables all 4 IAAs on the socket, and configures 2 WQs per IAA,
>  each containing 64 entries. The driver sends decompresses to wqX.0 of
>  the mapped IAA device, and distributes compresses to wqX.1 of all
>  same-package IAAs in a round-robin manner.
>
>  64K folios: Kernel compilation/allmodconfig: deflate-iaa:
>  =========================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>  -------------------------------------------------------------------------------
>  real_sec                          901.81          840.60       -6.8%
>  user_sec                       15,499.45       15,431.54
>  sys_sec                         2,672.93        2,171.17        -19%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,872,984       1,872,884
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                    2,633               0
>  64kB_swpout_fallback               2,630               0
>  zswpout                       34,700,692      24,076,095        -31%
>  zswpin                         7,791,832       4,937,822
>  pswpout                        2,624,324       1,459,681
>  pswpin                         2,486,667       1,229,416
>  ZSWPOUT-64kB                   1,254,622         896,433
>  SWPOUT-64kB                           36              18
>  pgmajfault                    10,613,272       6,305,623
>  zswap_reject_compress_fail            64             111
>  zswap_reject_reclaim_fail            301              59
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages       2,612,474       1,451,961        -44%
>  IAA incompressible pages              64             111
>  -------------------------------------------------------------------------------
>
>
>  2M folios: Kernel compilation/allmodconfig: deflate-iaa:
>  ========================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>  -------------------------------------------------------------------------------
>  real_sec                          838.76          804.83         -4%
>  user_sec                       15,624.57       15,566.49
>  sys_sec                         3,173.57        2,422.63        -24%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,874,680       1,872,892
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                   10,350             906
>  thp_swpout_fallback               10,342             906
>  zswpout                       59,544,198      38,093,995        -36%
>  zswpin                        17,933,865      10,220,321
>  pswpout                        2,740,024         935,226
>  pswpin                         3,179,590       1,346,338
>  ZSWPOUT-2048kB                     6,464          10,435
>  thp_swpout                             4               3
>  pgmajfault                    21,609,542      11,819,882
>  zswap_reject_compress_fail            50               8
>  zswap_reject_reclaim_fail          2,335           2,377
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages       2,726,367         929,614        -66%
>  IAA incompressible pages              50               8
>  -------------------------------------------------------------------------------
>
> With the iaa_crypto driver changes for non-blocking descriptor allocations,
> no timeouts-with-mitigations were seen in compress/decompress jobs, for all
> of the above experiments.
>
>
>  64K folios: Kernel compilation/allmodconfig: zstd:
>  ==================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd    Improvement
>  -------------------------------------------------------------------------------
>  real_sec                          882.67          837.21       -5.2%
>  user_sec                       15,533.14       15,434.03
>  sys_sec                         3,573.31        2,593.94      -27.4%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,872,960       1,872,788
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  64kB_swpout_fallback                   0               0
>  zswpout                       42,768,967      22,660,215        -47%
>  zswpin                        10,146,520       4,750,133
>  pswpout                        2,118,745         731,419
>  pswpin                         2,114,735         824,655
>  ZSWPOUT-64kB                   1,484,862         824,976
>  SWPOUT-64kB                            6               3
>  pgmajfault                    12,698,613       5,697,281
>  zswap_reject_compress_fail            13               8
>  zswap_reject_reclaim_fail            624             211
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages       2,109,739         727,919        -65%
>  -------------------------------------------------------------------------------
>
>
>  2M folios: Kernel compilation/allmodconfig: zstd:
>  =================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-7-30-2025             v11
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd    Improvement
>  -------------------------------------------------------------------------------
>  real_sec                          831.09          813.40       -2.1%
>  user_sec                       15,648.65       15,566.01
>  sys_sec                         4,251.11        3,053.95      -28.2%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,872,892       1,874,684
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                    7,525           1,455
>  thp_swpout_fallback                7,499           1,452
>  zswpout                       59,452,638      35,832,407        -40%
>  zswpin                        17,690,718       9,550,640
>  pswpout                        1,047,676         426,042
>  pswpin                         2,155,989         840,514
>  ZSWPOUT-2048kB                     8,254           8,651
>  thp_swpout                             4               2
>  pgmajfault                    20,278,921      10,581,456
>  zswap_reject_compress_fail            47              20
>  zswap_reject_reclaim_fail          2,342             451
>  zswap_pool_limit_hit                   0               0
>  zswap_written_back_pages       1,041,721         423,334        -59%
>  -------------------------------------------------------------------------------
>
>
>
> IAA configuration script "enable_iaa.sh":
> =========================================
>
>  Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad.
>
>  Usage:
>  ------
>
>    ./enable_iaa.sh -d <num_IAAs> -q <num_WQs_per_IAA>
>
>
>  #---------------------------------<cut here>----------------------------------
>  #!/usr/bin/env bash
>  #SPDX-License-Identifier: BSD-3-Clause
>  #Copyright (c) 2025, Intel Corporation
>  #Description: Configure IAA devices
>
>  VERIFY_COMPRESS_PATH="/sys/bus/dsa/drivers/crypto/verify_compress"
>
>  iax_dev_id="0cfe"
>  num_iaa=$(lspci -d:${iax_dev_id} | wc -l)
>  sockets=$(lscpu | grep Socket | awk '{print $2}')
>  echo "Found ${num_iaa} instances in ${sockets} sockets(s)"
>
>  #  The same number of devices will be configured in each socket, if there
>  #  are  more than one socket.
>  #  Normalize with respect to the number of sockets.
>  device_num_per_socket=$(( num_iaa/sockets ))
>  num_iaa_per_socket=$(( num_iaa / sockets ))
>
>  iaa_wqs=2
>  verbose=0
>  iaa_engines=8
>  mode="dedicated"
>  wq_type="kernel"
>  iaa_crypto_mode="async"
>  verify_compress=0
>
>
>  # Function to handle errors
>  handle_error() {
>      echo "Error: $1"
>      exit 1
>  }
>
>  # Process arguments
>
>  while getopts "d:hm:q:vD" opt; do
>    case $opt in
>      d)
>        device_num_per_socket=$OPTARG
>        ;;
>      m)
>        iaa_crypto_mode=$OPTARG
>        ;;
>      q)
>        iaa_wqs=$OPTARG
>        ;;
>      D)
>        verbose=1
>        ;;
>      v)
>        verify_compress=1
>        ;;
>      h)
>        echo "Usage: $0 [-d <device_count>][-q <wq_per_device>][-v]"
>        echo "       -d - number of devices"
>        echo "       -q - number of WQs per device"
>        echo "       -v - verbose mode"
>        echo "       -h - help"
>        exit
>        ;;
>      \?)
>        echo "Invalid option: -$OPTARG" >&2
>        exit
>        ;;
>    esac
>  done
>
>  LOG="configure_iaa.log"
>
>  # Update wq_size based on number of wqs
>  wq_size=$(( 128 / iaa_wqs ))
>
>  # Take care of the enumeration, if DSA is enabled.
>  dsa=`lspci | grep -c 0b25`
>  # set first,step counters to correctly enumerate iax devices based on
>  # whether running on guest or host with or without dsa
>  first=0
>  step=1
>  [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=1 && step=2
>  echo "first index: ${first}, step: ${step}"
>
>
>  #
>  # Switch to software compressors and disable IAAs to have a clean start
>  #
>  COMPRESSOR=/sys/module/zswap/parameters/compressor
>  last_comp=`cat ${COMPRESSOR}`
>  echo lzo > ${COMPRESSOR}
>
>  echo "Disable IAA devices before configuring"
>
>  for ((i = ${first}; i < ${step} * ${num_iaa}; i += ${step})); do
>      for ((j = 0; j < ${iaa_wqs}; j += 1)); do
>          cmd="accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null"
>         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>       done
>      cmd="accel-config disable-device iax${i} >& /dev/null"
>      [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>  done
>
>  rmmod iaa_crypto
>  modprobe iaa_crypto
>
>  # apply crypto parameters
>  echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did not change verify_compress"
>  # Note: This is a temporary solution for during the kernel transition.
>  if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa || handle_error "did not set g_comp_wqs_per_iaa"
>  elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "did not set g_wqs_per_iaa"
>  fi
>  if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq || handle_error "did not set g_consec_descs_per_gwq"
>  fi
>  echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode || handle_error "could not set sync_mode"
>
>
>
>  echo "Configuring ${device_num_per_socket} device(s) out of $num_iaa_per_socket per socket"
>  if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then
>      echo "Configuring all devices"
>      start=${first}
>      end=$(( ${step} * ${device_num_per_socket} ))
>  else
>     echo "ERROR: Not enough devices"
>     exit
>  fi
>
>
>  #
>  # enable all iax devices and wqs
>  #
>  for (( socket = 0; socket < ${sockets}; socket += 1 )); do
>  for ((i = ${start}; i < ${end}; i += ${step})); do
>
>      echo "Configuring iaa$i on socket ${socket}"
>
>      for ((j = 0; j < ${iaa_engines}; j += 1)); do
>          cmd="accel-config config-engine iax${i}/engine${i}.${j} --group-id=0"
>          [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>      done
>
>      # Config  WQs
>      for ((j = 0; j < ${iaa_wqs}; j += 1)); do
>          # Config WQ: group 0,  priority=10, mode=shared, type = kernel name=kernel, driver_name=crypto
>          cmd="accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_size} -p 10 -m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto"
>          [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>       done
>
>      # Enable Device and WQs
>      cmd="accel-config enable-device iax${i}"
>      [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>
>      for ((j = 0; j < ${iaa_wqs}; j += 1)); do
>          cmd="accel-config enable-wq iax${i}/wq${i}.${j}"
>          [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>       done
>
>  done
>      start=$(( start + ${step} * ${num_iaa_per_socket} ))
>      end=$(( start + (${step} * ${device_num_per_socket}) ))
>  done
>
>  # Restore the last compressor
>  echo "$last_comp" > ${COMPRESSOR}
>
>  # Check if the configuration is correct
>  echo "Configured IAA devices:"
>  accel-config list | grep iax
>
>  #---------------------------------<cut here>----------------------------------
>
>
> Changes since v10:
> ==================
> 1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd.
> 2) Added change logging in patch 0024 on there being no Intel-specific
>    dependencies in the batching framework, as suggested by
>    Andrew Morton. Thanks Andrew!
> 3) Added change logging in patch 0024 on other ongoing work that can use
>    batching, as per Andrew's suggestion. Thanks Andrew!
> 4) Added the IAA configuration script in the cover letter, as suggested
>    by Nhat Pham. Thanks Nhat!
> 5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU
>    hotplug procedures to pool functions.
> 6) Gathered kernel_compilation 'allmod' config performance data with
>    writeback and zswap shrinker_enabled=Y.
> 7) Changed the pool->batch_size for software compressors to be
>    ZSWAP_MAX_BATCH_SIZE since this gives better performance with the zswap
>    shrinker enabled.
> 8) Was unable to replicate in v11 the issue seen in v10 with higher
>    memcg_swap_fail than in the baseline, with usemem30/zstd.
>
> Changes since v9:
> =================
> 1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3.
> 2) iaa_crypto rearchitecting, mainline race condition fix, performance
>    optimizations, code cleanup.
> 3) Addressed Herbert's comments in v9 patch 10, that an array based
>    crypto_acomp interface is not acceptable.
> 4) Optimized the implementation of the batching zswap_compress() and
>    zswap_store_pages() added in v9, to recover performance when
>    integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer
>    the the original page's node for compressed data").
>
> Changes since v8:
> =================
> 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
> 2) Backported commits for reverting request chaining, since these are
>    in cryptodev-2.6 but not yet in mm-unstable: without these backports,
>    deflate-iaa is non-functional in mm-unstable:
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
>    commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
>                          testing"")
>    Backported this hotfix as well:
>    commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
>    calculating last page").
> 3) crypto_acomp_[de]compress() restored to non-request chained
>    implementations since request chaining has been removed from acomp in
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
> 4) New IAA WQ architecture to denote WQ type and whether or not a WQ
>    should be shared among all package cores, or only to the "mapped"
>    ones from an even cores-to-IAA distribution scheme.
> 5) Compress/decompress batching are implemented in iaa_crypto using new
>    crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
> 6) Defines a "void *data" in struct acomp_req, based on Herbert advising
>    against using req->base.data in the driver. This is needed for async
>    submit-poll to work.
> 7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
>    functions", per Yosry's suggestion to move procedures in a distinct
>    patch before refactoring patches.
> 8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
>    the number of requests/buffers associated with the per-cpu acomp_ctx,
>    as per Yosry's suggestion.
> 9) Simplifications to the acomp_ctx resources allocation, deletion,
>    locking, and for these to exist from pool creation to pool deletion,
>    based on v8 code review discussions with Yosry.
> 10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
>     acomp_ctx_dealloc(), as per Yosry's v8 comment.
> 11) zswap_store_folio() is deleted, and instead, the loop over
>     zswap_store_pages() is moved inline in zswap_store(), per Yosry's
>     suggestion.
> 12) Better structure in zswap_compress(), unified procedure that
>     compresses/stores a batch of pages for both, non-batching and
>     batching compressors. Renamed from zswap_batch_compress() to
>     zswap_compress(): Thanks Yosry for these suggestions.
>
>
> Changes since v7:
> =================
> 1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
> 2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE is
>    defined as 8U, for saving memory in this per-cpu structure.
> 3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
>    acomp_ctx->initialized to acomp_ctx->__online.
> 4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
>    thanks to all!
>    a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
>       for this suggestion!
>    b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
>       of whether or not the compressor supports batching. This gets rid of
>       the kmalloc(entries), and allows us to allocate an array of
>       ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
>       zswap_store_pages().
>    c) Use of a common structure and code paths for compressing a folio in
>       batches, either as a request chain (in parallel in IAA hardware) or
>       sequentially. No code duplication since zswap_compress() has been
>       replaced with zswap_batch_compress(), simplifying maintainability.
> 5) A key difference between compressors that support batching and
>    those that do not, is that for the latter, the acomp_ctx mutex is
>    locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that decompressions
>    to handle page-faults can make progress. This fixes the zstd kernel
>    compilation regression seen in v7. For compressors that support
>    batching, for e.g. IAA, the mutex is locked/released once for storing
>    the folio.
> 6) Used likely/unlikely compiler directives and prefetchw to restore
>    performance with the common code paths.
>
> Changes since v6:
> =================
> 1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.
>
> 2) Deleted crypto_acomp_batch_compress() and
>    crypto_acomp_batch_decompress() interfaces, as per Herbert's
>    suggestion. Batching is instead enabled by chaining the requests. For
>    non-batching compressors, there is no request chaining involved. Both,
>    batching and non-batching compressions are accomplished by zswap by
>    calling:
>
>    crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
>
> 3) iaa_crypto implementation of batch compressions/decompressions using
>    request chaining, as per Herbert's suggestions.
> 4) Simplification of the acomp_ctx resource allocation/deletion with
>    respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
>    mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
>    the per-cpu memory cost of this proposed change is acceptable (IAA:
>    64.8KB, Software compressors: 8.2KB). On the positive side, I believe
>    restarting reclaim on a CPU after it has been through an offline-online
>    transition, will be much faster by not deleting the acomp_ctx resources
>    when the CPU gets offlined.
> 5) Use of lockdep assertions rather than comments for internal locking
>    rules, as per Yosry's suggestion.
> 6) No specific references to IAA in zswap.c, as suggested by Yosry.
> 7) Explored various solutions other than the v6 zswap_store_folio()
>    implementation, to fix the zstd regression seen in v5, to attempt to
>    unify common code paths, and to allocate smaller arrays for the zswap
>    entries on the stack. All these options were found to cause usemem30
>    latency regression with zstd. The v6 version of zswap_store_folio() is
>    the only implementation that does not cause zstd regression, confirmed
>    by 10 consecutive runs, each giving quite consistent latency
>    numbers. Hence, the v6 implementation is carried forward to v7, with
>    changes for branching for batching vs. sequential compression API
>    calls.
>
>
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.
>
> Several improvements, regression fixes and bug fixes, based on Yosry's
> v5 comments (Thanks Yosry!):
>
> 2) Fix for zstd performance regression in v5.
> 3) Performance debug and fix for marginal improvements with IAA batching
>    vs. sequential.
> 4) Performance testing data compares IAA with and without batching, instead
>    of IAA batching against zstd.
> 5) Commit logs/zswap comments not mentioning crypto_acomp implementation
>    details.
> 6) Delete the pr_info_once() when batching resources are allocated in
>    zswap_cpu_comp_prepare().
> 7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
>    zswap_cpu_comp_prepare().
> 8) Simplify and consolidate error handling cleanup code in
>    zswap_cpu_comp_prepare().
> 9) Introduce zswap_compress_folio() in a separate patch.
> 10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
>     compressed objects and entries to be freed, and UAF when zswap_store()
>     tries to free the entries that were already added to the xarray prior
>     to the failure.
> 11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
>     the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
>     when zswap_store_page() fails") by Hyeonggon Yoo.
>
> iaa_crypto improvements/fixes/changes:
>
> 12) Enables asynchronous mode and makes it the default. With commit
>     4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
>     sync_mode is set to 'async'"), async mode was previously just sync. We
>     now have true async support.
> 13) Change idxd descriptor allocations from blocking to non-blocking with
>     timeouts, and mitigations for compress/decompress ops that fail to
>     obtain a descriptor. This is a fix for tasks blocked errors seen in
>     configurations where 30+ cores are running workloads under high memory
>     pressure, and sending comps/decomps to 1 IAA device.
> 14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
>     deflate_generic_decompress(), which can cause data corruption and
>     zswap_decompress() kernel crash.
> 15) zswap uses crypto_acomp_batch_compress() with async polling instead of
>     request chaining for slightly better latency. However, the request
>     chaining framework itself is unchanged, preserved from v5.
>
>
> Changes since v4:
> =================
> 1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
> 2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
> 3) Implemented IAA compress batching using request chaining.
> 4) zswap_store() batching simplifications suggested by Chengming, Yosry and
>    Nhat, thanks to all!
>    - New zswap_compress_folio() that is called by zswap_store().
>    - Move the loop over folio's pages out of zswap_store() and into a
>      zswap_store_folio() that stores all pages.
>    - Allocate all zswap entries for the folio upfront.
>    - Added zswap_batch_compress().
>    - Branch to call zswap_compress() or zswap_batch_compress() inside
>      zswap_compress_folio().
>    - All iterations over pages kept in same function level.
>    - No helpers other than the newly added zswap_store_folio() and
>      zswap_compress_folio().
>
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
> 2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
>    based on packages instead of NUMA nodes.
> 3) Added acomp_has_async_batching() API to crypto acomp, that allows
>    zswap/zram to query if a crypto_acomp has registered batch_compress and
>    batch_decompress interfaces.
> 4) Clear the poll bits on the acomp_reqs passed to
>    iaa_comp_a[de]compress_batch() so that a module like zswap can be
>    confident about the acomp_reqs[0] not having the poll bit set before
>    calling the fully synchronous API crypto_acomp_[de]compress().
>    Herbert, I would appreciate it if you can review changes 2-4; in patches
>    1-8 in v4. I did not want to introduce too many iaa_crypto changes in
>    v4, given that patch 7 is already making a major change. I plan to work
>    on incorporating the request chaining using the ahash interface in v5
>    (I need to understand the basic crypto ahash better). Thanks Herbert!
> 5) Incorporated Johannes' suggestion to not have a sysctl to enable
>    compress batching.
> 6) Incorporated Yosry's suggestion to allocate batching resources in the
>    cpu hotplug onlining code, since there is no longer a sysctl to control
>    batching. Thanks Yosry!
> 7) Incorporated Johannes' suggestions related to making the overall
>    sequence of events between zswap_store() and zswap_batch_store() similar
>    as much as possible for readability and control flow, better naming of
>    procedures, avoiding forward declarations, not inlining error path
>    procedures, deleting zswap internal details from zswap.h, etc. Thanks
>    Johannes, really appreciate the direction!
>    I have tried to explain the minimal future-proofing in terms of the
>    zswap_batch_store() signature and the definition of "struct
>    zswap_batch_store_sub_batch" in the comments for this struct. I hope the
>    new code explains the control flow a bit better.
>
>
> Changes since v2:
> =================
> 1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
> 2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
>    returned by kmalloc_node() for acomp_ctx->buffers and for
>    acomp_ctx->reqs.
> 3) Fixed a bug in zswap_pool_can_batch() for returning true if
>    pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
>    the per-cpu acomp_batch_ctx tests true for batching resources having
>    been allocated on this cpu. Also, changed from per_cpu_ptr() to
>    raw_cpu_ptr().
> 4) Incorporated the zswap_store_propagate_errors() compilation warning fix
>    suggested by Dan Carpenter. Thanks Dan!
> 5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
>    zswap.h, with SWAP_CRYPTO_BATCH_SIZE.
>
> Changes since v1:
> =================
> 1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
> 2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
>    async/poll mode, and to encapsulate the polling functionality in the
>    iaa_crypto driver. Thanks Herbert!
> 3) Incorporated Herbert's and Yosry's suggestions to implement the batching
>    API in iaa_crypto and to make its use seamless from zswap's
>    perspective. Thanks Herbert and Yosry!
> 4) Incorporated Yosry's suggestion to make it more convenient for the user
>    to enable compress batching, while minimizing the memory footprint
>    cost. Thanks Yosry!
> 5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
>    reclaim batching patch from this series, since it requires a broader
>    discussion.
>
>
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
>
> Thanks,
> Kanchana
>
>
>
>
> Kanchana P Sridhar (24):
>   crypto: iaa - Reorganize the iaa_crypto driver code.
>   crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
>     core mapping.
>   crypto: iaa - Simplify, consistency of function parameters, minor
>     stats bug fix.
>   crypto: iaa - Descriptor allocation timeouts with mitigations.
>   crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
>   crypto: iaa - Simplify the code flow in iaa_compress() and
>     iaa_decompress().
>   crypto: iaa - Refactor hardware descriptor setup into separate
>     procedures.
>   crypto: iaa - Simplified, efficient job submissions for non-irq mode.
>   crypto: iaa - Deprecate exporting add/remove IAA compression modes.
>   crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap
>     and zram.
>   crypto: iaa - Enablers for submitting descriptors then polling for
>     completion.
>   crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for
>     kernel users.
>   crypto: iaa - IAA Batching for parallel compressions/decompressions.
>   crypto: iaa - Enable async mode and make it the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Submit the two largest source buffers first in
>     decompress batching.
>   crypto: iaa - Add deflate-iaa-dynamic compression mode.
>   crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's
>     batch-size.
>   crypto: iaa - IAA acomp_algs register the get_batch_size() interface.
>   mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
>     deletion.
>   mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
>     resources.
>   mm: zswap: Allocate pool batching resources if the compressor supports
>     batching.
>   mm: zswap: zswap_store() will process a large folio in batches.
>   mm: zswap: Batched zswap_compress() with compress batching of large
>     folios.
>
>  .../driver-api/crypto/iaa/iaa-crypto.rst      |  168 +-
>  crypto/acompress.c                            |    1 +
>  crypto/testmgr.c                              |   10 +
>  crypto/testmgr.h                              |   74 +
>  drivers/crypto/intel/iaa/Makefile             |    4 +-
>  drivers/crypto/intel/iaa/iaa_crypto.h         |   59 +-
>  .../intel/iaa/iaa_crypto_comp_dynamic.c       |   22 +
>  drivers/crypto/intel/iaa/iaa_crypto_main.c    | 2902 ++++++++++++-----
>  drivers/crypto/intel/iaa/iaa_crypto_stats.c   |    8 +
>  drivers/crypto/intel/iaa/iaa_crypto_stats.h   |    2 +
>  include/crypto/acompress.h                    |   30 +
>  include/crypto/internal/acompress.h           |    3 +
>  include/linux/iaa_comp.h                      |  159 +
>  mm/swap.h                                     |   23 +
>  mm/zswap.c                                    |  646 ++--
>  15 files changed, 3085 insertions(+), 1026 deletions(-)
>  create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
>  create mode 100644 include/linux/iaa_comp.h
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-08 23:51 ` [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Nhat Pham
@ 2025-08-09  0:03   ` Sridhar, Kanchana P
  2025-08-15  5:27   ` Herbert Xu
  1 sibling, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-09  0:03 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Friday, August 8, 2025 4:51 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >

[snip]

> > Many thanks to Nhat for suggesting ideas on using batching with the
> > ongoing kcompressd work, as well as beneficially using decompression
> > batching & block IO batching to improve zswap writeback efficiency.
> 
> My pleasure :)

Thanks Nhat!

[snip]

> I see a lot of good numbers for both IAA and zstd here. Thanks for
> working on it, Kanchana!

Thanks again, Nhat! It has been a most rewarding experience :)
I have learned so much from all the maintainers. Thanks for taking
the time to review and give feedback on the design, code reviews, etc.

Best regards,
Kanchana



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-01  4:36 ` [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-08-14 20:58   ` Nhat Pham
  2025-08-14 22:05     ` Sridhar, Kanchana P
  2025-08-26  3:48   ` Barry Song
  1 sibling, 1 reply; 68+ messages in thread
From: Nhat Pham @ 2025-08-14 20:58 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch sets up zswap for allocating per-CPU resources optimally for
> non-batching and batching compressors.
>
> A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
> limit on the number of pages in large folios that will be batch
> compressed.
>
> As per Herbert's comments in [2] in response to the
> crypto_acomp_batch_compress() and crypto_acomp_batch_decompress() API
> proposed in [1], this series does not create new crypto_acomp batching
> API. Instead, zswap compression batching uses the existing
> crypto_acomp_compress() API in combination with the "void *kernel_data"
> member added to "struct acomp_req" earlier in this series.
>
> It is up to the compressor to manage multiple requests, as needed, to
> accomplish batch parallelism. zswap only needs to allocate the per-CPU
> dst buffers according to the batch size supported by the compressor.
>
> A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> Yosry's suggestion. pool->compr_batch_size is set as the minimum of the
> compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it
> proceeds to allocate the necessary compression dst buffers in the
> per-CPU acomp_ctx.
>
> Another "u8 batch_size" member is added to "struct zswap_pool" to store
> the unit for batching large folio stores: for batching compressors, this
> is the pool->compr_batch_size. For non-batching compressors, this is
> ZSWAP_MAX_BATCH_SIZE.
>
> zswap does not use more than one dst buffer yet. Follow-up patches will
> actually utilize the multiple acomp_ctx buffers for batch
> compression/decompression of multiple pages.
>
> Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used for
> batching. There is a small extra memory overhead of allocating
> the acomp_ctx->buffers array for compressors that do not support
> batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).
>
> [1]: https://patchwork.kernel.org/project/linux-mm/patch/20250508194134.28392-11-kanchana.p.sridhar@intel.com/
> [2]: https://patchwork.kernel.org/comment/26382610
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Mostly LGTM. Just a couple of questions below:

> ---
>  mm/zswap.c | 82 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 63 insertions(+), 19 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index efd501a7fe294..63a997b999537 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -80,6 +80,9 @@ static bool zswap_pool_reached_full;
>
>  #define ZSWAP_PARAM_UNSET ""
>
> +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>
>  /* Enable/disable zswap */
> @@ -147,7 +150,7 @@ struct crypto_acomp_ctx {
>         struct crypto_acomp *acomp;
>         struct acomp_req *req;
>         struct crypto_wait wait;
> -       u8 *buffer;
> +       u8 **buffers;
>         struct mutex mutex;
>         bool is_sleepable;
>  };
> @@ -166,6 +169,8 @@ struct zswap_pool {
>         struct work_struct release_work;
>         struct hlist_node node;
>         char tfm_name[CRYPTO_MAX_ALG_NAME];
> +       u8 compr_batch_size;
> +       u8 batch_size;

Apologies if this is explained elsewhere, but I'm very confused - why
do we need both of these two fields?

Seems like batch_size is defined below, and never changed:

      pool->batch_size = (pool->compr_batch_size > 1) ?
                            pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;

Can we just determine this in zswap_store() as a local variable?


>  };
>
>  /* Global LRU lists shared by all zswap pools. */
> @@ -258,8 +263,10 @@ static void __zswap_pool_empty(struct percpu_ref *ref);
>   *   zswap_cpu_comp_prepare(), not others.
>   * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
>   */
> -static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
>  {
> +       u8 i;
> +
>         if (IS_ERR_OR_NULL(acomp_ctx))
>                 return;
>
> @@ -269,7 +276,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
>         if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>                 crypto_free_acomp(acomp_ctx->acomp);
>
> -       kfree(acomp_ctx->buffer);
> +       if (acomp_ctx->buffers) {
> +               for (i = 0; i < nr_buffers; ++i)
> +                       kfree(acomp_ctx->buffers[i]);
> +               kfree(acomp_ctx->buffers);
> +       }
>  }
>
>  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
> @@ -290,6 +301,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>                         return NULL;
>         }
>
> +       /* Many things rely on the zero-initialization. */
>         pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>         if (!pool)
>                 return NULL;
> @@ -352,13 +364,28 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>                 goto ref_fail;
>         INIT_LIST_HEAD(&pool->list);
>
> +       /*
> +        * Set the unit of compress batching for large folios, for quick
> +        * retrieval in the zswap_compress() fast path:
> +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> +        * large folios will be compressed in batches of ZSWAP_MAX_BATCH_SIZE
> +        * pages, where each page in the batch is compressed sequentially.
> +        * We see better performance by processing the folio in batches of
> +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> +        * structures.
> +        */
> +       pool->batch_size = (pool->compr_batch_size > 1) ?
> +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> +
>         zswap_pool_debug("created", pool);
>
>         return pool;
>
>  ref_fail:
>         for_each_possible_cpu(cpu)
> -               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> +               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> +                                 pool->compr_batch_size);
> +
>  error:
>         if (pool->acomp_ctx)
>                 free_percpu(pool->acomp_ctx);
> @@ -417,7 +444,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
>         zswap_pool_debug("destroying", pool);
>
>         for_each_possible_cpu(cpu)
> -               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> +               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> +                                 pool->compr_batch_size);
>
>         free_percpu(pool->acomp_ctx);
>
> @@ -876,6 +904,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         int ret = -ENOMEM;
> +       u8 i;
>
>         /*
>          * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
> @@ -888,10 +917,6 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>                 return 0;
>
> -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -       if (!acomp_ctx->buffer)
> -               return ret;
> -
>         acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>         if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
>                 pr_err("could not alloc crypto acomp %s : %ld\n",
> @@ -904,17 +929,36 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
>         if (IS_ERR_OR_NULL(acomp_ctx->req)) {
>                 pr_err("could not alloc crypto acomp_request %s\n",
> -                      pool->tfm_name);
> +                       pool->tfm_name);

Is this intentional? :)

>                 goto fail;
>         }
>
> -       crypto_init_wait(&acomp_ctx->wait);
> +       /*
> +        * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
> +        * compressor supports batching.
> +        */
> +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> +                                    crypto_acomp_batch_size(acomp_ctx->acomp));
> +
> +       acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
> +                                         GFP_KERNEL, cpu_to_node(cpu));
> +       if (!acomp_ctx->buffers)
> +               goto fail;
> +
> +       for (i = 0; i < pool->compr_batch_size; ++i) {
> +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> +                                                    cpu_to_node(cpu));
> +               if (!acomp_ctx->buffers[i])
> +                       goto fail;
> +       }
>
>         /*
>          * if the backend of acomp is async zip, crypto_req_done() will wakeup
>          * crypto_wait_req(); if the backend of acomp is scomp, the callback
>          * won't be called, crypto_wait_req() will return without blocking.
>          */
> +       crypto_init_wait(&acomp_ctx->wait);
> +
>         acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> @@ -922,7 +966,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         return 0;
>
>  fail:
> -       acomp_ctx_dealloc(acomp_ctx);
> +       acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
>         return ret;
>  }
>
> @@ -942,7 +986,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffer;
> +       dst = acomp_ctx->buffers[0];
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> @@ -1003,19 +1047,19 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>         mutex_lock(&acomp_ctx->mutex);
> -       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> +       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffers[0]);
>
>         /*
>          * zpool_obj_read_begin() might return a kmap address of highmem when
> -        * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> -        * handle highmem addresses, so copy the object to acomp_ctx->buffer.
> +        * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
> +        * handle highmem addresses, so copy the object to acomp_ctx->buffers[0].
>          */
>         if (virt_addr_valid(obj)) {
>                 src = obj;
>         } else {
> -               WARN_ON_ONCE(obj == acomp_ctx->buffer);
> -               memcpy(acomp_ctx->buffer, obj, entry->length);
> -               src = acomp_ctx->buffer;
> +               WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
> +               memcpy(acomp_ctx->buffers[0], obj, entry->length);
> +               src = acomp_ctx->buffers[0];
>         }
>
>         sg_init_one(&input, src, entry->length);
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches.
  2025-08-01  4:36 ` [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
@ 2025-08-14 21:05   ` Nhat Pham
  2025-08-14 22:10     ` Sridhar, Kanchana P
  2025-08-28 23:59   ` Barry Song
  1 sibling, 1 reply; 68+ messages in thread
From: Nhat Pham @ 2025-08-14 21:05 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch modifies zswap_store() to store a batch of pages in large
> folios at a time, instead of storing one page at a time. It does this by
> calling a new procedure zswap_store_pages() with a range of
> "pool->batch_size" indices in the folio.
>
> zswap_store_pages() implements all the computes done earlier in
> zswap_store_page() for a single-page, for multiple pages in a folio,
> namely the "batch":
>
> 1) It starts by allocating all zswap entries required to store the
>    batch. New procedures, zswap_entries_cache_alloc_batch() and
>    zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
>    to optimize the performance of this step.
>
> 2) Next, the entries fields are written, computes that need to be happen
>    anyway, without modifying the zswap xarray/LRU publishing order. This
>    improves latency by avoiding having the bring the entries into the
>    cache for writing in different code blocks within this procedure.
>
> 3) Next, it calls zswap_compress() to sequentially compress each page in
>    the batch.
>
> 4) Finally, it adds the batch's zswap entries to the xarray and LRU,
>    charges zswap memory and increments zswap stats.
>
> 5) The error handling and cleanup required for all failure scenarios
>    that can occur while storing a batch in zswap are consolidated to a
>    single "store_pages_failed" label in zswap_store_pages(). Here again,
>    we optimize performance by calling kmem_cache_free_bulk().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 218 ++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 149 insertions(+), 69 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 63a997b999537..8ca69c3f30df2 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -879,6 +879,24 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>         kmem_cache_free(zswap_entry_cache, entry);
>  }
>
> +/*
> + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number otherwise.
> + * The code for __kmem_cache_alloc_bulk() indicates that this positive number
> + * will be the @size requested, i.e., @nr_entries.
> + */
> +static __always_inline int zswap_entries_cache_alloc_batch(void **entries,
> +                                                          unsigned int nr_entries,
> +                                                          gfp_t gfp)
> +{
> +       return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries, entries);
> +}
> +
> +static __always_inline void zswap_entries_cache_free_batch(void **entries,
> +                                                          unsigned int nr_entries)
> +{
> +       kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> +}
> +
>  /*
>   * Carries out the common pattern of freeing and entry's zpool allocation,
>   * freeing the entry itself, and decrementing the number of stored pages.
> @@ -1512,93 +1530,154 @@ static void shrink_worker(struct work_struct *w)
>  * main API
>  **********************************/
>
> -static bool zswap_store_page(struct page *page,
> -                            struct obj_cgroup *objcg,
> -                            struct zswap_pool *pool)
> +/*
> + * Store multiple pages in @folio, starting from the page at index @start up to
> + * the page at index @end-1.
> + */
> +static bool zswap_store_pages(struct folio *folio,
> +                             long start,
> +                             long end,
> +                             struct obj_cgroup *objcg,
> +                             struct zswap_pool *pool,
> +                             int node_id)
>  {
> -       swp_entry_t page_swpentry = page_swap_entry(page);
> -       struct zswap_entry *entry, *old;
> -
> -       /* allocate entry */
> -       entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> -       if (!entry) {
> -               zswap_reject_kmemcache_fail++;
> -               return false;
> +       struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> +       u8 i, store_fail_idx = 0, nr_pages = end - start;
> +
> +       if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> +                                                     nr_pages, GFP_KERNEL))) {
> +               for (i = 0; i < nr_pages; ++i) {
> +                       entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +
> +                       if (unlikely(!entries[i])) {
> +                               zswap_reject_kmemcache_fail++;
> +                               /*
> +                                * While handling this error, we only need to
> +                                * call zswap_entries_cache_free_batch() for
> +                                * entries[0 .. i-1].
> +                                */
> +                               nr_pages = i;
> +                               goto store_pages_failed;
> +                       }
> +               }
>         }
>
> -       if (!zswap_compress(page, entry, pool))
> -               goto compress_failed;
> +       /*
> +        * Three sets of initializations are done to minimize bringing
> +        * @entries into the cache for writing at different parts of this
> +        * procedure, since doing so regresses performance:
> +        *
> +        * 1) Do all the writes to each entry in one code block. These
> +        *    writes need to be done anyway upon success which is more likely
> +        *    than not.
> +        *
> +        * 2) Initialize the handle to an error value. This facilitates
> +        *    having a consolidated failure handling
> +        *    'goto store_pages_failed' that can inspect the value of the
> +        *    handle to determine whether zpool memory needs to be
> +        *    de-allocated.
> +        *
> +        * 3) The page_swap_entry() is obtained once and stored in the entry.
> +        *    Subsequent store in xarray gets the entry->swpentry instead of
> +        *    calling page_swap_entry(), minimizing computes.
> +        */
> +       for (i = 0; i < nr_pages; ++i) {
> +               entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> +               entries[i]->pool = pool;
> +               entries[i]->swpentry = page_swap_entry(folio_page(folio, start + i));
> +               entries[i]->objcg = objcg;
> +               entries[i]->referenced = true;
> +               INIT_LIST_HEAD(&entries[i]->lru);
> +       }
>
> -       old = xa_store(swap_zswap_tree(page_swpentry),
> -                      swp_offset(page_swpentry),
> -                      entry, GFP_KERNEL);
> -       if (xa_is_err(old)) {
> -               int err = xa_err(old);
> +       for (i = 0; i < nr_pages; ++i) {
> +               struct page *page = folio_page(folio, start + i);
>
> -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -               zswap_reject_alloc_fail++;
> -               goto store_failed;
> +               if (!zswap_compress(page, entries[i], pool))
> +                       goto store_pages_failed;
>         }
>
> -       /*
> -        * We may have had an existing entry that became stale when
> -        * the folio was redirtied and now the new version is being
> -        * swapped out. Get rid of the old.
> -        */
> -       if (old)
> -               zswap_entry_free(old);
> +       for (i = 0; i < nr_pages; ++i) {
> +               struct zswap_entry *old, *entry = entries[i];
>
> -       /*
> -        * The entry is successfully compressed and stored in the tree, there is
> -        * no further possibility of failure. Grab refs to the pool and objcg,
> -        * charge zswap memory, and increment zswap_stored_pages.
> -        * The opposite actions will be performed by zswap_entry_free()
> -        * when the entry is removed from the tree.
> -        */
> -       zswap_pool_get(pool);
> -       if (objcg) {
> -               obj_cgroup_get(objcg);
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -       }
> -       atomic_long_inc(&zswap_stored_pages);
> +               old = xa_store(swap_zswap_tree(entry->swpentry),
> +                              swp_offset(entry->swpentry),
> +                              entry, GFP_KERNEL);
> +               if (unlikely(xa_is_err(old))) {
> +                       int err = xa_err(old);
>
> -       /*
> -        * We finish initializing the entry while it's already in xarray.
> -        * This is safe because:
> -        *
> -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> -        *
> -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> -        *    The publishing order matters to prevent writeback from seeing
> -        *    an incoherent entry.
> -        */
> -       entry->pool = pool;
> -       entry->swpentry = page_swpentry;
> -       entry->objcg = objcg;
> -       entry->referenced = true;
> -       if (entry->length) {
> -               INIT_LIST_HEAD(&entry->lru);
> -               zswap_lru_add(&zswap_list_lru, entry);
> +                       WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +                       zswap_reject_alloc_fail++;
> +                       /*
> +                        * Entries up to this point have been stored in the
> +                        * xarray. zswap_store() will erase them from the xarray
> +                        * and call zswap_entry_free(). Local cleanup in
> +                        * 'store_pages_failed' only needs to happen for
> +                        * entries from [@i to @nr_pages).
> +                        */
> +                       store_fail_idx = i;
> +                       goto store_pages_failed;
> +               }
> +
> +               /*
> +                * We may have had an existing entry that became stale when
> +                * the folio was redirtied and now the new version is being
> +                * swapped out. Get rid of the old.
> +                */
> +               if (unlikely(old))
> +                       zswap_entry_free(old);
> +
> +               /*
> +                * The entry is successfully compressed and stored in the tree, there is
> +                * no further possibility of failure. Grab refs to the pool and objcg,
> +                * charge zswap memory, and increment zswap_stored_pages.
> +                * The opposite actions will be performed by zswap_entry_free()
> +                * when the entry is removed from the tree.
> +                */
> +               zswap_pool_get(pool);
> +               if (objcg) {
> +                       obj_cgroup_get(objcg);
> +                       obj_cgroup_charge_zswap(objcg, entry->length);
> +               }
> +               atomic_long_inc(&zswap_stored_pages);
> +
> +               /*
> +                * We finish by adding the entry to the LRU while it's already
> +                * in xarray. This is safe because:
> +                *
> +                * 1. Concurrent stores and invalidations are excluded by folio lock.
> +                *
> +                * 2. Writeback is excluded by the entry not being on the LRU yet.
> +                *    The publishing order matters to prevent writeback from seeing
> +                *    an incoherent entry.
> +                */
> +               if (likely(entry->length))
> +                       zswap_lru_add(&zswap_list_lru, entry);
>         }
>
>         return true;
>
> -store_failed:
> -       zpool_free(pool->zpool, entry->handle);
> -compress_failed:
> -       zswap_entry_cache_free(entry);
> +store_pages_failed:
> +       for (i = store_fail_idx; i < nr_pages; ++i) {
> +               if (!IS_ERR_VALUE(entries[i]->handle))
> +                       zpool_free(pool->zpool, entries[i]->handle);
> +       }
> +       zswap_entries_cache_free_batch((void **)&entries[store_fail_idx],
> +                                      nr_pages - store_fail_idx);
> +
>         return false;
>  }
>
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
> +       int node_id = folio_nid(folio);
>         swp_entry_t swp = folio->swap;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
>         struct zswap_pool *pool;
>         bool ret = false;
> -       long index;
> +       long start, end;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1632,10 +1711,11 @@ bool zswap_store(struct folio *folio)
>                 mem_cgroup_put(memcg);
>         }
>
> -       for (index = 0; index < nr_pages; ++index) {
> -               struct page *page = folio_page(folio, index);
> +       /* Store the folio in batches of @pool->batch_size pages. */
> +       for (start = 0; start < nr_pages; start += pool->batch_size) {
> +               end = min(start + pool->batch_size, nr_pages);
>
> -               if (!zswap_store_page(page, objcg, pool))
> +               if (!zswap_store_pages(folio, start, end, objcg, pool, node_id))
>                         goto put_pool;
>         }
>
> @@ -1665,9 +1745,9 @@ bool zswap_store(struct folio *folio)
>                 struct zswap_entry *entry;
>                 struct xarray *tree;
>
> -               for (index = 0; index < nr_pages; ++index) {
> -                       tree = swap_zswap_tree(swp_entry(type, offset + index));
> -                       entry = xa_erase(tree, offset + index);
> +               for (start = 0; start < nr_pages; ++start) {
> +                       tree = swap_zswap_tree(swp_entry(type, offset + start));
> +                       entry = xa_erase(tree, offset + start);
>                         if (entry)
>                                 zswap_entry_free(entry);
>                 }
> --
> 2.27.0
>

This patch LGTM for the most part. Lemme test the series again (I
tested an old version of this patch series), and I will give my Ack.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-01  4:36 ` [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
@ 2025-08-14 21:14   ` Nhat Pham
  2025-08-14 22:17     ` Sridhar, Kanchana P
  2025-08-28 23:54   ` Barry Song
  1 sibling, 1 reply; 68+ messages in thread
From: Nhat Pham @ 2025-08-14 21:14 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch introduces a new unified implementation of zswap_compress()
> for compressors that do and do not support batching. This eliminates
> code duplication and facilitates maintainability of the code with the
> introduction of compress batching.
>
> The vectorized implementation of calling the earlier zswap_compress()
> sequentially, one page at a time in zswap_store_pages(), is replaced
> with this new version of zswap_compress() that accepts multiple pages to
> compress as a batch.
>
> If the compressor does not support batching, each page in the batch is
> compressed and stored sequentially.
>
> If the compressor supports batching, for e.g., 'deflate-iaa', the Intel
> IAA hardware accelerator, the batch is compressed in parallel in
> hardware by setting the acomp_ctx->req->kernel_data to contain the
> necessary batching data before calling crypto_acomp_compress(). If all
> requests in the batch are compressed without errors, the compressed
> buffers are then stored in zpool.
>
> Another important change this patch makes is with the acomp_ctx mutex
> locking in zswap_compress(). Earlier, the mutex was held per page's
> compression. With the new code, [un]locking the mutex per page caused
> regressions for software compressors when testing with usemem
> (30 processes) and also kernel compilation with 'allmod' config. The
> regressions were more eggregious when PMD folios were stored. The
> implementation in this commit locks/unlocks the mutex once per batch,
> that resolves the regression.
>
> The use of prefetchw() for zswap entries and likely()/unlikely()
> annotations prevent regressions with software compressors like zstd, and
> generally improve non-batching compressors' performance with the
> batching code by ~3%.
>
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be
> hardware-agnostic. It has no dependencies on Intel-specific features and
> can be leveraged by any hardware accelerator or software-based
> compressor. In other words, the framework is open and inclusive by
> design.
>
> Other ongoing work that can use batching:
> =========================================
> This patch-series demonstrates the performance benefits of compress
> batching when used in zswap_store() of large folios. shrink_folio_list()
> "reclaim batching" of any-order folios is the major next work that uses
> the zswap compress batching framework: our testing of kernel_compilation
> with writeback and the zswap shrinker indicates 10X fewer pages get
> written back when we reclaim 32 folios as a batch, as compared to one
> folio at a time: this is with deflate-iaa and with zstd. We expect to
> submit a patch-series with this data and the resulting performance
> improvements shortly. Reclaim batching relieves memory pressure faster
> than reclaiming one folio at a time, hence alleviates the need to scan
> slab memory for writeback.
>
> Nhat has given ideas on using batching with the ongoing kcompressd work,
> as well as beneficially using decompression batching & block IO batching
> to improve zswap writeback efficiency.
>
> Experiments that combine zswap compress batching, reclaim batching,
> swapin_readahead() decompression batching of prefetched pages, and
> writeback batching show that 0 pages are written back with deflate-iaa
> and zstd. For comparison, the baselines for these compressors see
> 200K-800K pages written to disk (kernel compilation 'allmod' config).
>
> To summarize, these are future clients of the batching framework:
>
>    - shrink_folio_list() reclaim batching of multiple folios:
>        Implemented, will submit patch-series.
>    - zswap writeback with decompress batching:
>        Implemented, will submit patch-series.
>    - zram:
>        Implemented, will submit patch-series.
>    - kcompressd:
>        Not yet implemented.
>    - file systems:
>        Not yet implemented.
>    - swapin_readahead() decompression batching of prefetched pages:
>        Implemented, will submit patch-series.
>
> Additionally, any place we have folios that need to be compressed, can
> potentially be parallelized.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/swap.h  |  23 ++++++
>  mm/zswap.c | 201 ++++++++++++++++++++++++++++++++++++++---------------
>  2 files changed, 168 insertions(+), 56 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 911ad5ff0f89f..2afbf00f59fea 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -11,6 +11,29 @@ extern int page_cluster;
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +/* linux/mm/zswap.c */
> +/*
> + * A compression algorithm that wants to batch compressions/decompressions
> + * must define its own internal data structures that exactly mirror
> + * @struct swap_batch_comp_data and @struct swap_batch_decomp_data.
> + */
> +struct swap_batch_comp_data {
> +       struct page **pages;
> +       u8 **dsts;
> +       unsigned int *dlens;
> +       int *errors;
> +       u8 nr_comps;
> +};
> +
> +struct swap_batch_decomp_data {
> +       u8 **srcs;
> +       struct page **pages;
> +       unsigned int *slens;
> +       unsigned int *dlens;
> +       int *errors;
> +       u8 nr_decomps;
> +};

This struct is not being used yet right? I assume this is used for
batch zswap load and writeback etc.

Can we introduce them when those series are sent out? Just to limit
the amount of reviewing here :)

> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 8ca69c3f30df2..c30c1f325f573 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -35,6 +35,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/workqueue.h>
>  #include <linux/list_lru.h>
> +#include <linux/prefetch.h>
>
>  #include "swap.h"
>  #include "internal.h"
> @@ -988,71 +989,163 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         return ret;
>  }
>
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -                          struct zswap_pool *pool)
> +/*
> + * Unified code path for compressors that do and do not support batching. This
> + * procedure will compress multiple @nr_pages in @folio starting from the
> + * @start index.
> + *
> + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
> + * sure of this by design.

Maybe add a VM_WARN_ON_ONCE(nr_pages <= ZSWAP_MAX_BATCH_SIZE); in
zswap_store_pages() to codify this design choice?

> + *
> + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
> + * support batching.
> + *
> + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> + *
> + * If @pool->compr_batch_size is > 1, compression batching is invoked, except if
> + * @nr_pages is 1: if so, we call the fully synchronous non-batching
> + * crypto_acomp API.
> + *
> + * In both cases, if all compressions are successful, the compressed buffers
> + * are stored in zpool.
> + *
> + * A few important changes made to not regress and in fact improve
> + * compression performance with non-batching software compressors, using this
> + * new/batching code:
> + *
> + * 1) acomp_ctx mutex locking:
> + *    Earlier, the mutex was held per page compression. With the new code,
> + *    [un]locking the mutex per page caused regressions for software
> + *    compressors. We now lock the mutex once per batch, which resolves the
> + *    regression.

Makes sense, yeah.

> + *
> + * 2) The prefetchw() and likely()/unlikely() annotations prevent
> + *    regressions with software compressors like zstd, and generally improve
> + *    non-batching compressors' performance with the batching code by ~3%.
> + */
> +static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
> +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> +                          int node_id)
>  {
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct scatterlist input, output;
> -       int comp_ret = 0, alloc_ret = 0;
> -       unsigned int dlen = PAGE_SIZE;
> -       unsigned long handle;
> -       struct zpool *zpool;
> +       struct zpool *zpool = pool->zpool;
> +
> +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> +       int errors[ZSWAP_MAX_BATCH_SIZE];
> +
> +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> +       unsigned int i, j;
> +       int err;
>         gfp_t gfp;
> -       u8 *dst;
> +
> +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
>
>         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffers[0];
> -       sg_init_table(&input, 1);
> -       sg_set_page(&input, page, PAGE_SIZE, 0);
> -
>         /*
> -        * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> -        * and hardware-accelerators may won't check the dst buffer size, so
> -        * giving the dst buffer with enough length to avoid buffer overflow.
> +        * Note:
> +        * [i] refers to the incoming batch space and is used to
> +        *     index into the folio pages, @entries and @errors.
>          */
> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +       for (i = 0; i < nr_pages; i += nr_comps) {
> +               if (nr_comps == 1) {
> +                       sg_init_table(&input, 1);
> +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
>
> -       /*
> -        * it maybe looks a little bit silly that we send an asynchronous request,
> -        * then wait for its completion synchronously. This makes the process look
> -        * synchronous in fact.
> -        * Theoretically, acomp supports users send multiple acomp requests in one
> -        * acomp instance, then get those requests done simultaneously. but in this
> -        * case, zswap actually does store and load page by page, there is no
> -        * existing method to send the second page before the first page is done
> -        * in one thread doing zwap.
> -        * but in different threads running on different cpu, we have different
> -        * acomp instance, so multiple threads can do (de)compression in parallel.
> -        */
> -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -       dlen = acomp_ctx->req->dlen;
> -       if (comp_ret)
> -               goto unlock;
> +                       /*
> +                        * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> +                        * and hardware-accelerators may won't check the dst buffer size, so
> +                        * giving the dst buffer with enough length to avoid buffer overflow.
> +                        */
> +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
> +                       acomp_request_set_params(acomp_ctx->req, &input,
> +                                                &output, PAGE_SIZE, PAGE_SIZE);
> +
> +                       errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> +                                                   &acomp_ctx->wait);
> +                       if (unlikely(errors[i]))
> +                               goto compress_error;
> +
> +                       dlens[i] = acomp_ctx->req->dlen;
> +               } else {
> +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> +                       unsigned int k;
> +
> +                       for (k = 0; k < nr_pages; ++k)
> +                               pages[k] = folio_page(folio, start + k);
> +
> +                       struct swap_batch_comp_data batch_comp_data = {
> +                               .pages = pages,
> +                               .dsts = acomp_ctx->buffers,
> +                               .dlens = dlens,
> +                               .errors = errors,
> +                               .nr_comps = nr_pages,
> +                       };
> +
> +                       acomp_ctx->req->kernel_data = &batch_comp_data;
> +
> +                       if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
> +                               goto compress_error;

I assume this is a new crypto API?

I'll let Herbert decide whether this makes sense :)

> +               }
>
> -       zpool = pool->zpool;
> -       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> -       alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
> -       if (alloc_ret)
> -               goto unlock;
> -
> -       zpool_obj_write(zpool, handle, dst, dlen);
> -       entry->handle = handle;
> -       entry->length = dlen;
> -
> -unlock:
> -       if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> -               zswap_reject_compress_poor++;
> -       else if (comp_ret)
> -               zswap_reject_compress_fail++;
> -       else if (alloc_ret)
> -               zswap_reject_alloc_fail++;
> +               /*
> +                * All @nr_comps pages were successfully compressed.
> +                * Store the pages in zpool.
> +                *
> +                * Note:
> +                * [j] refers to the incoming batch space and is used to
> +                *     index into the folio pages, @entries, @dlens and @errors.
> +                * [k] refers to the @acomp_ctx space, as determined by
> +                *     @pool->compr_batch_size, and is used to index into
> +                *     @acomp_ctx->buffers.
> +                */
> +               for (j = i; j < i + nr_comps; ++j) {
> +                       unsigned int k = j - i;
> +                       unsigned long handle;
> +
> +                       /*
> +                        * prefetchw() minimizes cache-miss latency by
> +                        * moving the zswap entry to the cache before it
> +                        * is written to; reducing sys time by ~1.5% for
> +                        * non-batching software compressors.
> +                        */
> +                       prefetchw(entries[j]);
> +                       err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
> +
> +                       if (unlikely(err)) {
> +                               if (err == -ENOSPC)
> +                                       zswap_reject_compress_poor++;
> +                               else
> +                                       zswap_reject_alloc_fail++;
> +
> +                               goto err_unlock;
> +                       }
> +
> +                       zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]);
> +                       entries[j]->handle = handle;
> +                       entries[j]->length = dlens[j];
> +               }
> +       } /* finished compress and store nr_pages. */
>
>         mutex_unlock(&acomp_ctx->mutex);
> -       return comp_ret == 0 && alloc_ret == 0;
> +       return true;
> +
> +compress_error:
> +       for (j = i; j < i + nr_comps; ++j) {
> +               if (errors[j]) {
> +                       if (errors[j] == -ENOSPC)
> +                               zswap_reject_compress_poor++;
> +                       else
> +                               zswap_reject_compress_fail++;
> +               }
> +       }
> +
> +err_unlock:
> +       mutex_unlock(&acomp_ctx->mutex);
> +       return false;
>  }
>
>  static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> @@ -1590,12 +1683,8 @@ static bool zswap_store_pages(struct folio *folio,
>                 INIT_LIST_HEAD(&entries[i]->lru);
>         }
>
> -       for (i = 0; i < nr_pages; ++i) {
> -               struct page *page = folio_page(folio, start + i);
> -
> -               if (!zswap_compress(page, entries[i], pool))
> -                       goto store_pages_failed;
> -       }
> +       if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool, node_id)))
> +               goto store_pages_failed;
>
>         for (i = 0; i < nr_pages; ++i) {
>                 struct zswap_entry *old, *entry = entries[i];
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-14 20:58   ` Nhat Pham
@ 2025-08-14 22:05     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-14 22:05 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 14, 2025 1:58 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch sets up zswap for allocating per-CPU resources optimally for
> > non-batching and batching compressors.
> >
> > A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
> > limit on the number of pages in large folios that will be batch
> > compressed.
> >
> > As per Herbert's comments in [2] in response to the
> > crypto_acomp_batch_compress() and crypto_acomp_batch_decompress()
> API
> > proposed in [1], this series does not create new crypto_acomp batching
> > API. Instead, zswap compression batching uses the existing
> > crypto_acomp_compress() API in combination with the "void *kernel_data"
> > member added to "struct acomp_req" earlier in this series.
> >
> > It is up to the compressor to manage multiple requests, as needed, to
> > accomplish batch parallelism. zswap only needs to allocate the per-CPU
> > dst buffers according to the batch size supported by the compressor.
> >
> > A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> > Yosry's suggestion. pool->compr_batch_size is set as the minimum of the
> > compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it
> > proceeds to allocate the necessary compression dst buffers in the
> > per-CPU acomp_ctx.
> >
> > Another "u8 batch_size" member is added to "struct zswap_pool" to store
> > the unit for batching large folio stores: for batching compressors, this
> > is the pool->compr_batch_size. For non-batching compressors, this is
> > ZSWAP_MAX_BATCH_SIZE.
> >
> > zswap does not use more than one dst buffer yet. Follow-up patches will
> > actually utilize the multiple acomp_ctx buffers for batch
> > compression/decompression of multiple pages.
> >
> > Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> for
> > batching. There is a small extra memory overhead of allocating
> > the acomp_ctx->buffers array for compressors that do not support
> > batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).
> >
> > [1]: https://patchwork.kernel.org/project/linux-
> mm/patch/20250508194134.28392-11-kanchana.p.sridhar@intel.com/
> > [2]: https://patchwork.kernel.org/comment/26382610
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> Mostly LGTM. Just a couple of questions below:

Hi Nhat,

Thanks for taking the time to review the patches! Sure, these are
great questions, responses are inline.

> 
> > ---
> >  mm/zswap.c | 82 +++++++++++++++++++++++++++++++++++++++++-------
> ------
> >  1 file changed, 63 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index efd501a7fe294..63a997b999537 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -80,6 +80,9 @@ static bool zswap_pool_reached_full;
> >
> >  #define ZSWAP_PARAM_UNSET ""
> >
> > +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > +
> >  static int zswap_setup(void);
> >
> >  /* Enable/disable zswap */
> > @@ -147,7 +150,7 @@ struct crypto_acomp_ctx {
> >         struct crypto_acomp *acomp;
> >         struct acomp_req *req;
> >         struct crypto_wait wait;
> > -       u8 *buffer;
> > +       u8 **buffers;
> >         struct mutex mutex;
> >         bool is_sleepable;
> >  };
> > @@ -166,6 +169,8 @@ struct zswap_pool {
> >         struct work_struct release_work;
> >         struct hlist_node node;
> >         char tfm_name[CRYPTO_MAX_ALG_NAME];
> > +       u8 compr_batch_size;
> > +       u8 batch_size;
> 
> Apologies if this is explained elsewhere, but I'm very confused - why
> do we need both of these two fields?

No worries. This was my thinking in keeping these separate:

  "compr_batch_size" is indicative of the number of batching resources
  allocated per-CPU. Hence, zswap_compress() uses this to determine if
  we need to compress one page at a time in the input batch of pages.

  "batch_size" represents the number of pages that will be sent to
  zswap_compress() as a batch.

> 
> Seems like batch_size is defined below, and never changed:
> 
>       pool->batch_size = (pool->compr_batch_size > 1) ?
>                             pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> 
> Can we just determine this in zswap_store() as a local variable?

I figured since the number of zswap_pools at any given time is less than
or equal to 2 (IIRC), it should be a good compromise to add these two
u8 members for latency reasons, so that this doesn't have to be
computed per call to zswap_store(). 

> 
> 
> >  };
> >
> >  /* Global LRU lists shared by all zswap pools. */
> > @@ -258,8 +263,10 @@ static void __zswap_pool_empty(struct
> percpu_ref *ref);
> >   *   zswap_cpu_comp_prepare(), not others.
> >   * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
> >   */
> > -static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8
> nr_buffers)
> >  {
> > +       u8 i;
> > +
> >         if (IS_ERR_OR_NULL(acomp_ctx))
> >                 return;
> >
> > @@ -269,7 +276,11 @@ static void acomp_ctx_dealloc(struct
> crypto_acomp_ctx *acomp_ctx)
> >         if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >                 crypto_free_acomp(acomp_ctx->acomp);
> >
> > -       kfree(acomp_ctx->buffer);
> > +       if (acomp_ctx->buffers) {
> > +               for (i = 0; i < nr_buffers; ++i)
> > +                       kfree(acomp_ctx->buffers[i]);
> > +               kfree(acomp_ctx->buffers);
> > +       }
> >  }
> >
> >  static struct zswap_pool *zswap_pool_create(char *type, char
> *compressor)
> > @@ -290,6 +301,7 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> >                         return NULL;
> >         }
> >
> > +       /* Many things rely on the zero-initialization. */
> >         pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> >         if (!pool)
> >                 return NULL;
> > @@ -352,13 +364,28 @@ static struct zswap_pool
> *zswap_pool_create(char *type, char *compressor)
> >                 goto ref_fail;
> >         INIT_LIST_HEAD(&pool->list);
> >
> > +       /*
> > +        * Set the unit of compress batching for large folios, for quick
> > +        * retrieval in the zswap_compress() fast path:
> > +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> > +        * large folios will be compressed in batches of
> ZSWAP_MAX_BATCH_SIZE
> > +        * pages, where each page in the batch is compressed sequentially.
> > +        * We see better performance by processing the folio in batches of
> > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> > +        * structures.
> > +        */
> > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> > +
> >         zswap_pool_debug("created", pool);
> >
> >         return pool;
> >
> >  ref_fail:
> >         for_each_possible_cpu(cpu)
> > -               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> > +                                 pool->compr_batch_size);
> > +
> >  error:
> >         if (pool->acomp_ctx)
> >                 free_percpu(pool->acomp_ctx);
> > @@ -417,7 +444,8 @@ static void zswap_pool_destroy(struct zswap_pool
> *pool)
> >         zswap_pool_debug("destroying", pool);
> >
> >         for_each_possible_cpu(cpu)
> > -               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +               acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> > +                                 pool->compr_batch_size);
> >
> >         free_percpu(pool->acomp_ctx);
> >
> > @@ -876,6 +904,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> >         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx,
> cpu);
> >         int ret = -ENOMEM;
> > +       u8 i;
> >
> >         /*
> >          * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
> > @@ -888,10 +917,6 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >                 return 0;
> >
> > -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -       if (!acomp_ctx->buffer)
> > -               return ret;
> > -
> >         acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> >         if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
> >                 pr_err("could not alloc crypto acomp %s : %ld\n",
> > @@ -904,17 +929,36 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> >         if (IS_ERR_OR_NULL(acomp_ctx->req)) {
> >                 pr_err("could not alloc crypto acomp_request %s\n",
> > -                      pool->tfm_name);
> > +                       pool->tfm_name);
> 
> Is this intentional? :)

Yes, it is indeed :). No issue if I should revert.

Thanks,
Kanchana

> 
> >                 goto fail;
> >         }
> >
> > -       crypto_init_wait(&acomp_ctx->wait);
> > +       /*
> > +        * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
> > +        * compressor supports batching.
> > +        */
> > +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> > +                                    crypto_acomp_batch_size(acomp_ctx->acomp));
> > +
> > +       acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8
> *),
> > +                                         GFP_KERNEL, cpu_to_node(cpu));
> > +       if (!acomp_ctx->buffers)
> > +               goto fail;
> > +
> > +       for (i = 0; i < pool->compr_batch_size; ++i) {
> > +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> GFP_KERNEL,
> > +                                                    cpu_to_node(cpu));
> > +               if (!acomp_ctx->buffers[i])
> > +                       goto fail;
> > +       }
> >
> >         /*
> >          * if the backend of acomp is async zip, crypto_req_done() will wakeup
> >          * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >          * won't be called, crypto_wait_req() will return without blocking.
> >          */
> > +       crypto_init_wait(&acomp_ctx->wait);
> > +
> >         acomp_request_set_callback(acomp_ctx->req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >                                    crypto_req_done, &acomp_ctx->wait);
> >
> > @@ -922,7 +966,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         return 0;
> >
> >  fail:
> > -       acomp_ctx_dealloc(acomp_ctx);
> > +       acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
> >         return ret;
> >  }
> >
> > @@ -942,7 +986,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffer;
> > +       dst = acomp_ctx->buffers[0];
> >         sg_init_table(&input, 1);
> >         sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> > @@ -1003,19 +1047,19 @@ static bool zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >
> >         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >         mutex_lock(&acomp_ctx->mutex);
> > -       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> > +       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx-
> >buffers[0]);
> >
> >         /*
> >          * zpool_obj_read_begin() might return a kmap address of highmem
> when
> > -        * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> > -        * handle highmem addresses, so copy the object to acomp_ctx-
> >buffer.
> > +        * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
> > +        * handle highmem addresses, so copy the object to acomp_ctx-
> >buffers[0].
> >          */
> >         if (virt_addr_valid(obj)) {
> >                 src = obj;
> >         } else {
> > -               WARN_ON_ONCE(obj == acomp_ctx->buffer);
> > -               memcpy(acomp_ctx->buffer, obj, entry->length);
> > -               src = acomp_ctx->buffer;
> > +               WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
> > +               memcpy(acomp_ctx->buffers[0], obj, entry->length);
> > +               src = acomp_ctx->buffers[0];
> >         }
> >
> >         sg_init_one(&input, src, entry->length);
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches.
  2025-08-14 21:05   ` Nhat Pham
@ 2025-08-14 22:10     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-14 22:10 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 14, 2025 2:05 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 23/24] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch modifies zswap_store() to store a batch of pages in large
> > folios at a time, instead of storing one page at a time. It does this by
> > calling a new procedure zswap_store_pages() with a range of
> > "pool->batch_size" indices in the folio.
> >
> > zswap_store_pages() implements all the computes done earlier in
> > zswap_store_page() for a single-page, for multiple pages in a folio,
> > namely the "batch":
> >
> > 1) It starts by allocating all zswap entries required to store the
> >    batch. New procedures, zswap_entries_cache_alloc_batch() and
> >    zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
> >    to optimize the performance of this step.
> >
> > 2) Next, the entries fields are written, computes that need to be happen
> >    anyway, without modifying the zswap xarray/LRU publishing order. This
> >    improves latency by avoiding having the bring the entries into the
> >    cache for writing in different code blocks within this procedure.
> >
> > 3) Next, it calls zswap_compress() to sequentially compress each page in
> >    the batch.
> >
> > 4) Finally, it adds the batch's zswap entries to the xarray and LRU,
> >    charges zswap memory and increments zswap stats.
> >
> > 5) The error handling and cleanup required for all failure scenarios
> >    that can occur while storing a batch in zswap are consolidated to a
> >    single "store_pages_failed" label in zswap_store_pages(). Here again,
> >    we optimize performance by calling kmem_cache_free_bulk().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 218 ++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 149 insertions(+), 69 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 63a997b999537..8ca69c3f30df2 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -879,6 +879,24 @@ static void zswap_entry_cache_free(struct
> zswap_entry *entry)
> >         kmem_cache_free(zswap_entry_cache, entry);
> >  }
> >
> > +/*
> > + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number
> otherwise.
> > + * The code for __kmem_cache_alloc_bulk() indicates that this positive
> number
> > + * will be the @size requested, i.e., @nr_entries.
> > + */
> > +static __always_inline int zswap_entries_cache_alloc_batch(void
> **entries,
> > +                                                          unsigned int nr_entries,
> > +                                                          gfp_t gfp)
> > +{
> > +       return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> entries);
> > +}
> > +
> > +static __always_inline void zswap_entries_cache_free_batch(void
> **entries,
> > +                                                          unsigned int nr_entries)
> > +{
> > +       kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> > +}
> > +
> >  /*
> >   * Carries out the common pattern of freeing and entry's zpool allocation,
> >   * freeing the entry itself, and decrementing the number of stored pages.
> > @@ -1512,93 +1530,154 @@ static void shrink_worker(struct work_struct
> *w)
> >  * main API
> >  **********************************/
> >
> > -static bool zswap_store_page(struct page *page,
> > -                            struct obj_cgroup *objcg,
> > -                            struct zswap_pool *pool)
> > +/*
> > + * Store multiple pages in @folio, starting from the page at index @start up
> to
> > + * the page at index @end-1.
> > + */
> > +static bool zswap_store_pages(struct folio *folio,
> > +                             long start,
> > +                             long end,
> > +                             struct obj_cgroup *objcg,
> > +                             struct zswap_pool *pool,
> > +                             int node_id)
> >  {
> > -       swp_entry_t page_swpentry = page_swap_entry(page);
> > -       struct zswap_entry *entry, *old;
> > -
> > -       /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > -       if (!entry) {
> > -               zswap_reject_kmemcache_fail++;
> > -               return false;
> > +       struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > +       u8 i, store_fail_idx = 0, nr_pages = end - start;
> > +
> > +       if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > +                                                     nr_pages, GFP_KERNEL))) {
> > +               for (i = 0; i < nr_pages; ++i) {
> > +                       entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> > +
> > +                       if (unlikely(!entries[i])) {
> > +                               zswap_reject_kmemcache_fail++;
> > +                               /*
> > +                                * While handling this error, we only need to
> > +                                * call zswap_entries_cache_free_batch() for
> > +                                * entries[0 .. i-1].
> > +                                */
> > +                               nr_pages = i;
> > +                               goto store_pages_failed;
> > +                       }
> > +               }
> >         }
> >
> > -       if (!zswap_compress(page, entry, pool))
> > -               goto compress_failed;
> > +       /*
> > +        * Three sets of initializations are done to minimize bringing
> > +        * @entries into the cache for writing at different parts of this
> > +        * procedure, since doing so regresses performance:
> > +        *
> > +        * 1) Do all the writes to each entry in one code block. These
> > +        *    writes need to be done anyway upon success which is more likely
> > +        *    than not.
> > +        *
> > +        * 2) Initialize the handle to an error value. This facilitates
> > +        *    having a consolidated failure handling
> > +        *    'goto store_pages_failed' that can inspect the value of the
> > +        *    handle to determine whether zpool memory needs to be
> > +        *    de-allocated.
> > +        *
> > +        * 3) The page_swap_entry() is obtained once and stored in the entry.
> > +        *    Subsequent store in xarray gets the entry->swpentry instead of
> > +        *    calling page_swap_entry(), minimizing computes.
> > +        */
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > +               entries[i]->pool = pool;
> > +               entries[i]->swpentry = page_swap_entry(folio_page(folio, start +
> i));
> > +               entries[i]->objcg = objcg;
> > +               entries[i]->referenced = true;
> > +               INIT_LIST_HEAD(&entries[i]->lru);
> > +       }
> >
> > -       old = xa_store(swap_zswap_tree(page_swpentry),
> > -                      swp_offset(page_swpentry),
> > -                      entry, GFP_KERNEL);
> > -       if (xa_is_err(old)) {
> > -               int err = xa_err(old);
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               struct page *page = folio_page(folio, start + i);
> >
> > -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > -               zswap_reject_alloc_fail++;
> > -               goto store_failed;
> > +               if (!zswap_compress(page, entries[i], pool))
> > +                       goto store_pages_failed;
> >         }
> >
> > -       /*
> > -        * We may have had an existing entry that became stale when
> > -        * the folio was redirtied and now the new version is being
> > -        * swapped out. Get rid of the old.
> > -        */
> > -       if (old)
> > -               zswap_entry_free(old);
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               struct zswap_entry *old, *entry = entries[i];
> >
> > -       /*
> > -        * The entry is successfully compressed and stored in the tree, there is
> > -        * no further possibility of failure. Grab refs to the pool and objcg,
> > -        * charge zswap memory, and increment zswap_stored_pages.
> > -        * The opposite actions will be performed by zswap_entry_free()
> > -        * when the entry is removed from the tree.
> > -        */
> > -       zswap_pool_get(pool);
> > -       if (objcg) {
> > -               obj_cgroup_get(objcg);
> > -               obj_cgroup_charge_zswap(objcg, entry->length);
> > -       }
> > -       atomic_long_inc(&zswap_stored_pages);
> > +               old = xa_store(swap_zswap_tree(entry->swpentry),
> > +                              swp_offset(entry->swpentry),
> > +                              entry, GFP_KERNEL);
> > +               if (unlikely(xa_is_err(old))) {
> > +                       int err = xa_err(old);
> >
> > -       /*
> > -        * We finish initializing the entry while it's already in xarray.
> > -        * This is safe because:
> > -        *
> > -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -        *
> > -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -        *    The publishing order matters to prevent writeback from seeing
> > -        *    an incoherent entry.
> > -        */
> > -       entry->pool = pool;
> > -       entry->swpentry = page_swpentry;
> > -       entry->objcg = objcg;
> > -       entry->referenced = true;
> > -       if (entry->length) {
> > -               INIT_LIST_HEAD(&entry->lru);
> > -               zswap_lru_add(&zswap_list_lru, entry);
> > +                       WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > +                       zswap_reject_alloc_fail++;
> > +                       /*
> > +                        * Entries up to this point have been stored in the
> > +                        * xarray. zswap_store() will erase them from the xarray
> > +                        * and call zswap_entry_free(). Local cleanup in
> > +                        * 'store_pages_failed' only needs to happen for
> > +                        * entries from [@i to @nr_pages).
> > +                        */
> > +                       store_fail_idx = i;
> > +                       goto store_pages_failed;
> > +               }
> > +
> > +               /*
> > +                * We may have had an existing entry that became stale when
> > +                * the folio was redirtied and now the new version is being
> > +                * swapped out. Get rid of the old.
> > +                */
> > +               if (unlikely(old))
> > +                       zswap_entry_free(old);
> > +
> > +               /*
> > +                * The entry is successfully compressed and stored in the tree,
> there is
> > +                * no further possibility of failure. Grab refs to the pool and objcg,
> > +                * charge zswap memory, and increment zswap_stored_pages.
> > +                * The opposite actions will be performed by zswap_entry_free()
> > +                * when the entry is removed from the tree.
> > +                */
> > +               zswap_pool_get(pool);
> > +               if (objcg) {
> > +                       obj_cgroup_get(objcg);
> > +                       obj_cgroup_charge_zswap(objcg, entry->length);
> > +               }
> > +               atomic_long_inc(&zswap_stored_pages);
> > +
> > +               /*
> > +                * We finish by adding the entry to the LRU while it's already
> > +                * in xarray. This is safe because:
> > +                *
> > +                * 1. Concurrent stores and invalidations are excluded by folio
> lock.
> > +                *
> > +                * 2. Writeback is excluded by the entry not being on the LRU yet.
> > +                *    The publishing order matters to prevent writeback from seeing
> > +                *    an incoherent entry.
> > +                */
> > +               if (likely(entry->length))
> > +                       zswap_lru_add(&zswap_list_lru, entry);
> >         }
> >
> >         return true;
> >
> > -store_failed:
> > -       zpool_free(pool->zpool, entry->handle);
> > -compress_failed:
> > -       zswap_entry_cache_free(entry);
> > +store_pages_failed:
> > +       for (i = store_fail_idx; i < nr_pages; ++i) {
> > +               if (!IS_ERR_VALUE(entries[i]->handle))
> > +                       zpool_free(pool->zpool, entries[i]->handle);
> > +       }
> > +       zswap_entries_cache_free_batch((void **)&entries[store_fail_idx],
> > +                                      nr_pages - store_fail_idx);
> > +
> >         return false;
> >  }
> >
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> > +       int node_id = folio_nid(folio);
> >         swp_entry_t swp = folio->swap;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> >         struct zswap_pool *pool;
> >         bool ret = false;
> > -       long index;
> > +       long start, end;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1632,10 +1711,11 @@ bool zswap_store(struct folio *folio)
> >                 mem_cgroup_put(memcg);
> >         }
> >
> > -       for (index = 0; index < nr_pages; ++index) {
> > -               struct page *page = folio_page(folio, index);
> > +       /* Store the folio in batches of @pool->batch_size pages. */
> > +       for (start = 0; start < nr_pages; start += pool->batch_size) {
> > +               end = min(start + pool->batch_size, nr_pages);
> >
> > -               if (!zswap_store_page(page, objcg, pool))
> > +               if (!zswap_store_pages(folio, start, end, objcg, pool, node_id))
> >                         goto put_pool;
> >         }
> >
> > @@ -1665,9 +1745,9 @@ bool zswap_store(struct folio *folio)
> >                 struct zswap_entry *entry;
> >                 struct xarray *tree;
> >
> > -               for (index = 0; index < nr_pages; ++index) {
> > -                       tree = swap_zswap_tree(swp_entry(type, offset + index));
> > -                       entry = xa_erase(tree, offset + index);
> > +               for (start = 0; start < nr_pages; ++start) {
> > +                       tree = swap_zswap_tree(swp_entry(type, offset + start));
> > +                       entry = xa_erase(tree, offset + start);
> >                         if (entry)
> >                                 zswap_entry_free(entry);
> >                 }
> > --
> > 2.27.0
> >
> 
> This patch LGTM for the most part. Lemme test the series again (I
> tested an old version of this patch series), and I will give my Ack.

Sounds great.. Thank you Nhat!

Best regards,
Kanchana

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-14 21:14   ` Nhat Pham
@ 2025-08-14 22:17     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-14 22:17 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 14, 2025 2:15 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Thu, Jul 31, 2025 at 9:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch introduces a new unified implementation of zswap_compress()
> > for compressors that do and do not support batching. This eliminates
> > code duplication and facilitates maintainability of the code with the
> > introduction of compress batching.
> >
> > The vectorized implementation of calling the earlier zswap_compress()
> > sequentially, one page at a time in zswap_store_pages(), is replaced
> > with this new version of zswap_compress() that accepts multiple pages to
> > compress as a batch.
> >
> > If the compressor does not support batching, each page in the batch is
> > compressed and stored sequentially.
> >
> > If the compressor supports batching, for e.g., 'deflate-iaa', the Intel
> > IAA hardware accelerator, the batch is compressed in parallel in
> > hardware by setting the acomp_ctx->req->kernel_data to contain the
> > necessary batching data before calling crypto_acomp_compress(). If all
> > requests in the batch are compressed without errors, the compressed
> > buffers are then stored in zpool.
> >
> > Another important change this patch makes is with the acomp_ctx mutex
> > locking in zswap_compress(). Earlier, the mutex was held per page's
> > compression. With the new code, [un]locking the mutex per page caused
> > regressions for software compressors when testing with usemem
> > (30 processes) and also kernel compilation with 'allmod' config. The
> > regressions were more eggregious when PMD folios were stored. The
> > implementation in this commit locks/unlocks the mutex once per batch,
> > that resolves the regression.
> >
> > The use of prefetchw() for zswap entries and likely()/unlikely()
> > annotations prevent regressions with software compressors like zstd, and
> > generally improve non-batching compressors' performance with the
> > batching code by ~3%.
> >
> > Architectural considerations for the zswap batching framework:
> >
> ==============================================================
> > We have designed the zswap batching framework to be
> > hardware-agnostic. It has no dependencies on Intel-specific features and
> > can be leveraged by any hardware accelerator or software-based
> > compressor. In other words, the framework is open and inclusive by
> > design.
> >
> > Other ongoing work that can use batching:
> > =========================================
> > This patch-series demonstrates the performance benefits of compress
> > batching when used in zswap_store() of large folios. shrink_folio_list()
> > "reclaim batching" of any-order folios is the major next work that uses
> > the zswap compress batching framework: our testing of kernel_compilation
> > with writeback and the zswap shrinker indicates 10X fewer pages get
> > written back when we reclaim 32 folios as a batch, as compared to one
> > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > submit a patch-series with this data and the resulting performance
> > improvements shortly. Reclaim batching relieves memory pressure faster
> > than reclaiming one folio at a time, hence alleviates the need to scan
> > slab memory for writeback.
> >
> > Nhat has given ideas on using batching with the ongoing kcompressd work,
> > as well as beneficially using decompression batching & block IO batching
> > to improve zswap writeback efficiency.
> >
> > Experiments that combine zswap compress batching, reclaim batching,
> > swapin_readahead() decompression batching of prefetched pages, and
> > writeback batching show that 0 pages are written back with deflate-iaa
> > and zstd. For comparison, the baselines for these compressors see
> > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> >
> > To summarize, these are future clients of the batching framework:
> >
> >    - shrink_folio_list() reclaim batching of multiple folios:
> >        Implemented, will submit patch-series.
> >    - zswap writeback with decompress batching:
> >        Implemented, will submit patch-series.
> >    - zram:
> >        Implemented, will submit patch-series.
> >    - kcompressd:
> >        Not yet implemented.
> >    - file systems:
> >        Not yet implemented.
> >    - swapin_readahead() decompression batching of prefetched pages:
> >        Implemented, will submit patch-series.
> >
> > Additionally, any place we have folios that need to be compressed, can
> > potentially be parallelized.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/swap.h  |  23 ++++++
> >  mm/zswap.c | 201 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  2 files changed, 168 insertions(+), 56 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 911ad5ff0f89f..2afbf00f59fea 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -11,6 +11,29 @@ extern int page_cluster;
> >  #include <linux/swapops.h> /* for swp_offset */
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> >
> > +/* linux/mm/zswap.c */
> > +/*
> > + * A compression algorithm that wants to batch
> compressions/decompressions
> > + * must define its own internal data structures that exactly mirror
> > + * @struct swap_batch_comp_data and @struct
> swap_batch_decomp_data.
> > + */
> > +struct swap_batch_comp_data {
> > +       struct page **pages;
> > +       u8 **dsts;
> > +       unsigned int *dlens;
> > +       int *errors;
> > +       u8 nr_comps;
> > +};
> > +
> > +struct swap_batch_decomp_data {
> > +       u8 **srcs;
> > +       struct page **pages;
> > +       unsigned int *slens;
> > +       unsigned int *dlens;
> > +       int *errors;
> > +       u8 nr_decomps;
> > +};
> 
> This struct is not being used yet right? I assume this is used for
> batch zswap load and writeback etc.

Yes, you are right :)

> 
> Can we introduce them when those series are sent out? Just to limit
> the amount of reviewing here :)

Sure,  I can make this change in v12.

> 
> > +
> >  /* linux/mm/page_io.c */
> >  int sio_pool_init(void);
> >  struct swap_iocb;
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8ca69c3f30df2..c30c1f325f573 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -35,6 +35,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/workqueue.h>
> >  #include <linux/list_lru.h>
> > +#include <linux/prefetch.h>
> >
> >  #include "swap.h"
> >  #include "internal.h"
> > @@ -988,71 +989,163 @@ static int zswap_cpu_comp_prepare(unsigned
> int cpu, struct hlist_node *node)
> >         return ret;
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -                          struct zswap_pool *pool)
> > +/*
> > + * Unified code path for compressors that do and do not support batching.
> This
> > + * procedure will compress multiple @nr_pages in @folio starting from the
> > + * @start index.
> > + *
> > + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE.
> zswap_store() makes
> > + * sure of this by design.
> 
> Maybe add a VM_WARN_ON_ONCE(nr_pages <= ZSWAP_MAX_BATCH_SIZE);
> in
> zswap_store_pages() to codify this design choice?
> 
> > + *
> > + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the
> compressor does not
> > + * support batching.
> > + *
> > + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> > + *
> > + * If @pool->compr_batch_size is > 1, compression batching is invoked,
> except if
> > + * @nr_pages is 1: if so, we call the fully synchronous non-batching
> > + * crypto_acomp API.
> > + *
> > + * In both cases, if all compressions are successful, the compressed buffers
> > + * are stored in zpool.
> > + *
> > + * A few important changes made to not regress and in fact improve
> > + * compression performance with non-batching software compressors,
> using this
> > + * new/batching code:
> > + *
> > + * 1) acomp_ctx mutex locking:
> > + *    Earlier, the mutex was held per page compression. With the new code,
> > + *    [un]locking the mutex per page caused regressions for software
> > + *    compressors. We now lock the mutex once per batch, which resolves
> the
> > + *    regression.
> 
> Makes sense, yeah.

Thanks!

> 
> > + *
> > + * 2) The prefetchw() and likely()/unlikely() annotations prevent
> > + *    regressions with software compressors like zstd, and generally improve
> > + *    non-batching compressors' performance with the batching code by
> ~3%.
> > + */
> > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> nr_pages,
> > +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> > +                          int node_id)
> >  {
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         struct scatterlist input, output;
> > -       int comp_ret = 0, alloc_ret = 0;
> > -       unsigned int dlen = PAGE_SIZE;
> > -       unsigned long handle;
> > -       struct zpool *zpool;
> > +       struct zpool *zpool = pool->zpool;
> > +
> > +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > +       int errors[ZSWAP_MAX_BATCH_SIZE];
> > +
> > +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > +       unsigned int i, j;
> > +       int err;
> >         gfp_t gfp;
> > -       u8 *dst;
> > +
> > +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> >
> >         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffers[0];
> > -       sg_init_table(&input, 1);
> > -       sg_set_page(&input, page, PAGE_SIZE, 0);
> > -
> >         /*
> > -        * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > -        * and hardware-accelerators may won't check the dst buffer size, so
> > -        * giving the dst buffer with enough length to avoid buffer overflow.
> > +        * Note:
> > +        * [i] refers to the incoming batch space and is used to
> > +        *     index into the folio pages, @entries and @errors.
> >          */
> > -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +       for (i = 0; i < nr_pages; i += nr_comps) {
> > +               if (nr_comps == 1) {
> > +                       sg_init_table(&input, 1);
> > +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
> >
> > -       /*
> > -        * it maybe looks a little bit silly that we send an asynchronous request,
> > -        * then wait for its completion synchronously. This makes the process
> look
> > -        * synchronous in fact.
> > -        * Theoretically, acomp supports users send multiple acomp requests in
> one
> > -        * acomp instance, then get those requests done simultaneously. but in
> this
> > -        * case, zswap actually does store and load page by page, there is no
> > -        * existing method to send the second page before the first page is
> done
> > -        * in one thread doing zwap.
> > -        * but in different threads running on different cpu, we have different
> > -        * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > -        */
> > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -       dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > -               goto unlock;
> > +                       /*
> > +                        * We need PAGE_SIZE * 2 here since there maybe over-
> compression case,
> > +                        * and hardware-accelerators may won't check the dst buffer
> size, so
> > +                        * giving the dst buffer with enough length to avoid buffer
> overflow.
> > +                        */
> > +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
> > +                       acomp_request_set_params(acomp_ctx->req, &input,
> > +                                                &output, PAGE_SIZE, PAGE_SIZE);
> > +
> > +                       errors[i] =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> > +                                                   &acomp_ctx->wait);
> > +                       if (unlikely(errors[i]))
> > +                               goto compress_error;
> > +
> > +                       dlens[i] = acomp_ctx->req->dlen;
> > +               } else {
> > +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > +                       unsigned int k;
> > +
> > +                       for (k = 0; k < nr_pages; ++k)
> > +                               pages[k] = folio_page(folio, start + k);
> > +
> > +                       struct swap_batch_comp_data batch_comp_data = {
> > +                               .pages = pages,
> > +                               .dsts = acomp_ctx->buffers,
> > +                               .dlens = dlens,
> > +                               .errors = errors,
> > +                               .nr_comps = nr_pages,
> > +                       };
> > +
> > +                       acomp_ctx->req->kernel_data = &batch_comp_data;
> > +
> > +                       if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
> > +                               goto compress_error;
> 
> I assume this is a new crypto API?

Not exactly a new API, rather a new "void *kernel_data" member added
to the existing "struct acomp_req".

> 
> I'll let Herbert decide whether this makes sense :)

Definitely.

Thanks,
Kanchana

> 
> > +               }
> >
> > -       zpool = pool->zpool;
> > -       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> > -       alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
> > -       if (alloc_ret)
> > -               goto unlock;
> > -
> > -       zpool_obj_write(zpool, handle, dst, dlen);
> > -       entry->handle = handle;
> > -       entry->length = dlen;
> > -
> > -unlock:
> > -       if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > -               zswap_reject_compress_poor++;
> > -       else if (comp_ret)
> > -               zswap_reject_compress_fail++;
> > -       else if (alloc_ret)
> > -               zswap_reject_alloc_fail++;
> > +               /*
> > +                * All @nr_comps pages were successfully compressed.
> > +                * Store the pages in zpool.
> > +                *
> > +                * Note:
> > +                * [j] refers to the incoming batch space and is used to
> > +                *     index into the folio pages, @entries, @dlens and @errors.
> > +                * [k] refers to the @acomp_ctx space, as determined by
> > +                *     @pool->compr_batch_size, and is used to index into
> > +                *     @acomp_ctx->buffers.
> > +                */
> > +               for (j = i; j < i + nr_comps; ++j) {
> > +                       unsigned int k = j - i;
> > +                       unsigned long handle;
> > +
> > +                       /*
> > +                        * prefetchw() minimizes cache-miss latency by
> > +                        * moving the zswap entry to the cache before it
> > +                        * is written to; reducing sys time by ~1.5% for
> > +                        * non-batching software compressors.
> > +                        */
> > +                       prefetchw(entries[j]);
> > +                       err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
> > +
> > +                       if (unlikely(err)) {
> > +                               if (err == -ENOSPC)
> > +                                       zswap_reject_compress_poor++;
> > +                               else
> > +                                       zswap_reject_alloc_fail++;
> > +
> > +                               goto err_unlock;
> > +                       }
> > +
> > +                       zpool_obj_write(zpool, handle, acomp_ctx->buffers[k],
> dlens[j]);
> > +                       entries[j]->handle = handle;
> > +                       entries[j]->length = dlens[j];
> > +               }
> > +       } /* finished compress and store nr_pages. */
> >
> >         mutex_unlock(&acomp_ctx->mutex);
> > -       return comp_ret == 0 && alloc_ret == 0;
> > +       return true;
> > +
> > +compress_error:
> > +       for (j = i; j < i + nr_comps; ++j) {
> > +               if (errors[j]) {
> > +                       if (errors[j] == -ENOSPC)
> > +                               zswap_reject_compress_poor++;
> > +                       else
> > +                               zswap_reject_compress_fail++;
> > +               }
> > +       }
> > +
> > +err_unlock:
> > +       mutex_unlock(&acomp_ctx->mutex);
> > +       return false;
> >  }
> >
> >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> *folio)
> > @@ -1590,12 +1683,8 @@ static bool zswap_store_pages(struct folio
> *folio,
> >                 INIT_LIST_HEAD(&entries[i]->lru);
> >         }
> >
> > -       for (i = 0; i < nr_pages; ++i) {
> > -               struct page *page = folio_page(folio, start + i);
> > -
> > -               if (!zswap_compress(page, entries[i], pool))
> > -                       goto store_pages_failed;
> > -       }
> > +       if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> node_id)))
> > +               goto store_pages_failed;
> >
> >         for (i = 0; i < nr_pages; ++i) {
> >                 struct zswap_entry *old, *entry = entries[i];
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-08 23:51 ` [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Nhat Pham
  2025-08-09  0:03   ` Sridhar, Kanchana P
@ 2025-08-15  5:27   ` Herbert Xu
  2025-08-22 19:26     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 68+ messages in thread
From: Herbert Xu @ 2025-08-15  5:27 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Fri, Aug 08, 2025 at 04:51:14PM -0700, Nhat Pham wrote:
> 
> Can we get some comments from crypto tree maintainers as well? I feel
> like this patch series is more crypto patch than zswap patch, at this
> point.
> 
> Can we land any zswap parts without the crypto API change? Grasping at
> straws here, in case we can parallelize the reviewing and merging
> process.

My preference is for a unified interface that caters to both
software compression as well as parallel hardware compression.

The reason is that there is clear advantage in passing a large
batch of pages to the Crypto API even for software compression,
the least we could do is to pack the compressed result together
and avoid the unnecessary copying of the compressed output that
is currently done in zswap.

However, since you guys are both happy with this patch-set,
I'm not going stand in the way.

But I do want some changes made to the proposed Crypto API interface
so that it can be reused for IPComp.

In particular, instead of passing an opaque pointer (kernel_data)
to magically turn on batching, please add a new helper that enables
batching.

I don't think we need any extra fields in struct acomp_req apart
from a new field called unit_size.  This would be 4096 for zswap,
it could be the MTU for IPsec.

So add something like this and document that it must be called
after acmop_request_set_callback (which should set unit_size to 0):

static inline void acomp_request_set_unit_size(struct acomp_req *req,
					       unsigned int du)
{
	req->unit = du;
}

static inline void acomp_request_set_callback(struct acomp_req *req, ...)
{
	...
+	req->unit = 0;
}

For the source, nothing needs to be done because the folio could
be passed in as is.

For the destination, construct an SG list for them and pass that in.
The rule should be that the SG list must contain a sufficient number
of pages for the compression output based on the given unit size.

For the output lengths, just set the lengths in the destination
SG list after compression.  If a page is incompressible (including
an error), just set the length to a negative value (-ENOSPC could
be used for incompressible input, as we already do).  Even though
struct scatterlist->length is unsigned, there should be no issue
with storing a negative value there.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-08-01  4:36 ` [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
@ 2025-08-15  5:28   ` Herbert Xu
  2025-08-22 19:31     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Herbert Xu @ 2025-08-15  5:28 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Thu, Jul 31, 2025 at 09:36:36PM -0700, Kanchana P Sridhar wrote:
>
> diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
> index ffffd88bbbad3..2325ee18e7a10 100644
> --- a/include/crypto/internal/acompress.h
> +++ b/include/crypto/internal/acompress.h
> @@ -28,6 +28,8 @@
>   *
>   * @compress:	Function performs a compress operation
>   * @decompress:	Function performs a de-compress operation
> + * @get_batch_size:	Maximum batch-size for batching compress/decompress
> + *			operations.
>   * @init:	Initialize the cryptographic transformation object.
>   *		This function is used to initialize the cryptographic
>   *		transformation object. This function is called only once at
> @@ -46,6 +48,7 @@
>  struct acomp_alg {
>  	int (*compress)(struct acomp_req *req);
>  	int (*decompress)(struct acomp_req *req);
> +	unsigned int (*get_batch_size)(void);

I can't imagine a situation where this needs to be dynamic.
Please just make it a static value rather than a callback function.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-15  5:27   ` Herbert Xu
@ 2025-08-22 19:26     ` Sridhar, Kanchana P
  2025-08-25  5:38       ` Herbert Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-22 19:26 UTC (permalink / raw)
  To: Herbert Xu, Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Thursday, August 14, 2025 10:28 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Fri, Aug 08, 2025 at 04:51:14PM -0700, Nhat Pham wrote:
> >
> > Can we get some comments from crypto tree maintainers as well? I feel
> > like this patch series is more crypto patch than zswap patch, at this
> > point.
> >
> > Can we land any zswap parts without the crypto API change? Grasping at
> > straws here, in case we can parallelize the reviewing and merging
> > process.
> 
> My preference is for a unified interface that caters to both
> software compression as well as parallel hardware compression.
> 
> The reason is that there is clear advantage in passing a large
> batch of pages to the Crypto API even for software compression,
> the least we could do is to pack the compressed result together
> and avoid the unnecessary copying of the compressed output that
> is currently done in zswap.
> 
> However, since you guys are both happy with this patch-set,
> I'm not going stand in the way.
> 
> But I do want some changes made to the proposed Crypto API interface
> so that it can be reused for IPComp.
> 
> In particular, instead of passing an opaque pointer (kernel_data)
> to magically turn on batching, please add a new helper that enables
> batching.
> 
> I don't think we need any extra fields in struct acomp_req apart
> from a new field called unit_size.  This would be 4096 for zswap,
> it could be the MTU for IPsec.
> 
> So add something like this and document that it must be called
> after acmop_request_set_callback (which should set unit_size to 0):
> 
> static inline void acomp_request_set_unit_size(struct acomp_req *req,
> 					       unsigned int du)
> {
> 	req->unit = du;
> }
> 
> static inline void acomp_request_set_callback(struct acomp_req *req, ...)
> {
> 	...
> +	req->unit = 0;
> }
> 
> For the source, nothing needs to be done because the folio could
> be passed in as is.
> 
> For the destination, construct an SG list for them and pass that in.
> The rule should be that the SG list must contain a sufficient number
> of pages for the compression output based on the given unit size.
> 
> For the output lengths, just set the lengths in the destination
> SG list after compression.  If a page is incompressible (including
> an error), just set the length to a negative value (-ENOSPC could
> be used for incompressible input, as we already do).  Even though
> struct scatterlist->length is unsigned, there should be no issue
> with storing a negative value there.

Hi Herbert, Nhat,

Thanks Herbert for these suggestions! I have implemented the new crypto API
and the SG list suggestion. While doing so, I was also able to consolidate the
new scatterlist based zswap_compress() implementation for software and hardware
(i.e. batching) compressors, within the constraints of not changing anything
below the crypto layer for software compressors.

I wanted to provide some additional details so that you can review the overall
approach and let me know if things look Ok. I will rebase the code to the latest
mm-unstable and start working on v12 in the meantime.

1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
   inputs/outputs, with nents set to the pool->compr_batch_size (1 for software
   compressors). This per-CPU data incurs additional memory overhead per-CPU,
   however this is memory that will anyway be allocated on the stack in
   zswap_compress(); and less memory overhead than the latter because we know
   exactly how many sg_table scatterlists to allocate for the given pool
   (assuming we don't kmalloc in zswap_compress()). I will make sure to quantify
   the overhead in v12's commit logs.

2) I added new sg_alloc_table_node() and sg_kmalloc_node() to facilitate this.

3) I added the acomp_request_set_unit_size() helper for
   batching; initialized the unit_size to 0 in
   acomp_request_set_callback(). zswap_cpu_comp_prepare() will set the unit_size
   to PAGE_SIZE after the call to acomp_request_set_callback().

4) Unified code in zswap_compress() for software and hardware compressors to use
   the per-CPU SG lists. Some unavoidable changes for software path to use the
   acomp_req->dlen instead of the SG list output length.

5) A trade-off I had to make in the iaa_crypto driver to adhere to the new SG
   list based batching architecture:

   Currently, all calls to dma_map_sg() in iaa_crypto_main.c use
   sg_nents(req->src[dst]). This requires the sg_init_marker() is set correctly
   based on the number of pages in the batch. This in turn requires
   sg_unmark_end() to be called to clear the termination marker before
   returning. All this adds latency to zswap_compress() (i.e. per batch compress
   call) with the new approach and causes regression wrt v11.

   To make the new approach functional and performant, I have changed all
   the calls to dma_map_sg() to use nents of 1. This should not be a concern,
   since it eliminates redundant computes to scan an SG list with only one
   scatterlist for existing kernel users, i.e. zswap with the zswap_compress()
   modifications described in (4). This will continue to hold true with the zram
   IAA batching support I am developing. There are no kernel use cases for the
   iaa_crypto driver that will break this assumption.

6) "For the source, nothing needs to be done because the folio could be passed
   in as is.". As far as I know, this cannot be accomplished without
   modifications to the crypto API for software compressors, because compressed
   buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
   granularity.
   
I have validated the above v12 changes applied over the v11 patch-series,
on Sapphire Rapids:

  1) usemem30: Both zstd and deflate-iaa have comparable performance to v11.

  2) kernel_compilation test: Mostly better performance than baseline, but worse
     than v11. Slightly worse sys time than baseline for zstd/PMD.


  usemem30 with 64K folios:
  =========================
  
     zswap shrinker_enabled = N.
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa     deflate-iaa
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,153,359      10,856,388      10,714,236
     Avg throughput (KB/s)         238,445         361,879         357,141          
     elapsed time (sec)              92.61           70.50           68.87
     sys time (sec)               2,193.59        1,675.32        1,613.11
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd            zstd
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,866,411       6,874,244       6,922,818
     Avg throughput (KB/s)         228,880         229,141         230,760
     elapsed time (sec)              96.45           89.05           87.75
     sys time (sec)               2,410.72        2,150.63        2,090.86     
     -----------------------------------------------------------------------


  usemem30 with 2M folios:
  ========================
  
     zswap shrinker_enabled = N.
    
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa     deflate-iaa
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,268,929      11,312,195      10,943,491
     Avg throughput (KB/s)         242,297         377,073         364,783
     elapsed time (sec)              80.40           68.73           69.19
     sys time (sec)               1,856.54        1,599.25        1,618.08
     -----------------------------------------------------------------------
  
     -----------------------------------------------------------------------
                     mm-unstable-7-30-2025             v11             v12
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd            zstd           
     -----------------------------------------------------------------------
     Total throughput (KB/s)     7,560,441       7,627,155       7,600,588
     Avg throughput (KB/s)         252,014         254,238         253,352
     elapsed time (sec)              88.89           83.22           87.55
     sys time (sec)               2,132.05        1,952.98        2,079.26
     -----------------------------------------------------------------------


  kernel_compilation with 64K folios:
  ===================================

     zswap shrinker_enabled = Y.
  
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa     deflate-iaa
     --------------------------------------------------------------------------
     real_sec                          901.81          840.60          895.94
     sys_sec                         2,672.93        2,171.17        2,584.04
     zswpout                       34,700,692      24,076,095      37,725,671
     zswap_written_back_pages       2,612,474       1,451,961       3,050,557
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd            zstd
     --------------------------------------------------------------------------
     real_sec                          882.67          837.21          872.98  
     sys_sec                         3,573.31        2,593.94        3,301.67
     zswpout                       42,768,967      22,660,215      36,810,396
     zswap_written_back_pages       2,109,739         727,919       1,475,480
     --------------------------------------------------------------------------


  kernel_compilation with PMD folios:
  ===================================

     zswap shrinker_enabled = Y.

     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor             deflate-iaa     deflate-iaa     deflate-iaa
     --------------------------------------------------------------------------
     real_sec                          838.76          804.83          826.05
     sys_sec                         3,173.57        2,422.63        3,128.11
     zswpout                       59,544,198      38,093,995      60,072,119
     zswap_written_back_pages       2,726,367         929,614       2,324,707
     --------------------------------------------------------------------------
 
 
     --------------------------------------------------------------------------
                        mm-unstable-7-30-2025             v11             v12
     --------------------------------------------------------------------------
     zswap compressor                    zstd            zstd            zstd 
     --------------------------------------------------------------------------
     real_sec                          831.09          813.40          827.84
     sys_sec                         4,251.11        3,053.95        4,406.65
     zswpout                       59,452,638      35,832,407      63,459,471
     zswap_written_back_pages       1,041,721         423,334       1,162,913
     --------------------------------------------------------------------------


I am still in the process of verifying if modifying zswap_decompress() to use
the per-CPU SG lists improves kernel_compilation, but thought this would be a
good sync point to get your thoughts.

I would greatly appreciate your comments on the approach and trade-offs, and
guidance on how to proceed.


"v12" zswap.c diff wrt v11:
===========================

diff --git a/mm/zswap.c b/mm/zswap.c
index c30c1f325f57..58ad257e87e8 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -152,6 +152,8 @@ struct crypto_acomp_ctx {
 	struct acomp_req *req;
 	struct crypto_wait wait;
 	u8 **buffers;
+	struct sg_table *sg_inputs;
+	struct sg_table *sg_outputs;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -282,6 +284,16 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 			kfree(acomp_ctx->buffers[i]);
 		kfree(acomp_ctx->buffers);
 	}
+
+	if (acomp_ctx->sg_inputs) {
+		sg_free_table(acomp_ctx->sg_inputs);
+		acomp_ctx->sg_inputs = NULL;
+	}
+
+	if (acomp_ctx->sg_outputs) {
+		sg_free_table(acomp_ctx->sg_outputs);
+		acomp_ctx->sg_outputs = NULL;
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
@@ -922,6 +934,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	int cpu_node = cpu_to_node(cpu);
 	int ret = -ENOMEM;
 	u8 i;
 
@@ -936,7 +949,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		return 0;
 
-	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_node);
 	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
@@ -960,13 +973,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 				     crypto_acomp_batch_size(acomp_ctx->acomp));
 
 	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
-					  GFP_KERNEL, cpu_to_node(cpu));
+					  GFP_KERNEL, cpu_node);
 	if (!acomp_ctx->buffers)
 		goto fail;
 
 	for (i = 0; i < pool->compr_batch_size; ++i) {
 		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
-						     cpu_to_node(cpu));
+						     cpu_node);
 		if (!acomp_ctx->buffers[i])
 			goto fail;
 	}
@@ -981,6 +994,26 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
+
+	acomp_ctx->sg_inputs = kmalloc_node(sizeof(*acomp_ctx->sg_inputs),
+					    GFP_KERNEL, cpu_node);
+	if (!acomp_ctx->sg_inputs)
+		goto fail;
+
+	if (sg_alloc_table_node(&acomp_ctx->sg_inputs, pool->compr_batch_size,
+				GFP_KERNEL, cpu_node))
+		goto fail;
+
+	acomp_ctx->sg_outputs = kmalloc_node(sizeof(*acomp_ctx->sg_outputs),
+					     GFP_KERNEL, cpu_node);
+	if (!acomp_ctx->sg_outputs)
+		goto fail;
+
+	if (sg_alloc_table_node(&acomp_ctx->sg_outputs, pool->compr_batch_size,
+				GFP_KERNEL, cpu_node))
+		goto fail;
+
 	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
@@ -1027,17 +1060,14 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 			   struct zswap_entry *entries[], struct zswap_pool *pool,
 			   int node_id)
 {
+	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
 	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
 	struct zpool *zpool = pool->zpool;
-
-	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
-	int errors[ZSWAP_MAX_BATCH_SIZE];
-
-	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
-	unsigned int i, j;
-	int err;
+	struct scatterlist *sg;
+	unsigned int i, j, k;
 	gfp_t gfp;
+	int err;
 
 	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
 
@@ -1045,59 +1075,58 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 
 	mutex_lock(&acomp_ctx->mutex);
 
+	prefetchw(acomp_ctx->sg_inputs->sgl);
+	prefetchw(acomp_ctx->sg_outputs->sgl);
+
 	/*
 	 * Note:
 	 * [i] refers to the incoming batch space and is used to
-	 *     index into the folio pages, @entries and @errors.
+	 *     index into the folio pages and @entries.
+	 *
+	 * [k] refers to the @acomp_ctx space, as determined by
+	 *     @pool->compr_batch_size, and is used to index into
+	 *     @acomp_ctx->buffers and @dlens.
 	 */
 	for (i = 0; i < nr_pages; i += nr_comps) {
-		if (nr_comps == 1) {
-			sg_init_table(&input, 1);
-			sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
+		for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
+			sg_set_page(sg, folio_page(folio, start + k + i), PAGE_SIZE, 0);
 
-			/*
-			 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
-			 * and hardware-accelerators may won't check the dst buffer size, so
-			 * giving the dst buffer with enough length to avoid buffer overflow.
-			 */
-			sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
-			acomp_request_set_params(acomp_ctx->req, &input,
-						 &output, PAGE_SIZE, PAGE_SIZE);
-
-			errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
-						    &acomp_ctx->wait);
-			if (unlikely(errors[i]))
-				goto compress_error;
-
-			dlens[i] = acomp_ctx->req->dlen;
-		} else {
-			struct page *pages[ZSWAP_MAX_BATCH_SIZE];
-			unsigned int k;
-
-			for (k = 0; k < nr_pages; ++k)
-				pages[k] = folio_page(folio, start + k);
-
-			struct swap_batch_comp_data batch_comp_data = {
-				.pages = pages,
-				.dsts = acomp_ctx->buffers,
-				.dlens = dlens,
-				.errors = errors,
-				.nr_comps = nr_pages,
-			};
-
-			acomp_ctx->req->kernel_data = &batch_comp_data;
-
-			if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
-				goto compress_error;
+		/*
+		 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+		 * and hardware-accelerators may won't check the dst buffer size, so
+		 * giving the dst buffer with enough length to avoid buffer overflow.
+		 */
+		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+			sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
+
+		acomp_request_set_params(acomp_ctx->req,
+					 acomp_ctx->sg_inputs->sgl,
+					 acomp_ctx->sg_outputs->sgl,
+					 nr_comps * PAGE_SIZE,
+					 nr_comps * PAGE_SIZE);
+
+		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+				      &acomp_ctx->wait);
+
+		if (unlikely(err)) {
+			if (nr_comps == 1)
+				dlens[0] = err;
+			goto compress_error;
 		}
 
+		if (nr_comps == 1)
+			dlens[0] = acomp_ctx->req->dlen;
+		else
+			for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
+				dlens[k] = sg->length;
+
 		/*
 		 * All @nr_comps pages were successfully compressed.
 		 * Store the pages in zpool.
 		 *
 		 * Note:
 		 * [j] refers to the incoming batch space and is used to
-		 *     index into the folio pages, @entries, @dlens and @errors.
+		 *     index into the folio pages and @entries.
 		 * [k] refers to the @acomp_ctx space, as determined by
 		 *     @pool->compr_batch_size, and is used to index into
 		 *     @acomp_ctx->buffers.
@@ -1113,7 +1142,7 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 			 * non-batching software compressors.
 			 */
 			prefetchw(entries[j]);
-			err = zpool_malloc(zpool, dlens[j], gfp, &handle, node_id);
+			err = zpool_malloc(zpool, dlens[k], gfp, &handle, node_id);
 
 			if (unlikely(err)) {
 				if (err == -ENOSPC)
@@ -1124,9 +1153,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 				goto err_unlock;
 			}
 
-			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[j]);
+			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], dlens[k]);
 			entries[j]->handle = handle;
-			entries[j]->length = dlens[j];
+			entries[j]->length = dlens[k];
 		}
 	} /* finished compress and store nr_pages. */
 
@@ -1134,9 +1163,9 @@ static bool zswap_compress(struct folio *folio, long start, unsigned int nr_page
 	return true;
 
 compress_error:
-	for (j = i; j < i + nr_comps; ++j) {
-		if (errors[j]) {
-			if (errors[j] == -ENOSPC)
+	for (k = 0; k < nr_comps; ++k) {
+		if (dlens[k] < 0) {
+			if (dlens[k] == -ENOSPC)
 				zswap_reject_compress_poor++;
 			else
 				zswap_reject_compress_fail++;



Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-08-15  5:28   ` Herbert Xu
@ 2025-08-22 19:31     ` Sridhar, Kanchana P
  2025-08-22 21:48       ` Nhat Pham
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-22 19:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Thursday, August 14, 2025 10:29 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 18/24] crypto: acomp - Add
> crypto_acomp_batch_size() to get an algorithm's batch-size.
> 
> On Thu, Jul 31, 2025 at 09:36:36PM -0700, Kanchana P Sridhar wrote:
> >
> > diff --git a/include/crypto/internal/acompress.h
> b/include/crypto/internal/acompress.h
> > index ffffd88bbbad3..2325ee18e7a10 100644
> > --- a/include/crypto/internal/acompress.h
> > +++ b/include/crypto/internal/acompress.h
> > @@ -28,6 +28,8 @@
> >   *
> >   * @compress:	Function performs a compress operation
> >   * @decompress:	Function performs a de-compress operation
> > + * @get_batch_size:	Maximum batch-size for batching
> compress/decompress
> > + *			operations.
> >   * @init:	Initialize the cryptographic transformation object.
> >   *		This function is used to initialize the cryptographic
> >   *		transformation object. This function is called only once at
> > @@ -46,6 +48,7 @@
> >  struct acomp_alg {
> >  	int (*compress)(struct acomp_req *req);
> >  	int (*decompress)(struct acomp_req *req);
> > +	unsigned int (*get_batch_size)(void);
> 
> I can't imagine a situation where this needs to be dynamic.
> Please just make it a static value rather than a callback function.

Hi Herbert,

I am not sure I understand.. Kernel users such as zswap/zram need to query
the algorithm to get the maximum supported batch-size so they can allocate
resources for dst buffers. The get_batch_size() callback and associated
crypto_acomp_batch_size() wrapper help accomplish this.

Can you please clarify what you mean by "static value"?

Thanks,
Kanchana

> 
> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-08-22 19:31     ` Sridhar, Kanchana P
@ 2025-08-22 21:48       ` Nhat Pham
  2025-08-22 21:58         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Nhat Pham @ 2025-08-22 21:48 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Aug 22, 2025 at 12:31 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Thursday, August 14, 2025 10:29 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v11 18/24] crypto: acomp - Add
> > crypto_acomp_batch_size() to get an algorithm's batch-size.
> >
> > On Thu, Jul 31, 2025 at 09:36:36PM -0700, Kanchana P Sridhar wrote:
> > >
> > > diff --git a/include/crypto/internal/acompress.h
> > b/include/crypto/internal/acompress.h
> > > index ffffd88bbbad3..2325ee18e7a10 100644
> > > --- a/include/crypto/internal/acompress.h
> > > +++ b/include/crypto/internal/acompress.h
> > > @@ -28,6 +28,8 @@
> > >   *
> > >   * @compress:      Function performs a compress operation
> > >   * @decompress:    Function performs a de-compress operation
> > > + * @get_batch_size:        Maximum batch-size for batching
> > compress/decompress
> > > + *                 operations.
> > >   * @init:  Initialize the cryptographic transformation object.
> > >   *         This function is used to initialize the cryptographic
> > >   *         transformation object. This function is called only once at
> > > @@ -46,6 +48,7 @@
> > >  struct acomp_alg {
> > >     int (*compress)(struct acomp_req *req);
> > >     int (*decompress)(struct acomp_req *req);
> > > +   unsigned int (*get_batch_size)(void);
> >
> > I can't imagine a situation where this needs to be dynamic.
> > Please just make it a static value rather than a callback function.
>
> Hi Herbert,
>
> I am not sure I understand.. Kernel users such as zswap/zram need to query
> the algorithm to get the maximum supported batch-size so they can allocate
> resources for dst buffers. The get_batch_size() callback and associated
> crypto_acomp_batch_size() wrapper help accomplish this.

I think he meant stored it as a static unsigned int field, rather than
a function pointer (i.e dynamic) like this.

Does batch size ever change at runtime?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-08-22 21:48       ` Nhat Pham
@ 2025-08-22 21:58         ` Sridhar, Kanchana P
  2025-08-22 22:00           ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-22 21:58 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Herbert Xu, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Friday, August 22, 2025 2:48 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 18/24] crypto: acomp - Add
> crypto_acomp_batch_size() to get an algorithm's batch-size.
> 
> On Fri, Aug 22, 2025 at 12:31 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Sent: Thursday, August 14, 2025 10:29 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v11 18/24] crypto: acomp - Add
> > > crypto_acomp_batch_size() to get an algorithm's batch-size.
> > >
> > > On Thu, Jul 31, 2025 at 09:36:36PM -0700, Kanchana P Sridhar wrote:
> > > >
> > > > diff --git a/include/crypto/internal/acompress.h
> > > b/include/crypto/internal/acompress.h
> > > > index ffffd88bbbad3..2325ee18e7a10 100644
> > > > --- a/include/crypto/internal/acompress.h
> > > > +++ b/include/crypto/internal/acompress.h
> > > > @@ -28,6 +28,8 @@
> > > >   *
> > > >   * @compress:      Function performs a compress operation
> > > >   * @decompress:    Function performs a de-compress operation
> > > > + * @get_batch_size:        Maximum batch-size for batching
> > > compress/decompress
> > > > + *                 operations.
> > > >   * @init:  Initialize the cryptographic transformation object.
> > > >   *         This function is used to initialize the cryptographic
> > > >   *         transformation object. This function is called only once at
> > > > @@ -46,6 +48,7 @@
> > > >  struct acomp_alg {
> > > >     int (*compress)(struct acomp_req *req);
> > > >     int (*decompress)(struct acomp_req *req);
> > > > +   unsigned int (*get_batch_size)(void);
> > >
> > > I can't imagine a situation where this needs to be dynamic.
> > > Please just make it a static value rather than a callback function.
> >
> > Hi Herbert,
> >
> > I am not sure I understand.. Kernel users such as zswap/zram need to query
> > the algorithm to get the maximum supported batch-size so they can allocate
> > resources for dst buffers. The get_batch_size() callback and associated
> > crypto_acomp_batch_size() wrapper help accomplish this.
> 
> I think he meant stored it as a static unsigned int field, rather than
> a function pointer (i.e dynamic) like this.

I see. Got it! Sure, I will make this change in v12. Thanks Nhat!

Best regards,
Kanchana

> 
> Does batch size ever change at runtime?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-08-22 21:58         ` Sridhar, Kanchana P
@ 2025-08-22 22:00           ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-22 22:00 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Herbert Xu, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Friday, August 22, 2025 2:58 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v11 18/24] crypto: acomp - Add
> crypto_acomp_batch_size() to get an algorithm's batch-size.
> 
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Friday, August 22, 2025 2:48 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> > yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v11 18/24] crypto: acomp - Add
> > crypto_acomp_batch_size() to get an algorithm's batch-size.
> >
> > On Fri, Aug 22, 2025 at 12:31 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Sent: Thursday, August 14, 2025 10:29 PM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > > ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>;
> > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > > <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v11 18/24] crypto: acomp - Add
> > > > crypto_acomp_batch_size() to get an algorithm's batch-size.
> > > >
> > > > On Thu, Jul 31, 2025 at 09:36:36PM -0700, Kanchana P Sridhar wrote:
> > > > >
> > > > > diff --git a/include/crypto/internal/acompress.h
> > > > b/include/crypto/internal/acompress.h
> > > > > index ffffd88bbbad3..2325ee18e7a10 100644
> > > > > --- a/include/crypto/internal/acompress.h
> > > > > +++ b/include/crypto/internal/acompress.h
> > > > > @@ -28,6 +28,8 @@
> > > > >   *
> > > > >   * @compress:      Function performs a compress operation
> > > > >   * @decompress:    Function performs a de-compress operation
> > > > > + * @get_batch_size:        Maximum batch-size for batching
> > > > compress/decompress
> > > > > + *                 operations.
> > > > >   * @init:  Initialize the cryptographic transformation object.
> > > > >   *         This function is used to initialize the cryptographic
> > > > >   *         transformation object. This function is called only once at
> > > > > @@ -46,6 +48,7 @@
> > > > >  struct acomp_alg {
> > > > >     int (*compress)(struct acomp_req *req);
> > > > >     int (*decompress)(struct acomp_req *req);
> > > > > +   unsigned int (*get_batch_size)(void);
> > > >
> > > > I can't imagine a situation where this needs to be dynamic.
> > > > Please just make it a static value rather than a callback function.
> > >
> > > Hi Herbert,
> > >
> > > I am not sure I understand.. Kernel users such as zswap/zram need to
> query
> > > the algorithm to get the maximum supported batch-size so they can
> allocate
> > > resources for dst buffers. The get_batch_size() callback and associated
> > > crypto_acomp_batch_size() wrapper help accomplish this.
> >
> > I think he meant stored it as a static unsigned int field, rather than
> > a function pointer (i.e dynamic) like this.
> 
> I see. Got it! Sure, I will make this change in v12. Thanks Nhat!
> 
> Best regards,
> Kanchana
> 
> >
> > Does batch size ever change at runtime?

No, batch size doesn't change at runtime.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-22 19:26     ` Sridhar, Kanchana P
@ 2025-08-25  5:38       ` Herbert Xu
  2025-08-25 18:12         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Herbert Xu @ 2025-08-25  5:38 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Aug 22, 2025 at 07:26:34PM +0000, Sridhar, Kanchana P wrote:
>
> 1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
>    inputs/outputs, with nents set to the pool->compr_batch_size (1 for software
>    compressors). This per-CPU data incurs additional memory overhead per-CPU,
>    however this is memory that will anyway be allocated on the stack in
>    zswap_compress(); and less memory overhead than the latter because we know
>    exactly how many sg_table scatterlists to allocate for the given pool
>    (assuming we don't kmalloc in zswap_compress()). I will make sure to quantify
>    the overhead in v12's commit logs.

There is no need for any SG lists for the source.  The folio should
be submitted as the source.

So only the destination requires an SG list.

> 6) "For the source, nothing needs to be done because the folio could be passed
>    in as is.". As far as I know, this cannot be accomplished without
>    modifications to the crypto API for software compressors, because compressed
>    buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
>    granularity.

Sure.  But all it needs is one central fallback path in the acompress
API.  I can do this for you.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-25  5:38       ` Herbert Xu
@ 2025-08-25 18:12         ` Sridhar, Kanchana P
  2025-08-26  1:13           ` Herbert Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-25 18:12 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nhat Pham, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, August 24, 2025 10:39 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosry.ahmed@linux.dev;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Fri, Aug 22, 2025 at 07:26:34PM +0000, Sridhar, Kanchana P wrote:
> >
> > 1) The zswap per-CPU acomp_ctx has two sg_tables added, one each for
> >    inputs/outputs, with nents set to the pool->compr_batch_size (1 for
> software
> >    compressors). This per-CPU data incurs additional memory overhead per-
> CPU,
> >    however this is memory that will anyway be allocated on the stack in
> >    zswap_compress(); and less memory overhead than the latter because we
> know
> >    exactly how many sg_table scatterlists to allocate for the given pool
> >    (assuming we don't kmalloc in zswap_compress()). I will make sure to
> quantify
> >    the overhead in v12's commit logs.
> 
> There is no need for any SG lists for the source.  The folio should
> be submitted as the source.
> 
> So only the destination requires an SG list.
> 
> > 6) "For the source, nothing needs to be done because the folio could be
> passed
> >    in as is.". As far as I know, this cannot be accomplished without
> >    modifications to the crypto API for software compressors, because
> compressed
> >    buffers need to be stored in the zswap/zram zs_pools at PAGE_SIZE
> >    granularity.
> 
> Sure.  But all it needs is one central fallback path in the acompress
> API.  I can do this for you.

Thanks Herbert, for reviewing the approach. IIUC, we should follow
these constraints:

1) The folio should be submitted as the source.

2) For the destination, construct an SG list for them and pass that in.
    The rule should be that the SG list must contain a sufficient number
    of pages for the compression output based on the given unit size
    (PAGE_SIZE for zswap).

For PMD folios, there would be 512 compression outputs. In this case,
would we need to pass in an SG list that can contain 512 compression
outputs after calling the acompress API once?

If so, this might not be feasible for zswap since there are only "batch-size"
number of pre-allocated per-CPU output buffers, where "batch_size" is the
max number of pages that can be compressed in one call to the algorithm
(1 for software compressors). Hence, gathering all 512 compression outputs
may not be possible in a single invocation to crypto_acomp_compress().

Is the suggestion to allocate 512 per-CPU output buffers to overcome
this? This could be memory-wise very expensive. Please let me know if
I am missing something.

Thanks for offering to make the necessary changes to the acompress API.
Hoping we can sync on the approach!

Best regards,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-25 18:12         ` Sridhar, Kanchana P
@ 2025-08-26  1:13           ` Herbert Xu
  2025-08-26  4:09             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Herbert Xu @ 2025-08-26  1:13 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Aug 25, 2025 at 06:12:19PM +0000, Sridhar, Kanchana P wrote:
>
> Thanks Herbert, for reviewing the approach. IIUC, we should follow
> these constraints:
> 
> 1) The folio should be submitted as the source.
> 
> 2) For the destination, construct an SG list for them and pass that in.
>     The rule should be that the SG list must contain a sufficient number
>     of pages for the compression output based on the given unit size
>     (PAGE_SIZE for zswap).
> 
> For PMD folios, there would be 512 compression outputs. In this case,
> would we need to pass in an SG list that can contain 512 compression
> outputs after calling the acompress API once?

Eventually yes :)

But for now we're just replicating your current patch-set, so
the folio should come with an offset and a length restriction,
and correspondingly the destination SG list should contain the
same number of pages as there are in your current patch-set.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-01  4:36 ` [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
  2025-08-14 20:58   ` Nhat Pham
@ 2025-08-26  3:48   ` Barry Song
  2025-08-26  4:27     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-26  3:48 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

Hi Kanchana,


[...]
>
> +       /*
> +        * Set the unit of compress batching for large folios, for quick
> +        * retrieval in the zswap_compress() fast path:
> +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> +        * large folios will be compressed in batches of ZSWAP_MAX_BATCH_SIZE
> +        * pages, where each page in the batch is compressed sequentially.
> +        * We see better performance by processing the folio in batches of
> +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> +        * structures.
> +        */
> +       pool->batch_size = (pool->compr_batch_size > 1) ?
> +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> +
>         zswap_pool_debug("created", pool);
>
>         return pool;
>

It’s hard to follow — you add batch_size and compr_batch_size in this
patch, but only use them in another. Could we merge the related changes
into one patch instead of splitting them into several that don’t work
independently?

> -
>         acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>         if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
>                 pr_err("could not alloc crypto acomp %s : %ld\n",
> @@ -904,17 +929,36 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
>         if (IS_ERR_OR_NULL(acomp_ctx->req)) {
>                 pr_err("could not alloc crypto acomp_request %s\n",
> -                      pool->tfm_name);
> +                       pool->tfm_name);
>                 goto fail;
>         }
>
> -       crypto_init_wait(&acomp_ctx->wait);
> +       /*
> +        * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
> +        * compressor supports batching.
> +        */
> +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> +                                    crypto_acomp_batch_size(acomp_ctx->acomp));
> +
> +       acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
> +                                         GFP_KERNEL, cpu_to_node(cpu));
> +       if (!acomp_ctx->buffers)
> +               goto fail;
> +
> +       for (i = 0; i < pool->compr_batch_size; ++i) {
> +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> +                                                    cpu_to_node(cpu));
> +               if (!acomp_ctx->buffers[i])
> +                       goto fail;
> +       }

It’s hard to follow — memory is allocated here but only used in another
patch. Could we merge the related changes into a single patch instead of
splitting them into several that don’t work independently?

>
>         /*
>          * if the backend of acomp is async zip, crypto_req_done() will wakeup
>          * crypto_wait_req(); if the backend of acomp is scomp, the callback
>          * won't be called, crypto_wait_req() will return without blocking.
>          */
> +       crypto_init_wait(&acomp_ctx->wait);
> +
>         acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> @@ -922,7 +966,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         return 0;
>
>  fail:
> -       acomp_ctx_dealloc(acomp_ctx);
> +       acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
>         return ret;
>  }
>
> @@ -942,7 +986,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffer;
> +       dst = acomp_ctx->buffers[0];
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> @@ -1003,19 +1047,19 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>         mutex_lock(&acomp_ctx->mutex);
> -       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> +       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffers[0]);
>
>         /*
>          * zpool_obj_read_begin() might return a kmap address of highmem when
> -        * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> -        * handle highmem addresses, so copy the object to acomp_ctx->buffer.
> +        * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
> +        * handle highmem addresses, so copy the object to acomp_ctx->buffers[0].
>          */
>         if (virt_addr_valid(obj)) {
>                 src = obj;
>         } else {
> -               WARN_ON_ONCE(obj == acomp_ctx->buffer);
> -               memcpy(acomp_ctx->buffer, obj, entry->length);
> -               src = acomp_ctx->buffer;
> +               WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
> +               memcpy(acomp_ctx->buffers[0], obj, entry->length);
> +               src = acomp_ctx->buffers[0];

Hard to understand what is going on if related changes are not kept in
one self-contained patch.

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-26  1:13           ` Herbert Xu
@ 2025-08-26  4:09             ` Sridhar, Kanchana P
  2025-08-26  4:14               ` Herbert Xu
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-26  4:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nhat Pham, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, August 25, 2025 6:13 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosry.ahmed@linux.dev;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Mon, Aug 25, 2025 at 06:12:19PM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks Herbert, for reviewing the approach. IIUC, we should follow
> > these constraints:
> >
> > 1) The folio should be submitted as the source.
> >
> > 2) For the destination, construct an SG list for them and pass that in.
> >     The rule should be that the SG list must contain a sufficient number
> >     of pages for the compression output based on the given unit size
> >     (PAGE_SIZE for zswap).
> >
> > For PMD folios, there would be 512 compression outputs. In this case,
> > would we need to pass in an SG list that can contain 512 compression
> > outputs after calling the acompress API once?
> 
> Eventually yes :)
> 
> But for now we're just replicating your current patch-set, so
> the folio should come with an offset and a length restriction,
> and correspondingly the destination SG list should contain the
> same number of pages as there are in your current patch-set.

Thanks Herbert. Just want to make sure I understand this. Are you
referring to replacing sg_set_page() for the input with sg_set_folio()?
We have to pass in a scatterlist for the acomp_req->src..

This is how the converged zswap_compress() code would look for
batch compression of "nr_pages" in "folio", starting at index "start".
The input SG list will contain "nr_comps" pages: nr_comps is
1 for software and 8 for IAA.

The destination SG list will contain an equivalent number of
buffers (each is PAGE_SIZE * 2).

Based on your suggestions, I was able to come up with a unified
implementation for software and hardware compressors: the SG list
for the input is a key aspect of this (lines 24-25 from the start of the
procedure):

static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
                           struct zswap_entry *entries[], struct zswap_pool *pool,
                           int node_id)
{
        unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
        unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
        struct crypto_acomp_ctx *acomp_ctx;
        struct zpool *zpool = pool->zpool;
        struct scatterlist *sg;
        unsigned int i, j, k;
        gfp_t gfp;
        int err;

        gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;

        acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);

        mutex_lock(&acomp_ctx->mutex);

        prefetchw(acomp_ctx->sg_inputs->sgl);
        prefetchw(acomp_ctx->sg_outputs->sgl);

        /*                                                                                                                                
         * Note:                                                                                                                          
         * [i] refers to the incoming batch space and is used to                                                                          
         *     index into the folio pages and @entries.                                                                                   
         *                                                                                                                                
         * [k] refers to the @acomp_ctx space, as determined by                                                                           
         *     @pool->compr_batch_size, and is used to index into                                                                         
         *     @acomp_ctx->buffers and @dlens.                                                                                            
         */
        for (i = 0; i < nr_pages; i += nr_comps) {
                for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
                        sg_set_folio(sg, folio, PAGE_SIZE, (start + k + i) * PAGE_SIZE);

                /*                                                                                                                        
                 * We need PAGE_SIZE * 2 here since there maybe over-compression case,                                                    
                 * and hardware-accelerators may won't check the dst buffer size, so                                                      
                 * giving the dst buffer with enough length to avoid buffer overflow.                                                     
                 */
                for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
                        sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);

                acomp_request_set_params(acomp_ctx->req,
                                         acomp_ctx->sg_inputs->sgl,
                                         acomp_ctx->sg_outputs->sgl,
                                         nr_comps * PAGE_SIZE,
                                         nr_comps * PAGE_SIZE);

                err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
                                      &acomp_ctx->wait);

                if (unlikely(err)) {
                        if (nr_comps == 1)
                                dlens[0] = err;
                        goto compress_error;
                }

                if (nr_comps == 1)
                        dlens[0] = acomp_ctx->req->dlen;
                else
                        for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
                                dlens[k] = sg->length;

[ store each compressed page in zpool]

I quickly tested this with usemem 30 processes. Switching from sg_set_page()
to sg_set_folio() does cause a 15% throughput regression for IAA and 2%
regression for zstd:

usemem30/64K folios/deflate-iaa/Avg throughput (KB/s):
sg_set_page(): 357,141
sg_set_folio(): 304,696

usemem30/64K folios/zstd/Avg throughput (KB/s):
sg_set_page(): 230,760
sg_set_folio(): 226,246

In my experience, zswap_compress() is highly performance critical code
and the smallest compute additions can cause significant impact on workload
performance and sys time.

Given the code simplification and unification that your SG list suggestions
have enabled, may I understand better why sg_set_folio() is preferred?
Again, my apologies if I have misunderstood your suggestion, but I think
it is worth getting this clarified so we are all in agreement.

Thanks and best regards,
Kanchana


> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-26  4:09             ` Sridhar, Kanchana P
@ 2025-08-26  4:14               ` Herbert Xu
  2025-08-26  4:42                 ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Herbert Xu @ 2025-08-26  4:14 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Aug 26, 2025 at 04:09:45AM +0000, Sridhar, Kanchana P wrote:
> 
> Thanks Herbert. Just want to make sure I understand this. Are you
> referring to replacing sg_set_page() for the input with sg_set_folio()?
> We have to pass in a scatterlist for the acomp_req->src..

I'm talking about acomp_request_set_src_folio.  You can pass just
a portion of a folio by specifying an offset and a length.

>         for (i = 0; i < nr_pages; i += nr_comps) {
>                 for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
>                         sg_set_folio(sg, folio, PAGE_SIZE, (start + k + i) * PAGE_SIZE);
> 
>                 /*                                                                                                                        
>                  * We need PAGE_SIZE * 2 here since there maybe over-compression case,                                                    
>                  * and hardware-accelerators may won't check the dst buffer size, so                                                      
>                  * giving the dst buffer with enough length to avoid buffer overflow.                                                     
>                  */
>                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
>                         sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
>
>                 acomp_request_set_params(acomp_ctx->req,
>                                          acomp_ctx->sg_inputs->sgl,
>                                          acomp_ctx->sg_outputs->sgl,
>                                          nr_comps * PAGE_SIZE,
>                                          nr_comps * PAGE_SIZE);

I meant something more like:

		acomp_request_set_src_folio(req, folio, start_offset,
					    nr_comps * PAGE_SIZE);
		acomp_request_set_dst_sg(req, acomp_ctx_sg_outputs->sgl,
					 nr_comps * PAGE_SIZE);
		acomp_request_set_unit_size(req, PAGE_SIZE);

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-26  3:48   ` Barry Song
@ 2025-08-26  4:27     ` Sridhar, Kanchana P
  2025-08-26  4:42       ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-26  4:27 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P



> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Monday, August 25, 2025 8:48 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> Hi Kanchana,
> 
> 
> [...]
> >
> > +       /*
> > +        * Set the unit of compress batching for large folios, for quick
> > +        * retrieval in the zswap_compress() fast path:
> > +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> > +        * large folios will be compressed in batches of
> ZSWAP_MAX_BATCH_SIZE
> > +        * pages, where each page in the batch is compressed sequentially.
> > +        * We see better performance by processing the folio in batches of
> > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> > +        * structures.
> > +        */
> > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> > +
> >         zswap_pool_debug("created", pool);
> >
> >         return pool;
> >
> 
> It’s hard to follow — you add batch_size and compr_batch_size in this
> patch, but only use them in another. Could we merge the related changes
> into one patch instead of splitting them into several that don’t work
> independently?

Hi Barry,

Thanks for reviewing the code and for your comments! Sure, I can merge
this patch with the next one. I was trying to keep the changes modularized
to a) zswap_cpu_comp_prepare(), b) zswap_store() and c) zswap_compress()
so the changes are broken into smaller parts, but I can see how this can
make the changes appear disjointed.

One thing though: the commit logs for each of the patches will
also probably need to be merged, since I have tried to explain the
changes in detail.

Thanks,
Kanchana




> 
> > -
> >         acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> >         if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
> >                 pr_err("could not alloc crypto acomp %s : %ld\n",
> > @@ -904,17 +929,36 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> >         if (IS_ERR_OR_NULL(acomp_ctx->req)) {
> >                 pr_err("could not alloc crypto acomp_request %s\n",
> > -                      pool->tfm_name);
> > +                       pool->tfm_name);
> >                 goto fail;
> >         }
> >
> > -       crypto_init_wait(&acomp_ctx->wait);
> > +       /*
> > +        * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
> > +        * compressor supports batching.
> > +        */
> > +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> > +                                    crypto_acomp_batch_size(acomp_ctx->acomp));
> > +
> > +       acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8
> *),
> > +                                         GFP_KERNEL, cpu_to_node(cpu));
> > +       if (!acomp_ctx->buffers)
> > +               goto fail;
> > +
> > +       for (i = 0; i < pool->compr_batch_size; ++i) {
> > +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> GFP_KERNEL,
> > +                                                    cpu_to_node(cpu));
> > +               if (!acomp_ctx->buffers[i])
> > +                       goto fail;
> > +       }
> 
> It’s hard to follow — memory is allocated here but only used in another
> patch. Could we merge the related changes into a single patch instead of
> splitting them into several that don’t work independently?
> 
> >
> >         /*
> >          * if the backend of acomp is async zip, crypto_req_done() will wakeup
> >          * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >          * won't be called, crypto_wait_req() will return without blocking.
> >          */
> > +       crypto_init_wait(&acomp_ctx->wait);
> > +
> >         acomp_request_set_callback(acomp_ctx->req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >                                    crypto_req_done, &acomp_ctx->wait);
> >
> > @@ -922,7 +966,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         return 0;
> >
> >  fail:
> > -       acomp_ctx_dealloc(acomp_ctx);
> > +       acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
> >         return ret;
> >  }
> >
> > @@ -942,7 +986,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffer;
> > +       dst = acomp_ctx->buffers[0];
> >         sg_init_table(&input, 1);
> >         sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> > @@ -1003,19 +1047,19 @@ static bool zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >
> >         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >         mutex_lock(&acomp_ctx->mutex);
> > -       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
> > +       obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx-
> >buffers[0]);
> >
> >         /*
> >          * zpool_obj_read_begin() might return a kmap address of highmem
> when
> > -        * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> > -        * handle highmem addresses, so copy the object to acomp_ctx-
> >buffer.
> > +        * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
> > +        * handle highmem addresses, so copy the object to acomp_ctx-
> >buffers[0].
> >          */
> >         if (virt_addr_valid(obj)) {
> >                 src = obj;
> >         } else {
> > -               WARN_ON_ONCE(obj == acomp_ctx->buffer);
> > -               memcpy(acomp_ctx->buffer, obj, entry->length);
> > -               src = acomp_ctx->buffer;
> > +               WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
> > +               memcpy(acomp_ctx->buffers[0], obj, entry->length);
> > +               src = acomp_ctx->buffers[0];
> 
> Hard to understand what is going on if related changes are not kept in
> one self-contained patch.
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver
  2025-08-26  4:14               ` Herbert Xu
@ 2025-08-26  4:42                 ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-26  4:42 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nhat Pham, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	senozhatsky@chromium.org, linux-crypto@vger.kernel.org,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, August 25, 2025 9:15 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosry.ahmed@linux.dev;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 00/24] zswap compression batching with optimized
> iaa_crypto driver
> 
> On Tue, Aug 26, 2025 at 04:09:45AM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks Herbert. Just want to make sure I understand this. Are you
> > referring to replacing sg_set_page() for the input with sg_set_folio()?
> > We have to pass in a scatterlist for the acomp_req->src..
> 
> I'm talking about acomp_request_set_src_folio.  You can pass just
> a portion of a folio by specifying an offset and a length.
> 
> >         for (i = 0; i < nr_pages; i += nr_comps) {
> >                 for_each_sg(acomp_ctx->sg_inputs->sgl, sg, nr_comps, k)
> >                         sg_set_folio(sg, folio, PAGE_SIZE, (start + k + i) * PAGE_SIZE);
> >
> >                 /*
> >                  * We need PAGE_SIZE * 2 here since there maybe over-
> compression case,
> >                  * and hardware-accelerators may won't check the dst buffer size,
> so
> >                  * giving the dst buffer with enough length to avoid buffer overflow.
> >                  */
> >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k)
> >                         sg_set_buf(sg, acomp_ctx->buffers[k], PAGE_SIZE * 2);
> >
> >                 acomp_request_set_params(acomp_ctx->req,
> >                                          acomp_ctx->sg_inputs->sgl,
> >                                          acomp_ctx->sg_outputs->sgl,
> >                                          nr_comps * PAGE_SIZE,
> >                                          nr_comps * PAGE_SIZE);
> 
> I meant something more like:
> 
> 		acomp_request_set_src_folio(req, folio, start_offset,
> 					    nr_comps * PAGE_SIZE);
> 		acomp_request_set_dst_sg(req, acomp_ctx_sg_outputs->sgl,
> 					 nr_comps * PAGE_SIZE);
> 		acomp_request_set_unit_size(req, PAGE_SIZE);

Ok, I get it now :) Thanks. I will try this out, and pending any issues
that may arise from testing, I might be all set for putting together v12.

Thanks again Herbert, I appreciate it.

Best regards,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-26  4:27     ` Sridhar, Kanchana P
@ 2025-08-26  4:42       ` Barry Song
  2025-08-26  4:56         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-26  4:42 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Aug 26, 2025 at 12:27 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Barry Song <21cnbao@gmail.com>
> > Sent: Monday, August 25, 2025 8:48 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> > if the compressor supports batching.
> >
> > Hi Kanchana,
> >
> >
> > [...]
> > >
> > > +       /*
> > > +        * Set the unit of compress batching for large folios, for quick
> > > +        * retrieval in the zswap_compress() fast path:
> > > +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> > > +        * large folios will be compressed in batches of
> > ZSWAP_MAX_BATCH_SIZE
> > > +        * pages, where each page in the batch is compressed sequentially.
> > > +        * We see better performance by processing the folio in batches of
> > > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> > > +        * structures.
> > > +        */
> > > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > > +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> > > +
> > >         zswap_pool_debug("created", pool);
> > >
> > >         return pool;
> > >
> >
> > It’s hard to follow — you add batch_size and compr_batch_size in this
> > patch, but only use them in another. Could we merge the related changes
> > into one patch instead of splitting them into several that don’t work
> > independently?
>
> Hi Barry,
>
> Thanks for reviewing the code and for your comments! Sure, I can merge
> this patch with the next one. I was trying to keep the changes modularized
> to a) zswap_cpu_comp_prepare(), b) zswap_store() and c) zswap_compress()
> so the changes are broken into smaller parts, but I can see how this can
> make the changes appear disjointed.
>
> One thing though: the commit logs for each of the patches will
> also probably need to be merged, since I have tried to explain the
> changes in detail.

It’s fine to merge the changelog and present the story as a whole. Do we
really need both pool->batch_size and pool->compr_batch_size? I assume
pool->batch_size = pool->compr_batch_size if HW supports batch; otherwise
pool->compr_batch_size = 1. It seems pool->compr_batch_size should either
be a bool or be dropped. If we drop it, you can still check whether HW
supports batch via crypto_acomp_batch_size() when doing compression:

if (crypto_acomp_batch_size() > 1)
    compress in steps of PAGE_SIZE;
else
    compress in steps of batch_size;

no ?

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-26  4:42       ` Barry Song
@ 2025-08-26  4:56         ` Sridhar, Kanchana P
  2025-08-26  5:17           ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-26  4:56 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Monday, August 25, 2025 9:42 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> On Tue, Aug 26, 2025 at 12:27 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@gmail.com>
> > > Sent: Monday, August 25, 2025 8:48 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > > foundation.org; senozhatsky@chromium.org; linux-
> crypto@vger.kernel.org;
> > > herbert@gondor.apana.org.au; davem@davemloft.net;
> > > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching
> resources
> > > if the compressor supports batching.
> > >
> > > Hi Kanchana,
> > >
> > >
> > > [...]
> > > >
> > > > +       /*
> > > > +        * Set the unit of compress batching for large folios, for quick
> > > > +        * retrieval in the zswap_compress() fast path:
> > > > +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> > > > +        * large folios will be compressed in batches of
> > > ZSWAP_MAX_BATCH_SIZE
> > > > +        * pages, where each page in the batch is compressed sequentially.
> > > > +        * We see better performance by processing the folio in batches of
> > > > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> > > > +        * structures.
> > > > +        */
> > > > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > > > +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> > > > +
> > > >         zswap_pool_debug("created", pool);
> > > >
> > > >         return pool;
> > > >
> > >
> > > It’s hard to follow — you add batch_size and compr_batch_size in this
> > > patch, but only use them in another. Could we merge the related changes
> > > into one patch instead of splitting them into several that don’t work
> > > independently?
> >
> > Hi Barry,
> >
> > Thanks for reviewing the code and for your comments! Sure, I can merge
> > this patch with the next one. I was trying to keep the changes modularized
> > to a) zswap_cpu_comp_prepare(), b) zswap_store() and c)
> zswap_compress()
> > so the changes are broken into smaller parts, but I can see how this can
> > make the changes appear disjointed.
> >
> > One thing though: the commit logs for each of the patches will
> > also probably need to be merged, since I have tried to explain the
> > changes in detail.
> 
> It’s fine to merge the changelog and present the story as a whole. Do we

Sure.

> really need both pool->batch_size and pool->compr_batch_size? I assume
> pool->batch_size = pool->compr_batch_size if HW supports batch; otherwise
> pool->compr_batch_size = 1.

Actually not exactly. We have found value in compressing in batches of
ZSWAP_MAX_BATCH_SIZE even for software compressors. Latency benefits
from cache locality of working-set data. Hence the approach that we have
settled on is pool->batch_size = ZSWAP_MAX_BATCH_SIZE if
the compressor does not support batching (i.e., if pool->compr_batch_size is 1).
If it does, then pool->batch_size = pool->compr_batch_size.

Besides this, pool->compr_batch_size helps to distinguish the number of
acomp_ctx resources during zswap_compress(); distinctly from the batch size.

> It seems pool->compr_batch_size should either
> be a bool or be dropped. If we drop it, you can still check whether HW
> supports batch via crypto_acomp_batch_size() when doing compression:
> 
> if (crypto_acomp_batch_size() > 1)
>     compress in steps of PAGE_SIZE;
> else
>     compress in steps of batch_size;
> 
> no ?

I could do this, but it will impact latency. As I was mentioning in an earlier
response to Nhat, this (keeping compr_batch_size and batch_size distinct)
was a small memory overhead per zswap_pool (one more u8 data member),
given that there are very few zswap pools. For this trade off, we are able to
minimize computes in zswap_compress()..

Thanks,
Kanchana

> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-26  4:56         ` Sridhar, Kanchana P
@ 2025-08-26  5:17           ` Barry Song
  2025-08-27  0:06             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-26  5:17 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

> > > > [...]
> > > > >
> > > > > +       /*
> > > > > +        * Set the unit of compress batching for large folios, for quick
> > > > > +        * retrieval in the zswap_compress() fast path:
> > > > > +        * If the compressor is sequential (@pool->compr_batch_size is 1),
> > > > > +        * large folios will be compressed in batches of
> > > > ZSWAP_MAX_BATCH_SIZE
> > > > > +        * pages, where each page in the batch is compressed sequentially.
> > > > > +        * We see better performance by processing the folio in batches of
> > > > > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working set
> > > > > +        * structures.
> > > > > +        */
> > > > > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > > > > +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> > > > > +
> > > > >         zswap_pool_debug("created", pool);
> > > > >
> > > > >         return pool;
> > > > >
> > > >
> > > > It’s hard to follow — you add batch_size and compr_batch_size in this
> > > > patch, but only use them in another. Could we merge the related changes
> > > > into one patch instead of splitting them into several that don’t work
> > > > independently?
> > >
> > > Hi Barry,
> > >
> > > Thanks for reviewing the code and for your comments! Sure, I can merge
> > > this patch with the next one. I was trying to keep the changes modularized
> > > to a) zswap_cpu_comp_prepare(), b) zswap_store() and c)
> > zswap_compress()
> > > so the changes are broken into smaller parts, but I can see how this can
> > > make the changes appear disjointed.
> > >
> > > One thing though: the commit logs for each of the patches will
> > > also probably need to be merged, since I have tried to explain the
> > > changes in detail.
> >
> > It’s fine to merge the changelog and present the story as a whole. Do we
>
> Sure.
>
> > really need both pool->batch_size and pool->compr_batch_size? I assume
> > pool->batch_size = pool->compr_batch_size if HW supports batch; otherwise
> > pool->compr_batch_size = 1.
>
> Actually not exactly. We have found value in compressing in batches of
> ZSWAP_MAX_BATCH_SIZE even for software compressors. Latency benefits
> from cache locality of working-set data. Hence the approach that we have
> settled on is pool->batch_size = ZSWAP_MAX_BATCH_SIZE if
> the compressor does not support batching (i.e., if pool->compr_batch_size is 1).
> If it does, then pool->batch_size = pool->compr_batch_size.

I understand that even without a hardware batch, you can still
have some software batching that excludes compression.

However, based on the code below, it looks like
pool->compr_batch_size is almost always either equal to
pool->batch_size or 1:

+       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
+                                    crypto_acomp_batch_size(acomp_ctx->acomp));

+       pool->batch_size = (pool->compr_batch_size > 1) ?
+                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;


It seems one of these two variables may be redundant.
For instance, no matter if pool->compr_batch_size > 1, could we always treat
batch_size as ZSWAP_MAX_BATCH_SIZE?  if we remove
pool->batch_size, could we just use pool->compr_batch_size as the
step size for compression no matter if pool->compr_batch_size > 1.

for example:
       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
                                    crypto_acomp_batch_size(acomp_ctx->acomp));

Then batch the zswap store using ZSWAP_MAX_BATCH_SIZE, but set the
compression step to pool->compr_batch_size.

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-26  5:17           ` Barry Song
@ 2025-08-27  0:06             ` Sridhar, Kanchana P
  2025-08-28 21:39               ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-27  0:06 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Monday, August 25, 2025 10:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> > > > > [...]
> > > > > >
> > > > > > +       /*
> > > > > > +        * Set the unit of compress batching for large folios, for quick
> > > > > > +        * retrieval in the zswap_compress() fast path:
> > > > > > +        * If the compressor is sequential (@pool->compr_batch_size is
> 1),
> > > > > > +        * large folios will be compressed in batches of
> > > > > ZSWAP_MAX_BATCH_SIZE
> > > > > > +        * pages, where each page in the batch is compressed
> sequentially.
> > > > > > +        * We see better performance by processing the folio in batches
> of
> > > > > > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working
> set
> > > > > > +        * structures.
> > > > > > +        */
> > > > > > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > > > > > +                               pool->compr_batch_size :
> ZSWAP_MAX_BATCH_SIZE;
> > > > > > +
> > > > > >         zswap_pool_debug("created", pool);
> > > > > >
> > > > > >         return pool;
> > > > > >
> > > > >
> > > > > It’s hard to follow — you add batch_size and compr_batch_size in this
> > > > > patch, but only use them in another. Could we merge the related
> changes
> > > > > into one patch instead of splitting them into several that don’t work
> > > > > independently?
> > > >
> > > > Hi Barry,
> > > >
> > > > Thanks for reviewing the code and for your comments! Sure, I can merge
> > > > this patch with the next one. I was trying to keep the changes
> modularized
> > > > to a) zswap_cpu_comp_prepare(), b) zswap_store() and c)
> > > zswap_compress()
> > > > so the changes are broken into smaller parts, but I can see how this can
> > > > make the changes appear disjointed.
> > > >
> > > > One thing though: the commit logs for each of the patches will
> > > > also probably need to be merged, since I have tried to explain the
> > > > changes in detail.
> > >
> > > It’s fine to merge the changelog and present the story as a whole. Do we
> >
> > Sure.
> >
> > > really need both pool->batch_size and pool->compr_batch_size? I assume
> > > pool->batch_size = pool->compr_batch_size if HW supports batch;
> otherwise
> > > pool->compr_batch_size = 1.
> >
> > Actually not exactly. We have found value in compressing in batches of
> > ZSWAP_MAX_BATCH_SIZE even for software compressors. Latency benefits
> > from cache locality of working-set data. Hence the approach that we have
> > settled on is pool->batch_size = ZSWAP_MAX_BATCH_SIZE if
> > the compressor does not support batching (i.e., if pool->compr_batch_size is
> 1).
> > If it does, then pool->batch_size = pool->compr_batch_size.
> 
> I understand that even without a hardware batch, you can still
> have some software batching that excludes compression.
> 
> However, based on the code below, it looks like
> pool->compr_batch_size is almost always either equal to
> pool->batch_size or 1:
> 
> +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> +                                    crypto_acomp_batch_size(acomp_ctx->acomp));

I would like to explain some of the considerations in coming up with this
approach:

1) The compression algorithm gets to decide an optimal batch-size.
    For a hardware accelerator such as IAA, this value could be different
    than ZSWAP_MAX_BATCH_SIZE.

2) ZSWAP_MAX_BATCH_SIZE acts as a limiting factor to the # of acomp_ctx
    per-CPU resources that will be allocated in zswap_cpu_comp_prepare();
    as per Yosry's suggestion. This helps limit the memory overhead for
    batching algorithms.

3) If a batching algorithm works with a batch size "X" , where
     1 < X < ZSWAP_MAX_BATCH_SIZE, two things need to happen:
     a) We want to only allocate "X" per-CPU resources.
     b) We want to process the folio in batches of "X", not ZSWAP_MAX_BATCH_SIZE
          to avail of the benefits of hardware parallelism. This is the compression
          step size you also mention.
          In particular, we cannot treat batch_size as ZSWAP_MAX_BATCH_SIZE,
          and send a batch of ZSWAP_MAX_BATCH_SIZE pages to zswap_compress()
          in this case. For e.g., what if the compress step-size is 6, but the new
          zswap_store_pages() introduced in patch 23 sends 8 pages to
          zswap_compress() because ZSWAP_MAX_BATCH_SIZE is set to 8?
          The code in zswap_compress() could get quite messy, which will impact
          latency.
        

> 
> +       pool->batch_size = (pool->compr_batch_size > 1) ?
> +                               pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
> 
> 
> It seems one of these two variables may be redundant.
> For instance, no matter if pool->compr_batch_size > 1, could we always treat
> batch_size as ZSWAP_MAX_BATCH_SIZE?  if we remove
> pool->batch_size, could we just use pool->compr_batch_size as the
> step size for compression no matter if pool->compr_batch_size > 1.

To further explain the rationale for keeping these two distinct, we
statically compute the compress step-size by querying the algorithm
in zswap_cpu_comp_prepare() after the acomp_ctx->acomp has been
created. We store it in pool->compr_batch_size.

Next, in zswap_pool_create(), we do a one-time computation to determine
if the pool->batch_size needs to align with a non-1 batching acomp's
compr_batch_size; or, if it is a non-batching compressor, for the
pool->batch_size to be ZSWAP_MAX_BATCH_SIZE.

This enables further code simplifications/unification in zswap_compress();
and quick retrieval of the # of available acomp_ctx batching resources to
set up the SG lists per call to crypto_acomp_compress(). IOW, memo-ization
for latency gains and unified code paths.

I hope this clarifies things further. Thanks again, these are all good questions.

Best regards,
Kanchana

> 
> for example:
>        pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
>                                     crypto_acomp_batch_size(acomp_ctx->acomp));
> 
> Then batch the zswap store using ZSWAP_MAX_BATCH_SIZE, but set the
> compression step to pool->compr_batch_size.
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-27  0:06             ` Sridhar, Kanchana P
@ 2025-08-28 21:39               ` Barry Song
  2025-08-28 22:47                 ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-28 21:39 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Aug 27, 2025 at 12:06 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Barry Song <21cnbao@gmail.com>
> > Sent: Monday, August 25, 2025 10:17 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> > if the compressor supports batching.
> >
> > > > > > [...]
> > > > > > >
> > > > > > > +       /*
> > > > > > > +        * Set the unit of compress batching for large folios, for quick
> > > > > > > +        * retrieval in the zswap_compress() fast path:
> > > > > > > +        * If the compressor is sequential (@pool->compr_batch_size is
> > 1),
> > > > > > > +        * large folios will be compressed in batches of
> > > > > > ZSWAP_MAX_BATCH_SIZE
> > > > > > > +        * pages, where each page in the batch is compressed
> > sequentially.
> > > > > > > +        * We see better performance by processing the folio in batches
> > of
> > > > > > > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of working
> > set
> > > > > > > +        * structures.
> > > > > > > +        */
> > > > > > > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > > > > > > +                               pool->compr_batch_size :
> > ZSWAP_MAX_BATCH_SIZE;
> > > > > > > +
> > > > > > >         zswap_pool_debug("created", pool);
> > > > > > >
> > > > > > >         return pool;
> > > > > > >
> > > > > >
> > > > > > It’s hard to follow — you add batch_size and compr_batch_size in this
> > > > > > patch, but only use them in another. Could we merge the related
> > changes
> > > > > > into one patch instead of splitting them into several that don’t work
> > > > > > independently?
> > > > >
> > > > > Hi Barry,
> > > > >
> > > > > Thanks for reviewing the code and for your comments! Sure, I can merge
> > > > > this patch with the next one. I was trying to keep the changes
> > modularized
> > > > > to a) zswap_cpu_comp_prepare(), b) zswap_store() and c)
> > > > zswap_compress()
> > > > > so the changes are broken into smaller parts, but I can see how this can
> > > > > make the changes appear disjointed.
> > > > >
> > > > > One thing though: the commit logs for each of the patches will
> > > > > also probably need to be merged, since I have tried to explain the
> > > > > changes in detail.
> > > >
> > > > It’s fine to merge the changelog and present the story as a whole. Do we
> > >
> > > Sure.
> > >
> > > > really need both pool->batch_size and pool->compr_batch_size? I assume
> > > > pool->batch_size = pool->compr_batch_size if HW supports batch;
> > otherwise
> > > > pool->compr_batch_size = 1.
> > >
> > > Actually not exactly. We have found value in compressing in batches of
> > > ZSWAP_MAX_BATCH_SIZE even for software compressors. Latency benefits
> > > from cache locality of working-set data. Hence the approach that we have
> > > settled on is pool->batch_size = ZSWAP_MAX_BATCH_SIZE if
> > > the compressor does not support batching (i.e., if pool->compr_batch_size is
> > 1).
> > > If it does, then pool->batch_size = pool->compr_batch_size.
> >
> > I understand that even without a hardware batch, you can still
> > have some software batching that excludes compression.
> >
> > However, based on the code below, it looks like
> > pool->compr_batch_size is almost always either equal to
> > pool->batch_size or 1:
> >
> > +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> > +                                    crypto_acomp_batch_size(acomp_ctx->acomp));
>
> I would like to explain some of the considerations in coming up with this
> approach:
>
> 1) The compression algorithm gets to decide an optimal batch-size.
>     For a hardware accelerator such as IAA, this value could be different
>     than ZSWAP_MAX_BATCH_SIZE.
>
> 2) ZSWAP_MAX_BATCH_SIZE acts as a limiting factor to the # of acomp_ctx
>     per-CPU resources that will be allocated in zswap_cpu_comp_prepare();
>     as per Yosry's suggestion. This helps limit the memory overhead for
>     batching algorithms.
>
> 3) If a batching algorithm works with a batch size "X" , where
>      1 < X < ZSWAP_MAX_BATCH_SIZE, two things need to happen:
>      a) We want to only allocate "X" per-CPU resources.
>      b) We want to process the folio in batches of "X", not ZSWAP_MAX_BATCH_SIZE
>           to avail of the benefits of hardware parallelism. This is the compression
>           step size you also mention.
>           In particular, we cannot treat batch_size as ZSWAP_MAX_BATCH_SIZE,
>           and send a batch of ZSWAP_MAX_BATCH_SIZE pages to zswap_compress()
>           in this case. For e.g., what if the compress step-size is 6, but the new
>           zswap_store_pages() introduced in patch 23 sends 8 pages to
>           zswap_compress() because ZSWAP_MAX_BATCH_SIZE is set to 8?
>           The code in zswap_compress() could get quite messy, which will impact
>           latency.

If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware batching,
compression is done with a step size of 1. If the hardware step size is 4,
compression occurs in two steps. If the hardware step size is 6, the first
compression uses a step size of 6, and the second uses a step size of 2.
Do you think this will work?

I don’t quite understand why you want to save
ZSWAP_MAX_BATCH_SIZE - X resources, since even without hardware batching
you are still allocating all ZSWAP_MAX_BATCH_SIZE resources. This is the
case for all platforms except yours.

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-28 21:39               ` Barry Song
@ 2025-08-28 22:47                 ` Sridhar, Kanchana P
  2025-08-28 23:28                   ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-28 22:47 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 2:40 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> On Wed, Aug 27, 2025 at 12:06 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@gmail.com>
> > > Sent: Monday, August 25, 2025 10:17 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > > foundation.org; senozhatsky@chromium.org; linux-
> crypto@vger.kernel.org;
> > > herbert@gondor.apana.org.au; davem@davemloft.net;
> > > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching
> resources
> > > if the compressor supports batching.
> > >
> > > > > > > [...]
> > > > > > > >
> > > > > > > > +       /*
> > > > > > > > +        * Set the unit of compress batching for large folios, for quick
> > > > > > > > +        * retrieval in the zswap_compress() fast path:
> > > > > > > > +        * If the compressor is sequential (@pool-
> >compr_batch_size is
> > > 1),
> > > > > > > > +        * large folios will be compressed in batches of
> > > > > > > ZSWAP_MAX_BATCH_SIZE
> > > > > > > > +        * pages, where each page in the batch is compressed
> > > sequentially.
> > > > > > > > +        * We see better performance by processing the folio in
> batches
> > > of
> > > > > > > > +        * ZSWAP_MAX_BATCH_SIZE, due to cache locality of
> working
> > > set
> > > > > > > > +        * structures.
> > > > > > > > +        */
> > > > > > > > +       pool->batch_size = (pool->compr_batch_size > 1) ?
> > > > > > > > +                               pool->compr_batch_size :
> > > ZSWAP_MAX_BATCH_SIZE;
> > > > > > > > +
> > > > > > > >         zswap_pool_debug("created", pool);
> > > > > > > >
> > > > > > > >         return pool;
> > > > > > > >
> > > > > > >
> > > > > > > It’s hard to follow — you add batch_size and compr_batch_size in
> this
> > > > > > > patch, but only use them in another. Could we merge the related
> > > changes
> > > > > > > into one patch instead of splitting them into several that don’t work
> > > > > > > independently?
> > > > > >
> > > > > > Hi Barry,
> > > > > >
> > > > > > Thanks for reviewing the code and for your comments! Sure, I can
> merge
> > > > > > this patch with the next one. I was trying to keep the changes
> > > modularized
> > > > > > to a) zswap_cpu_comp_prepare(), b) zswap_store() and c)
> > > > > zswap_compress()
> > > > > > so the changes are broken into smaller parts, but I can see how this
> can
> > > > > > make the changes appear disjointed.
> > > > > >
> > > > > > One thing though: the commit logs for each of the patches will
> > > > > > also probably need to be merged, since I have tried to explain the
> > > > > > changes in detail.
> > > > >
> > > > > It’s fine to merge the changelog and present the story as a whole. Do
> we
> > > >
> > > > Sure.
> > > >
> > > > > really need both pool->batch_size and pool->compr_batch_size? I
> assume
> > > > > pool->batch_size = pool->compr_batch_size if HW supports batch;
> > > otherwise
> > > > > pool->compr_batch_size = 1.
> > > >
> > > > Actually not exactly. We have found value in compressing in batches of
> > > > ZSWAP_MAX_BATCH_SIZE even for software compressors. Latency
> benefits
> > > > from cache locality of working-set data. Hence the approach that we
> have
> > > > settled on is pool->batch_size = ZSWAP_MAX_BATCH_SIZE if
> > > > the compressor does not support batching (i.e., if pool-
> >compr_batch_size is
> > > 1).
> > > > If it does, then pool->batch_size = pool->compr_batch_size.
> > >
> > > I understand that even without a hardware batch, you can still
> > > have some software batching that excludes compression.
> > >
> > > However, based on the code below, it looks like
> > > pool->compr_batch_size is almost always either equal to
> > > pool->batch_size or 1:
> > >
> > > +       pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> > > +                                    crypto_acomp_batch_size(acomp_ctx->acomp));
> >
> > I would like to explain some of the considerations in coming up with this
> > approach:
> >
> > 1) The compression algorithm gets to decide an optimal batch-size.
> >     For a hardware accelerator such as IAA, this value could be different
> >     than ZSWAP_MAX_BATCH_SIZE.
> >
> > 2) ZSWAP_MAX_BATCH_SIZE acts as a limiting factor to the # of acomp_ctx
> >     per-CPU resources that will be allocated in zswap_cpu_comp_prepare();
> >     as per Yosry's suggestion. This helps limit the memory overhead for
> >     batching algorithms.
> >
> > 3) If a batching algorithm works with a batch size "X" , where
> >      1 < X < ZSWAP_MAX_BATCH_SIZE, two things need to happen:
> >      a) We want to only allocate "X" per-CPU resources.
> >      b) We want to process the folio in batches of "X", not
> ZSWAP_MAX_BATCH_SIZE
> >           to avail of the benefits of hardware parallelism. This is the
> compression
> >           step size you also mention.
> >           In particular, we cannot treat batch_size as ZSWAP_MAX_BATCH_SIZE,
> >           and send a batch of ZSWAP_MAX_BATCH_SIZE pages to
> zswap_compress()
> >           in this case. For e.g., what if the compress step-size is 6, but the new
> >           zswap_store_pages() introduced in patch 23 sends 8 pages to
> >           zswap_compress() because ZSWAP_MAX_BATCH_SIZE is set to 8?
> >           The code in zswap_compress() could get quite messy, which will
> impact
> >           latency.
> 
> If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware batching,
> compression is done with a step size of 1. If the hardware step size is 4,
> compression occurs in two steps. If the hardware step size is 6, the first
> compression uses a step size of 6, and the second uses a step size of 2.
> Do you think this will work?

Hi Barry,

This would be non-optimal from code simplicity and latency perspectives.
One of the benefits of using the hardware accelerator's "batch parallelism"
is cost amortization across the batch. We might lose this benefit if we make
multiple calls to zswap_compress() to ask the hardware accelerator to
batch compress in smaller batches. Compression throughput would also
be sub-optimal.

In my patch-series, the rule is simple: if an algorithm has specified a
batch-size, carve out the folio by that "batch-size" # of pages to be
compressed as a batch in zswap_compress(). This custom batch-size
is capped at ZSWAP_MAX_BATCH_SIZE.

If an algorithm has not specified a batch-size, the default batch-size
is 1. In this case, carve out the folio by ZSWAP_MAX_BATCH_SIZE
# of pages to be compressed as a batch in zswap_compress().

> 
> I don’t quite understand why you want to save
> ZSWAP_MAX_BATCH_SIZE - X resources, since even without hardware
> batching
> you are still allocating all ZSWAP_MAX_BATCH_SIZE resources. This is the
> case for all platforms except yours.

Not sure I understand.. Just to clarify, this is not done to save on resources,
rather for the reasons stated above.

We are already saving on resources by only allocating only
"pool->compr_batch_size" number of resources
(*not* ZSWAP_MAX_BATCH_SIZE resources):

	pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
				     crypto_acomp_batch_size(acomp_ctx->acomp));

For non-Intel platforms, this means only 1 dst buffer is allocated, as
explained in the commit log for this patch.

" A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
Yosry's suggestion. pool->compr_batch_size is set as the minimum of the
compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it
proceeds to allocate the necessary compression dst buffers in the
per-CPU acomp_ctx."

Thanks,
Kanchana

> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-28 22:47                 ` Sridhar, Kanchana P
@ 2025-08-28 23:28                   ` Barry Song
  2025-08-29  2:56                     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-28 23:28 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

> >
> > If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware batching,
> > compression is done with a step size of 1. If the hardware step size is 4,
> > compression occurs in two steps. If the hardware step size is 6, the first
> > compression uses a step size of 6, and the second uses a step size of 2.
> > Do you think this will work?
>
> Hi Barry,
>
> This would be non-optimal from code simplicity and latency perspectives.
> One of the benefits of using the hardware accelerator's "batch parallelism"
> is cost amortization across the batch. We might lose this benefit if we make
> multiple calls to zswap_compress() to ask the hardware accelerator to
> batch compress in smaller batches. Compression throughput would also
> be sub-optimal.

I guess it wouldn’t be an issue if both ZSWAP_MAX_BATCH_SIZE and
pool->compr_batch_size are powers of two. As you mentioned, we still
gain improvement with ZSWAP_MAX_BATCH_SIZE batching even when
pool->compr_batch_size == 1, by compressing pages one by one but
batching other work such as zswap_entries_cache_alloc_batch() ?

>
> In my patch-series, the rule is simple: if an algorithm has specified a
> batch-size, carve out the folio by that "batch-size" # of pages to be
> compressed as a batch in zswap_compress(). This custom batch-size
> is capped at ZSWAP_MAX_BATCH_SIZE.
>
> If an algorithm has not specified a batch-size, the default batch-size
> is 1. In this case, carve out the folio by ZSWAP_MAX_BATCH_SIZE
> # of pages to be compressed as a batch in zswap_compress().

Yes, I understand your rule. However, having two global variables is still
somewhat confusing. It might be clearer to use a single variable with a
comment, since one variable can clearly determine the value of the other.

Can we get the batch_size at runtime based on pool->compr_batch_size?

/*
 * If hardware compression supports batching, we use the hardware step size.
 * Otherwise, we use ZSWAP_MAX_BATCH_SIZE for batching, but still compress
 * one page at a time.
 */
batch_size = pool->compr_batch_size > 1 ? pool->compr_batch_size :
             ZSWAP_MAX_BATCH_SIZE;

We probably don’t need this if both pool->compr_batch_size and
ZSWAP_MAX_BATCH_SIZE are powers of two?

>
> >
> > I don’t quite understand why you want to save
> > ZSWAP_MAX_BATCH_SIZE - X resources, since even without hardware
> > batching
> > you are still allocating all ZSWAP_MAX_BATCH_SIZE resources. This is the
> > case for all platforms except yours.
>
> Not sure I understand.. Just to clarify, this is not done to save on resources,
> rather for the reasons stated above.
>
> We are already saving on resources by only allocating only
> "pool->compr_batch_size" number of resources
> (*not* ZSWAP_MAX_BATCH_SIZE resources):
>
>         pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
>                                      crypto_acomp_batch_size(acomp_ctx->acomp));
>
> For non-Intel platforms, this means only 1 dst buffer is allocated, as
> explained in the commit log for this patch.

you are right. I misunderstood your code :-)

>
> " A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> Yosry's suggestion. pool->compr_batch_size is set as the minimum of the
> compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it
> proceeds to allocate the necessary compression dst buffers in the
> per-CPU acomp_ctx."

This is fine, but it still doesn’t provide a strong reason for having
two global variables when one can fully determine the value of the other.

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-01  4:36 ` [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
  2025-08-14 21:14   ` Nhat Pham
@ 2025-08-28 23:54   ` Barry Song
  2025-08-29  3:04     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-28 23:54 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

> +static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
> +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> +                          int node_id)
>  {
>         struct crypto_acomp_ctx *acomp_ctx;
>         struct scatterlist input, output;
> -       int comp_ret = 0, alloc_ret = 0;
> -       unsigned int dlen = PAGE_SIZE;
> -       unsigned long handle;
> -       struct zpool *zpool;
> +       struct zpool *zpool = pool->zpool;
> +
> +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> +       int errors[ZSWAP_MAX_BATCH_SIZE];
> +
> +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> +       unsigned int i, j;
> +       int err;
>         gfp_t gfp;
> -       u8 *dst;
> +
> +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
>
>         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffers[0];
> -       sg_init_table(&input, 1);
> -       sg_set_page(&input, page, PAGE_SIZE, 0);
> -
>         /*
> -        * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> -        * and hardware-accelerators may won't check the dst buffer size, so
> -        * giving the dst buffer with enough length to avoid buffer overflow.
> +        * Note:
> +        * [i] refers to the incoming batch space and is used to
> +        *     index into the folio pages, @entries and @errors.
>          */
> -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +       for (i = 0; i < nr_pages; i += nr_comps) {
> +               if (nr_comps == 1) {
> +                       sg_init_table(&input, 1);
> +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
>
> -       /*
> -        * it maybe looks a little bit silly that we send an asynchronous request,
> -        * then wait for its completion synchronously. This makes the process look
> -        * synchronous in fact.
> -        * Theoretically, acomp supports users send multiple acomp requests in one
> -        * acomp instance, then get those requests done simultaneously. but in this
> -        * case, zswap actually does store and load page by page, there is no
> -        * existing method to send the second page before the first page is done
> -        * in one thread doing zwap.
> -        * but in different threads running on different cpu, we have different
> -        * acomp instance, so multiple threads can do (de)compression in parallel.
> -        */
> -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -       dlen = acomp_ctx->req->dlen;
> -       if (comp_ret)
> -               goto unlock;
> +                       /*
> +                        * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> +                        * and hardware-accelerators may won't check the dst buffer size, so
> +                        * giving the dst buffer with enough length to avoid buffer overflow.
> +                        */
> +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
> +                       acomp_request_set_params(acomp_ctx->req, &input,
> +                                                &output, PAGE_SIZE, PAGE_SIZE);
> +
> +                       errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> +                                                   &acomp_ctx->wait);
> +                       if (unlikely(errors[i]))
> +                               goto compress_error;
> +
> +                       dlens[i] = acomp_ctx->req->dlen;
> +               } else {
> +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> +                       unsigned int k;
> +
> +                       for (k = 0; k < nr_pages; ++k)
> +                               pages[k] = folio_page(folio, start + k);
> +
> +                       struct swap_batch_comp_data batch_comp_data = {
> +                               .pages = pages,
> +                               .dsts = acomp_ctx->buffers,
> +                               .dlens = dlens,
> +                               .errors = errors,
> +                               .nr_comps = nr_pages,
> +                       };

Why would this work given that nr_pages might be larger than
pool->compr_batch_size?

unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);

So this actually doesn’t happen unless pool->compr_batch_size == 1,
but the code is confusing, right?

> +
> +                       acomp_ctx->req->kernel_data = &batch_comp_data;

Can you actually pass a request larger than pool->compr_batch_size
to the crypto driver?

By the way, swap_batch_comp_data seems like a poor name. Why should
crypto drivers know anything about swap_? kernel_data isn’t ideal either;
maybe batch_data would be better ?

> +
> +                       if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
> +                               goto compress_error;
> +               }

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches.
  2025-08-01  4:36 ` [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
  2025-08-14 21:05   ` Nhat Pham
@ 2025-08-28 23:59   ` Barry Song
  2025-08-29  3:06     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-28 23:59 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Fri, Aug 1, 2025 at 4:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch modifies zswap_store() to store a batch of pages in large
> folios at a time, instead of storing one page at a time. It does this by
> calling a new procedure zswap_store_pages() with a range of
> "pool->batch_size" indices in the folio.
>
> zswap_store_pages() implements all the computes done earlier in
> zswap_store_page() for a single-page, for multiple pages in a folio,
> namely the "batch":
>
> 1) It starts by allocating all zswap entries required to store the
>    batch. New procedures, zswap_entries_cache_alloc_batch() and
>    zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
>    to optimize the performance of this step.
>
> 2) Next, the entries fields are written, computes that need to be happen
>    anyway, without modifying the zswap xarray/LRU publishing order. This
>    improves latency by avoiding having the bring the entries into the
>    cache for writing in different code blocks within this procedure.
>
> 3) Next, it calls zswap_compress() to sequentially compress each page in
>    the batch.
>
> 4) Finally, it adds the batch's zswap entries to the xarray and LRU,
>    charges zswap memory and increments zswap stats.
>
> 5) The error handling and cleanup required for all failure scenarios
>    that can occur while storing a batch in zswap are consolidated to a
>    single "store_pages_failed" label in zswap_store_pages(). Here again,
>    we optimize performance by calling kmem_cache_free_bulk().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 218 ++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 149 insertions(+), 69 deletions(-)

This seems fine overall. However, could we pull some data from the
cover letter. For example, even with hardware batching, we are still
improving performance. Since your cover letter is very long, readers
might fail to connect this data with the patches.

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface.
  2025-08-01  4:36 ` [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface Kanchana P Sridhar
@ 2025-08-29  0:16   ` Barry Song
  2025-08-29  3:12     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-29  0:16 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, akpm,
	senozhatsky, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Fri, Aug 1, 2025 at 4:36 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> The Fixed ("deflate-iaa") and Dynamic ("deflate-iaa-dynamic") IAA
> acomp_algs register an implementation for get_batch_size(). zswap can
> query crypto_acomp_batch_size() to get the maximum number of requests
> that can be batch [de]compressed. zswap can use the minimum of this, and
> any zswap-specific upper limits for batch-size to allocate batching
> resources.
>
> This enables zswap to compress/decompress pages in parallel in the IAA
> hardware accelerator to improve swapout/swapin performance and memory
> savings.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
> index 480e12c1d77a5..b7c6fc334dae7 100644
> --- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
> +++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
> @@ -2785,6 +2785,7 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
>         .init                   = iaa_comp_init_fixed,
>         .compress               = iaa_comp_acompress_main,
>         .decompress             = iaa_comp_adecompress_main,
> +       .get_batch_size         = iaa_comp_get_max_batch_size,
>         .base                   = {
>                 .cra_name               = "deflate",
>                 .cra_driver_name        = "deflate-iaa",
> @@ -2810,6 +2811,7 @@ static struct acomp_alg iaa_acomp_dynamic_deflate = {
>         .init                   = iaa_comp_init_dynamic,
>         .compress               = iaa_comp_acompress_main,
>         .decompress             = iaa_comp_adecompress_main,
> +       .get_batch_size         = iaa_comp_get_max_batch_size,

I feel the patches are being split too finely and are not fully
self-contained. You added iaa_comp_get_max_batch_size in the previous
patch, but the callback appears in this one. Why not combine them
together? Anyway, since you are moving to a static field, this patch
will be removed automatically. Could you reconsider organizing v12
to make it easier for everyone to follow? :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-28 23:28                   ` Barry Song
@ 2025-08-29  2:56                     ` Sridhar, Kanchana P
  2025-08-29  3:42                       ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-29  2:56 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 4:29 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> > >
> > > If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware batching,
> > > compression is done with a step size of 1. If the hardware step size is 4,
> > > compression occurs in two steps. If the hardware step size is 6, the first
> > > compression uses a step size of 6, and the second uses a step size of 2.
> > > Do you think this will work?
> >
> > Hi Barry,
> >
> > This would be non-optimal from code simplicity and latency perspectives.
> > One of the benefits of using the hardware accelerator's "batch parallelism"
> > is cost amortization across the batch. We might lose this benefit if we make
> > multiple calls to zswap_compress() to ask the hardware accelerator to
> > batch compress in smaller batches. Compression throughput would also
> > be sub-optimal.
> 
> I guess it wouldn’t be an issue if both ZSWAP_MAX_BATCH_SIZE and
> pool->compr_batch_size are powers of two. As you mentioned, we still
> gain improvement with ZSWAP_MAX_BATCH_SIZE batching even when
> pool->compr_batch_size == 1, by compressing pages one by one but
> batching other work such as zswap_entries_cache_alloc_batch() ?
> 
> >
> > In my patch-series, the rule is simple: if an algorithm has specified a
> > batch-size, carve out the folio by that "batch-size" # of pages to be
> > compressed as a batch in zswap_compress(). This custom batch-size
> > is capped at ZSWAP_MAX_BATCH_SIZE.
> >
> > If an algorithm has not specified a batch-size, the default batch-size
> > is 1. In this case, carve out the folio by ZSWAP_MAX_BATCH_SIZE
> > # of pages to be compressed as a batch in zswap_compress().
> 
> Yes, I understand your rule. However, having two global variables is still
> somewhat confusing. It might be clearer to use a single variable with a
> comment, since one variable can clearly determine the value of the other.
> 
> Can we get the batch_size at runtime based on pool->compr_batch_size?
> 
> /*
>  * If hardware compression supports batching, we use the hardware step size.
>  * Otherwise, we use ZSWAP_MAX_BATCH_SIZE for batching, but still
> compress
>  * one page at a time.
>  */
> batch_size = pool->compr_batch_size > 1 ? pool->compr_batch_size :
>              ZSWAP_MAX_BATCH_SIZE;
> 
> We probably don’t need this if both pool->compr_batch_size and
> ZSWAP_MAX_BATCH_SIZE are powers of two?

I am not sure I understand this rationale, but I do want to reiterate
that the patch-set implements a simple set of rules/design choices
to provide a batching framework for software and hardware compressors,
that has shown good performance improvements with both, while
unifying zswap_store()/zswap_compress() code paths for both.

As explained before, keeping the two variables as distinct u8 members
of struct zswap_pool is a design choice with these benefits:

1) Saves computes by avoiding computing this in performance-critical
    zswap_store() code. I have verified that dynamically computing the
    batch_size based on pool->compr_batch_size impacts latency.

2) The memory overhead is minimal: there is at most one zswap_pool
     active at any given time, other than at compressor transitions. The
     additional overhead is one u8, i.e., 1 byte for 1 runtime struct.

> 
> >
> > >
> > > I don’t quite understand why you want to save
> > > ZSWAP_MAX_BATCH_SIZE - X resources, since even without hardware
> > > batching
> > > you are still allocating all ZSWAP_MAX_BATCH_SIZE resources. This is the
> > > case for all platforms except yours.
> >
> > Not sure I understand.. Just to clarify, this is not done to save on resources,
> > rather for the reasons stated above.
> >
> > We are already saving on resources by only allocating only
> > "pool->compr_batch_size" number of resources
> > (*not* ZSWAP_MAX_BATCH_SIZE resources):
> >
> >         pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> >                                      crypto_acomp_batch_size(acomp_ctx->acomp));
> >
> > For non-Intel platforms, this means only 1 dst buffer is allocated, as
> > explained in the commit log for this patch.
> 
> you are right. I misunderstood your code :-)
> 
> >
> > " A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> > Yosry's suggestion. pool->compr_batch_size is set as the minimum of the
> > compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly, it
> > proceeds to allocate the necessary compression dst buffers in the
> > per-CPU acomp_ctx."
> 
> This is fine, but it still doesn’t provide a strong reason for having
> two global variables when one can fully determine the value of the other.

Hopefully the above provides clarity.

Thanks,
Kanchana

> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-28 23:54   ` Barry Song
@ 2025-08-29  3:04     ` Sridhar, Kanchana P
  2025-08-29  3:31       ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-29  3:04 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 4:54 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> nr_pages,
> > +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> > +                          int node_id)
> >  {
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         struct scatterlist input, output;
> > -       int comp_ret = 0, alloc_ret = 0;
> > -       unsigned int dlen = PAGE_SIZE;
> > -       unsigned long handle;
> > -       struct zpool *zpool;
> > +       struct zpool *zpool = pool->zpool;
> > +
> > +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > +       int errors[ZSWAP_MAX_BATCH_SIZE];
> > +
> > +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > +       unsigned int i, j;
> > +       int err;
> >         gfp_t gfp;
> > -       u8 *dst;
> > +
> > +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> >
> >         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffers[0];
> > -       sg_init_table(&input, 1);
> > -       sg_set_page(&input, page, PAGE_SIZE, 0);
> > -
> >         /*
> > -        * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > -        * and hardware-accelerators may won't check the dst buffer size, so
> > -        * giving the dst buffer with enough length to avoid buffer overflow.
> > +        * Note:
> > +        * [i] refers to the incoming batch space and is used to
> > +        *     index into the folio pages, @entries and @errors.
> >          */
> > -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +       for (i = 0; i < nr_pages; i += nr_comps) {
> > +               if (nr_comps == 1) {
> > +                       sg_init_table(&input, 1);
> > +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
> >
> > -       /*
> > -        * it maybe looks a little bit silly that we send an asynchronous request,
> > -        * then wait for its completion synchronously. This makes the process
> look
> > -        * synchronous in fact.
> > -        * Theoretically, acomp supports users send multiple acomp requests in
> one
> > -        * acomp instance, then get those requests done simultaneously. but in
> this
> > -        * case, zswap actually does store and load page by page, there is no
> > -        * existing method to send the second page before the first page is
> done
> > -        * in one thread doing zwap.
> > -        * but in different threads running on different cpu, we have different
> > -        * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > -        */
> > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -       dlen = acomp_ctx->req->dlen;
> > -       if (comp_ret)
> > -               goto unlock;
> > +                       /*
> > +                        * We need PAGE_SIZE * 2 here since there maybe over-
> compression case,
> > +                        * and hardware-accelerators may won't check the dst buffer
> size, so
> > +                        * giving the dst buffer with enough length to avoid buffer
> overflow.
> > +                        */
> > +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
> > +                       acomp_request_set_params(acomp_ctx->req, &input,
> > +                                                &output, PAGE_SIZE, PAGE_SIZE);
> > +
> > +                       errors[i] =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> > +                                                   &acomp_ctx->wait);
> > +                       if (unlikely(errors[i]))
> > +                               goto compress_error;
> > +
> > +                       dlens[i] = acomp_ctx->req->dlen;
> > +               } else {
> > +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > +                       unsigned int k;
> > +
> > +                       for (k = 0; k < nr_pages; ++k)
> > +                               pages[k] = folio_page(folio, start + k);
> > +
> > +                       struct swap_batch_comp_data batch_comp_data = {
> > +                               .pages = pages,
> > +                               .dsts = acomp_ctx->buffers,
> > +                               .dlens = dlens,
> > +                               .errors = errors,
> > +                               .nr_comps = nr_pages,
> > +                       };
> 
> Why would this work given that nr_pages might be larger than
> pool->compr_batch_size?

You mean the batching call? For batching compressors, nr_pages 
is always <= pool->batch_size. For batching compressors, pool->batch_size
is the pool->compr_batch_size.

> 
> unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> 
> So this actually doesn’t happen unless pool->compr_batch_size == 1,
> but the code is confusing, right?
> 
> > +
> > +                       acomp_ctx->req->kernel_data = &batch_comp_data;
> 
> Can you actually pass a request larger than pool->compr_batch_size
> to the crypto driver?

Clarification above..

> 
> By the way, swap_batch_comp_data seems like a poor name. Why should
> crypto drivers know anything about swap_? kernel_data isn’t ideal either;
> maybe batch_data would be better ?

This will be changing in v12 to use an SG list based on Herbert's suggestions.

Thanks,
Kanchana

> 
> > +
> > +                       if (unlikely(crypto_acomp_compress(acomp_ctx->req)))
> > +                               goto compress_error;
> > +               }
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches.
  2025-08-28 23:59   ` Barry Song
@ 2025-08-29  3:06     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-29  3:06 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 5:00 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 23/24] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> On Fri, Aug 1, 2025 at 4:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch modifies zswap_store() to store a batch of pages in large
> > folios at a time, instead of storing one page at a time. It does this by
> > calling a new procedure zswap_store_pages() with a range of
> > "pool->batch_size" indices in the folio.
> >
> > zswap_store_pages() implements all the computes done earlier in
> > zswap_store_page() for a single-page, for multiple pages in a folio,
> > namely the "batch":
> >
> > 1) It starts by allocating all zswap entries required to store the
> >    batch. New procedures, zswap_entries_cache_alloc_batch() and
> >    zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
> >    to optimize the performance of this step.
> >
> > 2) Next, the entries fields are written, computes that need to be happen
> >    anyway, without modifying the zswap xarray/LRU publishing order. This
> >    improves latency by avoiding having the bring the entries into the
> >    cache for writing in different code blocks within this procedure.
> >
> > 3) Next, it calls zswap_compress() to sequentially compress each page in
> >    the batch.
> >
> > 4) Finally, it adds the batch's zswap entries to the xarray and LRU,
> >    charges zswap memory and increments zswap stats.
> >
> > 5) The error handling and cleanup required for all failure scenarios
> >    that can occur while storing a batch in zswap are consolidated to a
> >    single "store_pages_failed" label in zswap_store_pages(). Here again,
> >    we optimize performance by calling kmem_cache_free_bulk().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 218 ++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 149 insertions(+), 69 deletions(-)
> 
> This seems fine overall. However, could we pull some data from the
> cover letter. For example, even with hardware batching, we are still
> improving performance. Since your cover letter is very long, readers
> might fail to connect this data with the patches.

Sure, will add the data in the commit log.

Thanks,
Kanchana

> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface.
  2025-08-29  0:16   ` Barry Song
@ 2025-08-29  3:12     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-29  3:12 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 5:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the
> get_batch_size() interface.
> 
> On Fri, Aug 1, 2025 at 4:36 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > The Fixed ("deflate-iaa") and Dynamic ("deflate-iaa-dynamic") IAA
> > acomp_algs register an implementation for get_batch_size(). zswap can
> > query crypto_acomp_batch_size() to get the maximum number of requests
> > that can be batch [de]compressed. zswap can use the minimum of this, and
> > any zswap-specific upper limits for batch-size to allocate batching
> > resources.
> >
> > This enables zswap to compress/decompress pages in parallel in the IAA
> > hardware accelerator to improve swapout/swapin performance and
> memory
> > savings.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c
> b/drivers/crypto/intel/iaa/iaa_crypto_main.c
> > index 480e12c1d77a5..b7c6fc334dae7 100644
> > --- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
> > +++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
> > @@ -2785,6 +2785,7 @@ static struct acomp_alg iaa_acomp_fixed_deflate
> = {
> >         .init                   = iaa_comp_init_fixed,
> >         .compress               = iaa_comp_acompress_main,
> >         .decompress             = iaa_comp_adecompress_main,
> > +       .get_batch_size         = iaa_comp_get_max_batch_size,
> >         .base                   = {
> >                 .cra_name               = "deflate",
> >                 .cra_driver_name        = "deflate-iaa",
> > @@ -2810,6 +2811,7 @@ static struct acomp_alg
> iaa_acomp_dynamic_deflate = {
> >         .init                   = iaa_comp_init_dynamic,
> >         .compress               = iaa_comp_acompress_main,
> >         .decompress             = iaa_comp_adecompress_main,
> > +       .get_batch_size         = iaa_comp_get_max_batch_size,
> 
> I feel the patches are being split too finely and are not fully
> self-contained. You added iaa_comp_get_max_batch_size in the previous
> patch, but the callback appears in this one. Why not combine them
> together? Anyway, since you are moving to a static field, this patch
> will be removed automatically.

Yes, based on your earlier suggestion I have made the note to bundle
related patches :)

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-29  3:04     ` Sridhar, Kanchana P
@ 2025-08-29  3:31       ` Barry Song
  2025-08-29  3:39         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-29  3:31 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Aug 29, 2025 at 11:05 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Barry Song <21cnbao@gmail.com>
> > Sent: Thursday, August 28, 2025 4:54 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> >
> > > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> > nr_pages,
> > > +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> > > +                          int node_id)
> > >  {
> > >         struct crypto_acomp_ctx *acomp_ctx;
> > >         struct scatterlist input, output;
> > > -       int comp_ret = 0, alloc_ret = 0;
> > > -       unsigned int dlen = PAGE_SIZE;
> > > -       unsigned long handle;
> > > -       struct zpool *zpool;
> > > +       struct zpool *zpool = pool->zpool;
> > > +
> > > +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > > +       int errors[ZSWAP_MAX_BATCH_SIZE];
> > > +
> > > +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > > +       unsigned int i, j;
> > > +       int err;
> > >         gfp_t gfp;
> > > -       u8 *dst;
> > > +
> > > +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> > __GFP_MOVABLE;
> > >
> > >         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > >
> > >         mutex_lock(&acomp_ctx->mutex);
> > >
> > > -       dst = acomp_ctx->buffers[0];
> > > -       sg_init_table(&input, 1);
> > > -       sg_set_page(&input, page, PAGE_SIZE, 0);
> > > -
> > >         /*
> > > -        * We need PAGE_SIZE * 2 here since there maybe over-compression
> > case,
> > > -        * and hardware-accelerators may won't check the dst buffer size, so
> > > -        * giving the dst buffer with enough length to avoid buffer overflow.
> > > +        * Note:
> > > +        * [i] refers to the incoming batch space and is used to
> > > +        *     index into the folio pages, @entries and @errors.
> > >          */
> > > -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> > > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> > PAGE_SIZE, dlen);
> > > +       for (i = 0; i < nr_pages; i += nr_comps) {
> > > +               if (nr_comps == 1) {
> > > +                       sg_init_table(&input, 1);
> > > +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE, 0);
> > >
> > > -       /*
> > > -        * it maybe looks a little bit silly that we send an asynchronous request,
> > > -        * then wait for its completion synchronously. This makes the process
> > look
> > > -        * synchronous in fact.
> > > -        * Theoretically, acomp supports users send multiple acomp requests in
> > one
> > > -        * acomp instance, then get those requests done simultaneously. but in
> > this
> > > -        * case, zswap actually does store and load page by page, there is no
> > > -        * existing method to send the second page before the first page is
> > done
> > > -        * in one thread doing zwap.
> > > -        * but in different threads running on different cpu, we have different
> > > -        * acomp instance, so multiple threads can do (de)compression in
> > parallel.
> > > -        */
> > > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >req), &acomp_ctx->wait);
> > > -       dlen = acomp_ctx->req->dlen;
> > > -       if (comp_ret)
> > > -               goto unlock;
> > > +                       /*
> > > +                        * We need PAGE_SIZE * 2 here since there maybe over-
> > compression case,
> > > +                        * and hardware-accelerators may won't check the dst buffer
> > size, so
> > > +                        * giving the dst buffer with enough length to avoid buffer
> > overflow.
> > > +                        */
> > > +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
> > > +                       acomp_request_set_params(acomp_ctx->req, &input,
> > > +                                                &output, PAGE_SIZE, PAGE_SIZE);
> > > +
> > > +                       errors[i] =
> > crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> > > +                                                   &acomp_ctx->wait);
> > > +                       if (unlikely(errors[i]))
> > > +                               goto compress_error;
> > > +
> > > +                       dlens[i] = acomp_ctx->req->dlen;
> > > +               } else {
> > > +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > > +                       unsigned int k;
> > > +
> > > +                       for (k = 0; k < nr_pages; ++k)
> > > +                               pages[k] = folio_page(folio, start + k);
> > > +
> > > +                       struct swap_batch_comp_data batch_comp_data = {
> > > +                               .pages = pages,
> > > +                               .dsts = acomp_ctx->buffers,
> > > +                               .dlens = dlens,
> > > +                               .errors = errors,
> > > +                               .nr_comps = nr_pages,
> > > +                       };
> >
> > Why would this work given that nr_pages might be larger than
> > pool->compr_batch_size?
>
> You mean the batching call? For batching compressors, nr_pages
> is always <= pool->batch_size. For batching compressors, pool->batch_size
> is the pool->compr_batch_size.

I’m actually confused that this feels inconsistent with the earlier

    unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);

So why not just use nr_comps instead?

>
> >
> > unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> >
> > So this actually doesn’t happen unless pool->compr_batch_size == 1,
> > but the code is confusing, right?
> >
> > > +
> > > +                       acomp_ctx->req->kernel_data = &batch_comp_data;
> >
> > Can you actually pass a request larger than pool->compr_batch_size
> > to the crypto driver?
>
> Clarification above..
>
> >
> > By the way, swap_batch_comp_data seems like a poor name. Why should
> > crypto drivers know anything about swap_? kernel_data isn’t ideal either;
> > maybe batch_data would be better ?
>
> This will be changing in v12 to use an SG list based on Herbert's suggestions.
>

Cool. Thanks!

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-08-29  3:31       ` Barry Song
@ 2025-08-29  3:39         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-29  3:39 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 8:31 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Aug 29, 2025 at 11:05 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@gmail.com>
> > > Sent: Thursday, August 28, 2025 4:54 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > > foundation.org; senozhatsky@chromium.org; linux-
> crypto@vger.kernel.org;
> > > herbert@gondor.apana.org.au; davem@davemloft.net;
> > > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v11 24/24] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> > > nr_pages,
> > > > +                          struct zswap_entry *entries[], struct zswap_pool *pool,
> > > > +                          int node_id)
> > > >  {
> > > >         struct crypto_acomp_ctx *acomp_ctx;
> > > >         struct scatterlist input, output;
> > > > -       int comp_ret = 0, alloc_ret = 0;
> > > > -       unsigned int dlen = PAGE_SIZE;
> > > > -       unsigned long handle;
> > > > -       struct zpool *zpool;
> > > > +       struct zpool *zpool = pool->zpool;
> > > > +
> > > > +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > > > +       int errors[ZSWAP_MAX_BATCH_SIZE];
> > > > +
> > > > +       unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > > > +       unsigned int i, j;
> > > > +       int err;
> > > >         gfp_t gfp;
> > > > -       u8 *dst;
> > > > +
> > > > +       gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> > > __GFP_MOVABLE;
> > > >
> > > >         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > >
> > > >         mutex_lock(&acomp_ctx->mutex);
> > > >
> > > > -       dst = acomp_ctx->buffers[0];
> > > > -       sg_init_table(&input, 1);
> > > > -       sg_set_page(&input, page, PAGE_SIZE, 0);
> > > > -
> > > >         /*
> > > > -        * We need PAGE_SIZE * 2 here since there maybe over-
> compression
> > > case,
> > > > -        * and hardware-accelerators may won't check the dst buffer size,
> so
> > > > -        * giving the dst buffer with enough length to avoid buffer overflow.
> > > > +        * Note:
> > > > +        * [i] refers to the incoming batch space and is used to
> > > > +        *     index into the folio pages, @entries and @errors.
> > > >          */
> > > > -       sg_init_one(&output, dst, PAGE_SIZE * 2);
> > > > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> > > PAGE_SIZE, dlen);
> > > > +       for (i = 0; i < nr_pages; i += nr_comps) {
> > > > +               if (nr_comps == 1) {
> > > > +                       sg_init_table(&input, 1);
> > > > +                       sg_set_page(&input, folio_page(folio, start + i), PAGE_SIZE,
> 0);
> > > >
> > > > -       /*
> > > > -        * it maybe looks a little bit silly that we send an asynchronous
> request,
> > > > -        * then wait for its completion synchronously. This makes the
> process
> > > look
> > > > -        * synchronous in fact.
> > > > -        * Theoretically, acomp supports users send multiple acomp
> requests in
> > > one
> > > > -        * acomp instance, then get those requests done simultaneously.
> but in
> > > this
> > > > -        * case, zswap actually does store and load page by page, there is
> no
> > > > -        * existing method to send the second page before the first page is
> > > done
> > > > -        * in one thread doing zwap.
> > > > -        * but in different threads running on different cpu, we have
> different
> > > > -        * acomp instance, so multiple threads can do (de)compression in
> > > parallel.
> > > > -        */
> > > > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > > >req), &acomp_ctx->wait);
> > > > -       dlen = acomp_ctx->req->dlen;
> > > > -       if (comp_ret)
> > > > -               goto unlock;
> > > > +                       /*
> > > > +                        * We need PAGE_SIZE * 2 here since there maybe over-
> > > compression case,
> > > > +                        * and hardware-accelerators may won't check the dst
> buffer
> > > size, so
> > > > +                        * giving the dst buffer with enough length to avoid buffer
> > > overflow.
> > > > +                        */
> > > > +                       sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE *
> 2);
> > > > +                       acomp_request_set_params(acomp_ctx->req, &input,
> > > > +                                                &output, PAGE_SIZE, PAGE_SIZE);
> > > > +
> > > > +                       errors[i] =
> > > crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> > > > +                                                   &acomp_ctx->wait);
> > > > +                       if (unlikely(errors[i]))
> > > > +                               goto compress_error;
> > > > +
> > > > +                       dlens[i] = acomp_ctx->req->dlen;
> > > > +               } else {
> > > > +                       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > > > +                       unsigned int k;
> > > > +
> > > > +                       for (k = 0; k < nr_pages; ++k)
> > > > +                               pages[k] = folio_page(folio, start + k);
> > > > +
> > > > +                       struct swap_batch_comp_data batch_comp_data = {
> > > > +                               .pages = pages,
> > > > +                               .dsts = acomp_ctx->buffers,
> > > > +                               .dlens = dlens,
> > > > +                               .errors = errors,
> > > > +                               .nr_comps = nr_pages,
> > > > +                       };
> > >
> > > Why would this work given that nr_pages might be larger than
> > > pool->compr_batch_size?
> >
> > You mean the batching call? For batching compressors, nr_pages
> > is always <= pool->batch_size. For batching compressors, pool->batch_size
> > is the pool->compr_batch_size.
> 
> I’m actually confused that this feels inconsistent with the earlier
> 
>     unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> 
> So why not just use nr_comps instead?

Good observation.. Yes, I too realized this, and have been using nr_comps
in the code snippets I've been sharing that prototype Herbert's SG list suggestions.

Thanks,
Kanchana

> 
> >
> > >
> > > unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > >
> > > So this actually doesn’t happen unless pool->compr_batch_size == 1,
> > > but the code is confusing, right?
> > >
> > > > +
> > > > +                       acomp_ctx->req->kernel_data = &batch_comp_data;
> > >
> > > Can you actually pass a request larger than pool->compr_batch_size
> > > to the crypto driver?
> >
> > Clarification above..
> >
> > >
> > > By the way, swap_batch_comp_data seems like a poor name. Why should
> > > crypto drivers know anything about swap_? kernel_data isn’t ideal either;
> > > maybe batch_data would be better ?
> >
> > This will be changing in v12 to use an SG list based on Herbert's suggestions.
> >
> 
> Cool. Thanks!
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-29  2:56                     ` Sridhar, Kanchana P
@ 2025-08-29  3:42                       ` Barry Song
  2025-08-29 18:39                         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-29  3:42 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Aug 29, 2025 at 10:57 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Barry Song <21cnbao@gmail.com>
> > Sent: Thursday, August 28, 2025 4:29 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> > if the compressor supports batching.
> >
> > > >
> > > > If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware batching,
> > > > compression is done with a step size of 1. If the hardware step size is 4,
> > > > compression occurs in two steps. If the hardware step size is 6, the first
> > > > compression uses a step size of 6, and the second uses a step size of 2.
> > > > Do you think this will work?
> > >
> > > Hi Barry,
> > >
> > > This would be non-optimal from code simplicity and latency perspectives.
> > > One of the benefits of using the hardware accelerator's "batch parallelism"
> > > is cost amortization across the batch. We might lose this benefit if we make
> > > multiple calls to zswap_compress() to ask the hardware accelerator to
> > > batch compress in smaller batches. Compression throughput would also
> > > be sub-optimal.
> >
> > I guess it wouldn’t be an issue if both ZSWAP_MAX_BATCH_SIZE and
> > pool->compr_batch_size are powers of two. As you mentioned, we still
> > gain improvement with ZSWAP_MAX_BATCH_SIZE batching even when
> > pool->compr_batch_size == 1, by compressing pages one by one but
> > batching other work such as zswap_entries_cache_alloc_batch() ?
> >
> > >
> > > In my patch-series, the rule is simple: if an algorithm has specified a
> > > batch-size, carve out the folio by that "batch-size" # of pages to be
> > > compressed as a batch in zswap_compress(). This custom batch-size
> > > is capped at ZSWAP_MAX_BATCH_SIZE.
> > >
> > > If an algorithm has not specified a batch-size, the default batch-size
> > > is 1. In this case, carve out the folio by ZSWAP_MAX_BATCH_SIZE
> > > # of pages to be compressed as a batch in zswap_compress().
> >
> > Yes, I understand your rule. However, having two global variables is still
> > somewhat confusing. It might be clearer to use a single variable with a
> > comment, since one variable can clearly determine the value of the other.
> >
> > Can we get the batch_size at runtime based on pool->compr_batch_size?
> >
> > /*
> >  * If hardware compression supports batching, we use the hardware step size.
> >  * Otherwise, we use ZSWAP_MAX_BATCH_SIZE for batching, but still
> > compress
> >  * one page at a time.
> >  */
> > batch_size = pool->compr_batch_size > 1 ? pool->compr_batch_size :
> >              ZSWAP_MAX_BATCH_SIZE;
> >
> > We probably don’t need this if both pool->compr_batch_size and
> > ZSWAP_MAX_BATCH_SIZE are powers of two?
>
> I am not sure I understand this rationale, but I do want to reiterate
> that the patch-set implements a simple set of rules/design choices
> to provide a batching framework for software and hardware compressors,
> that has shown good performance improvements with both, while
> unifying zswap_store()/zswap_compress() code paths for both.

I’m really curious: if ZSWAP_MAX_BATCH_SIZE = 8 and
compr_batch_size = 4, why wouldn’t batch_size = 8 and
compr_batch_size = 4 perform better than batch_size = 4 and
compr_batch_size = 4?

In short, I’d like the case of compr_batch_size == 1 to be treated the same
as compr_batch_size == 2, 4, etc., since you can still see performance
improvements when ZSWAP_MAX_BATCH_SIZE = 8 and compr_batch_size == 1,
as batching occurs even outside compression.

Therefore, I would expect batch_size == 8 and compr_batch_size == 2 to
perform better than when both are 2.

The only thing preventing this from happening is that compr_batch_size
might be 5, 6, or 7, which are not powers of two?

>
> As explained before, keeping the two variables as distinct u8 members
> of struct zswap_pool is a design choice with these benefits:
>
> 1) Saves computes by avoiding computing this in performance-critical
>     zswap_store() code. I have verified that dynamically computing the
>     batch_size based on pool->compr_batch_size impacts latency.

Ok, I’m a bit surprised, since this small computation wouldn’t
cause a branch misprediction at all.

In any case, if you want to keep both variables, that’s fine.
But can we at least rename one of them? For example, use
pool->store_batch_size and pool->compr_batch_size instead of
pool->batch_size and pool->compr_batch_size, since pool->batch_size
generally has a broader semantic scope than compr_batch_size.

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-29  3:42                       ` Barry Song
@ 2025-08-29 18:39                         ` Sridhar, Kanchana P
  2025-08-30  8:40                           ` Barry Song
  0 siblings, 1 reply; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-08-29 18:39 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Thursday, August 28, 2025 8:42 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> On Fri, Aug 29, 2025 at 10:57 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Barry Song <21cnbao@gmail.com>
> > > Sent: Thursday, August 28, 2025 4:29 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> > > foundation.org; senozhatsky@chromium.org; linux-
> crypto@vger.kernel.org;
> > > herbert@gondor.apana.org.au; davem@davemloft.net;
> > > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > > Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching
> resources
> > > if the compressor supports batching.
> > >
> > > > >
> > > > > If ZSWAP_MAX_BATCH_SIZE is set to 8 and there is no hardware
> batching,
> > > > > compression is done with a step size of 1. If the hardware step size is
> 4,
> > > > > compression occurs in two steps. If the hardware step size is 6, the
> first
> > > > > compression uses a step size of 6, and the second uses a step size of 2.
> > > > > Do you think this will work?
> > > >
> > > > Hi Barry,
> > > >
> > > > This would be non-optimal from code simplicity and latency
> perspectives.
> > > > One of the benefits of using the hardware accelerator's "batch
> parallelism"
> > > > is cost amortization across the batch. We might lose this benefit if we
> make
> > > > multiple calls to zswap_compress() to ask the hardware accelerator to
> > > > batch compress in smaller batches. Compression throughput would also
> > > > be sub-optimal.
> > >
> > > I guess it wouldn’t be an issue if both ZSWAP_MAX_BATCH_SIZE and
> > > pool->compr_batch_size are powers of two. As you mentioned, we still
> > > gain improvement with ZSWAP_MAX_BATCH_SIZE batching even when
> > > pool->compr_batch_size == 1, by compressing pages one by one but
> > > batching other work such as zswap_entries_cache_alloc_batch() ?
> > >
> > > >
> > > > In my patch-series, the rule is simple: if an algorithm has specified a
> > > > batch-size, carve out the folio by that "batch-size" # of pages to be
> > > > compressed as a batch in zswap_compress(). This custom batch-size
> > > > is capped at ZSWAP_MAX_BATCH_SIZE.
> > > >
> > > > If an algorithm has not specified a batch-size, the default batch-size
> > > > is 1. In this case, carve out the folio by ZSWAP_MAX_BATCH_SIZE
> > > > # of pages to be compressed as a batch in zswap_compress().
> > >
> > > Yes, I understand your rule. However, having two global variables is still
> > > somewhat confusing. It might be clearer to use a single variable with a
> > > comment, since one variable can clearly determine the value of the other.
> > >
> > > Can we get the batch_size at runtime based on pool->compr_batch_size?
> > >
> > > /*
> > >  * If hardware compression supports batching, we use the hardware step
> size.
> > >  * Otherwise, we use ZSWAP_MAX_BATCH_SIZE for batching, but still
> > > compress
> > >  * one page at a time.
> > >  */
> > > batch_size = pool->compr_batch_size > 1 ? pool->compr_batch_size :
> > >              ZSWAP_MAX_BATCH_SIZE;
> > >
> > > We probably don’t need this if both pool->compr_batch_size and
> > > ZSWAP_MAX_BATCH_SIZE are powers of two?
> >
> > I am not sure I understand this rationale, but I do want to reiterate
> > that the patch-set implements a simple set of rules/design choices
> > to provide a batching framework for software and hardware compressors,
> > that has shown good performance improvements with both, while
> > unifying zswap_store()/zswap_compress() code paths for both.
> 
> I’m really curious: if ZSWAP_MAX_BATCH_SIZE = 8 and
> compr_batch_size = 4, why wouldn’t batch_size = 8 and
> compr_batch_size = 4 perform better than batch_size = 4 and
> compr_batch_size = 4?
> 
> In short, I’d like the case of compr_batch_size == 1 to be treated the same
> as compr_batch_size == 2, 4, etc., since you can still see performance
> improvements when ZSWAP_MAX_BATCH_SIZE = 8 and compr_batch_size ==
> 1,
> as batching occurs even outside compression.
> 
> Therefore, I would expect batch_size == 8 and compr_batch_size == 2 to
> perform better than when both are 2.
> 
> The only thing preventing this from happening is that compr_batch_size
> might be 5, 6, or 7, which are not powers of two?

It would be interesting to see if a generalization of pool->compr_batch_size
being a factor "N" (where N > 1) of ZSWAP_MAX_BATCH_SIZE yields better
performance than the current set of rules. However, we would still need to
handle the case where it is not, as you mention, which might still necessitate
the use of a distinct pool->batch_size to avoid re-calculating this dynamically,
when this information doesn't change after pool creation.

The current implementation gives preference to the algorithm to determine
not just the batch compression step-size, but also the working-set size for
other zswap processing for the batch, i.e., bulk allocation of entries,
zpool writes, etc. The algorithm's batch-size is what zswap uses for the latter
(the zswap_store_pages() in my patch-set). This has been shown to work
well.

To change this design to be driven instead by ZSWAP_MAX_BATCH_SIZE
always (while handling non-factor pool->compr_batch_size) requires more
data gathering. I am inclined to keep the existing implementation and
we can continue to improve upon this if its Ok with you.

> 
> >
> > As explained before, keeping the two variables as distinct u8 members
> > of struct zswap_pool is a design choice with these benefits:
> >
> > 1) Saves computes by avoiding computing this in performance-critical
> >     zswap_store() code. I have verified that dynamically computing the
> >     batch_size based on pool->compr_batch_size impacts latency.
> 
> Ok, I’m a bit surprised, since this small computation wouldn’t
> cause a branch misprediction at all.
> 
> In any case, if you want to keep both variables, that’s fine.
> But can we at least rename one of them? For example, use
> pool->store_batch_size and pool->compr_batch_size instead of
> pool->batch_size and pool->compr_batch_size, since pool->batch_size
> generally has a broader semantic scope than compr_batch_size.

Sure. I will change pool->batch_size to be pool->store_batch_size.

Thanks,
Kanchana

> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-29 18:39                         ` Sridhar, Kanchana P
@ 2025-08-30  8:40                           ` Barry Song
  2025-09-03 18:00                             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 68+ messages in thread
From: Barry Song @ 2025-08-30  8:40 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh

> > >
> > > I am not sure I understand this rationale, but I do want to reiterate
> > > that the patch-set implements a simple set of rules/design choices
> > > to provide a batching framework for software and hardware compressors,
> > > that has shown good performance improvements with both, while
> > > unifying zswap_store()/zswap_compress() code paths for both.
> >
> > I’m really curious: if ZSWAP_MAX_BATCH_SIZE = 8 and
> > compr_batch_size = 4, why wouldn’t batch_size = 8 and
> > compr_batch_size = 4 perform better than batch_size = 4 and
> > compr_batch_size = 4?
> >
> > In short, I’d like the case of compr_batch_size == 1 to be treated the same
> > as compr_batch_size == 2, 4, etc., since you can still see performance
> > improvements when ZSWAP_MAX_BATCH_SIZE = 8 and compr_batch_size ==
> > 1,
> > as batching occurs even outside compression.
> >
> > Therefore, I would expect batch_size == 8 and compr_batch_size == 2 to
> > perform better than when both are 2.
> >
> > The only thing preventing this from happening is that compr_batch_size
> > might be 5, 6, or 7, which are not powers of two?
>
> It would be interesting to see if a generalization of pool->compr_batch_size
> being a factor "N" (where N > 1) of ZSWAP_MAX_BATCH_SIZE yields better
> performance than the current set of rules. However, we would still need to
> handle the case where it is not, as you mention, which might still necessitate
> the use of a distinct pool->batch_size to avoid re-calculating this dynamically,
> when this information doesn't change after pool creation.
>
> The current implementation gives preference to the algorithm to determine
> not just the batch compression step-size, but also the working-set size for
> other zswap processing for the batch, i.e., bulk allocation of entries,
> zpool writes, etc. The algorithm's batch-size is what zswap uses for the latter
> (the zswap_store_pages() in my patch-set). This has been shown to work
> well.
>
> To change this design to be driven instead by ZSWAP_MAX_BATCH_SIZE
> always (while handling non-factor pool->compr_batch_size) requires more
> data gathering. I am inclined to keep the existing implementation and
> we can continue to improve upon this if its Ok with you.

Right, I have no objection at this stage. I’m just curious—since some hardware
now supports HW compression with only one queue, and in the future may
increase to two or four queues but not many overall—whether batch_size ==
compr_batch_size is always the best rule.

BTW, is HW compression always better than software? For example, when
kswapd, proactive reclamation, and direct reclamation all run simultaneously,
the CPU-based approach can leverage multiple CPUs to perform compression
in parallel. But if the hardware only provides a limited number of queues,
software might actually perform better. An extreme case is when multiple
threads are running MADV_PAGEOUT at the same time.

I’m not opposing your current patchset, just sharing some side thoughts :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-08-30  8:40                           ` Barry Song
@ 2025-09-03 18:00                             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 68+ messages in thread
From: Sridhar, Kanchana P @ 2025-09-03 18:00 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org, senozhatsky@chromium.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Barry Song <21cnbao@gmail.com>
> Sent: Saturday, August 30, 2025 1:41 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@linux.alibaba.com; akpm@linux-
> foundation.org; senozhatsky@chromium.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Gomes, Vinicius <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v11 22/24] mm: zswap: Allocate pool batching resources
> if the compressor supports batching.
> 
> > > >
> > > > I am not sure I understand this rationale, but I do want to reiterate
> > > > that the patch-set implements a simple set of rules/design choices
> > > > to provide a batching framework for software and hardware
> compressors,
> > > > that has shown good performance improvements with both, while
> > > > unifying zswap_store()/zswap_compress() code paths for both.
> > >
> > > I’m really curious: if ZSWAP_MAX_BATCH_SIZE = 8 and
> > > compr_batch_size = 4, why wouldn’t batch_size = 8 and
> > > compr_batch_size = 4 perform better than batch_size = 4 and
> > > compr_batch_size = 4?
> > >
> > > In short, I’d like the case of compr_batch_size == 1 to be treated the same
> > > as compr_batch_size == 2, 4, etc., since you can still see performance
> > > improvements when ZSWAP_MAX_BATCH_SIZE = 8 and compr_batch_size
> ==
> > > 1,
> > > as batching occurs even outside compression.
> > >
> > > Therefore, I would expect batch_size == 8 and compr_batch_size == 2 to
> > > perform better than when both are 2.
> > >
> > > The only thing preventing this from happening is that compr_batch_size
> > > might be 5, 6, or 7, which are not powers of two?
> >
> > It would be interesting to see if a generalization of pool->compr_batch_size
> > being a factor "N" (where N > 1) of ZSWAP_MAX_BATCH_SIZE yields better
> > performance than the current set of rules. However, we would still need to
> > handle the case where it is not, as you mention, which might still
> necessitate
> > the use of a distinct pool->batch_size to avoid re-calculating this
> dynamically,
> > when this information doesn't change after pool creation.
> >
> > The current implementation gives preference to the algorithm to determine
> > not just the batch compression step-size, but also the working-set size for
> > other zswap processing for the batch, i.e., bulk allocation of entries,
> > zpool writes, etc. The algorithm's batch-size is what zswap uses for the
> latter
> > (the zswap_store_pages() in my patch-set). This has been shown to work
> > well.
> >
> > To change this design to be driven instead by ZSWAP_MAX_BATCH_SIZE
> > always (while handling non-factor pool->compr_batch_size) requires more
> > data gathering. I am inclined to keep the existing implementation and
> > we can continue to improve upon this if its Ok with you.
> 
> Right, I have no objection at this stage. I’m just curious—since some hardware
> now supports HW compression with only one queue, and in the future may
> increase to two or four queues but not many overall—whether batch_size ==
> compr_batch_size is always the best rule.
> 
> BTW, is HW compression always better than software? For example, when
> kswapd, proactive reclamation, and direct reclamation all run simultaneously,
> the CPU-based approach can leverage multiple CPUs to perform compression
> in parallel. But if the hardware only provides a limited number of queues,
> software might actually perform better. An extreme case is when multiple
> threads are running MADV_PAGEOUT at the same time.

These are great questions that we'll need to run more experiments to answer
and understand the trade-offs. The good thing is the zswap architecture proposed
in this patch-set is flexible enough to allow us to do so with minor changes in
how we set up these two zswap_pool data members (pool->compr_batch_size and
pool->store_batch_size).

One of the next steps that we plan to explore is integrating batching for
hardware parallelism with the kcompressd work, as per Nhat's suggestion. I
believe hardware can be used with multiple compression threads from all these
sources. There are many details to be worked out, but I think this can be done
by striking the right balance between cost amortization, hardware parallelism,
overlapping computes between the CPU and the accelerator, etc.

Another important point is that our hardware roadmap continues to evolve,
consistently improving compression ratios, lowering both compression and
decompression latency and boosting overall throughput. 

To summarize: with hardware compression acceleration and using batching to avail
of parallel hardware compress/decompress, we can improve the reclaim latency for
a given CPU thread. When combined with compression ratio improvements, we save
more memory for equivalent performance; and/or need to reclaim less via
proactive or direct reclaim (reclaim latency improvements result in less memory
pressure) and improve workload performance. This can make a significant impact
on contended systems where CPU threads for reclaim come at a cost, and hardware
compression can help offset this. Hence, I can only see upside, but we'll need to
prove this out :).

Thanks,
Kanchana

> 
> I’m not opposing your current patchset, just sharing some side thoughts :-)
> 
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-09-03 18:00 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-01  4:36 [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 01/24] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 02/24] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 03/24] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 04/24] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 05/24] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 06/24] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 07/24] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 08/24] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 09/24] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 10/24] crypto: iaa - Rearchitect the iaa_crypto driver to be usable by zswap and zram Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 11/24] crypto: iaa - Enablers for submitting descriptors then polling for completion Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 12/24] crypto: acomp - Add "void *kernel_data" in "struct acomp_req" for kernel users Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 13/24] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 14/24] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 15/24] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 16/24] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 17/24] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 18/24] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
2025-08-15  5:28   ` Herbert Xu
2025-08-22 19:31     ` Sridhar, Kanchana P
2025-08-22 21:48       ` Nhat Pham
2025-08-22 21:58         ` Sridhar, Kanchana P
2025-08-22 22:00           ` Sridhar, Kanchana P
2025-08-01  4:36 ` [PATCH v11 19/24] crypto: iaa - IAA acomp_algs register the get_batch_size() interface Kanchana P Sridhar
2025-08-29  0:16   ` Barry Song
2025-08-29  3:12     ` Sridhar, Kanchana P
2025-08-01  4:36 ` [PATCH v11 20/24] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 21/24] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
2025-08-01  4:36 ` [PATCH v11 22/24] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-08-14 20:58   ` Nhat Pham
2025-08-14 22:05     ` Sridhar, Kanchana P
2025-08-26  3:48   ` Barry Song
2025-08-26  4:27     ` Sridhar, Kanchana P
2025-08-26  4:42       ` Barry Song
2025-08-26  4:56         ` Sridhar, Kanchana P
2025-08-26  5:17           ` Barry Song
2025-08-27  0:06             ` Sridhar, Kanchana P
2025-08-28 21:39               ` Barry Song
2025-08-28 22:47                 ` Sridhar, Kanchana P
2025-08-28 23:28                   ` Barry Song
2025-08-29  2:56                     ` Sridhar, Kanchana P
2025-08-29  3:42                       ` Barry Song
2025-08-29 18:39                         ` Sridhar, Kanchana P
2025-08-30  8:40                           ` Barry Song
2025-09-03 18:00                             ` Sridhar, Kanchana P
2025-08-01  4:36 ` [PATCH v11 23/24] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
2025-08-14 21:05   ` Nhat Pham
2025-08-14 22:10     ` Sridhar, Kanchana P
2025-08-28 23:59   ` Barry Song
2025-08-29  3:06     ` Sridhar, Kanchana P
2025-08-01  4:36 ` [PATCH v11 24/24] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
2025-08-14 21:14   ` Nhat Pham
2025-08-14 22:17     ` Sridhar, Kanchana P
2025-08-28 23:54   ` Barry Song
2025-08-29  3:04     ` Sridhar, Kanchana P
2025-08-29  3:31       ` Barry Song
2025-08-29  3:39         ` Sridhar, Kanchana P
2025-08-08 23:51 ` [PATCH v11 00/24] zswap compression batching with optimized iaa_crypto driver Nhat Pham
2025-08-09  0:03   ` Sridhar, Kanchana P
2025-08-15  5:27   ` Herbert Xu
2025-08-22 19:26     ` Sridhar, Kanchana P
2025-08-25  5:38       ` Herbert Xu
2025-08-25 18:12         ` Sridhar, Kanchana P
2025-08-26  1:13           ` Herbert Xu
2025-08-26  4:09             ` Sridhar, Kanchana P
2025-08-26  4:14               ` Herbert Xu
2025-08-26  4:42                 ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).