[PATCH v9 00/19] zswap compression batching

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v9 00/19] zswap compression batching
@ 2025-04-30 20:52 Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 01/19] crypto: acomp - Remove request chaining Kanchana P Sridhar
                   ` (20 more replies)
  0 siblings, 21 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Compression Batching:
=====================

This patch-series introduces batch compression of pages in large folios to
improve zswap swapout latency. It preserves the existing zswap protocols
for non-batching software compressors by calling crypto_acomp sequentially
per page in the batch. Additionally, in support of hardware accelerators
that can process a batch as an integral unit, the patch-series creates
generic batching interfaces in crypto_acomp, and calls the
crypto_acomp_batch_compress() interface in zswap_compress() for compressors
that intrinsically support batching.

The patch series provides a proof point by using the Intel Analytics
Accelerator (IAA) for implementing the compress/decompress batching API
using hardware parallelism in the iaa_crypto driver and another proof point
with a sequential software compressor, zstd.

SUMMARY:
========

  The first proof point is to test with IAA using a sequential call (fully
  synchronous, compress one page at a time) vs. a batching call (fully
  asynchronous, submit a batch to IAA for parallel compression, then poll for
  completion statuses).
  
    The performance testing data with usemem 30 processes and kernel
    compilation test using 32 threads, show 67%-77% throughput gains and
    28%-32% sys time reduction (usemem30) and 2-3% sys time reduction
    (kernel compilation) with zswap_store() large folios using IAA compress
    batching as compared to IAA sequential.

  The second proof point is to make sure that software algorithms such as
  zstd do not regress. The data indicates that for sequential software
  algorithms a performance gain is achieved. 
  
    With the performance optimizations implemented in patches 18 and 19 of
    v9, zstd usemem30 throughput increases by 1%, along with a 6%-8% sys time
    reduction. With kernel compilation using zstd, we get a 0.4%-3.2%
    reduction in sys time. These optimizations pertain to common code
    paths, removing redundant branches/computes, using prefetchw() of the
    zswap entry before it is written, and selectively annotating branches
    with likely()/unlikely() compiler directives to minimize branch
    mis-prediction penalty. Additionally, using the batching code for
    non-batching compressors to sequentially compress/store batches of up
    to ZSWAP_MAX_BATCH_SIZE (8) pages seems to help, most likely due to
    cache locality of working set structures such as the array of
    zswap_entry-s for the batch.
  
    Our internal validation of zstd with the batching interface vs. IAA with
    the batching interface on Emerald Rapids has shown that IAA
    compress/decompress batching gives 21.3% more memory savings as compared
    to zstd, for 5% performance loss as compared to the baseline without any
    memory pressure. IAA batching demonstrates more than 2X the memory
    savings obtained by zstd at this 95% performance KPI.
    The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
    this compression ratio deficit for IAA, batching is extremely
    beneficial. As we improve the compression ratio of the IAA accelerator,
    we expect to see even better memory savings with IAA as compared to
    software compressors.
    

  Batching Roadmap:
  =================

  1) Compression batching within large folios (this series).
  
  2) Reclaim batching of hybrid folios:
  
     We can expect to see even more significant performance and throughput
     improvements if we use the parallelism offered by IAA to do reclaim
     batching of 4K/large folios (really any-order folios), and using the
     zswap_store() high throughput compression pipeline to batch-compress
     pages comprising these folios, not just batching within large
     folios. This is the reclaim batching patch 13 in v1, which we expect
     to submit in a separate patch-series.
  
  3) Decompression batching:

     We have developed a zswap load batching interface for IAA to be used
     for parallel decompression batching, using swapin_readahead().
  
  These capabilities are architected so as to be useful to zswap and
  zram. We are actively working on integrating these components with zram.
  
  v9 Performance Summary:
  =======================

  This is a performance testing summary of results with usemem30
  (30 usemem processes running in a cgroup limited at 150G, each trying to
  allocate 10G).
  
  usemem30 with 64K folios:
  =========================
  
     -----------------------------------------------------------------------
                    mm-unstable-4-21-2025              v9
     -----------------------------------------------------------------------
     zswap compressor         deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)    6,091,607      10,174,344         67%
     Avg throughput (KB/s)        203,053         339,144                    
     elapsed time (sec)            100.46           69.70        -31%      
     sys time (sec)              2,416.97        1,648.37        -32%
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                    mm-unstable-4-21-2025              v9    
     -----------------------------------------------------------------------
     zswap compressor                zstd            zstd    v9 zstd    
                                                             improvement  
     -----------------------------------------------------------------------
     Total throughput (KB/s)    6,574,380       6,632,230          1%
     Avg throughput (KB/s)        219,146         221,074
     elapsed time (sec)             96.58           90.60         -6%
     sys time (sec)              2,416.52        2,224.78         -8%          
     -----------------------------------------------------------------------
    
  usemem30 with 2M folios:
  ========================
  
     ----------------------------------------------------------------------
                     mm-unstable-4-21-2025             v9
     ----------------------------------------------------------------------
     zswap compressor          deflate-iaa    deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
     ----------------------------------------------------------------------
     Total throughput (KB/s)     6,371,048     11,282,935         77%     
     Avg throughput (KB/s)         212,368        376,097              
     elapsed time (sec)              87.15          63.04        -28%     
     sys time (sec)               2,011.56       1,450.45        -28%     
     ----------------------------------------------------------------------
  
     ----------------------------------------------------------------------
                     mm-unstable-4-21-2025             v9      
     ----------------------------------------------------------------------
     zswap compressor                 zstd           zstd    v9 zstd           
                                                             improvement
     ----------------------------------------------------------------------
     Total throughput (KB/s)     7,320,278      7,428,055          1%         
     Avg throughput (KB/s)         244,009        247,601            
     elapsed time (sec)              83.30          81.60         -2%         
     sys time (sec)               1,970.89       1,857.70         -6%         
     ----------------------------------------------------------------------



DETAILS:
========

(A) From zswap's perspective, the most significant changes are:
===============================================================

1) A unified zswap_compress() API is added to compress multiple
   pages:

   - If the compressor has multiple acomp requests, i.e., internally
     supports batching, crypto_acomp_batch_compress() is called. If all
     pages are successfully compressed, the batch is stored in zpool.
   
   - If the compressor can only compress one page at a time, each page
     is compressed and stored sequentially.

   Many thanks to Yosry for this suggestion, because it is an essential
   component of unifying common code paths between sequential/batching
   compressions.

   prefetchw() is used in zswap_compress() to minimize cache-miss
   latency by moving the zswap entry to the cache before it is written
   to; reducing sys time by ~1.5% for zstd (non-batching software
   compression). In other words, this optimization helps both batching and
   software compressors.

   Overall, the prefetchw() and likely()/unlikely() annotations prevent
   regressions with software compressors like zstd, and generally improve
   non-batching compressors' performance with the batching code by ~8%.

2) A new zswap_store_pages() is added, that stores multiple pages in a
   folio in a range of indices. This is an extension of the earlier
   zswap_store_page(), except it operates on a batch of pages.

3) zswap_store() is modified to store the folio's pages in batches
   by calling zswap_store_pages(). If the compressor supports batching,
   i.e., has multiple acomp requests, the folio will be compressed in
   batches of "pool->nr_reqs". If the compressor has only one acomp
   request, the folio will be compressed in batches of
   ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch is
   compressed sequentially. We see better performance by processing
   the folio in batches of ZSWAP_MAX_BATCH_SIZE, due to cache locality
   of working set structures such as the array of zswap_entry-s for the
   batch.

   Many thanks to Yosry and Johannes for steering towards a common
   design and code paths for sequential and batched compressions (i.e.,
   for software compressors and hardware accelerators such as IAA). As per
   Yosry's suggestion in v8, the nr_reqs is an attribute of the
   compressor/pool, and hence is stored in struct zswap_pool instead of in
   struct crypto_acomp_ctx.

4) Simplifications to the acomp_ctx resources allocation/deletion
   vis-a-vis CPU hot[un]plug. This further improves upon v8 of this
   patch-series based on the discussion with Yosry, and formalizes the
   lifetime of these resources from pool creation to pool
   deletion. zswap does not register a CPU hotplug teardown
   callback. The acomp_ctx resources will persist through CPU
   online/offline transitions. The main changes made to avoid UAF/race
   conditions, and correctly handle process migration, are:

   a) No acomp_ctx mutex locking in zswap_cpu_comp_prepare().
   b) No CPU hotplug teardown callback, no acomp_ctx resources deleted.
   c) New acomp_ctx_dealloc() procedure that cleans up the acomp_ctx
      resources, and is shared by zswap_cpu_comp_prepare() error
      handling and zswap_pool_destroy().
   d) The zswap_pool node list instance is removed right after the node
      list add function in zswap_pool_create().
   e) We directly call mutex_[un]lock(&acomp_ctx->mutex) in
      zswap_[de]compress(). acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock()
      are deleted.

   The commit log of patch 0015 has a more detailed analysis.


(B) Main changes in crypto_acomp and iaa_crypto:
================================================

1) A new architecture is introduced for IAA device WQs' usage as:
   - compress only
   - decompress only
   - generic, i.e., both compress/decompress.

   Further, IAA devices/wqs are assigned to cores based on packages
   instead of NUMA nodes.

   The WQ rebalancing algorithm that is invoked as WQs are
   discovered/deleted has been made very general and flexible so that
   the user can control exactly how IAA WQs are used. In addition to the
   user being able to specify a WQ type as comp/decomp/generic, the user
   can also configure if WQs need to be shared among all same-package
   cores, or, whether the cores should be divided up amongst the
   available IAA devices.

   If distribute_[de]comps is enabled, from a given core's perspective,
   the iaa_crypto driver will distribute comp/decomp jobs among all
   devices' WQs in round-robin manner. This improves batching latency
   and can improve compression/decompression throughput for workloads
   that see a lot of swap activity.

   The commit log of patch 0006 provides more details on new iaa_crypto
   driver parameters added, along with recommended settings.

2) Compress/decompress batching are implemented using
   crypto_acomp_batch_[de]compress(), along the lines of v6 since
   request chaining is no longer the recommended approach.


(C) The patch-series is organized as follows:
=============================================

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patches 1-4) Backport some of the crypto patches that revert request
                 chaining that are in the cryptodev-2.6 git tree and are
                 yet to be included in mm-unstable. I have also
                 backported the fix to the scomp off-by-one bug. Further, the
                 non-request-chaining implementations of
                 crypto_acomp_[de]compress() are reinstated. Without
                 patches 1/2/3, the crypto/testmgr issues errors that
                 prevent deflate-iaa from being used as zswap's
                 compressor. Once mm-unstable is updated with the
                 request chaining reverts, patches 1/3/4 can be deleted
                 from this patch-series.

    Patch 5) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             subsequent iaa_crypto patches. This patch makes no
             functional changes.

    Patch 6) Makes an infrastructure change in the iaa_crypto driver
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids during internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.

             This patch also develops a new architecture that
             generalizes how IAA device WQs are used. It enables
             designating IAA device WQs as either compress-only or
             decompress-only or generic. Once IAA device WQ types are
             thus defined, it also allows the configuration of whether
             device WQs will be shared by all cores on the package, or
             used only by "mapped cores" obtained by a simple allocation
             of available IAAs to cores on the package.

             As a result of the overhaul of wq_table definition,
             allocation and rebalancing, this patch eliminates
             duplication of device WQs in per-cpu wq_tables, thereby
             saving 140MiB on a 384 cores dual socket Granite Rapids server
             with 8 IAAs.

             Regardless of how the user has configured the WQs' usage,
             the next WQ to use is obtained through a direct look-up in
             per-cpu "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
             as to minimize latency in the critical path driver compress
             and decompress routines.

    Patch 7) Defines a "void *data" in struct acomp_req, in response to
             Herbert's comments in v8 about avoiding use of
             req->base.data. iaa_crypto requires the req->data to
             store the idxd_desc allocated in the core
             iaa_[de]compress() functions, for later retreival in the
             iaa_comp_poll() function to check for the descriptor's
             completion status. This async submit-poll is essential for
             batching.

    Patch 8) Makes a change to iaa_crypto driver's descriptor allocation,
             from blocking to non-blocking with retries/timeouts and
             mitigations in case of timeouts during compress/decompress
             ops. This prevents tasks getting blocked indefinitely, which
             was observed when testing 30 cores running workloads, with
             only 1 IAA enabled on Sapphire Rapids (out of 4). These
             timeouts are typically only encountered, and associated
             mitigations exercised, only in configurations with 1 IAA
             device shared by 30+ cores.

    Patch 9) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
             async poll mode in iaa_crypto.

   Patch 10) Adds acomp_alg/crypto_acomp interfaces for get_batch_size(),
             batch_compress() and batch_decompress() along with the
             corresponding crypto_acomp_batch_size(),
             crypto_acomp_batch_compress() and
             crypto_acomp_batch_decompress() API for use in zswap.

   Patch 11) iaa-crypto driver implementations for the newly added batching
             interfaces. iaa_crypto implements the crypto_acomp
             get_batch_size() interface that returns an iaa_driver specific
             constant, IAA_CRYPTO_MAX_BATCH_SIZE (set to 8U currently).

             This patch also provides the iaa_crypto driver implementations
             for the batch_compress() and batch_decompress() crypto_acomp
             interfaces.

   Patch 12) Modifies the default iaa_crypto driver mode to async, now that
             iaa_crypto provides a truly async mode that gives
             significantly better latency than sync mode for the batching
             use case.

   Patch 13) Disables verify_compress by default, to facilitate users to
             run IAA easily for comparison with software compressors.


 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

   Patch 14) Moves the zswap CPU hotplug procedures under "pool functions",
             because they are invoked upon pool creation/deletion.

   Patch 15) Simplifies the zswap_pool's per-CPU acomp_ctx resource
             management and lifetime to be from pool creation to pool
             deletion.

   Patch 16) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for
             valid acomp/req, thereby making it consistent with the resource
             de-allocation code.

   Patch 17) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
             as 8U) to denote the maximum number of acomp_ctx batching
             resources to allocate, thus limiting the amount of extra
             memory used for batching. Further, the "struct
             crypto_acomp_ctx" is modified to contain multiple acomp_reqs
             and buffers. A new "u8 nr_reqs" member is added to "struct
             zswap_pool" to track the number of requests/buffers associated
             with the compressor.

   Patch 18) Modifies zswap_store() to store the folio in batches of
             pool->nr_reqs by calling a new zswap_store_pages() that takes
             a range of indices in the folio to be stored.
             zswap_store_pages() pre-allocates zswap entries for the batch,
             calls zswap_compress() for each page in this range, and stores
             the entries in xarray/LRU.

   Patch 19) Introduces a new unified implementation of zswap_compress()
             for compressors that do and do not support batching. This
             eliminates code duplication and facilitates maintainability of
             the code with the introduction of compress batching. Further,
             there are many optimizations to this common code that result
             in workload throughput and performance improvements with
             software compressors and hardware accelerators such as IAA.

             zstd performance is better or on par with mm-unstable. We
             see impressive throughput/performance improvements with IAA
             batching vs. no-batching.


With v9 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute (the default).


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 4-21-2025,
commit 2c01d9f3c611, without and with this patch-series. Data was
gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
   - 64k
   - 2048k

   IAA WQ Configuration:

   Since usemem sees practically no swapin activity, we set up 1 WQ per
   IAA device, so that all 128 entries are available for compress
   jobs. All IAA's WQs are available to all package cores to send
   compress/decompress jobs in a round-robin manner.

     4 IAA devices
     1 WQ per device
     echo 0 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_decomps

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, with
   these large folios enabled to "always":
   - 64k

   IAA WQ Configuration:

   Since kernel compilation sees considerable swapin activity, we set up
   2 WQs per IAA device, each containing 64 entries. The driver sends
   decompresses to wqX.0 and compresses to wqX.1. All IAAs' wqX.0 are
   available to all package cores to send decompress jobs in a
   round-robin manner. Likewise, all IAAs' wqX.1 are available to all
   package cores to send decompress jobs in a round-robin manner. 

     4 IAA devices
     2 WQs per device
     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_decomps


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -b 1 -s 10 -n 30 10g


 64K folios: usemem30: deflate-iaa:
 ==================================

 -------------------------------------------------------------------------------
                 mm-unstable-4-21-2025             v9
 -------------------------------------------------------------------------------
 zswap compressor         deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                         IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    6,091,607      10,174,344         67%
 Avg throughput (KB/s)        203,053         339,144                    
 elapsed time (sec)            100.46           69.70        -31%      
 sys time (sec)              2,416.97        1,648.37        -32%      
                                                                         
 -------------------------------------------------------------------------------
 memcg_high                 1,262,996       1,403,680                    
 memcg_swap_fail                2,712           2,105                    
 zswpout                   58,146,954      64,508,450                    
 zswpin                            91             256                    
 pswpout                            0               0                    
 pswpin                             0               0                    
 thp_swpout                         0               0                    
 thp_swpout_fallback                0               0                    
 64kB_swpout_fallback           2,712           2,105                    
 pgmajfault                     2,858           3,032                    
 ZSWPOUT-64kB               3,631,559       4,029,802   
 SWPOUT-64kB                        0               0   
 -------------------------------------------------------------------------------


 2M folios: usemem30: deflate-iaa:
 =================================

 -------------------------------------------------------------------------------
                 mm-unstable-4-21-2025              v9
 -------------------------------------------------------------------------------
 zswap compressor          deflate-iaa     deflate-iaa     IAA Batching
                                                               vs.
                                                           IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)     6,371,048      11,282,935       77%     
 Avg throughput (KB/s)         212,368         376,097              
 elapsed time (sec)              87.15           63.04      -28%     
 sys time (sec)               2,011.56        1,450.45      -28%     
                                                                    
 -------------------------------------------------------------------------------
 memcg_high                    116,156         125,138             
 memcg_swap_fail                   348             248             
 zswpout                    59,815,486      64,509,928             
 zswpin                            442             422             
 pswpout                             0               0             
 pswpin                              0               0             
 thp_swpout                          0               0             
 thp_swpout_fallback               348             248             
 pgmajfault                      3,575           3,272             
 ZSWPOUT-2048kB                116,480         125,759        
 SWPOUT-2048kB                       0               0        
 -------------------------------------------------------------------------------


 64K folios: usemem30: zstd:
 ===========================

 -------------------------------------------------------------------------------
               mm-unstable-4-21-2025            v9       
 -------------------------------------------------------------------------------
 zswap compressor               zstd          zstd       v9 zstd    
                                                         improvement  
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,574,380     6,632,230        1%
 Avg throughput (KB/s)       219,146       221,074
 elapsed time (sec)            96.58         90.60       -6%
 sys time (sec)             2,416.52      2,224.78       -8%          
                                                  
 -------------------------------------------------------------------------------
 memcg_high                1,117,577     1,110,504
 memcg_swap_fail                  65         2,217
 zswpout                  48,771,672    48,806,988
 zswpin                          137           429
 pswpout                           0             0
 pswpin                            0             0
 thp_swpout                        0             0
 thp_swpout_fallback               0             0
 64kB_swpout_fallback             65         2,217
 pgmajfault                    3,286         3,224
 ZSWPOUT-64kB              3,048,122     3,048,198
 SWPOUT-64kB                       0             0
 -------------------------------------------------------------------------------


 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
               mm-unstable-4-21-2025            v9      
 -------------------------------------------------------------------------------
 zswap compressor               zstd          zstd      v9 zstd           
                                                        improvement
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   7,320,278     7,428,055       1%         
 Avg throughput (KB/s)       244,009       247,601                 
 elapsed time (sec)            83.30         81.60      -2%         
 sys time (sec)             1,970.89      1,857.70      -6%         
                                                   
 -------------------------------------------------------------------------------
 memcg_high                   92,970        92,708 
 memcg_swap_fail                  59           172 
 zswpout                  48,043,615    47,896,223
 zswpin                           77           416 
 pswpout                           0             0 
 pswpin                            0             0 
 thp_swpout                        0             0 
 thp_swpout_fallback              59           172 
 pgmajfault                    2,815         3,170 
 ZSWPOUT-2048kB               93,776        93,381 
 SWPOUT-2048kB                     0             0 
 -------------------------------------------------------------------------------



Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test use 32 threads and build
the "allmodconfig" that takes ~14 minutes, and has considerable
swapout/swapin activity. The cgroup's memory.max is set to 2G.


 64K folios: Kernel compilation/allmodconfig:
 ============================================

 -------------------------------------------------------------------------------
                       mm-unstable               v9    mm-unstable            v9
 -------------------------------------------------------------------------------
 zswap compressor      deflate-iaa      deflate-iaa           zstd          zstd   
 -------------------------------------------------------------------------------
 real_sec                   835.31           837.75         858.73        852.22
 user_sec                15,649.58        15,660.48      15,682.66     15,649.91
 sys_sec                  3,705.03         3,642.59       4,858.46      4,703.58
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB     1,874,524        1,872,200      1,871,248     1,870,972
 -------------------------------------------------------------------------------
 memcg_high                      0                0              0             0
 memcg_swap_fail                 0                0              0             0
 zswpout                89,767,776       91,376,740     76,444,847    73,771,346
 zswpin                 26,362,204       27,700,717     22,138,662    21,287,433
 pswpout                       360              574             52           154
 pswpin                        275              551             19            63
 thp_swpout                      0                0              0             0
 thp_swpout_fallback             0                0              0             0
 64kB_swpout_fallback            0            1,523              0             0
 pgmajfault             27,938,009       29,559,339     23,339,818    22,458,108
 ZSWPOUT-64kB            2,958,806        2,992,126      2,444,259     2,382,986
 SWPOUT-64kB                    21               30              3             8
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig:
 ===========================================

 -------------------------------------------------------------------------------
                       mm-unstable               v9    mm-unstable            v9
 -------------------------------------------------------------------------------
 zswap compressor      deflate-iaa      deflate-iaa           zstd          zstd   
 -------------------------------------------------------------------------------
 real_sec                   790.66           789.01         818.46        819.08
 user_sec                15,757.60        15,759.57      15,785.34     15,777.70
 sys_sec                  4,307.92         4,184.09       5,602.95      5,582.45
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB     1,871,100        1,872,892      1,872,892     1,872,888
 -------------------------------------------------------------------------------
 memcg_high                      0                0              0             0
 memcg_swap_fail                 0                0              0             0
 zswpout               107,349,845      101,481,140     90,083,661    90,818,923
 zswpin                 37,486,883       35,081,184     29,823,462    29,597,292
 pswpout                     3,664            1,191          1,066         1,617
 pswpin                      1,594              138             37         1,594
 thp_swpout                      7                2              2             3
 thp_swpout_fallback         9,434            8,100          6,354         5,809
 pgmajfault             38,781,821       36,235,171     30,677,937    30,442,685
 ZSWPOUT-2048kB              8,810            7,772          7,857         8,515
 -------------------------------------------------------------------------------


With the iaa_crypto driver changes for non-blocking descriptor allocations,
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.



Changes since v8:
=================
1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
2) Backported commits for reverting request chaining, since these are
   in cryptodev-2.6 but not yet in mm-unstable: without these backports,
   deflate-iaa is non-functional in mm-unstable:
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
   commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
                         testing"")
   Backported this hotfix as well:
   commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
   calculating last page").
3) crypto_acomp_[de]compress() restored to non-request chained
   implementations since request chaining has been removed from acomp in
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
4) New IAA WQ architecture to denote WQ type and whether or not a WQ
   should be shared among all package cores, or only to the "mapped"
   ones from an even cores-to-IAA distribution scheme.
5) Compress/decompress batching are implemented in iaa_crypto using new
   crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
6) Defines a "void *data" in struct acomp_req, based on Herbert advising
   against using req->base.data in the driver. This is needed for async
   submit-poll to work.
7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
   functions", per Yosry's suggestion to move procedures in a distinct
   patch before refactoring patches.
8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
   the number of requests/buffers associated with the per-cpu acomp_ctx,
   as per Yosry's suggestion.
9) Simplifications to the acomp_ctx resources allocation, deletion,
   locking, and for these to exist from pool creation to pool deletion,
   based on v8 code review discussions with Yosry.
10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
    acomp_ctx_dealloc(), as per Yosry's v8 comment.
11) zswap_store_folio() is deleted, and instead, the loop over
    zswap_store_pages() is moved inline in zswap_store(), per Yosry's
    suggestion.
12) Better structure in zswap_compress(), unified procedure that
    compresses/stores a batch of pages for both, non-batching and
    batching compressors. Renamed from zswap_batch_compress() to
    zswap_compress(): Thanks Yosry for these suggestions.


Changes since v7:
=================
1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE is
   defined as 8U, for saving memory in this per-cpu structure.
3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
   acomp_ctx->initialized to acomp_ctx->__online.
4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
   thanks to all!
   a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
      for this suggestion!
   b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
      of whether or not the compressor supports batching. This gets rid of
      the kmalloc(entries), and allows us to allocate an array of
      ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
      zswap_store_pages().
   c) Use of a common structure and code paths for compressing a folio in
      batches, either as a request chain (in parallel in IAA hardware) or
      sequentially. No code duplication since zswap_compress() has been
      replaced with zswap_batch_compress(), simplifying maintainability.
5) A key difference between compressors that support batching and
   those that do not, is that for the latter, the acomp_ctx mutex is
   locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that decompressions
   to handle page-faults can make progress. This fixes the zstd kernel
   compilation regression seen in v7. For compressors that support
   batching, for e.g. IAA, the mutex is locked/released once for storing
   the folio.
6) Used likely/unlikely compiler directives and prefetchw to restore
   performance with the common code paths.

Changes since v6:
=================
1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.

2) Deleted crypto_acomp_batch_compress() and
   crypto_acomp_batch_decompress() interfaces, as per Herbert's
   suggestion. Batching is instead enabled by chaining the requests. For
   non-batching compressors, there is no request chaining involved. Both,
   batching and non-batching compressions are accomplished by zswap by
   calling:

   crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

3) iaa_crypto implementation of batch compressions/decompressions using
   request chaining, as per Herbert's suggestions.
4) Simplification of the acomp_ctx resource allocation/deletion with
   respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
   mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
   the per-cpu memory cost of this proposed change is acceptable (IAA:
   64.8KB, Software compressors: 8.2KB). On the positive side, I believe
   restarting reclaim on a CPU after it has been through an offline-online
   transition, will be much faster by not deleting the acomp_ctx resources
   when the CPU gets offlined.
5) Use of lockdep assertions rather than comments for internal locking
   rules, as per Yosry's suggestion.
6) No specific references to IAA in zswap.c, as suggested by Yosry.
7) Explored various solutions other than the v6 zswap_store_folio()
   implementation, to fix the zstd regression seen in v5, to attempt to
   unify common code paths, and to allocate smaller arrays for the zswap
   entries on the stack. All these options were found to cause usemem30
   latency regression with zstd. The v6 version of zswap_store_folio() is
   the only implementation that does not cause zstd regression, confirmed
   by 10 consecutive runs, each giving quite consistent latency
   numbers. Hence, the v6 implementation is carried forward to v7, with
   changes for branching for batching vs. sequential compression API
   calls.


Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
   vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
   of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
   details.
6) Delete the pr_info_once() when batching resources are allocated in
   zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
   zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
   zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
    compressed objects and entries to be freed, and UAF when zswap_store()
    tries to free the entries that were already added to the xarray prior
    to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
    the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
    when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
    4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
    sync_mode is set to 'async'"), async mode was previously just sync. We
    now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
    timeouts, and mitigations for compress/decompress ops that fail to
    obtain a descriptor. This is a fix for tasks blocked errors seen in
    configurations where 30+ cores are running workloads under high memory
    pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
    deflate_generic_decompress(), which can cause data corruption and
    zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
    request chaining for slightly better latency. However, the request
    chaining framework itself is unchanged, preserved from v5.


Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().


Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (19):
  crypto: acomp - Remove request chaining
  crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress().
  Revert "crypto: testmgr - Add multibuffer acomp testing"
  crypto: scomp - Fix off-by-one bug when calculating last page
  crypto: iaa - Re-organize the iaa_crypto driver code.
  crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
    core mapping.
  crypto: iaa - Define and use req->data instead of req->base.data.
  crypto: iaa - Descriptor allocation timeouts with mitigations in
    iaa_crypto.
  crypto: iaa - CRYPTO_ACOMP_REQ_POLL acomp_req flag for sequential vs.
    parallel.
  crypto: acomp - New interfaces to facilitate batching support in acomp
    & drivers.
  crypto: iaa - Implement crypto_acomp batching interfaces for Intel
    IAA.
  crypto: iaa - Enable async mode and make it the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  mm: zswap: Move the CPU hotplug procedures under "pool functions".
  mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
    deletion.
  mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
    resources.
  mm: zswap: Allocate pool batching resources if the compressor supports
    batching.
  mm: zswap: zswap_store() will process a folio in batches.
  mm: zswap: Batched zswap_compress() with compress batching of large
    folios.

 .../driver-api/crypto/iaa/iaa-crypto.rst      |  145 +-
 crypto/acompress.c                            |  112 +-
 crypto/scompress.c                            |   28 +-
 crypto/testmgr.c                              |  147 +-
 drivers/crypto/intel/iaa/iaa_crypto.h         |   30 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 1934 ++++++++++++-----
 include/crypto/acompress.h                    |  129 +-
 include/crypto/internal/acompress.h           |   25 +-
 mm/zswap.c                                    |  684 +++---
 9 files changed, 2199 insertions(+), 1035 deletions(-)


base-commit: 2c01d9f3c61101355afde90dc5c0b39d9a772ef3
-- 
2.27.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v9 01/19] crypto: acomp - Remove request chaining
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-05-08 19:30   ` Sridhar, Kanchana P
  2025-04-30 20:52 ` [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress() Kanchana P Sridhar
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Request chaining requires the user to do too much book keeping.
Remove it from acomp.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
 crypto/acompress.c                  | 117 ++++++++--------------------
 crypto/scompress.c                  |  18 +----
 include/crypto/acompress.h          |  14 ----
 include/crypto/internal/acompress.h |   5 --
 4 files changed, 35 insertions(+), 119 deletions(-)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index f7a3fbe5447e..82fb3c04e68f 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -161,7 +161,6 @@ static void acomp_save_req(struct acomp_req *req, crypto_completion_t cplt)
 	state->data = req->base.data;
 	req->base.complete = cplt;
 	req->base.data = state;
-	state->req0 = req;
 }
 
 static void acomp_restore_req(struct acomp_req *req)
@@ -172,23 +171,20 @@ static void acomp_restore_req(struct acomp_req *req)
 	req->base.data = state->data;
 }
 
-static void acomp_reqchain_virt(struct acomp_req_chain *state, int err)
+static void acomp_reqchain_virt(struct acomp_req *req)
 {
-	struct acomp_req *req = state->cur;
+	struct acomp_req_chain *state = &req->chain;
 	unsigned int slen = req->slen;
 	unsigned int dlen = req->dlen;
 
-	req->base.err = err;
-	state = &req->chain;
-
 	if (state->flags & CRYPTO_ACOMP_REQ_SRC_VIRT)
 		acomp_request_set_src_dma(req, state->src, slen);
 	else if (state->flags & CRYPTO_ACOMP_REQ_SRC_FOLIO)
-		acomp_request_set_src_folio(req, state->sfolio, state->soff, slen);
+		acomp_request_set_src_folio(req, state->sfolio, req->soff, slen);
 	if (state->flags & CRYPTO_ACOMP_REQ_DST_VIRT)
 		acomp_request_set_dst_dma(req, state->dst, dlen);
 	else if (state->flags & CRYPTO_ACOMP_REQ_DST_FOLIO)
-		acomp_request_set_dst_folio(req, state->dfolio, state->doff, dlen);
+		acomp_request_set_dst_folio(req, state->dfolio, req->doff, dlen);
 }
 
 static void acomp_virt_to_sg(struct acomp_req *req)
@@ -213,7 +209,6 @@ static void acomp_virt_to_sg(struct acomp_req *req)
 		size_t off = req->soff;
 
 		state->sfolio = folio;
-		state->soff = off;
 		sg_init_table(&state->ssg, 1);
 		sg_set_page(&state->ssg, folio_page(folio, off / PAGE_SIZE),
 			    slen, off % PAGE_SIZE);
@@ -233,7 +228,6 @@ static void acomp_virt_to_sg(struct acomp_req *req)
 		size_t off = req->doff;
 
 		state->dfolio = folio;
-		state->doff = off;
 		sg_init_table(&state->dsg, 1);
 		sg_set_page(&state->dsg, folio_page(folio, off / PAGE_SIZE),
 			    dlen, off % PAGE_SIZE);
@@ -241,8 +235,7 @@ static void acomp_virt_to_sg(struct acomp_req *req)
 	}
 }
 
-static int acomp_do_nondma(struct acomp_req_chain *state,
-			   struct acomp_req *req)
+static int acomp_do_nondma(struct acomp_req *req, bool comp)
 {
 	u32 keep = CRYPTO_ACOMP_REQ_SRC_VIRT |
 		   CRYPTO_ACOMP_REQ_SRC_NONDMA |
@@ -259,7 +252,7 @@ static int acomp_do_nondma(struct acomp_req_chain *state,
 	fbreq->slen = req->slen;
 	fbreq->dlen = req->dlen;
 
-	if (state->op == crypto_acomp_reqtfm(req)->compress)
+	if (comp)
 		err = crypto_acomp_compress(fbreq);
 	else
 		err = crypto_acomp_decompress(fbreq);
@@ -268,114 +261,70 @@ static int acomp_do_nondma(struct acomp_req_chain *state,
 	return err;
 }
 
-static int acomp_do_one_req(struct acomp_req_chain *state,
-			    struct acomp_req *req)
+static int acomp_do_one_req(struct acomp_req *req, bool comp)
 {
-	state->cur = req;
-
 	if (acomp_request_isnondma(req))
-		return acomp_do_nondma(state, req);
+		return acomp_do_nondma(req, comp);
 
 	acomp_virt_to_sg(req);
-	return state->op(req);
+	return comp ? crypto_acomp_reqtfm(req)->compress(req) :
+		      crypto_acomp_reqtfm(req)->decompress(req);
 }
 
-static int acomp_reqchain_finish(struct acomp_req *req0, int err, u32 mask)
+static int acomp_reqchain_finish(struct acomp_req *req, int err)
 {
-	struct acomp_req_chain *state = req0->base.data;
-	struct acomp_req *req = state->cur;
-	struct acomp_req *n;
-
-	acomp_reqchain_virt(state, err);
-
-	if (req != req0)
-		list_add_tail(&req->base.list, &req0->base.list);
-
-	list_for_each_entry_safe(req, n, &state->head, base.list) {
-		list_del_init(&req->base.list);
-
-		req->base.flags &= mask;
-		req->base.complete = acomp_reqchain_done;
-		req->base.data = state;
-
-		err = acomp_do_one_req(state, req);
-
-		if (err == -EINPROGRESS) {
-			if (!list_empty(&state->head))
-				err = -EBUSY;
-			goto out;
-		}
-
-		if (err == -EBUSY)
-			goto out;
-
-		acomp_reqchain_virt(state, err);
-		list_add_tail(&req->base.list, &req0->base.list);
-	}
-
-	acomp_restore_req(req0);
-
-out:
+	acomp_reqchain_virt(req);
+	acomp_restore_req(req);
 	return err;
 }
 
 static void acomp_reqchain_done(void *data, int err)
 {
-	struct acomp_req_chain *state = data;
-	crypto_completion_t compl = state->compl;
+	struct acomp_req *req = data;
+	crypto_completion_t compl;
 
-	data = state->data;
+	compl = req->chain.compl;
+	data = req->chain.data;
 
-	if (err == -EINPROGRESS) {
-		if (!list_empty(&state->head))
-			return;
+	if (err == -EINPROGRESS)
 		goto notify;
-	}
 
-	err = acomp_reqchain_finish(state->req0, err,
-				    CRYPTO_TFM_REQ_MAY_BACKLOG);
-	if (err == -EBUSY)
-		return;
+	err = acomp_reqchain_finish(req, err);
 
 notify:
 	compl(data, err);
 }
 
-static int acomp_do_req_chain(struct acomp_req *req,
-			      int (*op)(struct acomp_req *req))
+static int acomp_do_req_chain(struct acomp_req *req, bool comp)
 {
-	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
-	struct acomp_req_chain *state;
 	int err;
 
-	if (crypto_acomp_req_chain(tfm) ||
-	    (!acomp_request_chained(req) && acomp_request_issg(req)))
-		return op(req);
-
 	acomp_save_req(req, acomp_reqchain_done);
-	state = req->base.data;
 
-	state->op = op;
-	state->src = NULL;
-	INIT_LIST_HEAD(&state->head);
-	list_splice_init(&req->base.list, &state->head);
-
-	err = acomp_do_one_req(state, req);
+	err = acomp_do_one_req(req, comp);
 	if (err == -EBUSY || err == -EINPROGRESS)
-		return -EBUSY;
+		return err;
 
-	return acomp_reqchain_finish(req, err, ~0);
+	return acomp_reqchain_finish(req, err);
 }
 
 int crypto_acomp_compress(struct acomp_req *req)
 {
-	return acomp_do_req_chain(req, crypto_acomp_reqtfm(req)->compress);
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+
+	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
+		crypto_acomp_reqtfm(req)->compress(req);
+	return acomp_do_req_chain(req, true);
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_compress);
 
 int crypto_acomp_decompress(struct acomp_req *req)
 {
-	return acomp_do_req_chain(req, crypto_acomp_reqtfm(req)->decompress);
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+
+	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
+		crypto_acomp_reqtfm(req)->decompress(req);
+	return acomp_do_req_chain(req, false);
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_decompress);
 
diff --git a/crypto/scompress.c b/crypto/scompress.c
index 5762fcc63b51..c1ce12564299 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -293,28 +293,14 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 	return ret;
 }
 
-static int scomp_acomp_chain(struct acomp_req *req, int dir)
-{
-	struct acomp_req *r2;
-	int err;
-
-	err = scomp_acomp_comp_decomp(req, dir);
-	req->base.err = err;
-
-	list_for_each_entry(r2, &req->base.list, base.list)
-		r2->base.err = scomp_acomp_comp_decomp(r2, dir);
-
-	return err;
-}
-
 static int scomp_acomp_compress(struct acomp_req *req)
 {
-	return scomp_acomp_chain(req, 1);
+	return scomp_acomp_comp_decomp(req, 1);
 }
 
 static int scomp_acomp_decompress(struct acomp_req *req)
 {
-	return scomp_acomp_chain(req, 0);
+	return scomp_acomp_comp_decomp(req, 0);
 }
 
 static void crypto_exit_scomp_ops_async(struct crypto_tfm *tfm)
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index c497c73baf13..267d557daeb1 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -52,10 +52,6 @@ struct acomp_req;
 struct folio;
 
 struct acomp_req_chain {
-	struct list_head head;
-	struct acomp_req *req0;
-	struct acomp_req *cur;
-	int (*op)(struct acomp_req *req);
 	crypto_completion_t compl;
 	void *data;
 	struct scatterlist ssg;
@@ -68,8 +64,6 @@ struct acomp_req_chain {
 		u8 *dst;
 		struct folio *dfolio;
 	};
-	size_t soff;
-	size_t doff;
 	u32 flags;
 };
 
@@ -349,8 +343,6 @@ static inline void acomp_request_set_callback(struct acomp_req *req,
 	req->base.data = data;
 	req->base.flags &= keep;
 	req->base.flags |= flgs & ~keep;
-
-	crypto_reqchain_init(&req->base);
 }
 
 /**
@@ -558,12 +550,6 @@ static inline void acomp_request_set_dst_folio(struct acomp_req *req,
 	req->base.flags |= CRYPTO_ACOMP_REQ_DST_FOLIO;
 }
 
-static inline void acomp_request_chain(struct acomp_req *req,
-				       struct acomp_req *head)
-{
-	crypto_request_chain(&req->base, &head->base);
-}
-
 /**
  * crypto_acomp_compress() -- Invoke asynchronous compress operation
  *
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index aaf59f3236fa..b69d818d7e68 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -98,11 +98,6 @@ void crypto_unregister_acomp(struct acomp_alg *alg);
 int crypto_register_acomps(struct acomp_alg *algs, int count);
 void crypto_unregister_acomps(struct acomp_alg *algs, int count);
 
-static inline bool acomp_request_chained(struct acomp_req *req)
-{
-	return crypto_request_chained(&req->base);
-}
-
 static inline bool acomp_request_issg(struct acomp_req *req)
 {
 	return !(req->base.flags & (CRYPTO_ACOMP_REQ_SRC_VIRT |
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress().
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 01/19] crypto: acomp - Remove request chaining Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-05-01  0:29   ` kernel test robot
  2025-04-30 20:52 ` [PATCH v9 03/19] Revert "crypto: testmgr - Add multibuffer acomp testing" Kanchana P Sridhar
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This reverts the request chaining implementations of
crypto_acomp_[de]compress() introduced in commit b67a02600372
("crypto: acomp - Add request chaining and virtual addresses") since
request chaining has been removed from acomp subsequently in commit
64929fe8c0a4 ("crypto: acomp - Remove request chaining").

This patch restores the implementations of crypto_acomp_[de]compress()
from prior to commit b67a02600372.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 82fb3c04e68f..d08e0fe8cd9e 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -310,21 +310,13 @@ static int acomp_do_req_chain(struct acomp_req *req, bool comp)
 
 int crypto_acomp_compress(struct acomp_req *req)
 {
-	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
-
-	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
-		crypto_acomp_reqtfm(req)->compress(req);
-	return acomp_do_req_chain(req, true);
+	return crypto_acomp_reqtfm(req)->compress(req);
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_compress);
 
 int crypto_acomp_decompress(struct acomp_req *req)
 {
-	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
-
-	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
-		crypto_acomp_reqtfm(req)->decompress(req);
-	return acomp_do_req_chain(req, false);
+	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_decompress);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 03/19] Revert "crypto: testmgr - Add multibuffer acomp testing"
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 01/19] crypto: acomp - Remove request chaining Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress() Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 04/19] crypto: scomp - Fix off-by-one bug when calculating last page Kanchana P Sridhar
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This reverts commit 99585c2192cb1ce212876e82ef01d1c98c7f4699.

Remove the acomp multibuffer tests so that the interface can be
redesigned.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
 crypto/testmgr.c | 147 +++++++++++++++++++++--------------------------
 1 file changed, 64 insertions(+), 83 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index abd609d4c8ef..82977ea25db3 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -58,9 +58,6 @@ module_param(fuzz_iterations, uint, 0644);
 MODULE_PARM_DESC(fuzz_iterations, "number of fuzz test iterations");
 #endif
 
-/* Multibuffer is unlimited.  Set arbitrary limit for testing. */
-#define MAX_MB_MSGS	16
-
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3329,48 +3326,27 @@ static int test_acomp(struct crypto_acomp *tfm,
 		      int ctcount, int dtcount)
 {
 	const char *algo = crypto_tfm_alg_driver_name(crypto_acomp_tfm(tfm));
-	struct scatterlist *src = NULL, *dst = NULL;
-	struct acomp_req *reqs[MAX_MB_MSGS] = {};
-	char *decomp_out[MAX_MB_MSGS] = {};
-	char *output[MAX_MB_MSGS] = {};
-	struct crypto_wait wait;
-	struct acomp_req *req;
-	int ret = -ENOMEM;
 	unsigned int i;
+	char *output, *decomp_out;
+	int ret;
+	struct scatterlist src, dst;
+	struct acomp_req *req;
+	struct crypto_wait wait;
 
-	src = kmalloc_array(MAX_MB_MSGS, sizeof(*src), GFP_KERNEL);
-	if (!src)
-		goto out;
-	dst = kmalloc_array(MAX_MB_MSGS, sizeof(*dst), GFP_KERNEL);
-	if (!dst)
-		goto out;
-
-	for (i = 0; i < MAX_MB_MSGS; i++) {
-		reqs[i] = acomp_request_alloc(tfm);
-		if (!reqs[i])
-			goto out;
-
-		acomp_request_set_callback(reqs[i],
-					   CRYPTO_TFM_REQ_MAY_SLEEP |
-					   CRYPTO_TFM_REQ_MAY_BACKLOG,
-					   crypto_req_done, &wait);
-		if (i)
-			acomp_request_chain(reqs[i], reqs[0]);
-
-		output[i] = kmalloc(COMP_BUF_SIZE, GFP_KERNEL);
-		if (!output[i])
-			goto out;
+	output = kmalloc(COMP_BUF_SIZE, GFP_KERNEL);
+	if (!output)
+		return -ENOMEM;
 
-		decomp_out[i] = kmalloc(COMP_BUF_SIZE, GFP_KERNEL);
-		if (!decomp_out[i])
-			goto out;
+	decomp_out = kmalloc(COMP_BUF_SIZE, GFP_KERNEL);
+	if (!decomp_out) {
+		kfree(output);
+		return -ENOMEM;
 	}
 
 	for (i = 0; i < ctcount; i++) {
 		unsigned int dlen = COMP_BUF_SIZE;
 		int ilen = ctemplate[i].inlen;
 		void *input_vec;
-		int j;
 
 		input_vec = kmemdup(ctemplate[i].input, ilen, GFP_KERNEL);
 		if (!input_vec) {
@@ -3378,61 +3354,70 @@ static int test_acomp(struct crypto_acomp *tfm,
 			goto out;
 		}
 
+		memset(output, 0, dlen);
 		crypto_init_wait(&wait);
-		sg_init_one(src, input_vec, ilen);
+		sg_init_one(&src, input_vec, ilen);
+		sg_init_one(&dst, output, dlen);
 
-		for (j = 0; j < MAX_MB_MSGS; j++) {
-			sg_init_one(dst + j, output[j], dlen);
-			acomp_request_set_params(reqs[j], src, dst + j, ilen, dlen);
+		req = acomp_request_alloc(tfm);
+		if (!req) {
+			pr_err("alg: acomp: request alloc failed for %s\n",
+			       algo);
+			kfree(input_vec);
+			ret = -ENOMEM;
+			goto out;
 		}
 
-		req = reqs[0];
+		acomp_request_set_params(req, &src, &dst, ilen, dlen);
+		acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+					   crypto_req_done, &wait);
+
 		ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
 		if (ret) {
 			pr_err("alg: acomp: compression failed on test %d for %s: ret=%d\n",
 			       i + 1, algo, -ret);
 			kfree(input_vec);
+			acomp_request_free(req);
 			goto out;
 		}
 
 		ilen = req->dlen;
 		dlen = COMP_BUF_SIZE;
+		sg_init_one(&src, output, ilen);
+		sg_init_one(&dst, decomp_out, dlen);
 		crypto_init_wait(&wait);
-		for (j = 0; j < MAX_MB_MSGS; j++) {
-			sg_init_one(src + j, output[j], ilen);
-			sg_init_one(dst + j, decomp_out[j], dlen);
-			acomp_request_set_params(reqs[j], src + j, dst + j, ilen, dlen);
-		}
-
-		crypto_wait_req(crypto_acomp_decompress(req), &wait);
-		for (j = 0; j < MAX_MB_MSGS; j++) {
-			ret = reqs[j]->base.err;
-			if (ret) {
-				pr_err("alg: acomp: compression failed on test %d (%d) for %s: ret=%d\n",
-				       i + 1, j, algo, -ret);
-				kfree(input_vec);
-				goto out;
-			}
+		acomp_request_set_params(req, &src, &dst, ilen, dlen);
 
-			if (reqs[j]->dlen != ctemplate[i].inlen) {
-				pr_err("alg: acomp: Compression test %d (%d) failed for %s: output len = %d\n",
-				       i + 1, j, algo, reqs[j]->dlen);
-				ret = -EINVAL;
-				kfree(input_vec);
-				goto out;
-			}
+		ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		if (ret) {
+			pr_err("alg: acomp: compression failed on test %d for %s: ret=%d\n",
+			       i + 1, algo, -ret);
+			kfree(input_vec);
+			acomp_request_free(req);
+			goto out;
+		}
 
-			if (memcmp(input_vec, decomp_out[j], reqs[j]->dlen)) {
-				pr_err("alg: acomp: Compression test %d (%d) failed for %s\n",
-				       i + 1, j, algo);
-				hexdump(output[j], reqs[j]->dlen);
-				ret = -EINVAL;
-				kfree(input_vec);
-				goto out;
-			}
+		if (req->dlen != ctemplate[i].inlen) {
+			pr_err("alg: acomp: Compression test %d failed for %s: output len = %d\n",
+			       i + 1, algo, req->dlen);
+			ret = -EINVAL;
+			kfree(input_vec);
+			acomp_request_free(req);
+			goto out;
+		}
+
+		if (memcmp(input_vec, decomp_out, req->dlen)) {
+			pr_err("alg: acomp: Compression test %d failed for %s\n",
+			       i + 1, algo);
+			hexdump(output, req->dlen);
+			ret = -EINVAL;
+			kfree(input_vec);
+			acomp_request_free(req);
+			goto out;
 		}
 
 		kfree(input_vec);
+		acomp_request_free(req);
 	}
 
 	for (i = 0; i < dtcount; i++) {
@@ -3446,9 +3431,10 @@ static int test_acomp(struct crypto_acomp *tfm,
 			goto out;
 		}
 
+		memset(output, 0, dlen);
 		crypto_init_wait(&wait);
-		sg_init_one(src, input_vec, ilen);
-		sg_init_one(dst, output[0], dlen);
+		sg_init_one(&src, input_vec, ilen);
+		sg_init_one(&dst, output, dlen);
 
 		req = acomp_request_alloc(tfm);
 		if (!req) {
@@ -3459,7 +3445,7 @@ static int test_acomp(struct crypto_acomp *tfm,
 			goto out;
 		}
 
-		acomp_request_set_params(req, src, dst, ilen, dlen);
+		acomp_request_set_params(req, &src, &dst, ilen, dlen);
 		acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 					   crypto_req_done, &wait);
 
@@ -3481,10 +3467,10 @@ static int test_acomp(struct crypto_acomp *tfm,
 			goto out;
 		}
 
-		if (memcmp(output[0], dtemplate[i].output, req->dlen)) {
+		if (memcmp(output, dtemplate[i].output, req->dlen)) {
 			pr_err("alg: acomp: Decompression test %d failed for %s\n",
 			       i + 1, algo);
-			hexdump(output[0], req->dlen);
+			hexdump(output, req->dlen);
 			ret = -EINVAL;
 			kfree(input_vec);
 			acomp_request_free(req);
@@ -3498,13 +3484,8 @@ static int test_acomp(struct crypto_acomp *tfm,
 	ret = 0;
 
 out:
-	acomp_request_free(reqs[0]);
-	for (i = 0; i < MAX_MB_MSGS; i++) {
-		kfree(output[i]);
-		kfree(decomp_out[i]);
-	}
-	kfree(dst);
-	kfree(src);
+	kfree(decomp_out);
+	kfree(output);
 	return ret;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 04/19] crypto: scomp - Fix off-by-one bug when calculating last page
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 03/19] Revert "crypto: testmgr - Add multibuffer acomp testing" Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 05/19] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Fix off-by-one bug in the last page calculation for src and dst.

Reported-by: Nhat Pham <nphamcs@gmail.com>
Fixes: 2d3553ecb4e3 ("crypto: scomp - Remove support for some non-trivial SG lists")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
 crypto/scompress.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/crypto/scompress.c b/crypto/scompress.c
index c1ce12564299..1ed52b9740c5 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -215,8 +215,8 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 			spage = nth_page(spage, soff / PAGE_SIZE);
 			soff = offset_in_page(soff);
 
-			n = slen / PAGE_SIZE;
-			n += (offset_in_page(slen) + soff - 1) / PAGE_SIZE;
+			n = (slen - 1) / PAGE_SIZE;
+			n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE;
 			if (PageHighMem(nth_page(spage, n)) &&
 			    size_add(soff, slen) > PAGE_SIZE)
 				break;
@@ -243,9 +243,9 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 			dpage = nth_page(dpage, doff / PAGE_SIZE);
 			doff = offset_in_page(doff);
 
-			n = dlen / PAGE_SIZE;
-			n += (offset_in_page(dlen) + doff - 1) / PAGE_SIZE;
-			if (PageHighMem(dpage + n) &&
+			n = (dlen - 1) / PAGE_SIZE;
+			n += (offset_in_page(dlen - 1) + doff) / PAGE_SIZE;
+			if (PageHighMem(nth_page(dpage, n)) &&
 			    size_add(doff, dlen) > PAGE_SIZE)
 				break;
 			dst = kmap_local_page(dpage) + doff;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 05/19] crypto: iaa - Re-organize the iaa_crypto driver code.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 04/19] crypto: scomp - Fix off-by-one bug when calculating last page Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 06/19] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code, without requiring forward declarations.

This is expected to make the code more maintainable and for it to be
easier to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 678 +++++++++++----------
 1 file changed, 348 insertions(+), 330 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 09d9589f2d68..4900f9c72600 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,10 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
+
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,54 +40,6 @@ static unsigned int cpus_per_iaa;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -91,36 +47,11 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = true;
 
-static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
-{
-	return sprintf(buf, "%d\n", iaa_verify_compress);
-}
-
-static ssize_t verify_compress_store(struct device_driver *driver,
-				     const char *buf, size_t count)
-{
-	int ret = -EBUSY;
-
-	mutex_lock(&iaa_devices_lock);
-
-	if (iaa_crypto_enabled)
-		goto out;
-
-	ret = kstrtobool(buf, &iaa_verify_compress);
-	if (ret)
-		goto out;
-
-	ret = count;
-out:
-	mutex_unlock(&iaa_devices_lock);
-
-	return ret;
-}
-static DRIVER_ATTR_RW(verify_compress);
-
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
  * compressions and decompressions are performed:
@@ -155,6 +86,37 @@ static bool async_mode;
 /* Use interrupts */
 static bool use_irq;
 
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
+
+static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_verify_compress);
+}
+
+static ssize_t verify_compress_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_verify_compress);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(verify_compress);
+
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -217,7 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+/****************************
+ * Driver compression modes.
+ ****************************/
 
 static int find_empty_iaa_compression_mode(void)
 {
@@ -409,11 +373,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -500,6 +459,11 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
+
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -513,18 +477,6 @@ static struct iaa_device *iaa_device_alloc(void)
 	return iaa_device;
 }
 
-static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
-{
-	struct iaa_wq *iaa_wq;
-
-	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
-		if (iaa_wq->wq == wq)
-			return true;
-	}
-
-	return false;
-}
-
 static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
@@ -560,6 +512,27 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 	nr_iaa--;
 }
 
+static void free_iaa_device(struct iaa_device *iaa_device)
+{
+	if (!iaa_device)
+		return;
+
+	remove_device_compression_modes(iaa_device);
+	kfree(iaa_device);
+}
+
+static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
+{
+	struct iaa_wq *iaa_wq;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->wq == wq)
+			return true;
+	}
+
+	return false;
+}
+
 static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 		      struct iaa_wq **new_wq)
 {
@@ -612,23 +585,23 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
+static void remove_iaa_wq(struct idxd_wq *wq)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
+	struct iaa_device *iaa_device;
 
-static void free_iaa_device(struct iaa_device *iaa_device)
-{
-	if (!iaa_device)
-		return;
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		if (iaa_has_wq(iaa_device, wq)) {
+			del_iaa_wq(iaa_device, wq);
+			break;
+		}
+	}
 
-	remove_device_compression_modes(iaa_device);
-	kfree(iaa_device);
+	if (nr_iaa) {
+		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		if (!cpus_per_iaa)
+			cpus_per_iaa = 1;
+	} else
+		cpus_per_iaa = 1;
 }
 
 static void __free_iaa_wq(struct iaa_wq *iaa_wq)
@@ -655,6 +628,75 @@ static void free_iaa_wq(struct iaa_wq *iaa_wq)
 	idxd_wq_set_private(wq, NULL);
 }
 
+static int save_iaa_wq(struct idxd_wq *wq)
+{
+	struct iaa_device *iaa_device, *found = NULL;
+	struct idxd_device *idxd;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		if (iaa_device->idxd == wq->idxd) {
+			idxd = iaa_device->idxd;
+			pdev = idxd->pdev;
+			dev = &pdev->dev;
+			/*
+			 * Check to see that we don't already have this wq.
+			 * Shouldn't happen but we don't control probing.
+			 */
+			if (iaa_has_wq(iaa_device, wq)) {
+				dev_dbg(dev, "same wq probed multiple times for iaa_device %p\n",
+					iaa_device);
+				goto out;
+			}
+
+			found = iaa_device;
+
+			ret = add_iaa_wq(iaa_device, wq, NULL);
+			if (ret)
+				goto out;
+
+			break;
+		}
+	}
+
+	if (!found) {
+		struct iaa_device *new_device;
+		struct iaa_wq *new_wq;
+
+		new_device = add_iaa_device(wq->idxd);
+		if (!new_device) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = add_iaa_wq(new_device, wq, &new_wq);
+		if (ret) {
+			del_iaa_device(new_device);
+			free_iaa_device(new_device);
+			goto out;
+		}
+
+		ret = init_iaa_device(new_device, new_wq);
+		if (ret) {
+			del_iaa_wq(new_device, new_wq->wq);
+			del_iaa_device(new_device);
+			free_iaa_wq(new_wq);
+			goto out;
+		}
+	}
+
+	if (WARN_ON(nr_iaa == 0))
+		return -EINVAL;
+
+	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	if (!cpus_per_iaa)
+		cpus_per_iaa = 1;
+out:
+	return 0;
+}
+
 static int iaa_wq_get(struct idxd_wq *wq)
 {
 	struct idxd_device *idxd = wq->idxd;
@@ -702,6 +744,37 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
 static void free_wq_table(void)
 {
 	int cpu;
@@ -739,92 +812,18 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static int save_iaa_wq(struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct idxd_wq *wq)
 {
-	struct iaa_device *iaa_device, *found = NULL;
-	struct idxd_device *idxd;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		if (iaa_device->idxd == wq->idxd) {
-			idxd = iaa_device->idxd;
-			pdev = idxd->pdev;
-			dev = &pdev->dev;
-			/*
-			 * Check to see that we don't already have this wq.
-			 * Shouldn't happen but we don't control probing.
-			 */
-			if (iaa_has_wq(iaa_device, wq)) {
-				dev_dbg(dev, "same wq probed multiple times for iaa_device %p\n",
-					iaa_device);
-				goto out;
-			}
-
-			found = iaa_device;
-
-			ret = add_iaa_wq(iaa_device, wq, NULL);
-			if (ret)
-				goto out;
-
-			break;
-		}
-	}
-
-	if (!found) {
-		struct iaa_device *new_device;
-		struct iaa_wq *new_wq;
-
-		new_device = add_iaa_device(wq->idxd);
-		if (!new_device) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		ret = add_iaa_wq(new_device, wq, &new_wq);
-		if (ret) {
-			del_iaa_device(new_device);
-			free_iaa_device(new_device);
-			goto out;
-		}
-
-		ret = init_iaa_device(new_device, new_wq);
-		if (ret) {
-			del_iaa_wq(new_device, new_wq->wq);
-			del_iaa_device(new_device);
-			free_iaa_wq(new_wq);
-			goto out;
-		}
-	}
-
-	if (WARN_ON(nr_iaa == 0))
-		return -EINVAL;
-
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-	if (!cpus_per_iaa)
-		cpus_per_iaa = 1;
-out:
-	return 0;
-}
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-static void remove_iaa_wq(struct idxd_wq *wq)
-{
-	struct iaa_device *iaa_device;
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
 
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		if (iaa_has_wq(iaa_device, wq)) {
-			del_iaa_wq(iaa_device, wq);
-			break;
-		}
-	}
+	entry->wqs[entry->n_wqs++] = wq;
 
-	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-		if (!cpus_per_iaa)
-			cpus_per_iaa = 1;
-	} else
-		cpus_per_iaa = 1;
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
@@ -937,6 +936,47 @@ static void rebalance_wq_table(void)
 	}
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
+
+static int deflate_generic_decompress(struct acomp_req *req)
+{
+	ACOMP_REQUEST_ON_STACK(fbreq, crypto_acomp_reqtfm(req));
+	int ret;
+
+	acomp_request_set_callback(fbreq, 0, NULL, NULL);
+	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
+				 req->dlen);
+	ret = crypto_acomp_decompress(fbreq);
+	req->dlen = fbreq->dlen;
+
+	update_total_sw_decomp_calls();
+
+	return ret;
+}
+
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -997,31 +1037,132 @@ static inline int check_completion(struct device *dev,
 	return ret;
 }
 
-static int deflate_generic_decompress(struct acomp_req *req)
+static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
+				struct acomp_req *req,
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
 {
-	ACOMP_REQUEST_ON_STACK(fbreq, crypto_acomp_reqtfm(req));
-	int ret;
+	int ret = 0;
+	int nr_sgs;
 
-	acomp_request_set_callback(fbreq, 0, NULL, NULL);
-	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
-				 req->dlen);
-	ret = crypto_acomp_decompress(fbreq);
-	req->dlen = fbreq->dlen;
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 
-	update_total_sw_decomp_calls();
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
 
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
 	return ret;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
-
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc);
+			       u32 compression_crc)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1239,133 +1380,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -1678,6 +1692,10 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 06/19] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 05/19] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 07/19] crypto: iaa - Define and use req->data instead of req->base.data Kanchana P Sridhar
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch re-architects the iaa_crypto driver in two aspects:

A) Map IAA devices/wqs to cores based on packages instead of NUMA.

B) The WQ rebalancing algorithm that is invoked as WQs are
   discovered/deleted has been made very general and flexible so that
   the user can control exactly how IAA WQs are used.

Description/motivation for (A):
===============================
This patch modifies the algorithm for mapping available IAA devices and
WQs to cores based on packages instead of NUMA nodes. This leads to a
more realistic mapping of IAA devices as compression/decompression
resources for a package, rather than for a NUMA node. This also resolves
problems that were observed during internal validation on Intel Granite
Rapids platforms with many more NUMA nodes than packages: for such
cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static mapping
of wqs per IAA device, in the "struct iaa_device" itself. The earlier
implementation would allocate memory per-cpu for this data, which never
changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

 1) Resolves "task blocked for more than x seconds" errors observed during
    internal validation on Intel systems with the earlier NUMA node based
    mappings, which was root-caused to the non-optimal IAA-to-core mappings
    described earlier.

 2) Results in a NUM_THREADS factor reduction in memory footprint cost of
    initializing IAA devices/wqs, due to eliminating the per-cpu copies of
    each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
    8 IAA devices, this saves 140MiB.

An auxiliary change included in this patch is that the driver's "nr_iaa",
"nr_iaa_per_package" and "cpus_per_iaa" global variables are made
atomic, because iaa_crypto_probe() and iaa_crypto_remove() change the
values of these variables asynchronously and concurrently as wqs get
added/deleted and rebalance_wq_table() is called. This change allows the
rebalance_wq_table() code to see consistent values of the number of IAA
devices.

Description/motivation for (B):
===============================
This builds upon the package-based driver infrastructure, to provide
more flexibility in using particular WQs for compress-only or
decompress-only jobs. It also introduces the notion of using all the IAA
devices on a package as resources that are shared by all cores on the
package: this significantly improves batching (to be added in subsequent
patches) latency and compress/decompress throughput. sysfs driver
paramters provide configurability of these features.

Two main concepts are introduced as part of the rebalancing changes:

 1) An IAA WQ can be used for specific ops, that determines a WQ "type"
    for the iaa_crypto driver to submit compress/decompress jobs:

    - compress only
    - decompress only
    - generic, i.e, for both compresses and decompresses

    The WQ type is decided based on the number of WQs configured for a
    given IAA device, and the new "g_comp_wqs_per_iaa" driver parameter.

 2) An IAA WQ can be mapped to cores using either of the following
    balancing techniques:

    a) Shared by all cores on a package. The iaa_crypto driver will
       dispatch compress/decompress jobs to all WQs of the same type,
       across all IAA devices on the package:
       - IAA compress jobs will be distributed to all same-package IAA
         compress-only/generic WQs.
       - IAA decompress jobs will be distributed to all same-package IAA
         decompress-only/generic WQs.

    b) Handles compress/decompress jobs only from "mapped cores", i.e.,
       the cores derived by evenly dividing the number of IAAs among the
       number of cores, per package.

Server setups that are moderately to highly contended can benefit from
(2.a). When the mix of workloads running on a system need high compress
throughput, and have relatively lower decompress activity, (2.b) might
be more optimal.

These approaches can be accomplished with the following new iaa_crypto
driver parameters. These parameters are global settings and will apply
to all IAAs on a package, interpreted in the context of the number of
WQs configured per IAA device.

 g_comp_wqs_per_iaa:
 ===================
   Number of compress-only WQs (default is 0).

   If the IAA device has more than "g_comp_wqs_per_iaa" WQs configured,
   the last "g_comp_wqs_per_iaa" number of WQs will be considered as
   "compress only". The remaining WQs will be considered as "decomp only".

   If the device has fewer WQs than "g_comp_wqs_per_iaa", all the
   device's WQs will be considered "generic", i.e., the driver will
   submit compress and decompress jobs to all the WQs configured for the
   device.

   For e.g., if an IAA "X" has 2 WQs, this will set up 1 decompress WQ and
   1 compress WQ:

     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa

     wqX.0: decompress jobs only.
     wqX.1: compress jobs only.

   This setting would typically benefit workloads that see a high
   level of compress and decompress activity.

   If an IAA has 1 WQ, that WQ will be considered "generic": the driver
   will submit compress and decompress jobs to the same WQ (this is
   independent of the "g_comp_wqs_per_iaa" setting):

     wqX.0: compress and decompress jobs.

   This would typically benefit workloads that see significant cold
   memory being reclaimed, and consequently, high swapout and low swapin
   activity.

 distribute_comps:
 =================
   Distribute compressions to all IAAs on package (default is Y).

   Assuming the WQ type has been established as
   compress-only/decompress-only/generic, this setting will determine if
   the driver will distribute compress jobs to all IAAs on a package
   (default behavior) or not.

   If this is turned off, the driver will dispatch compress jobs to a
   given IAA "compression enabled" WQ only from cores that are mapped to
   that IAA using an algorithm that evenly distributes IAAs per package
   to cores per package. For e.g., on a Sapphire Rapids server with
   56-physical-cores and 4 IAAs per package, with Hyperthreading, 28
   logical cores will be assigned to each IAA. With the
   "distribute_comps" driver parameter turned off, the driver will send
   compress jobs only to it's assigned IAA device.

   Enabling "distribute_comps" would typically benefit workloads in
   terms of batch compress latency and throughput.

 distribute_decomps:
 ===================
   Distribute decompressions to all IAAs on package (default is Y).

   Assuming the WQ type has been established as
   compress-only/decompress-only/generic, this setting will determine if
   the driver will distribute decompress jobs to all IAAs on a package
   (default behavior) or not.

   Enabling "distribute_decomps" would typically benefit workloads that
   see a high level of compress and decompress activity, especially
   p99 decompress latency.

Recommended settings for best compress/decompress latency, throughput
and hence memory savings for a moderately contended server:

   2 WQs per IAA
   g_comp_wqs_per_iaa = 1 (separate WQ for comps/decomps per IAA)
   distribute_decomps = Y
   distribute_comps = Y

For a system that has only 1 IAA device enabled on a given package,
the recommended settings are:

   1 WQ per IAA
   g_comp_wqs_per_iaa = 0 (same WQ for comps/decomps)
   distribute_decomps = N
   distribute_comps = N

 Examples:
 =========
   For a Sapphire Rapids server with 2 packages, 56 cores and 4 IAAs per
   package, each IAA has 2 WQs, and these settings are in effect:

     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
     echo 0 > /sys/bus/dsa/drivers/crypto/distribute_decomps

     wqX.0: decompress jobs only.
     wqX.1: compress jobs only.

   Compress jobs from all cores on package-0 will be distributed in
   round-robin manner to [iax1, iax3, iax5, iax7]'s wqX.1, to maximize
   compression throughput/latency/memory savings:

     wq1.1
     wq3.1
     wq5.1
     wq7.1

   Likewise, compress jobs from all cores on package-1 will be
   distributed in round-robin manner to [iax9, iax11, iax13, iax15]'s
   wqX.1, to maximize compression throughput/latency/memory savings for
   workloads running on package-1:

     wq9.1
     wq11.1
     wq13.1
     wq15.1

   Decompress jobs will be submitted from mapped logical cores only, as
   follows:

     package-0:

       CPU   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
       IAA:  iax1           iax3           iax5           iax7
       WQ:   wq1.0          wq3.0          wq5.0          wq7.0

     package-1:

       CPU   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
       IAA:  iax9           iax11          iax13           iax15
       WQ:   wq9.0          wq11.0         wq13.0          wq15.0

IAA WQs can be configured using higher level scripts as described in
Documentation/driver-api/crypto/iaa/iaa-crypto.rst. This documentation
has been updated for the above new parameters.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 .../driver-api/crypto/iaa/iaa-crypto.rst      | 134 +++
 drivers/crypto/intel/iaa/iaa_crypto.h         |  18 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 817 ++++++++++++++----
 3 files changed, 805 insertions(+), 164 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 8e50b900d51c..949bfa1ef624 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -290,6 +290,140 @@ The available attributes are:
     'sync' mode. This is to ensure correct iaa_crypto behavior until true
     async polling without interrupts is enabled in iaa_crypto.
 
+  - g_comp_wqs_per_iaa
+
+    Number of compress-only WQs (default is 0).
+
+    If the IAA device has more than "g_comp_wqs_per_iaa" WQs configured,
+    the last "g_comp_wqs_per_iaa" number of WQs will be considered as
+    "compress only". The remaining WQs will be considered as "decomp only".
+
+    If the device has fewer WQs than "g_comp_wqs_per_iaa", all the
+    device's WQs will be considered "generic", i.e., the driver will
+    submit compress and decompress jobs to all the WQs configured for the
+    device.
+
+    For e.g., if an IAA "X" has 2 WQs, this will set up 1 decompress WQ and
+    1 compress WQ::
+
+      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
+
+     wqX.0: decompress jobs only.
+     wqX.1: compress jobs only.
+
+    This setting would typically benefit workloads that see a high
+    level of compress and decompress activity.
+
+    If an IAA has 1 WQ, that WQ will be considered "generic": the driver
+    will submit compress and decompress jobs to the same WQ (this is
+    independent of the "g_comp_wqs_per_iaa" setting):
+
+     wqX.0: compress and decompress jobs.
+
+    This would typically benefit workloads that see significant cold
+    memory being reclaimed, and consequently, high swapout and low swapin
+    activity.
+
+  - distribute_comps
+
+    Distribute compressions to all IAAs on package (default is Y).
+
+    Assuming the WQ type has been established as
+    compress-only/decompress-only/generic, this setting will determine if
+    the driver will distribute compress jobs to all IAAs on a package
+    (default behavior) or not.
+
+    If this is turned off, the driver will dispatch compress jobs to a
+    given IAA "compression enabled" WQ only from cores that are mapped to
+    that IAA using an algorithm that evenly distributes IAAs per package
+    to cores per package. For e.g., on a Sapphire Rapids server with
+    56-physical-cores and 4 IAAs per package, with Hyperthreading, 28
+    logical cores will be assigned to each IAA. With the
+    "distribute_comps" driver parameter turned off, the driver will send
+    compress jobs only to it's assigned IAA device.
+
+    Enabling "distribute_comps" would typically benefit workloads in
+    terms of batch compress latency and throughput.
+
+  - distribute_decomps
+
+    Distribute decompressions to all IAAs on package (default is Y).
+
+    Assuming the WQ type has been established as
+    compress-only/decompress-only/generic, this setting will determine if
+    the driver will distribute decompress jobs to all IAAs on a package
+    (default behavior) or not.
+
+    Enabling "distribute_decomps" would typically benefit workloads that
+    see a high level of compress and decompress activity, especially
+    p99 decompress latency.
+
+    Recommended settings for best compress/decompress latency, throughput
+    and hence memory savings for a moderately contended server that
+    has more than 1 IAA device enabled on a given package:
+
+      2 WQs per IAA
+      g_comp_wqs_per_iaa = 1 (separate WQ for comps/decomps per IAA)
+      distribute_decomps = Y
+      distribute_comps = Y
+
+    For a system that has only 1 IAA device enabled on a given package,
+    the recommended settings are:
+
+      1 WQ per IAA
+      g_comp_wqs_per_iaa = 0 (same WQ for comps/decomps)
+      distribute_decomps = N
+      distribute_comps = N
+
+    Examples:
+
+    For a Sapphire Rapids server with 2 packages, 56 cores and 4 IAAs per
+    package, each IAA has 2 WQs, and these settings are in effect::
+
+      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
+      echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
+      echo 0 > /sys/bus/dsa/drivers/crypto/distribute_decomps
+
+    This enables the following behavior:
+
+      wqX.0: decompress jobs only.
+      wqX.1: compress jobs only.
+
+    Compress jobs from all cores on package-0 will be distributed in
+    round-robin manner to [iax1, iax3, iax5, iax7]'s wqX.1, to maximize
+    compression throughput/latency/memory savings:
+
+      wq1.1
+      wq3.1
+      wq5.1
+      wq7.1
+
+    Likewise, compress jobs from all cores on package-1 will be
+    distributed in round-robin manner to [iax9, iax11, iax13, iax15]'s
+    wqX.1, to maximize compression throughput/latency/memory savings for
+    workloads running on package-1:
+
+      wq9.1
+      wq11.1
+      wq13.1
+      wq15.1
+
+    Decompress jobs will be submitted from mapped logical cores only, as
+    follows:
+
+      package-0:
+
+        CPU   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
+        IAA:  iax1           iax3           iax5           iax7
+        WQ:   wq1.0          wq3.0          wq5.0          wq7.0
+
+      package-1:
+
+        CPU   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
+        IAA:  iax9           iax11          iax13           iax15
+        WQ:   wq9.0          wq11.0         wq13.0          wq15.0
+
+
 .. _iaa_default_config:
 
 IAA Default Configuration
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e395263..549ac98a9366 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -46,6 +46,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -63,6 +64,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq	**wqs;
+	unsigned int	max_wqs;
+	unsigned int	n_wqs;
+	unsigned int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -73,19 +81,15 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*generic_wq_table;
+	struct wq_table_entry		*comp_wq_table;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 4900f9c72600..2f2dc6987cc6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -23,32 +23,65 @@
 #define pr_fmt(fmt)			"idxd: " IDXD_SUBDRIVER_NAME ": " fmt
 
 #define IAA_ALG_PRIORITY               300
+#define MAX_PKG_IAA   8
+#define MAX_IAA_WQ    8
 
 /**************************************
  * Driver internal global variables.
  **************************************/
 
 /* number of iaa instances probed */
-static unsigned int nr_iaa;
+static atomic_t nr_iaa = ATOMIC_INIT(0);
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static atomic_t nr_iaa_per_package = ATOMIC_INIT(0);
 
 /* Number of physical cpus sharing each iaa instance */
-static unsigned int cpus_per_iaa;
+static atomic_t cpus_per_iaa = ATOMIC_INIT(0);
 
-/* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+/* Per-cpu lookup table for decomp wqs. */
+static struct wq_table_entry __percpu *cpu_decomp_wqs = NULL;
+
+/* Per-cpu lookup table for comp wqs. */
+static struct wq_table_entry __percpu *cpu_comp_wqs = NULL;
+
+/* All decomp wqs from IAAs on a package. */
+static struct wq_table_entry **pkg_global_decomp_wqs = NULL;
+/* All comp wqs from IAAs on a package. */
+static struct wq_table_entry **pkg_global_comp_wqs = NULL;
+
+static struct idxd_wq *first_wq_found = NULL;
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
+DEFINE_MUTEX(first_wq_found_lock);
 
 /* If enabled, IAA hw crypto algos are registered, unavailable otherwise */
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
+/*
+ * We use the atomic iaa_device_registration_done to know if the
+ * crypto testmgr has been started, and the device has been
+ * registered. Until this is done, the first WQ probed will be
+ * assigned to the per-CPU comp/decomp wq tables.
+ * With the new dynamic package-level rebalancing of WQs being
+ * discovered asynchronously and concurrently with tests
+ * triggered from device registration, this is needed to
+ * determine when it is safe for the rebalancing of decomp/comp
+ * WQs to de-allocate the per-package WQs and re-allocate them
+ * based on the latest number of IAA devices and WQs.
+ */
+static atomic_t iaa_device_registration_done = ATOMIC_INIT(0);
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
+/* Distribute decompressions across all IAAs on the package. */
+static bool iaa_distribute_decomps = true;
+
+/* Distribute compressions across all IAAs on the package. */
+static bool iaa_distribute_comps = true;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = true;
 
@@ -86,6 +119,9 @@ static bool async_mode;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of compress-only wqs per iaa*/
+static int g_comp_wqs_per_iaa = 0;
+
 /**************************************************
  * Driver attributes along with get/set functions.
  **************************************************/
@@ -179,6 +215,87 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_comp_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_comp_wqs_per_iaa);
+}
+
+static ssize_t g_comp_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_comp_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_comp_wqs_per_iaa);
+
+static ssize_t distribute_decomps_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_distribute_decomps);
+}
+
+static ssize_t distribute_decomps_store(struct device_driver *driver,
+					const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_distribute_decomps);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(distribute_decomps);
+
+static ssize_t distribute_comps_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_distribute_comps);
+}
+
+static ssize_t distribute_comps_store(struct device_driver *driver,
+				      const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_distribute_comps);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(distribute_comps);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -464,32 +581,81 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
 
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
+	struct wq_table_entry *wqt;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's generic/decomp wqs. */
+	iaa_device->generic_wq_table = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->generic_wq_table)
+		goto err;
+
+	wqt = iaa_device->generic_wq_table;
+
+	wqt->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct idxd_wq *), GFP_KERNEL);
+	if (!wqt->wqs)
+		goto err;
+
+	wqt->max_wqs = iaa_device->idxd->max_wqs;
+	wqt->n_wqs = 0;
+
+	/*
+	 * IAA device's comp wqs (optional). If the device has more than
+	 * "g_comp_wqs_per_iaa" WQs configured, the last "g_comp_wqs_per_iaa"
+	 * number of WQs will be considered as "comp only". The remaining
+	 * WQs will be considered as "decomp only".
+	 * If the device has fewer WQs than "g_comp_wqs_per_iaa", all the
+	 * device's WQs will be considered "generic", i.e., cores can submit
+	 * comp and decomp jobs to all the WQs configured for the device.
+	 */
+	iaa_device->comp_wq_table = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->comp_wq_table)
+		goto err;
+
+	wqt = iaa_device->comp_wq_table;
+
+	wqt->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct idxd_wq *), GFP_KERNEL);
+	if (!wqt->wqs)
+		goto err;
+
+	wqt->max_wqs = iaa_device->idxd->max_wqs;
+	wqt->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->generic_wq_table) {
+			kfree(iaa_device->generic_wq_table->wqs);
+			kfree(iaa_device->generic_wq_table);
+		}
+		kfree(iaa_device->comp_wq_table);
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
-	nr_iaa++;
+	atomic_inc(&nr_iaa);
 
 	return iaa_device;
 }
@@ -509,7 +675,7 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 {
 	list_del(&iaa_device->list);
 
-	nr_iaa--;
+	atomic_dec(&nr_iaa);
 }
 
 static void free_iaa_device(struct iaa_device *iaa_device)
@@ -518,6 +684,17 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->generic_wq_table) {
+		kfree(iaa_device->generic_wq_table->wqs);
+		kfree(iaa_device->generic_wq_table);
+	}
+
+	if (iaa_device->comp_wq_table) {
+		kfree(iaa_device->comp_wq_table->wqs);
+		kfree(iaa_device->comp_wq_table);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -548,6 +725,7 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
+	iaa_wq->mapped = false;
 
 	list_add_tail(&iaa_wq->list, &iaa_device->wqs);
 
@@ -576,7 +754,7 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 
 			dev_dbg(dev, "removed wq %d from iaa_device %d, n_wq %d, nr_iaa %d\n",
 				wq->id, iaa_device->idxd->id,
-				iaa_device->n_wq, nr_iaa);
+				iaa_device->n_wq, atomic_read(&nr_iaa));
 
 			if (iaa_device->n_wq == 0)
 				del_iaa_device(iaa_device);
@@ -588,6 +766,7 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 static void remove_iaa_wq(struct idxd_wq *wq)
 {
 	struct iaa_device *iaa_device;
+	unsigned int num_pkg_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
 		if (iaa_has_wq(iaa_device, wq)) {
@@ -596,12 +775,20 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 		}
 	}
 
-	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-		if (!cpus_per_iaa)
-			cpus_per_iaa = 1;
-	} else
-		cpus_per_iaa = 1;
+	if (atomic_read(&nr_iaa)) {
+		atomic_set(&cpus_per_iaa, (nr_packages * nr_cpus_per_package) / atomic_read(&nr_iaa));
+		if (!atomic_read(&cpus_per_iaa))
+			atomic_set(&cpus_per_iaa, 1);
+
+		num_pkg_iaa = atomic_read(&nr_iaa) / nr_packages;
+		if (!num_pkg_iaa)
+			num_pkg_iaa = 1;
+	} else {
+		atomic_set(&cpus_per_iaa, 1);
+		num_pkg_iaa = 1;
+	}
+
+	atomic_set(&nr_iaa_per_package, num_pkg_iaa);
 }
 
 static void __free_iaa_wq(struct iaa_wq *iaa_wq)
@@ -635,6 +822,7 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	struct pci_dev *pdev;
 	struct device *dev;
 	int ret = 0;
+	unsigned int num_pkg_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
 		if (iaa_device->idxd == wq->idxd) {
@@ -687,12 +875,19 @@ static int save_iaa_wq(struct idxd_wq *wq)
 		}
 	}
 
-	if (WARN_ON(nr_iaa == 0))
+	if (WARN_ON(atomic_read(&nr_iaa) == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-	if (!cpus_per_iaa)
-		cpus_per_iaa = 1;
+	atomic_set(&cpus_per_iaa, (nr_packages * nr_cpus_per_package) / atomic_read(&nr_iaa));
+	if (!atomic_read(&cpus_per_iaa))
+		atomic_set(&cpus_per_iaa, 1);
+
+	num_pkg_iaa = atomic_read(&nr_iaa) / nr_packages;
+	if (!num_pkg_iaa)
+		num_pkg_iaa = 1;
+
+	atomic_set(&nr_iaa_per_package, num_pkg_iaa);
+
 out:
 	return 0;
 }
@@ -748,105 +943,284 @@ static int iaa_wq_put(struct idxd_wq *wq)
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
 
-static void wq_table_free_entry(int cpu)
+/*
+ * Given a cpu, find the closest IAA instance.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	int package_id, base_iaa, iaa = 0;
+
+	if (!nr_packages || !atomic_read(&nr_iaa_per_package) || !atomic_read(&nr_iaa))
+		return -1;
+
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * atomic_read(&nr_iaa_per_package);
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / atomic_read(&cpus_per_iaa));
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
+
+	if (iaa >= 0 && iaa < atomic_read(&nr_iaa))
+		return iaa;
+
+	return (atomic_read(&nr_iaa) - 1);
 }
 
-static void wq_table_clear_entry(int cpu)
+static void free_wq_tables(void)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	if (cpu_decomp_wqs) {
+		free_percpu(cpu_decomp_wqs);
+		cpu_decomp_wqs = NULL;
+	}
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	if (cpu_comp_wqs) {
+		free_percpu(cpu_comp_wqs);
+		cpu_comp_wqs = NULL;
+	}
+
+	pr_debug("freed comp/decomp wq tables\n");
 }
 
-static void clear_wq_table(void)
+static void pkg_global_wqs_dealloc(void)
 {
-	int cpu;
+	int i;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	if (pkg_global_decomp_wqs) {
+		for (i = 0; i < nr_packages; ++i) {
+			kfree(pkg_global_decomp_wqs[i]->wqs);
+			kfree(pkg_global_decomp_wqs[i]);
+		}
+		kfree(pkg_global_decomp_wqs);
+		pkg_global_decomp_wqs = NULL;
+	}
 
-	pr_debug("cleared wq table\n");
+	if (pkg_global_comp_wqs) {
+		for (i = 0; i < nr_packages; ++i) {
+			kfree(pkg_global_comp_wqs[i]->wqs);
+			kfree(pkg_global_comp_wqs[i]);
+		}
+		kfree(pkg_global_comp_wqs);
+		pkg_global_comp_wqs = NULL;
+	}
 }
 
-static void free_wq_table(void)
+static bool pkg_global_wqs_alloc(void)
 {
-	int cpu;
+	int i;
+
+	pkg_global_decomp_wqs = kcalloc(nr_packages, sizeof(*pkg_global_decomp_wqs), GFP_KERNEL);
+	if (!pkg_global_decomp_wqs)
+		return false;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_decomp_wqs[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+		if (!pkg_global_decomp_wqs[i])
+			goto err;
+
+		pkg_global_decomp_wqs[i]->wqs = kcalloc(MAX_PKG_IAA * MAX_IAA_WQ, sizeof(struct idxd_wq *), GFP_KERNEL);
+		if (!pkg_global_decomp_wqs[i]->wqs)
+			goto err;
+
+		pkg_global_decomp_wqs[i]->max_wqs = MAX_PKG_IAA * MAX_IAA_WQ;
+	}
+
+	pkg_global_comp_wqs = kcalloc(nr_packages, sizeof(*pkg_global_comp_wqs), GFP_KERNEL);
+	if (!pkg_global_comp_wqs)
+		goto err;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_comp_wqs[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+		if (!pkg_global_comp_wqs[i])
+			goto err;
+
+		pkg_global_comp_wqs[i]->wqs = kcalloc(MAX_PKG_IAA * MAX_IAA_WQ, sizeof(struct idxd_wq *), GFP_KERNEL);
+		if (!pkg_global_comp_wqs[i]->wqs)
+			goto err;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+		pkg_global_comp_wqs[i]->max_wqs = MAX_PKG_IAA * MAX_IAA_WQ;
+	}
 
-	free_percpu(wq_table);
+	return true;
 
-	pr_debug("freed wq table\n");
+err:
+	pkg_global_wqs_dealloc();
+	return false;
 }
 
 static int alloc_wq_table(int max_wqs)
 {
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
+	cpu_decomp_wqs = alloc_percpu_gfp(struct wq_table_entry, GFP_KERNEL | __GFP_ZERO);
+	if (!cpu_decomp_wqs)
 		return -ENOMEM;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
+	cpu_comp_wqs = alloc_percpu_gfp(struct wq_table_entry, GFP_KERNEL | __GFP_ZERO);
+	if (!cpu_comp_wqs)
+		goto err;
 
-		entry->max_wqs = max_wqs;
-	}
+	if (!pkg_global_wqs_alloc())
+		goto err;
 
 	pr_debug("initialized wq table\n");
 
 	return 0;
+
+err:
+	free_wq_tables();
+	return -ENOMEM;
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+/*
+ * The caller should have established that device_iaa_wqs is not empty,
+ * i.e., every IAA device in "iaa_devices" has at least one WQ.
+ */
+static void add_device_wqs_to_wq_table(struct wq_table_entry *dst_wq_table,
+				       struct wq_table_entry *device_wq_table)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	int i;
+
+	for (i = 0; i < device_wq_table->n_wqs; ++i)
+		dst_wq_table->wqs[dst_wq_table->n_wqs++] = device_wq_table->wqs[i];
+}
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+static bool reinit_pkg_global_wqs(bool comp)
+{
+	int cur_iaa = 0, pkg = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry **pkg_wqs = comp ? pkg_global_comp_wqs : pkg_global_decomp_wqs;
+
+	for (pkg = 0; pkg < nr_packages; ++pkg)
+		pkg_wqs[pkg]->n_wqs = 0;
+
+	pkg = 0;
+
+one_iaa_special_case:
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		struct wq_table_entry *device_wq_table = comp ?
+			((iaa_device->comp_wq_table->n_wqs > 0) ?
+				iaa_device->comp_wq_table : iaa_device->generic_wq_table) :
+			iaa_device->generic_wq_table;
+
+		if (pkg_wqs[pkg]->n_wqs + device_wq_table->n_wqs > pkg_wqs[pkg]->max_wqs) {
+			pkg_wqs[pkg]->wqs = krealloc(pkg_wqs[pkg]->wqs,
+						     ksize(pkg_wqs[pkg]->wqs) +
+						     max((MAX_PKG_IAA * MAX_IAA_WQ), iaa_device->n_wq) * sizeof(struct idxd_wq *),
+						     GFP_KERNEL | __GFP_ZERO);
+			if (!pkg_wqs[pkg]->wqs)
+				return false;
+
+			pkg_wqs[pkg]->max_wqs = ksize(pkg_wqs[pkg]->wqs)/sizeof(struct idxd_wq *);
+		}
+
+		add_device_wqs_to_wq_table(pkg_wqs[pkg], device_wq_table);
+
+		pr_info("pkg_global_%s_wqs[%d] has %u n_wqs %u max_wqs",
+			(comp ? "comp" : "decomp"), pkg, pkg_wqs[pkg]->n_wqs, pkg_wqs[pkg]->max_wqs);
+
+		if (++cur_iaa == atomic_read(&nr_iaa_per_package)) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			if (atomic_read(&nr_iaa) == 1)
+				goto one_iaa_special_case;
+		}
+	}
+
+	return true;
+}
+
+static void create_cpu_wq_table(int cpu, struct wq_table_entry *wq_table, bool comp)
+{
+	struct wq_table_entry *entry = comp ?
+		per_cpu_ptr(cpu_comp_wqs, cpu) :
+		per_cpu_ptr(cpu_decomp_wqs, cpu);
+
+	if (!iaa_crypto_enabled || !atomic_read(&iaa_device_registration_done)) {
+		mutex_lock(&first_wq_found_lock);
+
+		BUG_ON(!first_wq_found && !wq_table->n_wqs);
+
+		if (!first_wq_found)
+			first_wq_found = wq_table->wqs[0];
+
+		mutex_unlock(&first_wq_found_lock);
+
+		entry->wqs = &first_wq_found;
+		entry->max_wqs = 1;
+		entry->n_wqs = 1;
+		entry->cur_wq = 0;
+		pr_info("%s: cpu %d: added %u first_wq_found for %s wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs, comp ? "comp":"decomp",
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
 		return;
+	}
+
+	entry->wqs = wq_table->wqs;
+	entry->max_wqs = wq_table->max_wqs;
+	entry->n_wqs = wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->n_wqs)
+		pr_info("%s: cpu %d: added %u iaa %s wqs up to wq %d.%d: entry->max_wqs = %u\n", __func__,
+			 cpu, entry->n_wqs, comp ? "comp":"decomp",
+			 entry->wqs[entry->n_wqs - 1]->idxd->id, entry->wqs[entry->n_wqs - 1]->id,
+			 entry->max_wqs);
+}
+
+static void set_cpu_wq_table_start_wq(int cpu, bool comp)
+{
+	struct wq_table_entry *entry = comp ?
+		per_cpu_ptr(cpu_comp_wqs, cpu) :
+		per_cpu_ptr(cpu_decomp_wqs, cpu);
+	unsigned int num_pkg_iaa = atomic_read(&nr_iaa_per_package);
 
-	entry->wqs[entry->n_wqs++] = wq;
+	int start_wq = (entry->n_wqs / num_pkg_iaa) * (cpu_to_iaa(cpu) % num_pkg_iaa);
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
 }
 
-static int wq_table_add_wqs(int iaa, int cpu)
+static void create_cpu_wq_table_from_pkg_wqs(bool comp)
+{
+	int cpu;
+
+	/*
+	 * All CPU on the same package share the same "package global"
+	 * [de]comp_wqs.
+	 */
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		int package_id = topology_logical_package_id(cpu);
+		struct wq_table_entry *pkg_wq_table = comp ?
+			((pkg_global_comp_wqs[package_id]->n_wqs > 0) ?
+				pkg_global_comp_wqs[package_id] : pkg_global_decomp_wqs[package_id])
+			: pkg_global_decomp_wqs[package_id];
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			/* Initialize decomp/comp wq_table for CPU. */
+			create_cpu_wq_table(pkg_cpu, pkg_wq_table, comp);
+			/* Stagger the starting WQ in the package WQ table, for each CPU. */
+			set_cpu_wq_table_start_wq(pkg_cpu, comp);
+		}
+	}
+}
+
+static int add_mapped_device_wq_table_for_cpu(int iaa, int cpu, bool comp)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	struct wq_table_entry *device_wq_table;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-
 		if (cur_iaa != iaa) {
 			cur_iaa++;
 			continue;
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -861,100 +1235,176 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
-	}
+	device_wq_table = comp ?
+		((found_device->comp_wq_table->n_wqs > 0) ?
+			found_device->comp_wq_table : found_device->generic_wq_table) :
+		found_device->generic_wq_table;
+
+	create_cpu_wq_table(cpu, device_wq_table, comp);
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
-		ret = -EINVAL;
-		goto out;
-	}
 out:
 	return ret;
 }
 
-/*
- * Rebalance the wq table so that given a cpu, it's easy to find the
- * closest IAA instance.  The idea is to try to choose the most
- * appropriate IAA instance for a caller and spread available
- * workqueues around to clients.
- */
-static void rebalance_wq_table(void)
+static void create_cpu_wq_table_from_mapped_device(bool comp)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
-	if (nr_iaa == 0)
-		return;
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
+		}
 
-	clear_wq_table();
+		if (WARN_ON(add_mapped_device_wq_table_for_cpu(iaa, cpu, comp))) {
+			pr_debug("could not add any wqs of iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
+	}
+}
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *generic, *for_comps;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	generic = iaa_device->generic_wq_table;
+	for_comps = iaa_device->comp_wq_table;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if ((!n_wqs_added || ((n_wqs_added + g_comp_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(generic->n_wqs < generic->max_wqs)) {
+
+			generic->wqs[generic->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added decomp wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(for_comps->n_wqs == for_comps->max_wqs))
+				break;
+
+			for_comps->wqs[for_comps->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added comp wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
 		}
 
-		return;
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
+		ret = -EINVAL;
+	}
 
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
+	return ret;
+}
 
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
 
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		BUG_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
-		}
+/*
+ * Rebalance the per-cpu wq table based on available IAA devices/WQs.
+ * Three driver parameters control how this algorithm works:
+ *
+ * - g_comp_wqs_per_iaa:
+ *
+ *   If multiple WQs are configured for a given device, this setting determines
+ *   the number of WQs to be used as "compress only" WQs. The remaining WQs will
+ *   be used as "decompress only WQs".
+ *   Note that the comp WQ can be the same as the decomp WQ, for e.g., if
+ *   g_comp_wqs_per_iaa is 0 (regardless of the # of available WQs per device), or,
+ *   if there is only 1 WQ configured for a device (regardless of
+ *   g_comp_wqs_per_iaa).
+ *
+ * - distribute_decomps, distribute_comps:
+ *
+ *   If this is enabled, all [de]comp WQs found from the IAA devices on a
+ *   package, will be aggregated into pkg_global_[de]comp_wqs, then assigned to
+ *   each CPU on the package.
+ */
+static bool rebalance_wq_table(void)
+{
+	if (atomic_read(&nr_iaa) == 0)
+		return true;
+
+	map_iaa_devices();
+
+	pr_info("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, nr_iaa_per_package %d, cpus_per_iaa %d\n",
+		nr_packages, nr_cpus, atomic_read(&nr_iaa),
+		atomic_read(&nr_iaa_per_package), atomic_read(&cpus_per_iaa));
+
+	if (iaa_distribute_decomps) {
+		/* Each CPU uses all IAA devices on package for decomps. */
+		if (!reinit_pkg_global_wqs(false))
+			return false;
+		create_cpu_wq_table_from_pkg_wqs(false);
+	} else {
+		/*
+		 * Each CPU uses the decomp WQ on the mapped IAA device using
+		 * a balanced mapping of cores to IAA.
+		 */
+		create_cpu_wq_table_from_mapped_device(false);
+	}
+
+	if (iaa_distribute_comps) {
+		/* Each CPU uses all IAA devices on package for comps. */
+		if (!reinit_pkg_global_wqs(true))
+			return false;
+		create_cpu_wq_table_from_pkg_wqs(true);
+	} else {
+		/*
+		 * Each CPU uses the comp WQ on the mapped IAA device using
+		 * a balanced mapping of cores to IAA.
+		 */
+		create_cpu_wq_table_from_mapped_device(true);
 	}
+
+	pr_debug("Finished rebalance decomp/comp wqs.");
+	return true;
 }
 
 /***************************************************************
  * Assign work-queues for driver ops using per-cpu wq_tables.
  ***************************************************************/
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
+static struct idxd_wq *decomp_wq_table_next_wq(int cpu)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	struct wq_table_entry *entry = per_cpu_ptr(cpu_decomp_wqs, cpu);
+	struct idxd_wq *wq = entry->wqs[entry->cur_wq];
 
-	if (++entry->cur_wq >= entry->n_wqs)
+	if (++entry->cur_wq == entry->n_wqs)
 		entry->cur_wq = 0;
 
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
+	return wq;
+}
 
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
+static struct idxd_wq *comp_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(cpu_comp_wqs, cpu);
+	struct idxd_wq *wq = entry->wqs[entry->cur_wq];
+
+	if (++entry->cur_wq == entry->n_wqs)
+		entry->cur_wq = 0;
 
-	return entry->wqs[entry->cur_wq];
+	return wq;
 }
 
 /*************************************************
@@ -1527,7 +1977,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
@@ -1625,7 +2075,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	wq = decomp_wq_table_next_wq(cpu);
 	put_cpu();
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
@@ -1728,17 +2178,20 @@ static int iaa_register_compression_device(void)
 
 	ret = crypto_register_acomp(&iaa_acomp_fixed_deflate);
 	if (ret) {
+		atomic_set(&iaa_device_registration_done, 0);
 		pr_err("deflate algorithm acomp fixed registration failed (%d)\n", ret);
 		goto out;
 	}
 
 	iaa_crypto_registered = true;
+	atomic_set(&iaa_device_registration_done, 1);
 out:
 	return ret;
 }
 
 static int iaa_unregister_compression_device(void)
 {
+	atomic_set(&iaa_device_registration_done, 0);
 	if (iaa_crypto_registered)
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
 
@@ -1760,10 +2213,13 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	if (data->type != IDXD_TYPE_IAX)
 		return -ENODEV;
 
+	mutex_lock(&iaa_devices_lock);
+
 	mutex_lock(&wq->wq_lock);
 
 	if (idxd_wq_get_private(wq)) {
 		mutex_unlock(&wq->wq_lock);
+		mutex_unlock(&iaa_devices_lock);
 		return -EBUSY;
 	}
 
@@ -1785,8 +2241,6 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 		goto err;
 	}
 
-	mutex_lock(&iaa_devices_lock);
-
 	if (list_empty(&iaa_devices)) {
 		ret = alloc_wq_table(wq->idxd->max_wqs);
 		if (ret)
@@ -1798,7 +2252,10 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	if (ret)
 		goto err_save;
 
-	rebalance_wq_table();
+	if (!rebalance_wq_table()) {
+		dev_dbg(dev, "iaa_crypto_probe: IAA rebalancing device wq tables failed\n");
+		goto err_register;
+	}
 
 	if (first_wq) {
 		iaa_crypto_enabled = true;
@@ -1808,14 +2265,22 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 			dev_dbg(dev, "IAA compression device registration failed\n");
 			goto err_register;
 		}
+
+		BUG_ON(!atomic_read(&iaa_device_registration_done));
+		if (!rebalance_wq_table()) {
+			iaa_crypto_enabled = false;
+			dev_dbg(dev, "iaa_crypto_probe: Rerun after registration: IAA rebalancing device wq tables failed\n");
+			goto err_register;
+		}
+
 		try_module_get(THIS_MODULE);
 
 		pr_info("iaa_crypto now ENABLED\n");
 	}
 
-	mutex_unlock(&iaa_devices_lock);
 out:
 	mutex_unlock(&wq->wq_lock);
+	mutex_unlock(&iaa_devices_lock);
 
 	return ret;
 
@@ -1824,9 +2289,8 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
-	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
 err:
 	wq->type = IDXD_WQT_NONE;
@@ -1843,8 +2307,8 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	idxd_wq_quiesce(wq);
 
-	mutex_lock(&wq->wq_lock);
 	mutex_lock(&iaa_devices_lock);
+	mutex_lock(&wq->wq_lock);
 
 	remove_iaa_wq(wq);
 
@@ -1870,18 +2334,26 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	}
 
 	idxd_drv_disable_wq(wq);
-	rebalance_wq_table();
 
-	if (nr_iaa == 0) {
+	if (!rebalance_wq_table()) {
+		pr_debug("iaa_crypto_remove: IAA rebalancing device wq tables failed\n");
+		iaa_crypto_enabled = false;
+	}
+
+	if (atomic_read(&nr_iaa) == 0) {
 		iaa_crypto_enabled = false;
-		free_wq_table();
+		atomic_set(&iaa_device_registration_done, 0);
+		pkg_global_wqs_dealloc();
+		free_wq_tables();
+		BUG_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
 	}
 out:
-	mutex_unlock(&iaa_devices_lock);
 	mutex_unlock(&wq->wq_lock);
+	mutex_unlock(&iaa_devices_lock);
 }
 
 static enum idxd_dev_type dev_types[] = {
@@ -1900,16 +2372,11 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	ret = iaa_aecs_init_fixed();
 	if (ret < 0) {
@@ -1923,6 +2390,27 @@ static int __init iaa_crypto_init_module(void)
 		goto err_driver_reg;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_comp_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_comp_wqs_per_iaa attr creation failed\n");
+		goto err_g_comp_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				 &driver_attr_distribute_decomps);
+	if (ret) {
+		pr_debug("IAA distribute_decomps attr creation failed\n");
+		goto err_distribute_decomps_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				 &driver_attr_distribute_comps);
+	if (ret) {
+		pr_debug("IAA distribute_comps attr creation failed\n");
+		goto err_distribute_comps_attr_create;
+	}
+
 	ret = driver_create_file(&iaa_crypto_driver.drv,
 				 &driver_attr_verify_compress);
 	if (ret) {
@@ -1948,6 +2436,15 @@ static int __init iaa_crypto_init_module(void)
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
 err_verify_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_comps);
+err_distribute_comps_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_decomps);
+err_distribute_decomps_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_comp_wqs_per_iaa);
+err_g_comp_wqs_per_iaa_attr_create:
 	idxd_driver_unregister(&iaa_crypto_driver);
 err_driver_reg:
 	iaa_aecs_cleanup_fixed();
@@ -1966,6 +2463,12 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_comps);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_decomps);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_comp_wqs_per_iaa);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 07/19] crypto: iaa - Define and use req->data instead of req->base.data.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 06/19] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 08/19] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Since req->base.data is for the user and not for the driver, we define a
"void *data" in struct acomp_req for use by driver code.

At present, iaa_crypto saves the "struct idxd_desc *idxd_desc" that is
allocated in iaa_[de]compress(), in req->data. When batching is
introduced in subsequent patches, we will need to support an async
"submit-poll" mechanism to achieve parallelism using IAA hardware. To
accomplish this, we will submit the descriptors for each request in the
batch in iaa_[de]compress(), and return -EINPROGRESS. The polling
function will retrieve the descriptor from req->data to check the
request's completion status.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 12 +++++++-----
 include/crypto/acompress.h                 |  2 ++
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 2f2dc6987cc6..0b821b8b4264 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1718,7 +1718,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	iaa_wq_put(idxd_desc->wq);
 }
 
-static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
+static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen,
@@ -1778,8 +1778,9 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 			" src_addr %llx, dst_addr %llx\n", __func__,
 			active_compression_mode->name,
 			src_addr, dst_addr);
-	} else if (ctx->async_mode)
-		req->base.data = idxd_desc;
+	} else if (ctx->async_mode) {
+		req->data = idxd_desc;
+	}
 
 	dev_dbg(dev, "%s: compression mode %s,"
 		" desc->src1_addr %llx, desc->src1_size %d,"
@@ -1889,8 +1890,9 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			" src_addr %llx, dst_addr %llx\n", __func__,
 			active_compression_mode->name,
 			src_addr, dst_addr);
-	} else if (ctx->async_mode && !disable_async)
-		req->base.data = idxd_desc;
+	} else if (ctx->async_mode && !disable_async) {
+		req->data = idxd_desc;
+	}
 
 	dev_dbg(dev, "%s: decompression mode %s,"
 		" desc->src1_addr %llx, desc->src1_size %d,"
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 267d557daeb1..01389fd7055f 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -81,6 +81,7 @@ struct acomp_req_chain {
  * @doff:	Destination folio offset
  * @slen:	Size of the input buffer
  * @dlen:	Size of the output buffer and number of bytes produced
+ * @data:	Private API code data, do not use
  * @chain:	Private API code data, do not use
  * @__ctx:	Start of private context data
  */
@@ -101,6 +102,7 @@ struct acomp_req {
 	unsigned int slen;
 	unsigned int dlen;
 
+	void *data;
 	struct acomp_req_chain chain;
 
 	void *__ctx[] CRYPTO_MINALIGN_ATTR;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 08/19] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 07/19] crypto: iaa - Define and use req->data instead of req->base.data Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 09/19] crypto: iaa - CRYPTO_ACOMP_REQ_POLL acomp_req flag for sequential vs. parallel Kanchana P Sridhar
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the descriptor allocation from blocking to non-blocking
with bounded retries or "timeouts".

This is necessary to prevent task blocked errors in high contention
scenarios, for instance, when the platform has only 1 IAA device
enabled. With 1 IAA device enabled per package on a dual-package
Sapphire Rapids with 56 cores/package, there are 112 logical cores
mapped to this single IAA device. In this scenario, the task blocked
errors can occur because idxd_alloc_desc() is called with
IDXD_OP_BLOCK. With batching, multiple descriptors will need to be
allocated per batch. Any process that is able to do so, can cause
contention for allocating descriptors for all other processes that share
the use of the same sbitmap_queue. Under IDXD_OP_BLOCK, this can cause
compress/decompress jobs to stall in stress test scenarios
(e.g. zswap_store() of 2M folios).

In order to make the iaa_crypto driver be more fail-safe, this commit
implements the following:

1) Change compress/decompress descriptor allocations to be non-blocking
   with retries ("timeouts").
2) Return compress error to zswap if descriptor allocation with timeouts
   fails during compress ops. zswap_store() will return an error and the
   folio gets stored in the backing swap device.
3) Fallback to software decompress if descriptor allocation with timeouts
   fails during decompress ops.

With these fixes, there are no task blocked errors seen under stress
testing conditions, and no performance degradation observed.

This patch also simplifies the success/error return paths in
iaa_[de]compress() and iaa_compress_verify().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  3 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 84 ++++++++++++----------
 2 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 549ac98a9366..b4a94da2c315 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -21,6 +21,9 @@
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
+#define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
+#define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 0b821b8b4264..7dab340c4a34 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1416,6 +1416,7 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	ACOMP_REQUEST_ON_STACK(fbreq, crypto_acomp_reqtfm(req));
 	int ret;
 
+	req->dlen = PAGE_SIZE;
 	acomp_request_set_callback(fbreq, 0, NULL, NULL);
 	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
 				 req->dlen);
@@ -1536,7 +1537,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1552,7 +1554,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_DECOMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
@@ -1604,14 +1610,10 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 		goto err;
 	}
 
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
 
-	goto out;
+	return ret;
 }
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
@@ -1727,7 +1729,8 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1743,7 +1746,11 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_COMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa compress failed: ret=%ld\n", PTR_ERR(idxd_desc));
@@ -1820,15 +1827,10 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	*compression_crc = idxd_desc->iax_completion->crc;
 
-	if (!ctx->async_mode)
-		idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
+out:
+	return ret;
 }
 
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -1840,7 +1842,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1856,12 +1859,18 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_DECOMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa decompress failed: ret=%ld\n",
 			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		ret = PTR_ERR(idxd_desc);
+		idxd_desc = NULL;
+		goto fallback_software_decomp;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1905,7 +1914,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
+		goto fallback_software_decomp;
 	}
 
 	/* Update stats */
@@ -1919,40 +1928,37 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	}
 
 	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+
+fallback_software_decomp:
 	if (ret) {
-		dev_dbg(dev, "%s: check_completion failed ret=%d\n", __func__, ret);
-		if (idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
+		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
+		if (idxd_desc && idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(req);
-			if (ret) {
-				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
-					__func__, ret);
-				goto err;
-			}
-		} else {
+		}
+
+		ret = deflate_generic_decompress(req);
+
+		if (ret) {
+			pr_err("%s: iaa decompress failed: deflate-generic fallback to software decompress error ret=%d\n", __func__, ret);
 			goto err;
 		}
 	} else {
 		req->dlen = idxd_desc->iax_completion->output_size;
+
+		/* Update stats */
+		update_total_decomp_bytes_in(slen);
+		update_wq_decomp_bytes(wq, slen);
 	}
 
 	*dlen = req->dlen;
 
-	if (!ctx->async_mode || disable_async)
+err:
+	if (idxd_desc)
 		idxd_free_desc(wq, idxd_desc);
-
-	/* Update stats */
-	update_total_decomp_bytes_in(slen);
-	update_wq_decomp_bytes(wq, slen);
 out:
 	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa decompress failed: ret=%d\n", ret);
-
-	goto out;
 }
 
 static int iaa_comp_acompress(struct acomp_req *req)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 09/19] crypto: iaa - CRYPTO_ACOMP_REQ_POLL acomp_req flag for sequential vs. parallel.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 08/19] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

The purpose of this commit is to allow kernel users of iaa_crypto, such
as zswap, to be able to invoke the crypto_acomp_compress() API in fully
synchronous mode for non-batching use cases (i.e. today's status-quo),
where zswap calls crypto_wait_req(crypto_acomp_compress(req), wait);
and to non-instrusively invoke the fully asynchronous batch
compress/decompress API that will be introduced in subsequent
patches. Both use cases need to reuse same code paths in the driver to
interface with hardware: the CRYPTO_ACOMP_REQ_POLL flag allows this
shared code to determine whether we need to process an acomp_req
synchronously/asynchronously. The idea is to simplify the crypto_acomp
sequential/batching interfaces for use by zswap.

Thus, regardless of the iaa_crypto driver's 'sync_mode' setting, it
can still be forced to use synchronous mode by turning
off the CRYPTO_ACOMP_REQ_POLL flag in req->base.flags (the default to
support sequential use cases in zswap today).

IAA batching functionality will be implemented in subsequent patches,
that will set the CRYPTO_ACOMP_REQ_POLL flag for the acomp_reqs in a
batch. This enables the iaa_crypto driver to implement true
async "submit-polling" for parallel compressions and decompressions in
the IAA hardware accelerator.

In other words, all three of the following need to be true for a request
to be processed in fully async submit-poll mode:

 1) async_mode should be "true"
 2) use_irq should be "false"
 3) req->base.flags & CRYPTO_ACOMP_REQ_POLL should be "true"

Subsequent patches will:
  - Set (1) and (2) as iaa_crypto defaults once async submit-poll is
    implemented.
  - Enable (3) for iaa_crypto batching, and clear the
    CRYPTO_ACOMP_REQ_POLL flags before exiting from the batching
    routines since the assumption is that the acomp_reqs are
    created/managed by a higher level kernel user such as zswap, and are
    reused for both, sequential and batching use cases from zswap's
    perspective.

This patch also removes "disable_async" from iaa_decompress().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 15 +++++++--------
 include/crypto/acompress.h                 |  6 ++++++
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 7dab340c4a34..52fe68606f4d 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1785,7 +1785,7 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			" src_addr %llx, dst_addr %llx\n", __func__,
 			active_compression_mode->name,
 			src_addr, dst_addr);
-	} else if (ctx->async_mode) {
+	} else if (ctx->async_mode && (req->base.flags & CRYPTO_ACOMP_REQ_POLL)) {
 		req->data = idxd_desc;
 	}
 
@@ -1807,7 +1807,7 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	update_total_comp_calls();
 	update_wq_comp_calls(wq);
 
-	if (ctx->async_mode) {
+	if (ctx->async_mode && (req->base.flags & CRYPTO_ACOMP_REQ_POLL)) {
 		ret = -EINPROGRESS;
 		dev_dbg(dev, "%s: returning -EINPROGRESS\n", __func__);
 		goto out;
@@ -1836,8 +1836,7 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
-			  dma_addr_t dst_addr, unsigned int *dlen,
-			  bool disable_async)
+			  dma_addr_t dst_addr, unsigned int *dlen)
 {
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -1886,7 +1885,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->src1_size = slen;
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	if (ctx->use_irq && !disable_async) {
+	if (ctx->use_irq) {
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
 		idxd_desc->crypto.req = req;
@@ -1899,7 +1898,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			" src_addr %llx, dst_addr %llx\n", __func__,
 			active_compression_mode->name,
 			src_addr, dst_addr);
-	} else if (ctx->async_mode && !disable_async) {
+	} else if (ctx->async_mode && (req->base.flags & CRYPTO_ACOMP_REQ_POLL)) {
 		req->data = idxd_desc;
 	}
 
@@ -1921,7 +1920,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	update_total_decomp_calls();
 	update_wq_decomp_calls(wq);
 
-	if (ctx->async_mode && !disable_async) {
+	if (ctx->async_mode && (req->base.flags & CRYPTO_ACOMP_REQ_POLL)) {
 		ret = -EINPROGRESS;
 		dev_dbg(dev, "%s: returning -EINPROGRESS\n", __func__);
 		goto out;
@@ -2127,7 +2126,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
-			     dst_addr, &req->dlen, false);
+			     dst_addr, &req->dlen);
 	if (ret == -EINPROGRESS)
 		return ret;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 01389fd7055f..939e51d122b0 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -20,6 +20,12 @@
 #include <linux/spinlock_types.h>
 #include <linux/types.h>
 
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define CRYPTO_ACOMP_REQ_POLL		0x00000001
+
 /* Set this bit if source is virtual address instead of SG list. */
 #define CRYPTO_ACOMP_REQ_SRC_VIRT	0x00000002
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 09/19] crypto: iaa - CRYPTO_ACOMP_REQ_POLL acomp_req flag for sequential vs. parallel Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-05-01  1:40   ` Herbert Xu
  2025-04-30 20:52 ` [PATCH v9 11/19] crypto: iaa - Implement crypto_acomp batching interfaces for Intel IAA Kanchana P Sridhar
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds get_batch_size(), batch_compress() and batch_decompress()
interfaces to:

  struct acomp_alg
  struct crypto_acomp

A crypto_acomp compression algorithm that supports batching of compressions
and decompressions must provide implementations for these API, so that a
higher level module based on crypto_acomp, such as zswap, can allocate
resources for submitting multiple compress/decompress jobs that can be
batched, and invoke batching of [de]compressions.

A new helper function acomp_has_async_batching() can be invoked to query if
a crypto_acomp has registered these batching interfaces.

Further, zswap can invoke the newly added "crypto_acomp_batch_size()"
API to query the maximum number of requests that can be batch
[de]compressed. crypto_acomp_batch_size() returns 1 if the acomp has not
provided an implementation for get_batch_size(). Based on this, zswap
can use the minimum of any zswap-specific upper limits for batch-size
and the compressor's max batch-size, to allocate batching resources.

This allows the iaa_crypto Intel IAA driver to register implementations for
the get_batch_size(), batch_compress() and batch_decompress() acomp_alg
interfaces, that can subsequently be invoked from zswap to
compress/decompress pages in parallel in the IAA hardware accelerator to
improve swapout/swapin performance, through these newly added
corresponding crypto_acomp API:

  crypto_acomp_batch_size()
  crypto_acomp_batch_compress()
  crypto_acomp_batch_decompress()

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |   3 +
 include/crypto/acompress.h          | 107 ++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h |  20 ++++++
 3 files changed, 130 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index d08e0fe8cd9e..c7cca5596acf 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -95,6 +95,9 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->get_batch_size = alg->get_batch_size;
+	acomp->batch_compress = alg->batch_compress;
+	acomp->batch_decompress = alg->batch_decompress;
 	acomp->reqsize = alg->reqsize;
 
 	acomp->base.exit = crypto_acomp_exit_tfm;
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 939e51d122b0..e50f3e71ba58 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -120,6 +120,10 @@ struct acomp_req {
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @get_batch_size:	Maximum batch-size for batching compress/decompress
+ *			operations.
+ * @batch_compress:	Function performs a batch compress operation.
+ * @batch_decompress:	Function performs a batch decompress operation.
  * @reqsize:		Context size for (de)compression requests
  * @fb:			Synchronous fallback tfm
  * @base:		Common crypto API algorithm data structure
@@ -127,6 +131,22 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(
+		struct acomp_req *reqs[],
+		struct page *pages[],
+		u8 *dsts[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
+	bool (*batch_decompress)(
+		struct acomp_req *reqs[],
+		u8 *srcs[],
+		struct page *pages[],
+		unsigned int slens[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
 	unsigned int reqsize;
 	struct crypto_acomp *fb;
 	struct crypto_tfm base;
@@ -224,6 +244,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->get_batch_size && tfm->batch_compress && tfm->batch_decompress);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -595,4 +622,84 @@ static inline struct acomp_req *acomp_request_on_stack_init(
 	return req;
 }
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	crypto_acomp's batch size.
+ */
+static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_has_async_batching(tfm))
+		return tfm->get_batch_size();
+
+	return 1;
+}
+
+/**
+ * crypto_acomp_batch_compress() -- Invoke asynchronous compress of a batch
+ * of requests.
+ *
+ * @reqs: @nr_reqs asynchronous compress requests.
+ * @pages: Pages to be compressed by IAA.
+ * @dsts: Pre-allocated destination buffers to store results of compression.
+ *        Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths for @pages.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of requests in @reqs, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *           to be compressed.
+ *
+ * Returns true if all compress requests in the batch complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_compress(
+	struct acomp_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_compress(reqs, pages, dsts, dlens, errors, nr_reqs);
+}
+
+/**
+ * crypto_acomp_batch_decompress() -- Invoke asynchronous decompress of a batch
+ * of requests.
+ *
+ * @reqs: @nr_reqs asynchronous decompress requests.
+ * @srcs: Source buffers to to be decompressed.
+ * @pages: Destination pages corresponding to @srcs.
+ * @slens: Buffer lengths for @srcs.
+ * @dlens: Will contain the decompressed lengths for @srcs.
+ *	   For batch decompressions, the caller must enforce additional
+ *	   semantics such as, BUG_ON(dlens[i] != PAGE_SIZE) assertions.
+ * @errors: zero on successful decompression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of requests in @reqs, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *           to be decompressed.
+ *
+ * Returns true if all decompress requests in the batch complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_decompress(
+	struct acomp_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_decompress(reqs, srcs, pages, slens, dlens, errors, nr_reqs);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index b69d818d7e68..891e40831af8 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -23,6 +23,10 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @get_batch_size:	Maximum batch-size for batching compress/decompress
+ *			operations.
+ * @batch_compress:	Function performs a batch compress operation.
+ * @batch_decompress:	Function performs a batch decompress operation.
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
  *		transformation object. This function is called only once at
@@ -43,6 +47,22 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(
+		struct acomp_req *reqs[],
+		struct page *pages[],
+		u8 *dsts[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
+	bool (*batch_decompress)(
+		struct acomp_req *reqs[],
+		u8 *srcs[],
+		struct page *pages[],
+		unsigned int slens[],
+		unsigned int dlens[],
+		int errors[],
+		int nr_reqs);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 11/19] crypto: iaa - Implement crypto_acomp batching interfaces for Intel IAA.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 12/19] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

iaa_crypto implements the new crypto_acomp get_batch_size() interface
that returns an iaa_driver specific constant, IAA_CRYPTO_MAX_BATCH_SIZE
(set to 8U currently).

This patch also provides the iaa_crypto driver implementations for the
newly added crypto_acomp batch_compress() and batch_decompress()
interfaces.

This allows swap modules such as zswap to allocate required batching
resources and then invoke fully asynchronous batch parallel
compression/decompression of pages on systems with Intel IAA, by
invoking these crypto API, respectively:

   crypto_acomp_batch_size(...);
   crypto_acomp_batch_compress(...);
   crypto_acomp_batch_decompress(...);

This enables zswap compress batching code to be developed in
a manner similar to the current single-page synchronous calls to:

   crypto_acomp_compress(...);

thereby, facilitating encapsulated and modular hand-off between the
kernel mm/zswap code and the crypto_acomp layer.

Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   9 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 288 +++++++++++++++++++++
 2 files changed, 297 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index b4a94da2c315..90ce336879f1 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -42,6 +42,15 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+/*
+ * The maximum compress/decompress batch size for IAA's implementation of
+ * the crypto_acomp batch_compress() and batch_decompress() interfaces.
+ * The IAA compression algorithms should provide the crypto_acomp
+ * get_batch_size() interface through a function that returns this
+ * constant.
+ */
+#define IAA_CRYPTO_MAX_BATCH_SIZE 8U
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 52fe68606f4d..d577f555d6ab 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -2149,6 +2149,291 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static __always_inline unsigned int iaa_comp_get_batch_size(void)
+{
+	return IAA_CRYPTO_MAX_BATCH_SIZE;
+}
+
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, compress_op, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
+static __always_inline void iaa_set_req_poll(
+	struct acomp_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->base.flags |= CRYPTO_ACOMP_REQ_POLL) :
+			   (reqs[i]->base.flags &= ~CRYPTO_ACOMP_REQ_POLL);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_reqs asynchronous compress requests.
+ * @pages: Pages to be compressed by IAA.
+ * @dsts: Pre-allocated destination buffers to store results of IAA
+ *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of requests, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *           to be compressed.
+ *
+ * Returns true if all compress requests in the batch complete successfully,
+ * false otherwise.
+ */
+static bool iaa_comp_acompress_batch(
+	struct acomp_req *reqs[],
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool compressions_done = false;
+	int i, err = 0;
+
+	BUG_ON(nr_reqs > IAA_CRYPTO_MAX_BATCH_SIZE);
+
+	iaa_set_req_poll(reqs, nr_reqs, true);
+
+	/*
+	 * Prepare and submit the batch of acomp_reqs to IAA. IAA will process
+	 * these compress jobs in parallel.
+	 */
+	for (i = 0; i < nr_reqs; ++i) {
+		sg_init_table(&inputs[i], 1);
+		sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+		 * and hardware-accelerators may won't check the dst buffer size, so
+		 * giving the dst buffer with enough length to avoid buffer overflow.
+		 */
+		sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					 &outputs[i], PAGE_SIZE, PAGE_SIZE);
+
+		errors[i] = iaa_comp_acompress(reqs[i]);
+
+		if (errors[i] != -EINPROGRESS) {
+			errors[i] = -EINVAL;
+			err = -EINVAL;
+		} else {
+			errors[i] = -EAGAIN;
+		}
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA compress job completions.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_reqs; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress(),
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs.
+	 */
+	iaa_set_req_poll(reqs, nr_reqs, false);
+
+	return !err;
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_reqs asynchronous decompress requests.
+ * @srcs: The src buffers to be decompressed by IAA.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @dlens: Will contain the decompressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_reqs: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *            to be decompressed.
+ *
+ * Returns true if all decompress requests complete successfully,
+ * false otherwise.
+ */
+static bool iaa_comp_adecompress_batch(
+	struct acomp_req *reqs[],
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_reqs)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool decompressions_done = false;
+	int i, err = 0;
+
+	BUG_ON(nr_reqs > IAA_CRYPTO_MAX_BATCH_SIZE);
+
+	iaa_set_req_poll(reqs, nr_reqs, true);
+
+	/*
+	 * Prepare and submit the batch of acomp_reqs to IAA. IAA will process
+	 * these decompress jobs in parallel.
+	 */
+	for (i = 0; i < nr_reqs; ++i) {
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					&outputs[i], slens[i], PAGE_SIZE);
+
+		errors[i] = iaa_comp_adecompress(reqs[i]);
+
+		/*
+		 * If it failed desc allocation/submission, errors[i] can
+		 * be 0 or error value from software decompress.
+		 */
+		if (errors[i] != -EINPROGRESS) {
+			errors[i] = -EINVAL;
+			err = -EINVAL;
+		} else {
+			errors[i] = -EAGAIN;
+		}
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA decompress job completions.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_reqs; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				/*
+				 * For batch decompressions, the caller should
+				 * check @errors and handle dlens[i] != PAGE_SIZE.
+				 */
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress(),
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs.
+	 */
+	iaa_set_req_poll(reqs, nr_reqs, false);
+
+	return !err;
+}
+
 /*********************************************
  * Interfaces to crypto_alg and crypto_acomp.
  *********************************************/
@@ -2169,6 +2454,9 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.init			= iaa_comp_init_fixed,
 	.compress		= iaa_comp_acompress,
 	.decompress		= iaa_comp_adecompress,
+	.get_batch_size		= iaa_comp_get_batch_size,
+	.batch_compress		= iaa_comp_acompress_batch,
+	.batch_decompress	= iaa_comp_adecompress_batch,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 12/19] crypto: iaa - Enable async mode and make it the default.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 11/19] crypto: iaa - Implement crypto_acomp batching interfaces for Intel IAA Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:52 ` [PATCH v9 13/19] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the 'async' sync_mode in the driver. Further, it sets
the default sync_mode to 'async', which makes it easier for IAA hardware
acceleration in the iaa_crypto driver to be loaded by default in the most
efficient/recommended 'async' mode for parallel
compressions/decompressions, namely, asynchronous submission of
descriptors, followed by polling for job completions. Earlier, the
"sync" mode used to be the default.

The iaa_crypto driver documentation has been updated with these
changes.

This way, anyone who wants to use IAA for zswap/zram can do so after
building the kernel, and without having to go through these steps to use
async mode:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 11 ++---------
 drivers/crypto/intel/iaa/iaa_crypto_main.c         |  4 ++--
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 949bfa1ef624..8e0e98d50972 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -272,7 +272,7 @@ The available attributes are:
       echo async_irq > /sys/bus/dsa/drivers/crypto/sync_mode
 
     Async mode without interrupts (caller must poll) can be enabled by
-    writing 'async' to it (please see Caveat)::
+    writing 'async' to it::
 
       echo async > /sys/bus/dsa/drivers/crypto/sync_mode
 
@@ -281,14 +281,7 @@ The available attributes are:
 
       echo sync > /sys/bus/dsa/drivers/crypto/sync_mode
 
-    The default mode is 'sync'.
-
-    Caveat: since the only mechanism that iaa_crypto currently implements
-    for async polling without interrupts is via the 'sync' mode as
-    described earlier, writing 'async' to
-    '/sys/bus/dsa/drivers/crypto/sync_mode' will internally enable the
-    'sync' mode. This is to ensure correct iaa_crypto behavior until true
-    async polling without interrupts is enabled in iaa_crypto.
+    The default mode is 'async'.
 
   - g_comp_wqs_per_iaa
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index d577f555d6ab..cfd4f5ead67b 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -115,7 +115,7 @@ static bool iaa_verify_compress = true;
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
@@ -169,7 +169,7 @@ static int set_iaa_sync_mode(const char *name)
 		async_mode = false;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async")) {
-		async_mode = false;
+		async_mode = true;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async_irq")) {
 		async_mode = true;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 13/19] crypto: iaa - Disable iaa_verify_compress by default.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 12/19] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
@ 2025-04-30 20:52 ` Kanchana P Sridhar
  2025-04-30 20:53 ` [PATCH v9 14/19] mm: zswap: Move the CPU hotplug procedures under "pool functions" Kanchana P Sridhar
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:52 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

The iaa_crypto driver documentation has been updated with this change.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 2 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 8e0e98d50972..bc5912f22ae1 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -239,7 +239,7 @@ The available attributes are:
 
       echo 0 > /sys/bus/dsa/drivers/crypto/verify_compress
 
-    The default setting is '1' - verify all compresses.
+    The default setting is '0' - to not verify compresses.
 
   - sync_mode
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index cfd4f5ead67b..815b5d718625 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -83,7 +83,7 @@ static bool iaa_distribute_decomps = true;
 static bool iaa_distribute_comps = true;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 14/19] mm: zswap: Move the CPU hotplug procedures under "pool functions".
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2025-04-30 20:52 ` [PATCH v9 13/19] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2025-04-30 20:53 ` Kanchana P Sridhar
  2025-04-30 20:53 ` [PATCH v9 15/19] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely moves zswap_cpu_comp_prepare() and
zswap_cpu_comp_dead() to be in the "pool functions" section because
these functions are invoked upon pool creation/deletion.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 188 ++++++++++++++++++++++++++---------------------------
 1 file changed, 94 insertions(+), 94 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 455e9425c5f5..358dad3e612a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -248,6 +248,100 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 **********************************/
 static void __zswap_pool_empty(struct percpu_ref *ref);
 
+static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	struct crypto_acomp *acomp = NULL;
+	struct acomp_req *req = NULL;
+	u8 *buffer = NULL;
+	int ret;
+
+	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	if (IS_ERR(acomp)) {
+		pr_err("could not alloc crypto acomp %s : %ld\n",
+				pool->tfm_name, PTR_ERR(acomp));
+		ret = PTR_ERR(acomp);
+		goto fail;
+	}
+
+	req = acomp_request_alloc(acomp);
+	if (!req) {
+		pr_err("could not alloc crypto acomp_request %s\n",
+		       pool->tfm_name);
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	/*
+	 * Only hold the mutex after completing allocations, otherwise we may
+	 * recurse into zswap through reclaim and attempt to hold the mutex
+	 * again resulting in a deadlock.
+	 */
+	mutex_lock(&acomp_ctx->mutex);
+	crypto_init_wait(&acomp_ctx->wait);
+
+	/*
+	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
+	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
+	 * won't be called, crypto_wait_req() will return without blocking.
+	 */
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &acomp_ctx->wait);
+
+	acomp_ctx->buffer = buffer;
+	acomp_ctx->acomp = acomp;
+	acomp_ctx->is_sleepable = acomp_is_async(acomp);
+	acomp_ctx->req = req;
+	mutex_unlock(&acomp_ctx->mutex);
+	return 0;
+
+fail:
+	if (acomp)
+		crypto_free_acomp(acomp);
+	kfree(buffer);
+	return ret;
+}
+
+static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	struct acomp_req *req;
+	struct crypto_acomp *acomp;
+	u8 *buffer;
+
+	if (IS_ERR_OR_NULL(acomp_ctx))
+		return 0;
+
+	mutex_lock(&acomp_ctx->mutex);
+	req = acomp_ctx->req;
+	acomp = acomp_ctx->acomp;
+	buffer = acomp_ctx->buffer;
+	acomp_ctx->req = NULL;
+	acomp_ctx->acomp = NULL;
+	acomp_ctx->buffer = NULL;
+	mutex_unlock(&acomp_ctx->mutex);
+
+	/*
+	 * Do the actual freeing after releasing the mutex to avoid subtle
+	 * locking dependencies causing deadlocks.
+	 */
+	if (!IS_ERR_OR_NULL(req))
+		acomp_request_free(req);
+	if (!IS_ERR_OR_NULL(acomp))
+		crypto_free_acomp(acomp);
+	kfree(buffer);
+
+	return 0;
+}
+
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	struct zswap_pool *pool;
@@ -818,100 +912,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
 /*********************************
 * compressed storage functions
 **********************************/
-static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct crypto_acomp *acomp = NULL;
-	struct acomp_req *req = NULL;
-	u8 *buffer = NULL;
-	int ret;
-
-	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!buffer) {
-		ret = -ENOMEM;
-		goto fail;
-	}
-
-	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp)) {
-		pr_err("could not alloc crypto acomp %s : %ld\n",
-				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto fail;
-	}
-
-	req = acomp_request_alloc(acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
-		goto fail;
-	}
-
-	/*
-	 * Only hold the mutex after completing allocations, otherwise we may
-	 * recurse into zswap through reclaim and attempt to hold the mutex
-	 * again resulting in a deadlock.
-	 */
-	mutex_lock(&acomp_ctx->mutex);
-	crypto_init_wait(&acomp_ctx->wait);
-
-	/*
-	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
-	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
-	 * won't be called, crypto_wait_req() will return without blocking.
-	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
-				   crypto_req_done, &acomp_ctx->wait);
-
-	acomp_ctx->buffer = buffer;
-	acomp_ctx->acomp = acomp;
-	acomp_ctx->is_sleepable = acomp_is_async(acomp);
-	acomp_ctx->req = req;
-	mutex_unlock(&acomp_ctx->mutex);
-	return 0;
-
-fail:
-	if (acomp)
-		crypto_free_acomp(acomp);
-	kfree(buffer);
-	return ret;
-}
-
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct acomp_req *req;
-	struct crypto_acomp *acomp;
-	u8 *buffer;
-
-	if (IS_ERR_OR_NULL(acomp_ctx))
-		return 0;
-
-	mutex_lock(&acomp_ctx->mutex);
-	req = acomp_ctx->req;
-	acomp = acomp_ctx->acomp;
-	buffer = acomp_ctx->buffer;
-	acomp_ctx->req = NULL;
-	acomp_ctx->acomp = NULL;
-	acomp_ctx->buffer = NULL;
-	mutex_unlock(&acomp_ctx->mutex);
-
-	/*
-	 * Do the actual freeing after releasing the mutex to avoid subtle
-	 * locking dependencies causing deadlocks.
-	 */
-	if (!IS_ERR_OR_NULL(req))
-		acomp_request_free(req);
-	if (!IS_ERR_OR_NULL(acomp))
-		crypto_free_acomp(acomp);
-	kfree(buffer);
-
-	return 0;
-}
-
 static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 15/19] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (13 preceding siblings ...)
  2025-04-30 20:53 ` [PATCH v9 14/19] mm: zswap: Move the CPU hotplug procedures under "pool functions" Kanchana P Sridhar
@ 2025-04-30 20:53 ` Kanchana P Sridhar
  2025-04-30 20:53 ` [PATCH v9 16/19] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
management. Similar to the per-CPU acomp_ctx itself, the per-CPU
acomp_ctx's resources' (acomp, ref, buffer) lifetime will also be from
pool creation to pool deletion. These resources will persist through CPU
hotplug operations. The zswap_cpu_comp_dead() teardown callback has been
deleted from the call to
cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
offline hotplug operations will be no-ops as far as the acomp_ctx
resources are concerned.

The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the
cores from being offlined until the multi state instance addition call
returns.

  From Documentation/core-api/cpu_hotplug.rst:

    "The node list add/remove operations and the callback invocations are
     serialized against CPU hotplug operations."

Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:

  - During pool creation/deletion, the pool is not in the zswap_pools
    list.

  - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
    out. zswap_cpu_comp_prepare() will be executed on a control CPU,
    since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
    cpuhp_state". Thanks Yosry for sharing this observation!

  In both these cases, any recursions into zswap reclaim from
  zswap_cpu_comp_prepare() will be handled by the old pool.

The above two observations enable the following simplifications:

 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
    the pool. Considerations for mutex init/locking and handling
    subsequent CPU hotplug online-offlines:

    Should we lock the mutex of current CPU's acomp_ctx from start to
    end? It doesn't seem like this is required. The CPU hotplug
    operations acquire a "cpuhp_state_mutex" before proceeding, hence
    they are serialized against CPU hotplug operations.

    If the process gets migrated while zswap_cpu_comp_prepare() is
    running, it will complete on the new CPU. In case of failures, we
    pass the acomp_ctx pointer obtained at the start of
    zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
    only undergo migration. There appear to be no contention scenarios
    that might cause inconsistent values of acomp_ctx's members. Hence,
    it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
    zswap_cpu_comp_prepare().

    Since the pool is not yet on zswap_pools list, we don't need to
    initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
    has been restored to occur in zswap_cpu_comp_prepare().

    zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
    valid. If so, it returns success. This should handle any CPU
    hotplug online-offline transitions after pool creation is done.

 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
    migrated to another CPU before the current CPU is dysfunctional. If
    zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
    CPU, that mutex will be released once it completes on the new
    CPU. Since there is no teardown callback, there is no possibility of
    UAF.

 3) Pool creation/deletion and process migration to another CPU:

    - During pool creation/deletion, the pool is not in the zswap_pools
      list. Hence it cannot contend with zswap ops on that CPU. However,
      the process can get migrated.

      Pool creation --> zswap_cpu_comp_prepare()
                                --> process migrated:
                                    * CPU offline: no-op.
                                    * zswap_cpu_comp_prepare() continues
                                      to run on the new CPU to finish
                                      allocating acomp_ctx resources for
                                      the offlined CPU.

      Pool deletion --> acomp_ctx_dealloc()
                                --> process migrated:
                                    * CPU offline: no-op.
                                    * acomp_ctx_dealloc() continues
                                      to run on the new CPU to finish
                                      de-allocating acomp_ctx resources
                                      for the offlined CPU.

 4) Pool deletion vis-a-vis CPU onlining:
    To prevent possibility of race conditions between
    acomp_ctx_dealloc() freeing the acomp_ctx resources and the initial
    check for a valid acomp_ctx->acomp in zswap_cpu_comp_prepare(), we
    need to delete the multi state instance right after it is added, in
    zswap_pool_create().

 Summary of changes based on the above:
 --------------------------------------
 1) Zero-initialization of pool->acomp_ctx in zswap_pool_create() to
    simplify and share common code for different error handling/cleanup
    related to the acomp_ctx.

 2) Remove the node list instance right after node list add function
    call in zswap_pool_create(). This prevents race conditions between
    CPU onlining after initial pool creation, and acomp_ctx_dealloc()
    freeing the acomp_ctx resources.

 3) zswap_pool_destroy() will call acomp_ctx_dealloc() to de-allocate
    the per-CPU acomp_ctx resources.

 4) Changes to zswap_cpu_comp_prepare():

    a) Check if acomp_ctx->acomp is valid at the beginning and return,
       because the acomp_ctx is already initialized.
    b) Move the mutex_init to happen in this procedure, before it
       returns.
    c) All error conditions handled by calling acomp_ctx_dealloc().

 5) New procedure acomp_ctx_dealloc() for common error/cleanup code.

 6) No more multi state instance teardown callback. CPU offlining is a
    no-op as far as acomp_ctx resources are concerned.

 7) Delete acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock(). Directly
    call mutex_lock(&acomp_ctx->mutex)/mutex_unlock(&acomp_ctx->mutex)
    in zswap_[de]compress().

The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is as
follows, on x86_64:

    IAA with batching: 64.8 KB
    Software compressors: 8.2 KB

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 193 +++++++++++++++++++++++++----------------------------
 1 file changed, 92 insertions(+), 101 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 358dad3e612a..238f92e63a22 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -248,43 +248,65 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 **********************************/
 static void __zswap_pool_empty(struct percpu_ref *ref);
 
+/*
+ * The per-cpu pool->acomp_ctx is zero-initialized on allocation. This makes
+ * it easy for different error conditions/cleanup related to the acomp_ctx
+ * to be handled by acomp_ctx_dealloc():
+ * - Errors during zswap_cpu_comp_prepare().
+ * - Partial success/error of cpuhp_state_add_instance() call in
+ *   zswap_pool_create(). Only some cores could have executed
+ *   zswap_cpu_comp_prepare(), not others.
+ * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
+ */
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+{
+	if (IS_ERR_OR_NULL(acomp_ctx))
+		return;
+
+	if (!IS_ERR_OR_NULL(acomp_ctx->req))
+		acomp_request_free(acomp_ctx->req);
+	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+		crypto_free_acomp(acomp_ctx->acomp);
+	kfree(acomp_ctx->buffer);
+}
+
 static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct crypto_acomp *acomp = NULL;
-	struct acomp_req *req = NULL;
-	u8 *buffer = NULL;
-	int ret;
+	int ret = -ENOMEM;
 
-	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!buffer) {
-		ret = -ENOMEM;
-		goto fail;
-	}
+	/*
+	 * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
+	 * Even though we delete the multi state instance right after successful
+	 * addition of the instance in zswap_pool_create(), we cannot eliminate
+	 * the possibility of the CPU going through offline-online transitions.
+	 * If this does happen, we check if the acomp_ctx has already been
+	 * initialized, and return.
+	 */
+	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+		return 0;
 
-	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp)) {
+	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffer)
+		return ret;
+
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	if (IS_ERR(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
-				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
+				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
+		ret = PTR_ERR(acomp_ctx->acomp);
 		goto fail;
 	}
+	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
-	req = acomp_request_alloc(acomp);
-	if (!req) {
+	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+	if (!acomp_ctx->req) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
-		ret = -ENOMEM;
 		goto fail;
 	}
 
-	/*
-	 * Only hold the mutex after completing allocations, otherwise we may
-	 * recurse into zswap through reclaim and attempt to hold the mutex
-	 * again resulting in a deadlock.
-	 */
-	mutex_lock(&acomp_ctx->mutex);
 	crypto_init_wait(&acomp_ctx->wait);
 
 	/*
@@ -292,56 +314,17 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
-	acomp_ctx->buffer = buffer;
-	acomp_ctx->acomp = acomp;
-	acomp_ctx->is_sleepable = acomp_is_async(acomp);
-	acomp_ctx->req = req;
-	mutex_unlock(&acomp_ctx->mutex);
+	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
 fail:
-	if (acomp)
-		crypto_free_acomp(acomp);
-	kfree(buffer);
+	acomp_ctx_dealloc(acomp_ctx);
 	return ret;
 }
 
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct acomp_req *req;
-	struct crypto_acomp *acomp;
-	u8 *buffer;
-
-	if (IS_ERR_OR_NULL(acomp_ctx))
-		return 0;
-
-	mutex_lock(&acomp_ctx->mutex);
-	req = acomp_ctx->req;
-	acomp = acomp_ctx->acomp;
-	buffer = acomp_ctx->buffer;
-	acomp_ctx->req = NULL;
-	acomp_ctx->acomp = NULL;
-	acomp_ctx->buffer = NULL;
-	mutex_unlock(&acomp_ctx->mutex);
-
-	/*
-	 * Do the actual freeing after releasing the mutex to avoid subtle
-	 * locking dependencies causing deadlocks.
-	 */
-	if (!IS_ERR_OR_NULL(req))
-		acomp_request_free(req);
-	if (!IS_ERR_OR_NULL(acomp))
-		crypto_free_acomp(acomp);
-	kfree(buffer);
-
-	return 0;
-}
-
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	struct zswap_pool *pool;
@@ -375,19 +358,43 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 
 	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
 
-	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
+	/* Many things rely on the zero-initialization. */
+	pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
+					   GFP_KERNEL | __GFP_ZERO);
 	if (!pool->acomp_ctx) {
 		pr_err("percpu alloc failed\n");
 		goto error;
 	}
 
-	for_each_possible_cpu(cpu)
-		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
-
+	/*
+	 * This is serialized against CPU hotplug operations. Hence, cores
+	 * cannot be offlined until this finishes.
+	 * In case of errors, we need to goto "ref_fail" instead of "error"
+	 * because there is no teardown callback registered anymore, for
+	 * cpuhp_state_add_instance() to de-allocate resources as it rolls back
+	 * state on cores before the CPU on which error was encountered.
+	 */
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
+
+	/*
+	 * We only needed the multi state instance add operation to invoke the
+	 * startup callback for all cores without cores getting offlined. Since
+	 * the acomp_ctx resources will now only be de-allocated when the pool
+	 * is destroyed, we can safely remove the multi state instance. This
+	 * minimizes (but does not eliminate) the possibility of
+	 * zswap_cpu_comp_prepare() being invoked again due to a CPU
+	 * offline-online transition. Removing the instance also prevents race
+	 * conditions between CPU onlining after initial pool creation, and
+	 * acomp_ctx_dealloc() freeing the acomp_ctx resources.
+	 * Note that we delete the instance before checking the error status of
+	 * the node list add operation because we want the instance removal even
+	 * in case of errors in the former.
+	 */
+	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
 	if (ret)
-		goto error;
+		goto ref_fail;
 
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
@@ -403,7 +410,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	return pool;
 
 ref_fail:
-	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+	for_each_possible_cpu(cpu)
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -457,9 +465,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
 
 static void zswap_pool_destroy(struct zswap_pool *pool)
 {
+	int cpu;
+
 	zswap_pool_debug("destroying", pool);
 
-	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+	for_each_possible_cpu(cpu)
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+
 	free_percpu(pool->acomp_ctx);
 
 	zpool_destroy_pool(pool->zpool);
@@ -912,31 +924,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
 /*********************************
 * compressed storage functions
 **********************************/
-static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
-{
-	struct crypto_acomp_ctx *acomp_ctx;
-
-	for (;;) {
-		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-		mutex_lock(&acomp_ctx->mutex);
-		if (likely(acomp_ctx->req))
-			return acomp_ctx;
-		/*
-		 * It is possible that we were migrated to a different CPU after
-		 * getting the per-CPU ctx but before the mutex was acquired. If
-		 * the old CPU got offlined, zswap_cpu_comp_dead() could have
-		 * already freed ctx->req (among other things) and set it to
-		 * NULL. Just try again on the new CPU that we ended up on.
-		 */
-		mutex_unlock(&acomp_ctx->mutex);
-	}
-}
-
-static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
-{
-	mutex_unlock(&acomp_ctx->mutex);
-}
-
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -949,7 +936,10 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	gfp_t gfp;
 	u8 *dst;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+
+	mutex_lock(&acomp_ctx->mutex);
+
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -997,7 +987,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	else if (alloc_ret)
 		zswap_reject_alloc_fail++;
 
-	acomp_ctx_put_unlock(acomp_ctx);
+	mutex_unlock(&acomp_ctx->mutex);
 	return comp_ret == 0 && alloc_ret == 0;
 }
 
@@ -1009,7 +999,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	int decomp_ret, dlen;
 	u8 *src, *obj;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
 	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
 
 	/*
@@ -1033,7 +1024,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	dlen = acomp_ctx->req->dlen;
 
 	zpool_obj_read_end(zpool, entry->handle, obj);
-	acomp_ctx_put_unlock(acomp_ctx);
+	mutex_unlock(&acomp_ctx->mutex);
 
 	if (!decomp_ret && dlen == PAGE_SIZE)
 		return true;
@@ -1849,7 +1840,7 @@ static int zswap_setup(void)
 	ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
 				      "mm/zswap_pool:prepare",
 				      zswap_cpu_comp_prepare,
-				      zswap_cpu_comp_dead);
+				      NULL);
 	if (ret)
 		goto hp_fail;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 16/19] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (14 preceding siblings ...)
  2025-04-30 20:53 ` [PATCH v9 15/19] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
@ 2025-04-30 20:53 ` Kanchana P Sridhar
  2025-04-30 20:53 ` [PATCH v9 17/19] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
for valid acomp/req, thereby making it consistent with acomp_ctx_dealloc().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 238f92e63a22..717835f214b2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -292,7 +292,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		return ret;
 
 	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp_ctx->acomp)) {
+	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
 		ret = PTR_ERR(acomp_ctx->acomp);
@@ -301,7 +301,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
 	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!acomp_ctx->req) {
+	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
 		goto fail;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 17/19] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (15 preceding siblings ...)
  2025-04-30 20:53 ` [PATCH v9 16/19] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
@ 2025-04-30 20:53 ` Kanchana P Sridhar
  2025-04-30 20:53 ` [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches Kanchana P Sridhar
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds support for the per-CPU acomp_ctx to track multiple
compression/decompression requests and multiple compression destination
buffers. zswap_cpu_comp_prepare() will get the maximum batch-size the
compressor supports. If so, it will allocate the necessary batching
resources (reqs/buffers), up to ZSWAP_MAX_BATCH_SIZE, which this patch
defines as 8U.

This patch also adds a "u8 nr_reqs" member to "struct
zswap_pool". Thanks Yosry for this suggestion. Once the pool's per-CPU
acomp_ctx resources have been successfully allocated, the pool->nr_reqs
is set up as the minimum of ZSWAP_MAX_BATCH_SIZE and
crypto_acomp_batch_size(acomp_ctx->acomp).

However, zswap does not use more than one request yet. Follow-up patches
will actually utilize the multiple acomp_ctx requests/buffers for batch
compression/decompression of multiple pages.

The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory
used for batching. There is a small extra memory overhead of allocating
the "reqs" and "buffers" arrays for compressors that do not support
batching: On x86_64, the overhead is two pointers per-CPU (i.e. 16 bytes).

Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 113 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 81 insertions(+), 32 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 717835f214b2..2273dbfd460f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -80,6 +80,9 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
+/* Limit the batch size to limit per-CPU memory usage for reqs and buffers. */
+#define ZSWAP_MAX_BATCH_SIZE 8U
+
 static int zswap_setup(void);
 
 /* Enable/disable zswap */
@@ -145,9 +148,9 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req **reqs;
+	u8 **buffers;
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -166,6 +169,7 @@ struct zswap_pool {
 	struct work_struct release_work;
 	struct hlist_node node;
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
+	u8 nr_reqs;
 };
 
 /* Global LRU lists shared by all zswap pools. */
@@ -258,16 +262,29 @@ static void __zswap_pool_empty(struct percpu_ref *ref);
  *   zswap_cpu_comp_prepare(), not others.
  * - Cleanup acomp_ctx resources on all cores in zswap_pool_destroy().
  */
-static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_reqs)
 {
+	u8 i;
+
 	if (IS_ERR_OR_NULL(acomp_ctx))
 		return;
 
-	if (!IS_ERR_OR_NULL(acomp_ctx->req))
-		acomp_request_free(acomp_ctx->req);
+	if (acomp_ctx->reqs) {
+		for (i = 0; i < nr_reqs; ++i) {
+			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+				acomp_request_free(acomp_ctx->reqs[i]);
+		}
+		kfree(acomp_ctx->reqs);
+	}
+
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		crypto_free_acomp(acomp_ctx->acomp);
-	kfree(acomp_ctx->buffer);
+
+	if (acomp_ctx->buffers) {
+		for (i = 0; i < nr_reqs; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+	}
 }
 
 static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
@@ -275,6 +292,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	int ret = -ENOMEM;
+	u8 i, nr_reqs = 0;
 
 	/*
 	 * The per-CPU pool->acomp_ctx is zero-initialized on allocation.
@@ -287,10 +305,6 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		return 0;
 
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return ret;
-
 	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
@@ -300,33 +314,58 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	}
 	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
-	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
-	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
+	nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
+		      crypto_acomp_batch_size(acomp_ctx->acomp));
+
+	acomp_ctx->buffers = kcalloc_node(nr_reqs, sizeof(u8 *),
+					  GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
 		goto fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
+						     cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i])
+			goto fail;
 	}
 
-	crypto_init_wait(&acomp_ctx->wait);
+	acomp_ctx->reqs = kcalloc_node(nr_reqs, sizeof(struct acomp_req *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
+		goto fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (IS_ERR_OR_NULL(acomp_ctx->reqs[i])) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+				i, pool->tfm_name);
+			goto fail;
+		}
+	}
 
 	/*
+	 * All calls to crypto_acomp_[de]compress() from zswap will use
+	 * acomp_ctx->reqs[0] with acomp_ctx->wait.
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	crypto_init_wait(&acomp_ctx->wait);
+
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
 	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
 fail:
-	acomp_ctx_dealloc(acomp_ctx);
+	acomp_ctx_dealloc(acomp_ctx, nr_reqs);
 	return ret;
 }
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
+	struct crypto_acomp_ctx *acomp_ctx;
 	struct zswap_pool *pool;
 	char name[38]; /* 'zswap' + 32 char (max) num + \0 */
 	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
@@ -343,6 +382,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 			return NULL;
 	}
 
+	/* Many things rely on the zero-initialization. */
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool)
 		return NULL;
@@ -393,9 +433,18 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	 */
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 
+	/*
+	 * If there was an error in adding the multi instance state, the
+	 * zero-initialized pool->nr_reqs value will be accurate for passing to
+	 * acomp_ctx_dealloc().
+	 */
 	if (ret)
 		goto ref_fail;
 
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	pool->nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
+			    crypto_acomp_batch_size(acomp_ctx->acomp));
+
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
 	 */
@@ -411,7 +460,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 
 ref_fail:
 	for_each_possible_cpu(cpu)
-		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu), pool->nr_reqs);
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -470,7 +519,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
 	zswap_pool_debug("destroying", pool);
 
 	for_each_possible_cpu(cpu)
-		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu), pool->nr_reqs);
 
 	free_percpu(pool->acomp_ctx);
 
@@ -940,7 +989,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -950,7 +999,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -964,8 +1013,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -1001,27 +1050,27 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
-	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer);
+	obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffers[0]);
 
 	/*
 	 * zpool_obj_read_begin() might return a kmap address of highmem when
-	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
-	 * handle highmem addresses, so copy the object to acomp_ctx->buffer.
+	 * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
+	 * handle highmem addresses, so copy the object to acomp_ctx->buffers[0].
 	 */
 	if (virt_addr_valid(obj)) {
 		src = obj;
 	} else {
-		WARN_ON_ONCE(obj == acomp_ctx->buffer);
-		memcpy(acomp_ctx->buffer, obj, entry->length);
-		src = acomp_ctx->buffer;
+		WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
+		memcpy(acomp_ctx->buffers[0], obj, entry->length);
+		src = acomp_ctx->buffers[0];
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	decomp_ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 
 	zpool_obj_read_end(zpool, entry->handle, obj);
 	mutex_unlock(&acomp_ctx->mutex);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (16 preceding siblings ...)
  2025-04-30 20:53 ` [PATCH v9 17/19] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-04-30 20:53 ` Kanchana P Sridhar
  2025-05-01  5:09   ` kernel test robot
  2025-04-30 20:53 ` [PATCH v9 19/19] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies zswap_store() to store a batch of pages at a time,
instead of storing one page at a time. It does this by calling a new
procedure zswap_store_pages() with "batch_size" pages. If the folio is
of order-0, the batch_size is 1. If zswap_store() is processing a large
folio:

 - If the compressor supports batching, the batch_size will be the
   pool->nr_reqs.

 - If the compressor does not support batching, the batch_size will be
   ZSWAP_MAX_BATCH_SIZE.

zswap_store_pages() implements all the computes done earlier in
zswap_store_page() for a single-page, for multiple pages in a folio,
namely the "batch". zswap_store_pages() starts by allocating all zswap
entries required to store the batch. Next, it calls zswap_compress() to
sequentially compress each page in the batch. Finally, it adds the
batch's zswap entries to the xarray and LRU, charges zswap memory and
increments zswap stats.

The error handling and cleanup required for all failure scenarios that can
occur while storing a batch in zswap are consolidated to a single
"store_pages_failed" label in zswap_store_pages().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 199 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 130 insertions(+), 69 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 2273dbfd460f..1d6795704350 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1518,81 +1518,125 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
-static bool zswap_store_page(struct page *page,
-			     struct obj_cgroup *objcg,
-			     struct zswap_pool *pool)
+/*
+ * Store multiple pages in @folio, starting from the page at index @start up to
+ * the page at index @end-1.
+ */
+static bool zswap_store_pages(struct folio *folio,
+			      long start,
+			      long end,
+			      struct obj_cgroup *objcg,
+			      struct zswap_pool *pool)
 {
-	swp_entry_t page_swpentry = page_swap_entry(page);
-	struct zswap_entry *entry, *old;
-
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		return false;
-	}
+	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
+	int node_id = folio_nid(folio);
+	u8 i, store_fail_idx = 0, nr_pages = end - start;
 
-	if (!zswap_compress(page, entry, pool))
-		goto compress_failed;
+	for (i = 0; i < nr_pages; ++i) {
+		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
 
-	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
-		       entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
+		if (unlikely(!entries[i])) {
+			zswap_reject_kmemcache_fail++;
+			/*
+			 * While handling this error, we only need to call
+			 * zswap_entry_cache_free() for entries[0 .. i-1].
+			 */
+			nr_pages = i;
+			goto store_pages_failed;
+		}
 
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
-		goto store_failed;
+		/*
+		 * Initialize the handle to an error value. This facilitates
+		 * having a consolidated failure handling
+		 * 'goto store_pages_failed' that can inspect the value of the
+		 * handle to determine whether zpool memory needs to be
+		 * de-allocated.
+		 */
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
 	}
 
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
+	for (i = 0; i < nr_pages; ++i) {
+		struct page *page = folio_page(folio, start + i);
 
-	/*
-	 * The entry is successfully compressed and stored in the tree, there is
-	 * no further possibility of failure. Grab refs to the pool and objcg,
-	 * charge zswap memory, and increment zswap_stored_pages.
-	 * The opposite actions will be performed by zswap_entry_free()
-	 * when the entry is removed from the tree.
-	 */
-	zswap_pool_get(pool);
-	if (objcg) {
-		obj_cgroup_get(objcg);
-		obj_cgroup_charge_zswap(objcg, entry->length);
+		if (!zswap_compress(page, entries[i], pool))
+			goto store_pages_failed;
 	}
-	atomic_long_inc(&zswap_stored_pages);
 
-	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
-	 */
-	entry->pool = pool;
-	entry->swpentry = page_swpentry;
-	entry->objcg = objcg;
-	entry->referenced = true;
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+	for (i = 0; i < nr_pages; ++i) {
+		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, start + i));
+		struct zswap_entry *old, *entry = entries[i];
+
+		old = xa_store(swap_zswap_tree(page_swpentry),
+			       swp_offset(page_swpentry),
+			       entry, GFP_KERNEL);
+		if (unlikely(xa_is_err(old))) {
+			int err = xa_err(old);
+
+			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+			zswap_reject_alloc_fail++;
+			/*
+			 * Entries up to this point have been stored in the
+			 * xarray. zswap_store() will erase them from the xarray
+			 * and call zswap_entry_free(). Local cleanup in
+			 * 'store_pages_failed' only needs to happen for
+			 * entries from [@i to @nr_pages).
+			 */
+			store_fail_idx = i;
+			goto store_pages_failed;
+		}
+
+		/*
+		 * We may have had an existing entry that became stale when
+		 * the folio was redirtied and now the new version is being
+		 * swapped out. Get rid of the old.
+		 */
+		if (unlikely(old))
+			zswap_entry_free(old);
+
+		/*
+		 * The entry is successfully compressed and stored in the tree, there is
+		 * no further possibility of failure. Grab refs to the pool and objcg,
+		 * charge zswap memory, and increment zswap_stored_pages.
+		 * The opposite actions will be performed by zswap_entry_free()
+		 * when the entry is removed from the tree.
+		 */
+		zswap_pool_get(pool);
+		if (objcg) {
+			obj_cgroup_get(objcg);
+			obj_cgroup_charge_zswap(objcg, entry->length);
+		}
+		atomic_long_inc(&zswap_stored_pages);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		entry->pool = pool;
+		entry->swpentry = page_swpentry;
+		entry->objcg = objcg;
+		entry->referenced = true;
+		if (likely(entry->length)) {
+			INIT_LIST_HEAD(&entry->lru);
+			zswap_lru_add(&zswap_list_lru, entry);
+		}
 	}
 
 	return true;
 
-store_failed:
-	zpool_free(pool->zpool, entry->handle);
-compress_failed:
-	zswap_entry_cache_free(entry);
+store_pages_failed:
+	for (i = store_fail_idx; i < nr_pages; ++i) {
+		if (!IS_ERR_VALUE(entries[i]->handle))
+			zpool_free(pool->zpool, entries[i]->handle);
+
+		zswap_entry_cache_free(entries[i]);
+	}
+
 	return false;
 }
 
@@ -1603,8 +1647,9 @@ bool zswap_store(struct folio *folio)
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
+	unsigned int batch_size;
 	bool ret = false;
-	long index;
+	long start, end;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1638,10 +1683,26 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
+	/*
+	 * If a large folio is being swapped out and the zswap compressor
+	 * supports batching, i.e., has multiple acomp requests, the folio will
+	 * be compressed in batches of @pool->nr_reqs. If the compressor has
+	 * only one acomp request, the folio will be compressed in batches of
+	 * ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch is
+	 * compressed sequentially. We see better performance by processing the
+	 * folio in batches of ZSWAP_MAX_BATCH_SIZE, due to cache locality of
+	 * working set structures such as the array of zswap_entry's for the
+	 * batch.
+	 */
+	batch_size = (nr_pages > 1) ? ((pool->nr_reqs > 1) ?
+					pool->nr_reqs : ZSWAP_MAX_BATCH_SIZE)
+				    : 1;
+
+	/* Store the folio in batches of "batch_size" pages. */
+	for (start = 0; start < nr_pages; start += batch_size) {
+		end = min(start + batch_size, nr_pages);
 
-		if (!zswap_store_page(page, objcg, pool))
+		if (!zswap_store_pages(folio, start, end, objcg, pool))
 			goto put_pool;
 	}
 
@@ -1671,9 +1732,9 @@ bool zswap_store(struct folio *folio)
 		struct zswap_entry *entry;
 		struct xarray *tree;
 
-		for (index = 0; index < nr_pages; ++index) {
-			tree = swap_zswap_tree(swp_entry(type, offset + index));
-			entry = xa_erase(tree, offset + index);
+		for (start = 0; start < nr_pages; ++start) {
+			tree = swap_zswap_tree(swp_entry(type, offset + start));
+			entry = xa_erase(tree, offset + start);
 			if (entry)
 				zswap_entry_free(entry);
 		}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v9 19/19] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (17 preceding siblings ...)
  2025-04-30 20:53 ` [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches Kanchana P Sridhar
@ 2025-04-30 20:53 ` Kanchana P Sridhar
  2025-05-08 19:25 ` [PATCH v9 00/19] zswap compression batching Sridhar, Kanchana P
  2025-05-11 20:52 ` Nhat Pham
  20 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-04-30 20:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces a new unified implementation of zswap_compress()
for compressors that do and do not support batching. This eliminates
code duplication and facilitates maintainability of the code with the
introduction of compress batching.

The vectorized implementation of calling the earlier zswap_compress()
sequentially, one page at a time in zswap_store_pages(), is replaced
with this new version of zswap_compress() that accepts multiple pages to
compress as a batch.

If the compressor does not support batching, each page in the batch is
compressed and stored sequentially.

If the zswap compressor supports batching, for e.g., 'deflate-iaa',
the Intel IAA hardware accelerator, the batch is compressed in parallel
in hardware by calling crypto_acomp_batch_compress(), the new batch
compression API added earlier in this series. If all requests in the
batch are compressed without errors, the compressed buffers are then
stored in zpool.

Another important change this patch makes is with the acomp_ctx mutex
locking in zswap_compress(). Earlier, the mutex was only held during
compression. With the new code, [un]locking the mutex per page caused
regressions for software compressors when testing with usemem
(30 processes) and also kernel compilation with 'allmod' config. The
regressions were more eggregious when PMD folios were stored. The
implementation in this commit locks/unlocks the mutex once per batch,
that resolves the regression.

The use of prefetchw() for zswap entries and likely()/unlikely()
annotations prevent regressions with software compressors like zstd, and
generally improve non-batching compressors' performance with the
batching code by ~8%.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 187 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 132 insertions(+), 55 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 1d6795704350..561096f29c58 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,7 @@
 #include <linux/pagemap.h>
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
+#include <linux/prefetch.h>
 
 #include "swap.h"
 #include "internal.h"
@@ -973,71 +974,147 @@ static void zswap_entry_free(struct zswap_entry *entry)
 /*********************************
 * compressed storage functions
 **********************************/
-static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool)
+/*
+ * Unified code path for compressors that do and do not support batching. This
+ * procedure will compress multiple @nr_pages passed in as @pages.
+ *
+ * @nr_pages can be ZSWAP_MAX_BATCH_SIZE even if the compressor does not support
+ * batching.
+ *
+ * If @pool->nr_reqs is 1, each page is processed sequentially.
+ *
+ * If @pool->nr_reqs is > 1, compression batching is invoked, except if
+ * @nr_pages is 1: if so, we call the fully synchronous non-batching
+ * crypto_acomp API.
+ *
+ * It is assumed that @nr_pages <= @pool->nr_reqs. We could
+ * check this, but don't, for performance reasons. zswap_store() makes
+ * sure of this by design.
+ *
+ * In both cases, if all compressions are successful, the compressed buffers
+ * are stored in zpool.
+ *
+ * A few important changes made to not regress and in fact improve
+ * compression performance with non-batching software compressors, using this
+ * new/batching code:
+ *
+ * 1) acomp_ctx mutex locking:
+ *    Earlier, the mutex was only held during compression. With the new code,
+ *    [un]locking the mutex per page caused regressions for software
+ *    compressors. We now lock the mutex once per batch, which resolved the
+ *    regression.
+ *
+ * 2) The prefetchw() and likely()/unlikely() annotations prevent
+ *    regressions with software compressors like zstd, and generally improve
+ *    non-batching compressors' performance with the batching code by ~7.3%.
+ */
+static bool zswap_compress(struct page *pages[], struct zswap_entry *entries[],
+			   unsigned int nr_pages, struct zswap_pool *pool)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
-	unsigned int dlen = PAGE_SIZE;
-	unsigned long handle;
-	struct zpool *zpool;
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
+	int errors[ZSWAP_MAX_BATCH_SIZE];
+	struct zpool *zpool = pool->zpool;
+	unsigned int i, j, nr_comps = min(nr_pages, pool->nr_reqs);
+	int err;
 	gfp_t gfp;
-	u8 *dst;
+
+	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffers[0];
-	sg_init_table(&input, 1);
-	sg_set_page(&input, page, PAGE_SIZE, 0);
-
 	/*
-	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
-	 * and hardware-accelerators may won't check the dst buffer size, so
-	 * giving the dst buffer with enough length to avoid buffer overflow.
+	 * Note:
+	 * [i] refers to the incoming batch space and is used to
+	 *     index into @pages, @entries and @errors.
 	 */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
+	for (i = 0; i < nr_pages; i += nr_comps) {
 
-	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zwap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
-	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
-	dlen = acomp_ctx->reqs[0]->dlen;
-	if (comp_ret)
-		goto unlock;
+		if (likely(nr_comps == 1)) {
+			sg_init_table(&input, 1);
+			sg_set_page(&input, pages[i], PAGE_SIZE, 0);
 
-	zpool = pool->zpool;
-	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
-	if (alloc_ret)
-		goto unlock;
-
-	zpool_obj_write(zpool, handle, dst, dlen);
-	entry->handle = handle;
-	entry->length = dlen;
-
-unlock:
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
-		zswap_reject_compress_poor++;
-	else if (comp_ret)
-		zswap_reject_compress_fail++;
-	else if (alloc_ret)
-		zswap_reject_alloc_fail++;
+			/*
+			 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
+			 * and hardware-accelerators may won't check the dst buffer size, so
+			 * giving the dst buffer with enough length to avoid buffer overflow.
+			 */
+			sg_init_one(&output, acomp_ctx->buffers[0], PAGE_SIZE * 2);
+			acomp_request_set_params(acomp_ctx->reqs[0], &input,
+						 &output, PAGE_SIZE, PAGE_SIZE);
+
+			errors[i] = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
+						    &acomp_ctx->wait);
+			if (unlikely(errors[i]))
+				goto compress_error;
+		} else if (!crypto_acomp_batch_compress(acomp_ctx->reqs,
+							pages,
+							acomp_ctx->buffers,
+							dlens,
+							errors,
+							nr_pages)) {
+				goto compress_error;
+		}
+
+		/*
+		 * All @nr_comps pages were successfully compressed.
+		 * Store the pages in zpool.
+		 *
+		 * Note:
+		 * [j] refers to the incoming batch space and is used to
+		 *     index into @pages, @entries and @errors.
+		 * [k] refers to the @acomp_ctx space, as determined by
+		 *     @pool->nr_reqs, and is used to index into
+		 *     @acomp_ctx->reqs and @acomp_ctx->buffers.
+		 */
+		for (j = i; j < i + nr_comps; ++j) {
+			unsigned int k = j - i;
+			unsigned long handle;
+
+			/*
+			 * prefetchw() minimizes cache-miss latency by
+			 * moving the zswap entry to the cache before it
+			 * is written to; reducing sys time by ~1.5% for
+			 * non-batching software compressors.
+			 */
+			prefetchw(entries[j]);
+			err = zpool_malloc(zpool, acomp_ctx->reqs[k]->dlen, gfp, &handle,
+					   page_to_nid(pages[j]));
+
+			if (unlikely(err)) {
+				if (err == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
+
+				goto err_unlock;
+			}
+
+			zpool_obj_write(zpool, handle, acomp_ctx->buffers[k], acomp_ctx->reqs[k]->dlen);
+			entries[j]->handle = handle;
+			entries[j]->length = acomp_ctx->reqs[k]->dlen;
+		}
+	} /* finished compress and store nr_pages. */
 
 	mutex_unlock(&acomp_ctx->mutex);
-	return comp_ret == 0 && alloc_ret == 0;
+	return true;
+
+compress_error:
+	for (j = i; j < i + nr_comps; ++j) {
+		if (errors[j]) {
+			if (errors[j] == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+		}
+	}
+
+err_unlock:
+	mutex_unlock(&acomp_ctx->mutex);
+	return false;
 }
 
 static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
@@ -1529,6 +1606,7 @@ static bool zswap_store_pages(struct folio *folio,
 			      struct zswap_pool *pool)
 {
 	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
+	struct page *pages[ZSWAP_MAX_BATCH_SIZE];
 	int node_id = folio_nid(folio);
 	u8 i, store_fail_idx = 0, nr_pages = end - start;
 
@@ -1555,12 +1633,11 @@ static bool zswap_store_pages(struct folio *folio,
 		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
 	}
 
-	for (i = 0; i < nr_pages; ++i) {
-		struct page *page = folio_page(folio, start + i);
+	for (i = 0; i < nr_pages; ++i)
+		pages[i] = folio_page(folio, start + i);
 
-		if (!zswap_compress(page, entries[i], pool))
-			goto store_pages_failed;
-	}
+	if (!zswap_compress(pages, entries, nr_pages, pool))
+		goto store_pages_failed;
 
 	for (i = 0; i < nr_pages; ++i) {
 		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, start + i));
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress().
  2025-04-30 20:52 ` [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress() Kanchana P Sridhar
@ 2025-05-01  0:29   ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-05-01  0:29 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi
  Cc: oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 2c01d9f3c61101355afde90dc5c0b39d9a772ef3]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Remove-request-chaining/20250501-045602
base:   2c01d9f3c61101355afde90dc5c0b39d9a772ef3
patch link:    https://lore.kernel.org/r/20250430205305.22844-3-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress().
config: arc-randconfig-001-20250501 (https://download.01.org/0day-ci/archive/20250501/202505010736.dy4ElGuu-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250501/202505010736.dy4ElGuu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505010736.dy4ElGuu-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> crypto/acompress.c:298:12: warning: 'acomp_do_req_chain' defined but not used [-Wunused-function]
     298 | static int acomp_do_req_chain(struct acomp_req *req, bool comp)
         |            ^~~~~~~~~~~~~~~~~~

vim +/acomp_do_req_chain +298 crypto/acompress.c

b67a026003725a5 Herbert Xu         2025-03-09  297  
1a66016d1faca1e Kanchana P Sridhar 2025-04-30 @298  static int acomp_do_req_chain(struct acomp_req *req, bool comp)
b67a026003725a5 Herbert Xu         2025-03-09  299  {
b67a026003725a5 Herbert Xu         2025-03-09  300  	int err;
b67a026003725a5 Herbert Xu         2025-03-09  301  
b67a026003725a5 Herbert Xu         2025-03-09  302  	acomp_save_req(req, acomp_reqchain_done);
b67a026003725a5 Herbert Xu         2025-03-09  303  
1a66016d1faca1e Kanchana P Sridhar 2025-04-30  304  	err = acomp_do_one_req(req, comp);
b67a026003725a5 Herbert Xu         2025-03-09  305  	if (err == -EBUSY || err == -EINPROGRESS)
1a66016d1faca1e Kanchana P Sridhar 2025-04-30  306  		return err;
b67a026003725a5 Herbert Xu         2025-03-09  307  
1a66016d1faca1e Kanchana P Sridhar 2025-04-30  308  	return acomp_reqchain_finish(req, err);
b67a026003725a5 Herbert Xu         2025-03-09  309  }
b67a026003725a5 Herbert Xu         2025-03-09  310  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.
  2025-04-30 20:52 ` [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
@ 2025-05-01  1:40   ` Herbert Xu
  2025-05-02 15:53     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2025-05-01  1:40 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote:
>
> @@ -127,6 +131,22 @@ struct acomp_req {
>  struct crypto_acomp {
>  	int (*compress)(struct acomp_req *req);
>  	int (*decompress)(struct acomp_req *req);
> +	unsigned int (*get_batch_size)(void);
> +	bool (*batch_compress)(
> +		struct acomp_req *reqs[],
> +		struct page *pages[],
> +		u8 *dsts[],
> +		unsigned int dlens[],
> +		int errors[],
> +		int nr_reqs);
> +	bool (*batch_decompress)(
> +		struct acomp_req *reqs[],
> +		u8 *srcs[],
> +		struct page *pages[],
> +		unsigned int slens[],
> +		unsigned int dlens[],
> +		int errors[],
> +		int nr_reqs);

I shelved request chaining because allocating one request per page
is actively harmful to performance.  So we should not add any
interface that is based on one request per page.

My plan is to supply a whole folio through acomp_request_set_src_folio
and mark it as a batch request with a data unit size of 4K, e.g.:

	acomp_request_set_src_folio(req, folio, 0, len);
	acomp_request_set_data_unit(req, 4096);

Then the algorithm can dice it up in whatever way it sees fit.  For
algorithms that don't support batching, the acompress API should dice
it up and feed it to the algorithm piece-meal.

IOW the folio loop in zswap_store would be moved into the Crypto API.

This is contingent on one API change, bringing back NULL dst support
to acompress.  This way zswap does not need to worry about allocating
memory that might not even be needed (when pages compress well).

This won't look like the useless NULL dst we had before which simply
pre-allocated memory rather than allocating them on demand.

What acompress should do is allocate one dst page at a time, once that
is filled up, then allocate one more.  They should be chained up in an
SG list.  Pages that do not compress can be marked as a zero-length
entry in the SG list.

If the allocation fails at any point in time, simply stop the
batching at that point and return the SG list of what has been
compressed so far.  After processing the returned pages, zswap
can then call acompress again with an offset into the folio to
continue compression.

To prevent pathological cases of zero progress, zswap can provide
one pre-allocated page to seed the process.  For iaa, it should
just allocate as many pages as it needs for batching, and if that
fails, simply fall back to no batching and do things one page at
a time (or however many pages you manage to allocate).

I'll whip up a quick POC and we can work on top of it.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches.
  2025-04-30 20:53 ` [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches Kanchana P Sridhar
@ 2025-05-01  5:09   ` kernel test robot
  0 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-05-01  5:09 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi
  Cc: oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build errors:

[auto build test ERROR on 2c01d9f3c61101355afde90dc5c0b39d9a772ef3]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Remove-request-chaining/20250501-045602
base:   2c01d9f3c61101355afde90dc5c0b39d9a772ef3
patch link:    https://lore.kernel.org/r/20250430205305.22844-19-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches.
config: arc-randconfig-001-20250501 (https://download.01.org/0day-ci/archive/20250501/202505011147.h3jgtfmE-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250501/202505011147.h3jgtfmE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505011147.h3jgtfmE-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   mm/zswap.c: In function 'zswap_store':
>> include/linux/compiler_types.h:557:45: error: call to '__compiletime_assert_394' declared with attribute error: min(start + batch_size, nr_pages) signedness error
     557 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:538:25: note: in definition of macro '__compiletime_assert'
     538 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:557:9: note: in expansion of macro '_compiletime_assert'
     557 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:93:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      93 |         BUILD_BUG_ON_MSG(!__types_ok(ux, uy),           \
         |         ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:98:9: note: in expansion of macro '__careful_cmp_once'
      98 |         __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:105:25: note: in expansion of macro '__careful_cmp'
     105 | #define min(x, y)       __careful_cmp(min, x, y)
         |                         ^~~~~~~~~~~~~
   mm/zswap.c:1703:23: note: in expansion of macro 'min'
    1703 |                 end = min(start + batch_size, nr_pages);
         |                       ^~~


vim +/__compiletime_assert_394 +557 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  543  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  544  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  545  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  546  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  547  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  548   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  549   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  550   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  551   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  552   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  553   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  554   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  555   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  556  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @557  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  558  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.
  2025-05-01  1:40   ` Herbert Xu
@ 2025-05-02 15:53     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-05-02 15:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	linux-crypto@vger.kernel.org, davem@davemloft.net,
	clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com,
	surenb@google.com, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Wednesday, April 30, 2025 6:41 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate
> batching support in acomp & drivers.
> 
> On Wed, Apr 30, 2025 at 01:52:56PM -0700, Kanchana P Sridhar wrote:
> >
> > @@ -127,6 +131,22 @@ struct acomp_req {
> >  struct crypto_acomp {
> >  	int (*compress)(struct acomp_req *req);
> >  	int (*decompress)(struct acomp_req *req);
> > +	unsigned int (*get_batch_size)(void);
> > +	bool (*batch_compress)(
> > +		struct acomp_req *reqs[],
> > +		struct page *pages[],
> > +		u8 *dsts[],
> > +		unsigned int dlens[],
> > +		int errors[],
> > +		int nr_reqs);
> > +	bool (*batch_decompress)(
> > +		struct acomp_req *reqs[],
> > +		u8 *srcs[],
> > +		struct page *pages[],
> > +		unsigned int slens[],
> > +		unsigned int dlens[],
> > +		int errors[],
> > +		int nr_reqs);
> 
> I shelved request chaining because allocating one request per page
> is actively harmful to performance.  So we should not add any
> interface that is based on one request per page.

Hi Herbert,

My cover letter presents data that I've gathered that indicates significant
performance improvements with the crypto_acomp_batch_compress()
interface that allocates one request per page.

In addition, I would also like to share the p50/p99 latency of just the calls
to crypto_acomp_compress() and crypto_acomp_batch_compress() that
I gathered using the silesia.tar dataset (http://wanos.co/assets/silesia.tar)
and a simple madvise test that reads the dataset into memory, then
swaps out all pages and swaps them back in.

This data is on Sapphire Rapids, core frequency fixed at 2500 MHz.
I have enabled 64K folios.
The "count" refers to the number of calls to the crypto_acomp API.
As expected, the deflate-iaa "count" values in v9 are much lower
because zswap_compress() in v9 uses compression batching, i.e.,
invokes crypto_acomp_batch_compress() with batches of 8 pages,
while storing the 64K folios.

     -------------------------------------------------------------------------
     64K folios:    Normalized Per-Page Latency of crypto_acomp
                           calls in zswap_compress() (ns)
     ------------+------------------------------+----------------------------
                 |     mm-unstable-4-21-2025    |              v9
     ------------+------------------------------+----------------------------
                 |   count       p50       p99  |   count      p50       p99
     ------------+------------------------------+----------------------------
     deflate-iaa |  50,459     3,396     3,663  |   6,379      717       774
                 |                              |
     zstd        |  50,631    27,610    33,006  |  50,631   27,253    32,516
     ------------+------------------------------+----------------------------

There is no indication of sending one acomp request per page
harming performance, with either deflate-iaa or zstd. We see a
4.7X speedup with IAA that uses the crypto_acomp_batch_compress()
interface in question.

> 
> My plan is to supply a whole folio through acomp_request_set_src_folio
> and mark it as a batch request with a data unit size of 4K, e.g.:
> 
> 	acomp_request_set_src_folio(req, folio, 0, len);
> 	acomp_request_set_data_unit(req, 4096);
> 
> Then the algorithm can dice it up in whatever way it sees fit.  For
> algorithms that don't support batching, the acompress API should dice
> it up and feed it to the algorithm piece-meal.
> 
> IOW the folio loop in zswap_store would be moved into the Crypto API.
> 
> This is contingent on one API change, bringing back NULL dst support
> to acompress.  This way zswap does not need to worry about allocating
> memory that might not even be needed (when pages compress well).
> 
> This won't look like the useless NULL dst we had before which simply
> pre-allocated memory rather than allocating them on demand.
> 
> What acompress should do is allocate one dst page at a time, once that
> is filled up, then allocate one more.  They should be chained up in an
> SG list.  Pages that do not compress can be marked as a zero-length
> entry in the SG list.
> 
> If the allocation fails at any point in time, simply stop the
> batching at that point and return the SG list of what has been
> compressed so far.  After processing the returned pages, zswap
> can then call acompress again with an offset into the folio to
> continue compression.

I am not sure if this would be feasible, because zswap_store()
incrementally does other book-keeping necessary for mm/swap
consistency as pages get compressed, such as allocating entries,
storing compressed buffers in zpool, updating the xarray of swap
offsets stored in zswap, adding the entry to the zswap memcg LRU
list, etc.

As soon as an error is encountered in zswap_compress(),
zswap_store() has to cleanup any prior zpool stores for the batch,
and delete any swap offsets for the folio from xarray.

Imo, handing over the folio store loop to crypto might make this
"maintaining consistency of mm/swap incrementally with each
page compressed/stored" somewhat messy. However, I would like
to request the zswap maintainers to weigh in with their insights
on pros/cons of what you are proposing.

> 
> To prevent pathological cases of zero progress, zswap can provide
> one pre-allocated page to seed the process.  For iaa, it should
> just allocate as many pages as it needs for batching, and if that
> fails, simply fall back to no batching and do things one page at
> a time (or however many pages you manage to allocate).

I'm somewhat concerned about a) allocating memory and b) adding
computes in the zswap_store() path. It is not clear what is the
motivating factor for doing so, especially because the solution and
data presented in v9 have indicated only performance upside with
the crypto_acomp_batch_compress() API and its implementation
in iaa_crypto, and modest performance gains with zstd using the
existing crypto_acomp_compress() API to compress one page at a
time in a large folio. Please let me know if I am missing something.

Thanks,
Kanchana

> 
> I'll whip up a quick POC and we can work on top of it.
> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v9 00/19] zswap compression batching
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (18 preceding siblings ...)
  2025-04-30 20:53 ` [PATCH v9 19/19] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
@ 2025-05-08 19:25 ` Sridhar, Kanchana P
  2025-05-11 20:52 ` Nhat Pham
  20 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-05-08 19:25 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius
  Cc: Feghali, Wajdi K, Gopal, Vinodh, senozhatsky@chromium.org,
	Sridhar, Kanchana P

Adding Sergey.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v9 01/19] crypto: acomp - Remove request chaining
  2025-04-30 20:52 ` [PATCH v9 01/19] crypto: acomp - Remove request chaining Kanchana P Sridhar
@ 2025-05-08 19:30   ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-05-08 19:30 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Gomes, Vinicius
  Cc: Feghali, Wajdi K, Gopal, Vinodh, senozhatsky@chromium.org,
	Sridhar, Kanchana P

Adding Sergey.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress().
  2025-05-08 19:41 [RESEND PATCH " Kanchana P Sridhar
@ 2025-05-08 19:41 ` Kanchana P Sridhar
  2025-05-13  8:01   ` Herbert Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-05-08 19:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This reverts the request chaining implementations of
crypto_acomp_[de]compress() introduced in commit b67a02600372
("crypto: acomp - Add request chaining and virtual addresses") since
request chaining has been removed from acomp subsequently in commit
64929fe8c0a4 ("crypto: acomp - Remove request chaining").

This patch restores the implementations of crypto_acomp_[de]compress()
from prior to commit b67a02600372.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 82fb3c04e68f..d08e0fe8cd9e 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -310,21 +310,13 @@ static int acomp_do_req_chain(struct acomp_req *req, bool comp)
 
 int crypto_acomp_compress(struct acomp_req *req)
 {
-	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
-
-	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
-		crypto_acomp_reqtfm(req)->compress(req);
-	return acomp_do_req_chain(req, true);
+	return crypto_acomp_reqtfm(req)->compress(req);
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_compress);
 
 int crypto_acomp_decompress(struct acomp_req *req)
 {
-	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
-
-	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
-		crypto_acomp_reqtfm(req)->decompress(req);
-	return acomp_do_req_chain(req, false);
+	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_decompress);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v9 00/19] zswap compression batching
  2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
                   ` (19 preceding siblings ...)
  2025-05-08 19:25 ` [PATCH v9 00/19] zswap compression batching Sridhar, Kanchana P
@ 2025-05-11 20:52 ` Nhat Pham
  2025-05-11 22:21   ` Sridhar, Kanchana P
  20 siblings, 1 reply; 30+ messages in thread
From: Nhat Pham @ 2025-05-11 20:52 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Wed, Apr 30, 2025 at 4:53 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Changes since v8:
> =================
> 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
> 2) Backported commits for reverting request chaining, since these are
>    in cryptodev-2.6 but not yet in mm-unstable: without these backports,
>    deflate-iaa is non-functional in mm-unstable:
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
>    commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
>                          testing"")
>    Backported this hotfix as well:
>    commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
>    calculating last page").
> 3) crypto_acomp_[de]compress() restored to non-request chained
>    implementations since request chaining has been removed from acomp in
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").

I'm a bit confused. Which patches on top of mm-unstable do I need from
the crypto tree as pre-requisite for this patch series? And in which
order?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v9 00/19] zswap compression batching
  2025-05-11 20:52 ` Nhat Pham
@ 2025-05-11 22:21   ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-05-11 22:21 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosry.ahmed@linux.dev,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, 21cnbao@gmail.com,
	ying.huang@linux.alibaba.com, akpm@linux-foundation.org,
	linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org,
	ebiggers@google.com, surenb@google.com, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Sunday, May 11, 2025 1:53 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 00/19] zswap compression batching
> 
> On Wed, Apr 30, 2025 at 4:53 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Changes since v8:
> > =================
> > 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
> > 2) Backported commits for reverting request chaining, since these are
> >    in cryptodev-2.6 but not yet in mm-unstable: without these backports,
> >    deflate-iaa is non-functional in mm-unstable:
> >    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
> >    commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
> >                          testing"")
> >    Backported this hotfix as well:
> >    commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
> >    calculating last page").
> > 3) crypto_acomp_[de]compress() restored to non-request chained
> >    implementations since request chaining has been removed from acomp in
> >    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
> 
> I'm a bit confused. Which patches on top of mm-unstable do I need from
> the crypto tree as pre-requisite for this patch series? And in which
> order?

Hi Nhat,

As of today's mm-unstable (commit "c68cfbc5048e"), you can apply
patches in this order:

1) commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
     from the cryptodev-2.6 tree. Or, Patch 0001 in this series that does
     the same.
2) Patch 0002 in this series.
3) Skip Patches 0003 and 0004 in this series, since they were not in
     mm-unstable when I created this patch series. They are now in
     mm-unstable.
4) Apply Patches 0005-0019 in this series.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress().
  2025-05-08 19:41 ` [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress() Kanchana P Sridhar
@ 2025-05-13  8:01   ` Herbert Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Herbert Xu @ 2025-05-13  8:01 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, linux-crypto, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, vinicius.gomes, wajdi.k.feghali,
	vinodh.gopal

On Thu, May 08, 2025 at 12:41:17PM -0700, Kanchana P Sridhar wrote:
>
> diff --git a/crypto/acompress.c b/crypto/acompress.c
> index 82fb3c04e68f..d08e0fe8cd9e 100644
> --- a/crypto/acompress.c
> +++ b/crypto/acompress.c
> @@ -310,21 +310,13 @@ static int acomp_do_req_chain(struct acomp_req *req, bool comp)
>  
>  int crypto_acomp_compress(struct acomp_req *req)
>  {
> -	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
> -
> -	if (crypto_acomp_req_chain(tfm) || acomp_request_issg(req))
> -		crypto_acomp_reqtfm(req)->compress(req);
> -	return acomp_do_req_chain(req, true);
> +	return crypto_acomp_reqtfm(req)->compress(req);

That's not right.  Request chaining has already been removed.
What remains is linear address support which you've just removed
with this patch.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-05-13  8:02 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-30 20:52 [PATCH v9 00/19] zswap compression batching Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 01/19] crypto: acomp - Remove request chaining Kanchana P Sridhar
2025-05-08 19:30   ` Sridhar, Kanchana P
2025-04-30 20:52 ` [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress() Kanchana P Sridhar
2025-05-01  0:29   ` kernel test robot
2025-04-30 20:52 ` [PATCH v9 03/19] Revert "crypto: testmgr - Add multibuffer acomp testing" Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 04/19] crypto: scomp - Fix off-by-one bug when calculating last page Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 05/19] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 06/19] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 07/19] crypto: iaa - Define and use req->data instead of req->base.data Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 08/19] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 09/19] crypto: iaa - CRYPTO_ACOMP_REQ_POLL acomp_req flag for sequential vs. parallel Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 10/19] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
2025-05-01  1:40   ` Herbert Xu
2025-05-02 15:53     ` Sridhar, Kanchana P
2025-04-30 20:52 ` [PATCH v9 11/19] crypto: iaa - Implement crypto_acomp batching interfaces for Intel IAA Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 12/19] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-04-30 20:52 ` [PATCH v9 13/19] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-04-30 20:53 ` [PATCH v9 14/19] mm: zswap: Move the CPU hotplug procedures under "pool functions" Kanchana P Sridhar
2025-04-30 20:53 ` [PATCH v9 15/19] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
2025-04-30 20:53 ` [PATCH v9 16/19] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
2025-04-30 20:53 ` [PATCH v9 17/19] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-04-30 20:53 ` [PATCH v9 18/19] mm: zswap: zswap_store() will process a folio in batches Kanchana P Sridhar
2025-05-01  5:09   ` kernel test robot
2025-04-30 20:53 ` [PATCH v9 19/19] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
2025-05-08 19:25 ` [PATCH v9 00/19] zswap compression batching Sridhar, Kanchana P
2025-05-11 20:52 ` Nhat Pham
2025-05-11 22:21   ` Sridhar, Kanchana P
  -- strict thread matches above, loose matches on Subject: below --
2025-05-08 19:41 [RESEND PATCH " Kanchana P Sridhar
2025-05-08 19:41 ` [PATCH v9 02/19] crypto: acomp - Reinstate non-chained crypto_acomp_[de]compress() Kanchana P Sridhar
2025-05-13  8:01   ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).