Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
@ 2025-11-28 14:25 ` Giovanni Cabiddu
  2025-11-28 16:13   ` Chris Mason
  2025-11-28 19:04 ` [RFC PATCH 01/16] crypto: zstd - fix double-free in per-CPU stream cleanup Giovanni Cabiddu
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 14:25 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky

Apologies, I just realized that there was a typo on one of the email
addresses in the TO list (Chris Mason's).
Let me know if I shall resend the series.

-- 
Giovanni

On Fri, Nov 28, 2025 at 07:04:48PM +0000, Giovanni Cabiddu wrote:
> This patch series applies to:
>   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
> 
> This series adds support for hardware-accelerated compression and decompression
> in BTRFS using the acomp APIs in the Crypto Framework.  While the
> implementation is generic and should work with any acomp-compatible
> implementation, the initial enablement targets Intel QAT devices.
> 
> Supported operations:
>   - zlib: compression and decompression (all QAT generations)
>   - zstd: compression only (GEN4 and GEN6)
> 
> This is a rework of the earlier RFC series [1].
> 
> Changes in this series:
>   1. Re-enable zlib-deflate in the Crypto API and QAT driver. These were
>      removed in [2] due to lack of in-kernel users. This series reverts
>      those commits to restore the functionality.
> 
>   2. Add compression level support to the acomp framework. The core
>      implementation is from Herbert Xu [3]; I've rebased it and
>      addressed checkpatch style issues.
>      Compression levels are enabled in deflate, zstd, and the QAT driver.
> 
>   3. Add compression offload to QAT accelerators in BTRFS using the acomp
>      layer enabled with runtime control via sysfs.
>      This feature is wrapped in CONFIG_BTRFS_EXPERIMENTAL.
> 
> Note that I included to this set also the bug fix `crypto: zstd - fix
> double-free in per-CPU stream cleanup`, even if this is already merged
> to crypto-2.6. This is just in case someone wants to test the series.
> 
> Feedback Requested:
>   1. General approach on integration of acomp with BTRFS
>   2. Folio-to-scatterlist conversion.
>      @Herbert, any thoughts on this? Would it make sense to do it in the
>      acomp layer instead?
>   3. Compression level changes.
>   4. Offload threshold strategy. Should acomp implementations report
>      optimal data size thresholds, possibly per compression level and
>      direction?
>   5. Optimizations on the LZ4s to sequences algorithm used for GEN4.
>      @Yann, any suggestions on how to improve it?
>   5. What benchmarks are required for acceptance?
> 
> Performance Results. I'm including here the results from the
> **previous implementation**, just to have an idea of the performance.
> I still need to formally benchmark the new implementation.
> 
> Benchmarked on a dual-socket Intel Xeon Platinum 8470N system:
>   - 512GB RAM (16x32GB DDR5 4800 MT/s)
>   - 4 NVMe drives (349.3GB Intel SSDPE21K375GA each)
>   - 2 QAT 4xxx devices (one per socket, compression-only configuration)
> 
> Test: 4 parallel processes writing 50GB each (Silesia corpus) to separate
> drives
> 
> +---------------------------+---------+---------+---------+---------+
> |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
> +---------------------------+---------+---------+---------+---------+
> | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
> +---------------------------+---------+---------+---------+---------+
> | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
> +---------------------------+---------+---------+---------+---------+
> | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
> +---------------------------+---------+---------+---------+---------+
> 
> Results: QAT zlib-deflate L9 achieves the best throughput with significantly
> lower CPU utilization and provides better compression
> ratio compared with software zstd-l3, zlib-l3 and lzo. 
> 
> Changes since v1:
>   - Addressed review comments from previous series.
>   - Refactored from zlib-specific to generic acomp implementation.
>   - Reworked to support folios instead of pages.
>   - Added support for zstd compression.
>   - Added runtime enable/disable via sysfs (/sys/fs/btrfs/$UUID/offload_compress).
>   - Moved buffer allocations from data path to workspace initialization
>   - Added compression level support.
> 
> [1] https://lore.kernel.org/all/20240426110941.5456-1-giovanni.cabiddu@intel.com/
> [2] https://lore.kernel.org/all/ZO8ULhlJSrJ0Mcsx@gondor.apana.org.au/
> [3] https://lore.kernel.org/all/cover.1716202860.git.herbert@gondor.apana.org.au/
> 
> Giovanni Cabiddu (12):
>   crypto: zstd - fix double-free in per-CPU stream cleanup
>   Revert "crypto: qat - remove unused macros in qat_comp_alg.c"
>   Revert "crypto: qat - Remove zlib-deflate"
>   crypto: qat - use memcpy_*_sglist() in zlib deflate
>   Revert "crypto: testmgr - Remove zlib-deflate"
>   crypto: deflate - add support for deflate rfc1950 (zlib)
>   crypto: acomp - add NUMA-aware stream allocation
>   crypto: deflate - add support for compression levels
>   crypto: qat - increase number of preallocated sgl descriptors
>   crypto: qat - add support for zstd
>   crypto: qat - add support for compression levels
>   btrfs: add compression hw-accelerated offload
> 
> Herbert Xu (3):
>   crypto: scomp - Add setparam interface
>   crypto: acomp - Add setparam interface
>   crypto: acomp - Add comp_params helpers
> 
> Suman Kumar Chakraborty (1):
>   crypto: zstd - add support for compression levels
> 
>  crypto/842.c                                  |   4 +-
>  crypto/acompress.c                            | 133 +++-
>  crypto/compress.h                             |   9 +-
>  crypto/deflate.c                              | 118 ++-
>  crypto/lz4.c                                  |   4 +-
>  crypto/lz4hc.c                                |   4 +-
>  crypto/lzo-rle.c                              |   4 +-
>  crypto/lzo.c                                  |   4 +-
>  crypto/scompress.c                            |  43 +-
>  crypto/testmgr.c                              |  10 +
>  crypto/testmgr.h                              |  75 ++
>  crypto/zstd.c                                 |  48 +-
>  drivers/crypto/intel/qat/Kconfig              |   1 +
>  .../intel/qat/qat_420xx/adf_420xx_hw_data.c   |   1 +
>  .../intel/qat/qat_4xxx/adf_4xxx_hw_data.c     |   1 +
>  .../intel/qat/qat_6xxx/adf_6xxx_hw_data.c     |  19 +-
>  drivers/crypto/intel/qat/qat_common/Makefile  |   1 +
>  .../intel/qat/qat_common/adf_accel_devices.h  |   8 +-
>  .../intel/qat/qat_common/adf_common_drv.h     |   6 +-
>  drivers/crypto/intel/qat/qat_common/adf_dc.c  |   5 +-
>  drivers/crypto/intel/qat/qat_common/adf_dc.h  |   3 +-
>  .../intel/qat/qat_common/adf_gen2_hw_data.c   |  16 +-
>  .../intel/qat/qat_common/adf_gen4_hw_data.c   |  29 +-
>  .../crypto/intel/qat/qat_common/adf_init.c    |   6 +-
>  .../crypto/intel/qat/qat_common/icp_qat_fw.h  |   7 +
>  .../intel/qat/qat_common/icp_qat_fw_comp.h    |   2 +
>  .../crypto/intel/qat/qat_common/icp_qat_hw.h  |   3 +-
>  drivers/crypto/intel/qat/qat_common/qat_bl.h  |   2 +-
>  .../intel/qat/qat_common/qat_comp_algs.c      | 712 +++++++++++++++++-
>  .../intel/qat/qat_common/qat_comp_req.h       |  11 +
>  .../qat/qat_common/qat_comp_zstd_utils.c      | 120 +++
>  .../qat/qat_common/qat_comp_zstd_utils.h      |  13 +
>  .../intel/qat/qat_common/qat_compression.c    |  23 +-
>  fs/btrfs/Makefile                             |   2 +-
>  fs/btrfs/acomp.c                              | 470 ++++++++++++
>  fs/btrfs/acomp_workspace.h                    |  61 ++
>  fs/btrfs/compression.c                        |  66 ++
>  fs/btrfs/compression.h                        |  30 +
>  fs/btrfs/disk-io.c                            |   6 +
>  fs/btrfs/fs.h                                 |   8 +
>  fs/btrfs/sysfs.c                              |  29 +
>  fs/btrfs/zlib.c                               |  81 ++
>  fs/btrfs/zstd.c                               |  64 ++
>  include/crypto/acompress.h                    |  27 +
>  include/crypto/internal/acompress.h           |  15 +-
>  include/crypto/internal/scompress.h           |  27 +
>  46 files changed, 2253 insertions(+), 78 deletions(-)
>  create mode 100644 drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.c
>  create mode 100644 drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.h
>  create mode 100644 fs/btrfs/acomp.c
>  create mode 100644 fs/btrfs/acomp_workspace.h
> 
> 
> base-commit: ebbdf6466b30e3b37f3b360826efd21f0633fb9e
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-11-28 14:25 ` Giovanni Cabiddu
@ 2025-11-28 16:13   ` Chris Mason
  0 siblings, 0 replies; 42+ messages in thread
From: Chris Mason @ 2025-11-28 16:13 UTC (permalink / raw)
  To: Giovanni Cabiddu, clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky

On 11/28/25 9:25 AM, Giovanni Cabiddu wrote:
> Apologies, I just realized that there was a typo on one of the email
> addresses in the TO list (Chris Mason's).
> Let me know if I shall resend the series.
> 

Thanks, no need to resend just for my email.

-chris

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
@ 2025-11-28 19:04 Giovanni Cabiddu
  2025-11-28 14:25 ` Giovanni Cabiddu
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

This patch series applies to:
  https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git

This series adds support for hardware-accelerated compression and decompression
in BTRFS using the acomp APIs in the Crypto Framework.  While the
implementation is generic and should work with any acomp-compatible
implementation, the initial enablement targets Intel QAT devices.

Supported operations:
  - zlib: compression and decompression (all QAT generations)
  - zstd: compression only (GEN4 and GEN6)

This is a rework of the earlier RFC series [1].

Changes in this series:
  1. Re-enable zlib-deflate in the Crypto API and QAT driver. These were
     removed in [2] due to lack of in-kernel users. This series reverts
     those commits to restore the functionality.

  2. Add compression level support to the acomp framework. The core
     implementation is from Herbert Xu [3]; I've rebased it and
     addressed checkpatch style issues.
     Compression levels are enabled in deflate, zstd, and the QAT driver.

  3. Add compression offload to QAT accelerators in BTRFS using the acomp
     layer enabled with runtime control via sysfs.
     This feature is wrapped in CONFIG_BTRFS_EXPERIMENTAL.

Note that I included to this set also the bug fix `crypto: zstd - fix
double-free in per-CPU stream cleanup`, even if this is already merged
to crypto-2.6. This is just in case someone wants to test the series.

Feedback Requested:
  1. General approach on integration of acomp with BTRFS
  2. Folio-to-scatterlist conversion.
     @Herbert, any thoughts on this? Would it make sense to do it in the
     acomp layer instead?
  3. Compression level changes.
  4. Offload threshold strategy. Should acomp implementations report
     optimal data size thresholds, possibly per compression level and
     direction?
  5. Optimizations on the LZ4s to sequences algorithm used for GEN4.
     @Yann, any suggestions on how to improve it?
  5. What benchmarks are required for acceptance?

Performance Results. I'm including here the results from the
**previous implementation**, just to have an idea of the performance.
I still need to formally benchmark the new implementation.

Benchmarked on a dual-socket Intel Xeon Platinum 8470N system:
  - 512GB RAM (16x32GB DDR5 4800 MT/s)
  - 4 NVMe drives (349.3GB Intel SSDPE21K375GA each)
  - 2 QAT 4xxx devices (one per socket, compression-only configuration)

Test: 4 parallel processes writing 50GB each (Silesia corpus) to separate
drives

+---------------------------+---------+---------+---------+---------+
|                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
+---------------------------+---------+---------+---------+---------+
| Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
+---------------------------+---------+---------+---------+---------+
| CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
+---------------------------+---------+---------+---------+---------+
| Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
+---------------------------+---------+---------+---------+---------+

Results: QAT zlib-deflate L9 achieves the best throughput with significantly
lower CPU utilization and provides better compression
ratio compared with software zstd-l3, zlib-l3 and lzo. 

Changes since v1:
  - Addressed review comments from previous series.
  - Refactored from zlib-specific to generic acomp implementation.
  - Reworked to support folios instead of pages.
  - Added support for zstd compression.
  - Added runtime enable/disable via sysfs (/sys/fs/btrfs/$UUID/offload_compress).
  - Moved buffer allocations from data path to workspace initialization
  - Added compression level support.

[1] https://lore.kernel.org/all/20240426110941.5456-1-giovanni.cabiddu@intel.com/
[2] https://lore.kernel.org/all/ZO8ULhlJSrJ0Mcsx@gondor.apana.org.au/
[3] https://lore.kernel.org/all/cover.1716202860.git.herbert@gondor.apana.org.au/

Giovanni Cabiddu (12):
  crypto: zstd - fix double-free in per-CPU stream cleanup
  Revert "crypto: qat - remove unused macros in qat_comp_alg.c"
  Revert "crypto: qat - Remove zlib-deflate"
  crypto: qat - use memcpy_*_sglist() in zlib deflate
  Revert "crypto: testmgr - Remove zlib-deflate"
  crypto: deflate - add support for deflate rfc1950 (zlib)
  crypto: acomp - add NUMA-aware stream allocation
  crypto: deflate - add support for compression levels
  crypto: qat - increase number of preallocated sgl descriptors
  crypto: qat - add support for zstd
  crypto: qat - add support for compression levels
  btrfs: add compression hw-accelerated offload

Herbert Xu (3):
  crypto: scomp - Add setparam interface
  crypto: acomp - Add setparam interface
  crypto: acomp - Add comp_params helpers

Suman Kumar Chakraborty (1):
  crypto: zstd - add support for compression levels

 crypto/842.c                                  |   4 +-
 crypto/acompress.c                            | 133 +++-
 crypto/compress.h                             |   9 +-
 crypto/deflate.c                              | 118 ++-
 crypto/lz4.c                                  |   4 +-
 crypto/lz4hc.c                                |   4 +-
 crypto/lzo-rle.c                              |   4 +-
 crypto/lzo.c                                  |   4 +-
 crypto/scompress.c                            |  43 +-
 crypto/testmgr.c                              |  10 +
 crypto/testmgr.h                              |  75 ++
 crypto/zstd.c                                 |  48 +-
 drivers/crypto/intel/qat/Kconfig              |   1 +
 .../intel/qat/qat_420xx/adf_420xx_hw_data.c   |   1 +
 .../intel/qat/qat_4xxx/adf_4xxx_hw_data.c     |   1 +
 .../intel/qat/qat_6xxx/adf_6xxx_hw_data.c     |  19 +-
 drivers/crypto/intel/qat/qat_common/Makefile  |   1 +
 .../intel/qat/qat_common/adf_accel_devices.h  |   8 +-
 .../intel/qat/qat_common/adf_common_drv.h     |   6 +-
 drivers/crypto/intel/qat/qat_common/adf_dc.c  |   5 +-
 drivers/crypto/intel/qat/qat_common/adf_dc.h  |   3 +-
 .../intel/qat/qat_common/adf_gen2_hw_data.c   |  16 +-
 .../intel/qat/qat_common/adf_gen4_hw_data.c   |  29 +-
 .../crypto/intel/qat/qat_common/adf_init.c    |   6 +-
 .../crypto/intel/qat/qat_common/icp_qat_fw.h  |   7 +
 .../intel/qat/qat_common/icp_qat_fw_comp.h    |   2 +
 .../crypto/intel/qat/qat_common/icp_qat_hw.h  |   3 +-
 drivers/crypto/intel/qat/qat_common/qat_bl.h  |   2 +-
 .../intel/qat/qat_common/qat_comp_algs.c      | 712 +++++++++++++++++-
 .../intel/qat/qat_common/qat_comp_req.h       |  11 +
 .../qat/qat_common/qat_comp_zstd_utils.c      | 120 +++
 .../qat/qat_common/qat_comp_zstd_utils.h      |  13 +
 .../intel/qat/qat_common/qat_compression.c    |  23 +-
 fs/btrfs/Makefile                             |   2 +-
 fs/btrfs/acomp.c                              | 470 ++++++++++++
 fs/btrfs/acomp_workspace.h                    |  61 ++
 fs/btrfs/compression.c                        |  66 ++
 fs/btrfs/compression.h                        |  30 +
 fs/btrfs/disk-io.c                            |   6 +
 fs/btrfs/fs.h                                 |   8 +
 fs/btrfs/sysfs.c                              |  29 +
 fs/btrfs/zlib.c                               |  81 ++
 fs/btrfs/zstd.c                               |  64 ++
 include/crypto/acompress.h                    |  27 +
 include/crypto/internal/acompress.h           |  15 +-
 include/crypto/internal/scompress.h           |  27 +
 46 files changed, 2253 insertions(+), 78 deletions(-)
 create mode 100644 drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.c
 create mode 100644 drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.h
 create mode 100644 fs/btrfs/acomp.c
 create mode 100644 fs/btrfs/acomp_workspace.h


base-commit: ebbdf6466b30e3b37f3b360826efd21f0633fb9e
-- 
2.51.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 01/16] crypto: zstd - fix double-free in per-CPU stream cleanup
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
  2025-11-28 14:25 ` Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 02/16] Revert "crypto: qat - remove unused macros in qat_comp_alg.c" Giovanni Cabiddu
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu,
	Suman Kumar Chakraborty

The crypto/zstd module has a double-free bug that occurs when multiple
tfms are allocated and freed.

The issue happens because zstd_streams (per-CPU contexts) are freed in
zstd_exit() during every tfm destruction, rather than being managed at
the module level.  When multiple tfms exist, each tfm exit attempts to
free the same shared per-CPU streams, resulting in a double-free.

This leads to a stack trace similar to:

  BUG: Bad page state in process kworker/u16:1  pfn:106fd93
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x106fd93
  flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
  page_type: 0xffffffff()
  raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: nonzero entire_mapcount
  Modules linked in: ...
  CPU: 3 UID: 0 PID: 2506 Comm: kworker/u16:1 Kdump: loaded Tainted: G    B
  Hardware name: ...
  Workqueue: btrfs-delalloc btrfs_work_helper
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5d/0x80
   bad_page+0x71/0xd0
   free_unref_page_prepare+0x24e/0x490
   free_unref_page+0x60/0x170
   crypto_acomp_free_streams+0x5d/0xc0
   crypto_acomp_exit_tfm+0x23/0x50
   crypto_destroy_tfm+0x60/0xc0
   ...

Change the lifecycle management of zstd_streams to free the streams only
once during module cleanup.

Fixes: f5ad93ffb541 ("crypto: zstd - convert to acomp")
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Suman Kumar Chakraborty <suman.kumar.chakraborty@intel.com>
---
 crypto/zstd.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/crypto/zstd.c b/crypto/zstd.c
index dc5b36141ff8..cbbd0413751a 100644
--- a/crypto/zstd.c
+++ b/crypto/zstd.c
@@ -75,11 +75,6 @@ static int zstd_init(struct crypto_acomp *acomp_tfm)
 	return ret;
 }

-static void zstd_exit(struct crypto_acomp *acomp_tfm)
-{
-	crypto_acomp_free_streams(&zstd_streams);
-}
-
 static int zstd_compress_one(struct acomp_req *req, struct zstd_ctx *ctx,
 			     const void *src, void *dst, unsigned int *dlen)
 {
@@ -297,7 +292,6 @@ static struct acomp_alg zstd_acomp = {
 		.cra_module = THIS_MODULE,
 	},
 	.init = zstd_init,
-	.exit = zstd_exit,
 	.compress = zstd_compress,
 	.decompress = zstd_decompress,
 };
@@ -310,6 +304,7 @@ static int __init zstd_mod_init(void)
 static void __exit zstd_mod_fini(void)
 {
 	crypto_unregister_acomp(&zstd_acomp);
+	crypto_acomp_free_streams(&zstd_streams);
 }

 module_init(zstd_mod_init);
-- 
2.51.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 02/16] Revert "crypto: qat - remove unused macros in qat_comp_alg.c"
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
  2025-11-28 14:25 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 01/16] crypto: zstd - fix double-free in per-CPU stream cleanup Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 03/16] Revert "crypto: qat - Remove zlib-deflate" Giovanni Cabiddu
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Reintroduce macros related to zlib-deflate in the QAT driver.

This is in preparation for the reintroduction of rfc1950 (zlib) in the
QAT driver.

This reverts commit dfff0e35fa5dd84ae75052ba129b0219d83e46dc.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 drivers/crypto/intel/qat/qat_common/qat_comp_algs.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index 8b123472b71c..a13f5fcf6bb7 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -14,6 +14,15 @@
 #include "qat_compression.h"
 #include "qat_algs_send.h"
 
+#define QAT_RFC_1950_HDR_SIZE 2
+#define QAT_RFC_1950_FOOTER_SIZE 4
+#define QAT_RFC_1950_CM_DEFLATE 8
+#define QAT_RFC_1950_CM_DEFLATE_CINFO_32K 7
+#define QAT_RFC_1950_CM_MASK 0x0f
+#define QAT_RFC_1950_CM_OFFSET 4
+#define QAT_RFC_1950_DICT_MASK 0x20
+#define QAT_RFC_1950_COMP_HDR 0x785e
+
 static DEFINE_MUTEX(algs_lock);
 static unsigned int active_devs;
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 03/16] Revert "crypto: qat - Remove zlib-deflate"
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (2 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 02/16] Revert "crypto: qat - remove unused macros in qat_comp_alg.c" Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 04/16] crypto: qat - use memcpy_*_sglist() in zlib deflate Giovanni Cabiddu
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Reintroduce the zlib-deflate implementation in the QAT driver through
the acomp api. This allows to enable the offload of the deflate
compression algorithm to QAT devices from the BTRFS filesystem.

This reverts commit e9dd20e0e5f62d01d9404db2cf9824d1faebcf71.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 .../intel/qat/qat_common/qat_comp_algs.c      | 128 +++++++++++++++++-
 1 file changed, 127 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index a13f5fcf6bb7..26eb8dcd2e53 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -63,6 +63,69 @@ static int qat_alg_send_dc_message(struct qat_compression_req *qat_req,
 	return qat_alg_send_message(alg_req);
 }
 
+static int parse_zlib_header(u16 zlib_h)
+{
+	int ret = -EINVAL;
+	__be16 header;
+	u8 *header_p;
+	u8 cmf, flg;
+
+	header = cpu_to_be16(zlib_h);
+	header_p = (u8 *)&header;
+
+	flg = header_p[0];
+	cmf = header_p[1];
+
+	if (cmf >> QAT_RFC_1950_CM_OFFSET > QAT_RFC_1950_CM_DEFLATE_CINFO_32K)
+		return ret;
+
+	if ((cmf & QAT_RFC_1950_CM_MASK) != QAT_RFC_1950_CM_DEFLATE)
+		return ret;
+
+	if (flg & QAT_RFC_1950_DICT_MASK)
+		return ret;
+
+	return 0;
+}
+
+static int qat_comp_rfc1950_callback(struct qat_compression_req *qat_req,
+				     void *resp)
+{
+	struct acomp_req *areq = qat_req->acompress_req;
+	enum direction dir = qat_req->dir;
+	__be32 qat_produced_adler;
+
+	qat_produced_adler = cpu_to_be32(qat_comp_get_produced_adler32(resp));
+
+	if (dir == COMPRESSION) {
+		__be16 zlib_header;
+
+		zlib_header = cpu_to_be16(QAT_RFC_1950_COMP_HDR);
+		scatterwalk_map_and_copy(&zlib_header, areq->dst, 0, QAT_RFC_1950_HDR_SIZE, 1);
+		areq->dlen += QAT_RFC_1950_HDR_SIZE;
+
+		scatterwalk_map_and_copy(&qat_produced_adler, areq->dst, areq->dlen,
+					 QAT_RFC_1950_FOOTER_SIZE, 1);
+		areq->dlen += QAT_RFC_1950_FOOTER_SIZE;
+	} else {
+		__be32 decomp_adler;
+		int footer_offset;
+		int consumed;
+
+		consumed = qat_comp_get_consumed_ctr(resp);
+		footer_offset = consumed + QAT_RFC_1950_HDR_SIZE;
+		if (footer_offset + QAT_RFC_1950_FOOTER_SIZE > areq->slen)
+			return -EBADMSG;
+
+		scatterwalk_map_and_copy(&decomp_adler, areq->src, footer_offset,
+					 QAT_RFC_1950_FOOTER_SIZE, 0);
+
+		if (qat_produced_adler != decomp_adler)
+			return -EBADMSG;
+	}
+	return 0;
+}
+
 static void qat_comp_generic_callback(struct qat_compression_req *qat_req,
 				      void *resp)
 {
@@ -167,6 +230,18 @@ static void qat_comp_alg_exit_tfm(struct crypto_acomp *acomp_tfm)
 	memset(ctx, 0, sizeof(*ctx));
 }
 
+static int qat_comp_alg_rfc1950_init_tfm(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	int ret;
+
+	ret = qat_comp_alg_init_tfm(acomp_tfm);
+	ctx->qat_comp_callback = &qat_comp_rfc1950_callback;
+
+	return ret;
+}
+
 static int qat_comp_alg_compress_decompress(struct acomp_req *areq, enum direction dir,
 					    unsigned int shdr, unsigned int sftr,
 					    unsigned int dhdr, unsigned int dftr)
@@ -242,6 +317,43 @@ static int qat_comp_alg_decompress(struct acomp_req *req)
 	return qat_comp_alg_compress_decompress(req, DECOMPRESSION, 0, 0, 0, 0);
 }
 
+static int qat_comp_alg_rfc1950_compress(struct acomp_req *req)
+{
+	if (!req->dst && req->dlen != 0)
+		return -EINVAL;
+
+	if (req->dst && req->dlen <= QAT_RFC_1950_HDR_SIZE + QAT_RFC_1950_FOOTER_SIZE)
+		return -EINVAL;
+
+	return qat_comp_alg_compress_decompress(req, COMPRESSION, 0, 0,
+						QAT_RFC_1950_HDR_SIZE,
+						QAT_RFC_1950_FOOTER_SIZE);
+}
+
+static int qat_comp_alg_rfc1950_decompress(struct acomp_req *req)
+{
+	struct crypto_acomp *acomp_tfm = crypto_acomp_reqtfm(req);
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct adf_accel_dev *accel_dev = ctx->inst->accel_dev;
+	u16 zlib_header;
+	int ret;
+
+	if (req->slen <= QAT_RFC_1950_HDR_SIZE + QAT_RFC_1950_FOOTER_SIZE)
+		return -EBADMSG;
+
+	scatterwalk_map_and_copy(&zlib_header, req->src, 0, QAT_RFC_1950_HDR_SIZE, 0);
+
+	ret = parse_zlib_header(zlib_header);
+	if (ret) {
+		dev_dbg(&GET_DEV(accel_dev), "Error parsing zlib header\n");
+		return ret;
+	}
+
+	return qat_comp_alg_compress_decompress(req, DECOMPRESSION, QAT_RFC_1950_HDR_SIZE,
+						QAT_RFC_1950_FOOTER_SIZE, 0, 0);
+}
+
 static struct acomp_alg qat_acomp[] = { {
 	.base = {
 		.cra_name = "deflate",
@@ -256,7 +368,21 @@ static struct acomp_alg qat_acomp[] = { {
 	.exit = qat_comp_alg_exit_tfm,
 	.compress = qat_comp_alg_compress,
 	.decompress = qat_comp_alg_decompress,
-}};
+}, {
+	.base = {
+		.cra_name = "zlib-deflate",
+		.cra_driver_name = "qat_zlib_deflate",
+		.cra_priority = 4001,
+		.cra_flags = CRYPTO_ALG_ASYNC,
+		.cra_ctxsize = sizeof(struct qat_compression_ctx),
+		.cra_reqsize = sizeof(struct qat_compression_req),
+		.cra_module = THIS_MODULE,
+	},
+	.init = qat_comp_alg_rfc1950_init_tfm,
+	.exit = qat_comp_alg_exit_tfm,
+	.compress = qat_comp_alg_rfc1950_compress,
+	.decompress = qat_comp_alg_rfc1950_decompress,
+} };
 
 int qat_comp_algs_register(void)
 {
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 04/16] crypto: qat - use memcpy_*_sglist() in zlib deflate
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (3 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 03/16] Revert "crypto: qat - Remove zlib-deflate" Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 05/16] Revert "crypto: testmgr - Remove zlib-deflate" Giovanni Cabiddu
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Replace istances of scatterwalk_map_and_copy() with memcpy_to_sglist()
and memcpy_from_sglist() to increase readability.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 drivers/crypto/intel/qat/qat_common/qat_comp_algs.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index 26eb8dcd2e53..23a1ed4f6b40 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -101,11 +101,11 @@ static int qat_comp_rfc1950_callback(struct qat_compression_req *qat_req,
 		__be16 zlib_header;
 
 		zlib_header = cpu_to_be16(QAT_RFC_1950_COMP_HDR);
-		scatterwalk_map_and_copy(&zlib_header, areq->dst, 0, QAT_RFC_1950_HDR_SIZE, 1);
+		memcpy_to_sglist(areq->dst, 0, &zlib_header, QAT_RFC_1950_HDR_SIZE);
 		areq->dlen += QAT_RFC_1950_HDR_SIZE;
 
-		scatterwalk_map_and_copy(&qat_produced_adler, areq->dst, areq->dlen,
-					 QAT_RFC_1950_FOOTER_SIZE, 1);
+		memcpy_to_sglist(areq->dst, areq->dlen, &qat_produced_adler,
+				 QAT_RFC_1950_FOOTER_SIZE);
 		areq->dlen += QAT_RFC_1950_FOOTER_SIZE;
 	} else {
 		__be32 decomp_adler;
@@ -117,8 +117,8 @@ static int qat_comp_rfc1950_callback(struct qat_compression_req *qat_req,
 		if (footer_offset + QAT_RFC_1950_FOOTER_SIZE > areq->slen)
 			return -EBADMSG;
 
-		scatterwalk_map_and_copy(&decomp_adler, areq->src, footer_offset,
-					 QAT_RFC_1950_FOOTER_SIZE, 0);
+		memcpy_from_sglist(&decomp_adler, areq->src, footer_offset,
+				   QAT_RFC_1950_FOOTER_SIZE);
 
 		if (qat_produced_adler != decomp_adler)
 			return -EBADMSG;
@@ -342,7 +342,7 @@ static int qat_comp_alg_rfc1950_decompress(struct acomp_req *req)
 	if (req->slen <= QAT_RFC_1950_HDR_SIZE + QAT_RFC_1950_FOOTER_SIZE)
 		return -EBADMSG;
 
-	scatterwalk_map_and_copy(&zlib_header, req->src, 0, QAT_RFC_1950_HDR_SIZE, 0);
+	memcpy_from_sglist(&zlib_header, req->src, 0, QAT_RFC_1950_HDR_SIZE);
 
 	ret = parse_zlib_header(zlib_header);
 	if (ret) {
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 05/16] Revert "crypto: testmgr - Remove zlib-deflate"
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (4 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 04/16] crypto: qat - use memcpy_*_sglist() in zlib deflate Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 06/16] crypto: deflate - add support for deflate rfc1950 (zlib) Giovanni Cabiddu
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Reintroduce the zlib-deflate test cases for the acomp API.

This reverts commit 30febae71c6182e0762dc7744737012b4f8e6a6d.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/testmgr.c | 10 +++++++
 crypto/testmgr.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index dc22b4f28633..9ca309b84b92 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -5580,6 +5580,16 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(xxhash64_tv_template)
 		}
+	}, {
+		.alg = "zlib-deflate",
+		.test = alg_test_comp,
+		.fips_allowed = 1,
+		.suite = {
+			.comp = {
+				.comp = __VECS(zlib_deflate_comp_tv_template),
+				.decomp = __VECS(zlib_deflate_decomp_tv_template)
+			}
+		}
 	}, {
 		.alg = "zstd",
 		.test = alg_test_comp,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index bdd9e71fee0f..523268679bf4 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -33364,6 +33364,81 @@ static const struct comp_testvec deflate_decomp_tv_template[] = {
 	},
 };
 
+static const struct comp_testvec zlib_deflate_comp_tv_template[] = {
+	{
+		.inlen	= 70,
+		.outlen	= 44,
+		.input	= "Join us now and share the software "
+			"Join us now and share the software ",
+		.output	= "\x78\x5e\xf3\xca\xcf\xcc\x53\x28"
+			  "\x2d\x56\xc8\xcb\x2f\x57\x48\xcc"
+			  "\x4b\x51\x28\xce\x48\x2c\x4a\x55"
+			  "\x28\xc9\x48\x55\x28\xce\x4f\x2b"
+			  "\x29\x07\x71\xbc\x08\x2b\x01\x00"
+			  "\x7c\x65\x19\x3d",
+	}, {
+		.inlen	= 191,
+		.outlen	= 129,
+		.input	= "This document describes a compression method based on the DEFLATE"
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+		.output	= "\x78\x5e\x5d\xce\x41\x0a\xc3\x30"
+			  "\x0c\x04\xc0\xaf\xec\x0b\xf2\x87"
+			  "\xd2\xa6\x50\xe8\xc1\x07\x7f\x40"
+			  "\xb1\x95\x5a\x60\x5b\xc6\x56\x0f"
+			  "\xfd\x7d\x93\x1e\x42\xe8\x51\xec"
+			  "\xee\x20\x9f\x64\x20\x6a\x78\x17"
+			  "\xae\x86\xc8\x23\x74\x59\x78\x80"
+			  "\x10\xb4\xb4\xce\x63\x88\x56\x14"
+			  "\xb6\xa4\x11\x0b\x0d\x8e\xd8\x6e"
+			  "\x4b\x8c\xdb\x7c\x7f\x5e\xfc\x7c"
+			  "\xae\x51\x7e\x69\x17\x4b\x65\x02"
+			  "\xfc\x1f\xbc\x4a\xdd\xd8\x7d\x48"
+			  "\xad\x65\x09\x64\x3b\xac\xeb\xd9"
+			  "\xc2\x01\xc0\xf4\x17\x3c\x1c\x1c"
+			  "\x7d\xb2\x52\xc4\xf5\xf4\x8f\xeb"
+			  "\x6a\x1a\x34\x4f\x5f\x2e\x32\x45"
+			  "\x4e",
+	},
+};
+
+static const struct comp_testvec zlib_deflate_decomp_tv_template[] = {
+	{
+		.inlen	= 128,
+		.outlen	= 191,
+		.input	= "\x78\x9c\x5d\x8d\x31\x0e\xc2\x30"
+			  "\x10\x04\xbf\xb2\x2f\xc8\x1f\x10"
+			  "\x04\x09\x89\xc2\x85\x3f\x70\xb1"
+			  "\x2f\xf8\x24\xdb\x67\xd9\x47\xc1"
+			  "\xef\x49\x68\x12\x51\xae\x76\x67"
+			  "\xd6\x27\x19\x88\x1a\xde\x85\xab"
+			  "\x21\xf2\x08\x5d\x16\x1e\x20\x04"
+			  "\x2d\xad\xf3\x18\xa2\x15\x85\x2d"
+			  "\x69\xc4\x42\x83\x23\xb6\x6c\x89"
+			  "\x71\x9b\xef\xcf\x8b\x9f\xcf\x33"
+			  "\xca\x2f\xed\x62\xa9\x4c\x80\xff"
+			  "\x13\xaf\x52\x37\xed\x0e\x52\x6b"
+			  "\x59\x02\xd9\x4e\xe8\x7a\x76\x1d"
+			  "\x02\x98\xfe\x8a\x87\x83\xa3\x4f"
+			  "\x56\x8a\xb8\x9e\x8e\x5c\x57\xd3"
+			  "\xa0\x79\xfa\x02\x2e\x32\x45\x4e",
+		.output	= "This document describes a compression method based on the DEFLATE"
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+	}, {
+		.inlen	= 44,
+		.outlen	= 70,
+		.input	= "\x78\x9c\xf3\xca\xcf\xcc\x53\x28"
+			  "\x2d\x56\xc8\xcb\x2f\x57\x48\xcc"
+			  "\x4b\x51\x28\xce\x48\x2c\x4a\x55"
+			  "\x28\xc9\x48\x55\x28\xce\x4f\x2b"
+			  "\x29\x07\x71\xbc\x08\x2b\x01\x00"
+			  "\x7c\x65\x19\x3d",
+		.output	= "Join us now and share the software "
+			"Join us now and share the software ",
+	},
+};
+
 /*
  * LZO test vectors (null-terminated strings).
  */
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 06/16] crypto: deflate - add support for deflate rfc1950 (zlib)
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (5 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 05/16] Revert "crypto: testmgr - Remove zlib-deflate" Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 07/16] crypto: scomp - Add setparam interface Giovanni Cabiddu
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Add acomp backend for zlib-deflate compression algorithm.

This backend outputs data in the format defined by rfc1950: raw deflate
data wrapped with a zlib header and footer.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/deflate.c | 74 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 9 deletions(-)

diff --git a/crypto/deflate.c b/crypto/deflate.c
index a3e1fff55661..26b4617f6196 100644
--- a/crypto/deflate.c
+++ b/crypto/deflate.c
@@ -36,7 +36,7 @@ static DEFINE_MUTEX(deflate_stream_lock);
 static void *deflate_alloc_stream(void)
 {
 	size_t size = max(zlib_inflate_workspacesize(),
-			  zlib_deflate_workspacesize(-DEFLATE_DEF_WINBITS,
+			  zlib_deflate_workspacesize(MAX_WBITS,
 						     DEFLATE_DEF_MEMLEVEL));
 	struct deflate_stream *ctx;
 
@@ -113,17 +113,34 @@ static int deflate_compress_one(struct acomp_req *req,
 	return 0;
 }
 
-static int deflate_compress(struct acomp_req *req)
+enum algo {
+	ALGO_DEFLATE,
+	ALGO_ZLIB_DEFLATE,
+};
+
+static int _deflate_compress(struct acomp_req *req, enum algo algo)
 {
 	struct crypto_acomp_stream *s;
 	struct deflate_stream *ds;
+	int window_bits;
 	int err;
 
+	switch (algo) {
+	case ALGO_DEFLATE:
+		window_bits = -DEFLATE_DEF_WINBITS;
+		break;
+	case ALGO_ZLIB_DEFLATE:
+		window_bits = DEFLATE_DEF_WINBITS;
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	s = crypto_acomp_lock_stream_bh(&deflate_streams);
 	ds = s->ctx;
 
 	err = zlib_deflateInit2(&ds->stream, DEFLATE_DEF_LEVEL, Z_DEFLATED,
-				-DEFLATE_DEF_WINBITS, DEFLATE_DEF_MEMLEVEL,
+				window_bits, DEFLATE_DEF_MEMLEVEL,
 				Z_DEFAULT_STRATEGY);
 	if (err != Z_OK) {
 		err = -EINVAL;
@@ -138,6 +155,16 @@ static int deflate_compress(struct acomp_req *req)
 	return err;
 }
 
+static int deflate_compress(struct acomp_req *req)
+{
+	return _deflate_compress(req, ALGO_DEFLATE);
+}
+
+static int zlib_deflate_compress(struct acomp_req *req)
+{
+	return _deflate_compress(req, ALGO_ZLIB_DEFLATE);
+}
+
 static int deflate_decompress_one(struct acomp_req *req,
 				  struct deflate_stream *ds)
 {
@@ -194,7 +221,7 @@ static int deflate_decompress_one(struct acomp_req *req,
 	return 0;
 }
 
-static int deflate_decompress(struct acomp_req *req)
+static int _deflate_decompress(struct acomp_req *req, enum algo algo)
 {
 	struct crypto_acomp_stream *s;
 	struct deflate_stream *ds;
@@ -203,7 +230,18 @@ static int deflate_decompress(struct acomp_req *req)
 	s = crypto_acomp_lock_stream_bh(&deflate_streams);
 	ds = s->ctx;
 
-	err = zlib_inflateInit2(&ds->stream, -DEFLATE_DEF_WINBITS);
+	switch (algo) {
+	case ALGO_DEFLATE:
+		err = zlib_inflateInit2(&ds->stream, -DEFLATE_DEF_WINBITS);
+		break;
+	case ALGO_ZLIB_DEFLATE:
+		err = zlib_inflateInit(&ds->stream);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
 	if (err != Z_OK) {
 		err = -EINVAL;
 		goto out;
@@ -217,6 +255,16 @@ static int deflate_decompress(struct acomp_req *req)
 	return err;
 }
 
+static int deflate_decompress(struct acomp_req *req)
+{
+	return _deflate_decompress(req, ALGO_DEFLATE);
+}
+
+static int zlib_deflate_decompress(struct acomp_req *req)
+{
+	return _deflate_decompress(req, ALGO_ZLIB_DEFLATE);
+}
+
 static int deflate_init(struct crypto_acomp *tfm)
 {
 	int ret;
@@ -228,7 +276,7 @@ static int deflate_init(struct crypto_acomp *tfm)
 	return ret;
 }
 
-static struct acomp_alg acomp = {
+static struct acomp_alg acomps[] = { {
 	.compress		= deflate_compress,
 	.decompress		= deflate_decompress,
 	.init			= deflate_init,
@@ -236,16 +284,24 @@ static struct acomp_alg acomp = {
 	.base.cra_driver_name	= "deflate-generic",
 	.base.cra_flags		= CRYPTO_ALG_REQ_VIRT,
 	.base.cra_module	= THIS_MODULE,
-};
+}, {
+	.compress		= zlib_deflate_compress,
+	.decompress		= zlib_deflate_decompress,
+	.init			= deflate_init,
+	.base.cra_name		= "zlib-deflate",
+	.base.cra_driver_name	= "zlib-deflate-generic",
+	.base.cra_flags		= CRYPTO_ALG_REQ_VIRT,
+	.base.cra_module	= THIS_MODULE,
+} };
 
 static int __init deflate_mod_init(void)
 {
-	return crypto_register_acomp(&acomp);
+	return crypto_register_acomps(acomps, ARRAY_SIZE(acomps));
 }
 
 static void __exit deflate_mod_fini(void)
 {
-	crypto_unregister_acomp(&acomp);
+	crypto_unregister_acomps(acomps, ARRAY_SIZE(acomps));
 	crypto_acomp_free_streams(&deflate_streams);
 }
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 07/16] crypto: scomp - Add setparam interface
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (6 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 06/16] crypto: deflate - add support for deflate rfc1950 (zlib) Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 08/16] crypto: acomp " Giovanni Cabiddu
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

From: Herbert Xu <herbert@gondor.apana.org.au>

Add the scompress plumbing for setparam.  This is modelled after
setkey for shash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/scompress.c                  | 48 ++++++++++++++++++++++++++++-
 include/crypto/internal/scompress.h | 27 ++++++++++++++++
 2 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/crypto/scompress.c b/crypto/scompress.c
index 1a7ed8ae65b0..900796f3035a 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -45,6 +45,46 @@ static cpumask_t scomp_scratch_want;
 static void scomp_scratch_workfn(struct work_struct *work);
 static DECLARE_WORK(scomp_scratch_work, scomp_scratch_workfn);
 
+static int scomp_no_setparam(struct crypto_scomp *tfm, const u8 *param,
+			     unsigned int len)
+{
+	return -ENOSYS;
+}
+
+static bool crypto_scomp_alg_has_setparam(struct scomp_alg *alg)
+{
+	return alg->setparam != scomp_no_setparam;
+}
+
+static bool crypto_scomp_alg_needs_param(struct scomp_alg *alg)
+{
+	return crypto_scomp_alg_has_setparam(alg) &&
+	       !(alg->base.cra_flags & CRYPTO_ALG_OPTIONAL_KEY);
+}
+
+static void scomp_set_need_param(struct crypto_scomp *tfm,
+				 struct scomp_alg *alg)
+{
+	if (crypto_scomp_alg_needs_param(alg))
+		crypto_scomp_set_flags(tfm, CRYPTO_TFM_NEED_KEY);
+}
+
+int crypto_scomp_setparam(struct crypto_scomp *tfm, const u8 *param,
+			  unsigned int len)
+{
+	struct scomp_alg *scomp = crypto_scomp_alg(tfm);
+	int err;
+
+	err = scomp->setparam(tfm, param, len);
+	if (unlikely(err)) {
+		scomp_set_need_param(tfm, scomp);
+		return err;
+	}
+
+	crypto_scomp_clear_flags(tfm, CRYPTO_TFM_NEED_KEY);
+	return 0;
+}
+
 static int __maybe_unused crypto_scomp_report(
 	struct sk_buff *skb, struct crypto_alg *alg)
 {
@@ -121,9 +161,12 @@ static int crypto_scomp_alloc_scratches(void)
 
 static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
 {
-	struct scomp_alg *alg = crypto_scomp_alg(__crypto_scomp_tfm(tfm));
+	struct crypto_scomp *comp = __crypto_scomp_tfm(tfm);
+	struct scomp_alg *alg = crypto_scomp_alg(comp);
 	int ret = 0;
 
+	scomp_set_need_param(comp, alg);
+
 	mutex_lock(&scomp_lock);
 	ret = crypto_acomp_alloc_streams(&alg->streams);
 	if (ret)
@@ -356,6 +399,9 @@ static void scomp_prepare_alg(struct scomp_alg *alg)
 	comp_prepare_alg(&alg->calg);
 
 	base->cra_flags |= CRYPTO_ALG_REQ_VIRT;
+
+	if (!alg->setparam)
+		alg->setparam = scomp_no_setparam;
 }
 
 int crypto_register_scomp(struct scomp_alg *alg)
diff --git a/include/crypto/internal/scompress.h b/include/crypto/internal/scompress.h
index 6a2c5f2e90f9..cfeff1009e2f 100644
--- a/include/crypto/internal/scompress.h
+++ b/include/crypto/internal/scompress.h
@@ -20,6 +20,7 @@ struct crypto_scomp {
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @setparam:	Set parameters of the algorithm (e.g., compression level)
  * @streams:	Per-cpu memory for algorithm
  * @calg:	Cmonn algorithm data structure shared with acomp
  */
@@ -30,6 +31,8 @@ struct scomp_alg {
 	int (*decompress)(struct crypto_scomp *tfm, const u8 *src,
 			  unsigned int slen, u8 *dst, unsigned int *dlen,
 			  void *ctx);
+	int (*setparam)(struct crypto_scomp *tfm, const u8 *param,
+			unsigned int len);
 
 	struct crypto_acomp_streams streams;
 
@@ -64,10 +67,31 @@ static inline struct scomp_alg *crypto_scomp_alg(struct crypto_scomp *tfm)
 	return __crypto_scomp_alg(crypto_scomp_tfm(tfm)->__crt_alg);
 }
 
+static inline u32 crypto_scomp_get_flags(struct crypto_scomp *tfm)
+{
+	return crypto_tfm_get_flags(crypto_scomp_tfm(tfm));
+}
+
+static inline void crypto_scomp_set_flags(struct crypto_scomp *tfm, u32 flags)
+{
+	crypto_tfm_set_flags(crypto_scomp_tfm(tfm), flags);
+}
+
+static inline void crypto_scomp_clear_flags(struct crypto_scomp *tfm, u32 flags)
+{
+	crypto_tfm_clear_flags(crypto_scomp_tfm(tfm), flags);
+}
+
+int crypto_scomp_setparam(struct crypto_scomp *tfm, const u8 *param,
+			  unsigned int len);
+
 static inline int crypto_scomp_compress(struct crypto_scomp *tfm,
 					const u8 *src, unsigned int slen,
 					u8 *dst, unsigned int *dlen, void *ctx)
 {
+	if (crypto_scomp_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
+		return -ENOKEY;
+
 	return crypto_scomp_alg(tfm)->compress(tfm, src, slen, dst, dlen, ctx);
 }
 
@@ -76,6 +100,9 @@ static inline int crypto_scomp_decompress(struct crypto_scomp *tfm,
 					  u8 *dst, unsigned int *dlen,
 					  void *ctx)
 {
+	if (crypto_scomp_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
+		return -ENOKEY;
+
 	return crypto_scomp_alg(tfm)->decompress(tfm, src, slen, dst, dlen,
 						 ctx);
 }
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 08/16] crypto: acomp - Add setparam interface
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (7 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 07/16] crypto: scomp - Add setparam interface Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 09/16] crypto: acomp - Add comp_params helpers Giovanni Cabiddu
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

From: Herbert Xu <herbert@gondor.apana.org.au>

Add the acompress plubming for setparam.  This is modelled after
setkey for ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/acompress.c                  | 73 +++++++++++++++++++++++++++--
 crypto/compress.h                   |  9 +++-
 crypto/scompress.c                  |  9 +---
 include/crypto/acompress.h          | 18 +++++++
 include/crypto/internal/acompress.h |  3 ++
 5 files changed, 101 insertions(+), 11 deletions(-)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index be28cbfd22e3..12bae6ee5925 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -48,6 +48,56 @@ static inline struct acomp_alg *crypto_acomp_alg(struct crypto_acomp *tfm)
 	return __crypto_acomp_alg(crypto_acomp_tfm(tfm)->__crt_alg);
 }
 
+static int acomp_no_setparam(struct crypto_acomp *tfm, const u8 *param,
+			    unsigned int len)
+{
+	return -ENOSYS;
+}
+
+static int acomp_set_need_param(struct crypto_acomp *tfm,
+				struct acomp_alg *alg)
+{
+	if (alg->calg.base.cra_type != &crypto_acomp_type) {
+		struct crypto_scomp **ctx = acomp_tfm_ctx(tfm);
+		struct crypto_scomp *scomp = *ctx;
+
+		if (!crypto_scomp_alg_has_setparam(crypto_scomp_alg(scomp)))
+			return 0;
+	} else if (alg->setparam == acomp_no_setparam) {
+		return 0;
+	}
+
+	if ((alg->base.cra_flags & CRYPTO_ALG_OPTIONAL_KEY))
+		crypto_acomp_set_flags(tfm, CRYPTO_TFM_NEED_KEY);
+
+	return 0;
+}
+
+int crypto_acomp_setparam(struct crypto_acomp *tfm, const u8 *param,
+			  unsigned int len)
+{
+	struct acomp_alg *alg = crypto_acomp_alg(tfm);
+	int err;
+
+	if (alg->calg.base.cra_type == &crypto_acomp_type) {
+		err = alg->setparam(tfm, param, len);
+	} else {
+		struct crypto_scomp **ctx = acomp_tfm_ctx(tfm);
+		struct crypto_scomp *scomp = *ctx;
+
+		err = crypto_scomp_setparam(scomp, param, len);
+	}
+
+	if (unlikely(err)) {
+		acomp_set_need_param(tfm, alg);
+		return err;
+	}
+
+	crypto_acomp_clear_flags(tfm, CRYPTO_TFM_NEED_KEY);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_acomp_setparam);
+
 static int __maybe_unused crypto_acomp_report(
 	struct sk_buff *skb, struct crypto_alg *alg)
 {
@@ -87,8 +137,9 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 	struct crypto_acomp *fb = NULL;
 	int err;
 
-	if (tfm->__crt_alg->cra_type != &crypto_acomp_type)
-		return crypto_init_scomp_ops_async(tfm);
+	if (alg->calg.base.cra_type != &crypto_acomp_type)
+		return crypto_init_scomp_ops_async(tfm) ?:
+		       acomp_set_need_param(acomp, alg);
 
 	if (acomp_is_async(acomp)) {
 		fb = crypto_alloc_acomp(crypto_acomp_alg_name(acomp), 0,
@@ -116,6 +167,10 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 	if (err)
 		goto out_free_fb;
 
+	err = acomp_set_need_param(acomp, alg);
+	if (err)
+		goto out_free_fb;
+
 	return 0;
 
 out_free_fb:
@@ -285,6 +340,8 @@ int crypto_acomp_compress(struct acomp_req *req)
 {
 	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
 
+	if (crypto_acomp_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
+		return -ENOKEY;
 	if (acomp_req_on_stack(req) && acomp_is_async(tfm))
 		return -EAGAIN;
 	if (crypto_acomp_req_virt(tfm) || acomp_request_issg(req))
@@ -297,6 +354,8 @@ int crypto_acomp_decompress(struct acomp_req *req)
 {
 	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
 
+	if (crypto_acomp_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
+		return -ENOKEY;
 	if (acomp_req_on_stack(req) && acomp_is_async(tfm))
 		return -EAGAIN;
 	if (crypto_acomp_req_virt(tfm) || acomp_request_issg(req))
@@ -312,11 +371,19 @@ void comp_prepare_alg(struct comp_alg_common *alg)
 	base->cra_flags &= ~CRYPTO_ALG_TYPE_MASK;
 }
 
+static void acomp_prepare_alg(struct acomp_alg *alg)
+{
+	comp_prepare_alg(&alg->calg);
+
+	if (!alg->setparam)
+		alg->setparam = acomp_no_setparam;
+}
+
 int crypto_register_acomp(struct acomp_alg *alg)
 {
 	struct crypto_alg *base = &alg->calg.base;
 
-	comp_prepare_alg(&alg->calg);
+	acomp_prepare_alg(alg);
 
 	base->cra_type = &crypto_acomp_type;
 	base->cra_flags |= CRYPTO_ALG_TYPE_ACOMPRESS;
diff --git a/crypto/compress.h b/crypto/compress.h
index f7737a1fcbbd..55f6bd137bdc 100644
--- a/crypto/compress.h
+++ b/crypto/compress.h
@@ -9,13 +9,20 @@
 #ifndef _LOCAL_CRYPTO_COMPRESS_H
 #define _LOCAL_CRYPTO_COMPRESS_H
 
+#include <crypto/internal/scompress.h>
 #include "internal.h"
 
 struct acomp_req;
-struct comp_alg_common;
 
 int crypto_init_scomp_ops_async(struct crypto_tfm *tfm);
+int scomp_no_setparam(struct crypto_scomp *tfm, const u8 *param,
+		      unsigned int len);
 
 void comp_prepare_alg(struct comp_alg_common *alg);
 
+static inline bool crypto_scomp_alg_has_setparam(struct scomp_alg *alg)
+{
+	return alg->setparam != scomp_no_setparam;
+}
+
 #endif	/* _LOCAL_CRYPTO_COMPRESS_H */
diff --git a/crypto/scompress.c b/crypto/scompress.c
index 900796f3035a..67da9ef9b9cc 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -45,17 +45,12 @@ static cpumask_t scomp_scratch_want;
 static void scomp_scratch_workfn(struct work_struct *work);
 static DECLARE_WORK(scomp_scratch_work, scomp_scratch_workfn);
 
-static int scomp_no_setparam(struct crypto_scomp *tfm, const u8 *param,
-			     unsigned int len)
+int scomp_no_setparam(struct crypto_scomp *tfm, const u8 *param,
+		      unsigned int len)
 {
 	return -ENOSYS;
 }
 
-static bool crypto_scomp_alg_has_setparam(struct scomp_alg *alg)
-{
-	return alg->setparam != scomp_no_setparam;
-}
-
 static bool crypto_scomp_alg_needs_param(struct scomp_alg *alg)
 {
 	return crypto_scomp_alg_has_setparam(alg) &&
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 9eacb9fa375d..3e735171271e 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -186,6 +186,21 @@ static inline struct comp_alg_common *crypto_comp_alg_common(
 	return __crypto_comp_alg_common(crypto_acomp_tfm(tfm)->__crt_alg);
 }
 
+static inline u32 crypto_acomp_get_flags(struct crypto_acomp *tfm)
+{
+	return crypto_tfm_get_flags(crypto_acomp_tfm(tfm));
+}
+
+static inline void crypto_acomp_set_flags(struct crypto_acomp *tfm, u32 flags)
+{
+	crypto_tfm_set_flags(crypto_acomp_tfm(tfm), flags);
+}
+
+static inline void crypto_acomp_clear_flags(struct crypto_acomp *tfm, u32 flags)
+{
+	crypto_tfm_clear_flags(crypto_acomp_tfm(tfm), flags);
+}
+
 static inline unsigned int crypto_acomp_reqsize(struct crypto_acomp *tfm)
 {
 	return tfm->reqsize;
@@ -554,4 +569,7 @@ static inline struct acomp_req *acomp_request_on_stack_init(
 struct acomp_req *acomp_request_clone(struct acomp_req *req,
 				      size_t total, gfp_t gfp);
 
+int crypto_acomp_setparam(struct crypto_acomp *tfm,
+			  const u8 *param, unsigned int len);
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 2d97440028ff..4cdc98a64418 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -28,6 +28,7 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @setparam:	Set parameters of the algorithm (e.g., compression level)
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
  *		transformation object. This function is called only once at
@@ -46,6 +47,8 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	int (*setparam)(struct crypto_acomp *tfm, const u8 *param,
+			unsigned int len);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 09/16] crypto: acomp - Add comp_params helpers
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (8 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 08/16] crypto: acomp " Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 10/16] crypto: acomp - add NUMA-aware stream allocation Giovanni Cabiddu
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

From: Herbert Xu <herbert@gondor.apana.org.au>

Add helpers to get compression parameters, including the level
and an optional dictionary.

Note that algorithms do not have to use these helpers and could
come up with its own set of parameters.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/acompress.c                  | 49 +++++++++++++++++++++++++++++
 include/crypto/acompress.h          |  9 ++++++
 include/crypto/internal/acompress.h | 10 ++++++
 3 files changed, 68 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 12bae6ee5925..a9d77056bd43 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -15,6 +15,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/percpu.h>
+#include <linux/rtnetlink.h>
 #include <linux/scatterlist.h>
 #include <linux/sched.h>
 #include <linux/seq_file.h>
@@ -651,5 +652,53 @@ struct acomp_req *acomp_request_clone(struct acomp_req *req,
 }
 EXPORT_SYMBOL_GPL(acomp_request_clone);
 
+int crypto_acomp_getparams(struct crypto_acomp_params *params, const u8 *raw,
+			   unsigned int len)
+{
+	struct rtattr *rta = (struct rtattr *)raw;
+	void *dict;
+
+	crypto_acomp_putparams(params);
+	params->level = CRYPTO_COMP_NO_LEVEL;
+
+	for (;; rta = RTA_NEXT(rta, len)) {
+		if (!RTA_OK(rta, len))
+			return -EINVAL;
+
+		if (rta->rta_type == CRYPTO_COMP_PARAM_LAST)
+			break;
+
+		switch (rta->rta_type) {
+		case CRYPTO_COMP_PARAM_LEVEL:
+			if (RTA_PAYLOAD(rta) != 4)
+				return -EINVAL;
+			memcpy(&params->level, RTA_DATA(rta), 4);
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	dict = RTA_NEXT(rta, len);
+	if (!len)
+		return 0;
+
+	params->dict = kvmemdup(dict, len, GFP_KERNEL);
+	if (!params->dict)
+		return -ENOMEM;
+	params->dict_sz = len;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_acomp_getparams);
+
+void crypto_acomp_putparams(struct crypto_acomp_params *params)
+{
+	kvfree(params->dict);
+	params->dict = NULL;
+	params->dict_sz = 0;
+}
+EXPORT_SYMBOL_GPL(crypto_acomp_putparams);
+
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Asynchronous compression type");
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 3e735171271e..98a1fd5ed0f8 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -15,6 +15,7 @@
 #include <linux/container_of.h>
 #include <linux/crypto.h>
 #include <linux/err.h>
+#include <linux/limits.h>
 #include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/spinlock_types.h>
@@ -69,6 +70,14 @@ struct acomp_req_chain {
 	u32 flags;
 };
 
+#define CRYPTO_COMP_NO_LEVEL		INT_MIN
+
+enum {
+	CRYPTO_COMP_PARAM_UNSPEC,
+	CRYPTO_COMP_PARAM_LEVEL,
+	CRYPTO_COMP_PARAM_LAST,
+};
+
 /**
  * struct acomp_req - asynchronous (de)compression request
  *
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 4cdc98a64418..89f742190091 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -104,6 +104,12 @@ struct acomp_walk {
 	int flags;
 };
 
+struct crypto_acomp_params {
+	int level;
+	unsigned int dict_sz;
+	void *dict;
+};
+
 /*
  * Transform internal helpers.
  */
@@ -244,4 +250,8 @@ static inline struct acomp_req *acomp_fbreq_on_stack_init(
 	return req;
 }
 
+int crypto_acomp_getparams(struct crypto_acomp_params *params, const u8 *raw,
+			   unsigned int len);
+void crypto_acomp_putparams(struct crypto_acomp_params *params);
+
 #endif
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 10/16] crypto: acomp - add NUMA-aware stream allocation
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (9 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 09/16] crypto: acomp - Add comp_params helpers Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:04 ` [RFC PATCH 11/16] crypto: deflate - add support for compression levels Giovanni Cabiddu
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Add NUMA node awareness to compression stream allocation to improve
performance on multi-socket systems by allocating memory local to the
CPU that will use it.

Add `int node` parameter to alloc_ctx() in the structure
crypto_acomp_stream and update crypto_acomp_alloc_streams() and
acomp_stream_workfn() to pass the node id to allocators.

Update all compression implementations to accept and use the new node
parameter.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/842.c                        |  4 ++--
 crypto/acompress.c                  | 11 ++++++++---
 crypto/deflate.c                    |  4 ++--
 crypto/lz4.c                        |  4 ++--
 crypto/lz4hc.c                      |  4 ++--
 crypto/lzo-rle.c                    |  4 ++--
 crypto/lzo.c                        |  4 ++--
 crypto/zstd.c                       |  4 ++--
 include/crypto/internal/acompress.h |  2 +-
 9 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/crypto/842.c b/crypto/842.c
index 4007e87bed80..b2d786efcd99 100644
--- a/crypto/842.c
+++ b/crypto/842.c
@@ -23,11 +23,11 @@
 #include <linux/module.h>
 #include <linux/sw842.h>
 
-static void *crypto842_alloc_ctx(void)
+static void *crypto842_alloc_ctx(int node)
 {
 	void *ctx;
 
-	ctx = kmalloc(SW842_MEM_COMPRESS, GFP_KERNEL);
+	ctx = kmalloc_node(SW842_MEM_COMPRESS, GFP_KERNEL, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/crypto/acompress.c b/crypto/acompress.c
index a9d77056bd43..394ce1a266e7 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -438,12 +438,14 @@ static void acomp_stream_workfn(struct work_struct *work)
 	for_each_cpu(cpu, &s->stream_want) {
 		struct crypto_acomp_stream *ps;
 		void *ctx;
+		int node;
 
 		ps = per_cpu_ptr(streams, cpu);
 		if (ps->ctx)
 			continue;
 
-		ctx = s->alloc_ctx();
+		node = cpu_to_node(cpu);
+		ctx = s->alloc_ctx(node);
 		if (IS_ERR(ctx))
 			break;
 
@@ -487,6 +489,7 @@ int crypto_acomp_alloc_streams(struct crypto_acomp_streams *s)
 	struct crypto_acomp_stream *ps;
 	unsigned int i;
 	void *ctx;
+	int node;
 
 	if (s->streams)
 		return 0;
@@ -495,13 +498,15 @@ int crypto_acomp_alloc_streams(struct crypto_acomp_streams *s)
 	if (!streams)
 		return -ENOMEM;
 
-	ctx = s->alloc_ctx();
+	i = cpumask_first(cpu_possible_mask);
+	node = cpu_to_node(i);
+
+	ctx = s->alloc_ctx(node);
 	if (IS_ERR(ctx)) {
 		free_percpu(streams);
 		return PTR_ERR(ctx);
 	}
 
-	i = cpumask_first(cpu_possible_mask);
 	ps = per_cpu_ptr(streams, i);
 	ps->ctx = ctx;
 
diff --git a/crypto/deflate.c b/crypto/deflate.c
index 26b4617f6196..d75c2951dfa9 100644
--- a/crypto/deflate.c
+++ b/crypto/deflate.c
@@ -33,14 +33,14 @@ struct deflate_stream {
 
 static DEFINE_MUTEX(deflate_stream_lock);
 
-static void *deflate_alloc_stream(void)
+static void *deflate_alloc_stream(int node)
 {
 	size_t size = max(zlib_inflate_workspacesize(),
 			  zlib_deflate_workspacesize(MAX_WBITS,
 						     DEFLATE_DEF_MEMLEVEL));
 	struct deflate_stream *ctx;
 
-	ctx = kvmalloc(struct_size(ctx, workspace, size), GFP_KERNEL);
+	ctx = kvmalloc_node(struct_size(ctx, workspace, size), GFP_KERNEL, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/crypto/lz4.c b/crypto/lz4.c
index 57b713516aef..ce125ecf889d 100644
--- a/crypto/lz4.c
+++ b/crypto/lz4.c
@@ -12,11 +12,11 @@
 #include <linux/lz4.h>
 #include <crypto/internal/scompress.h>
 
-static void *lz4_alloc_ctx(void)
+static void *lz4_alloc_ctx(int node)
 {
 	void *ctx;
 
-	ctx = vmalloc(LZ4_MEM_COMPRESS);
+	ctx = vmalloc_node(LZ4_MEM_COMPRESS, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/crypto/lz4hc.c b/crypto/lz4hc.c
index bb84f8a68cb5..c815ce2e0b67 100644
--- a/crypto/lz4hc.c
+++ b/crypto/lz4hc.c
@@ -10,11 +10,11 @@
 #include <linux/vmalloc.h>
 #include <linux/lz4.h>
 
-static void *lz4hc_alloc_ctx(void)
+static void *lz4hc_alloc_ctx(int node)
 {
 	void *ctx;
 
-	ctx = vmalloc(LZ4HC_MEM_COMPRESS);
+	ctx = vmalloc_node(LZ4HC_MEM_COMPRESS, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/crypto/lzo-rle.c b/crypto/lzo-rle.c
index 794e7ec49536..13144cc9c501 100644
--- a/crypto/lzo-rle.c
+++ b/crypto/lzo-rle.c
@@ -9,11 +9,11 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 
-static void *lzorle_alloc_ctx(void)
+static void *lzorle_alloc_ctx(int node)
 {
 	void *ctx;
 
-	ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+	ctx = kvmalloc_node(LZO1X_MEM_COMPRESS, GFP_KERNEL, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/crypto/lzo.c b/crypto/lzo.c
index d43242b24b4e..ffae9a09599d 100644
--- a/crypto/lzo.c
+++ b/crypto/lzo.c
@@ -9,11 +9,11 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 
-static void *lzo_alloc_ctx(void)
+static void *lzo_alloc_ctx(int node)
 {
 	void *ctx;
 
-	ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+	ctx = kvmalloc_node(LZO1X_MEM_COMPRESS, GFP_KERNEL, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/crypto/zstd.c b/crypto/zstd.c
index cbbd0413751a..fd240130ad4c 100644
--- a/crypto/zstd.c
+++ b/crypto/zstd.c
@@ -31,7 +31,7 @@ struct zstd_ctx {
 
 static DEFINE_MUTEX(zstd_stream_lock);
 
-static void *zstd_alloc_stream(void)
+static void *zstd_alloc_stream(int node)
 {
 	zstd_parameters params;
 	struct zstd_ctx *ctx;
@@ -44,7 +44,7 @@ static void *zstd_alloc_stream(void)
 	if (!wksp_size)
 		return ERR_PTR(-EINVAL);
 
-	ctx = kvmalloc(struct_size(ctx, wksp, wksp_size), GFP_KERNEL);
+	ctx = kvmalloc_node(struct_size(ctx, wksp, wksp_size), GFP_KERNEL, node);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 89f742190091..11f11a78360d 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -65,7 +65,7 @@ struct crypto_acomp_stream {
 
 struct crypto_acomp_streams {
 	/* These must come first because of struct scomp_alg. */
-	void *(*alloc_ctx)(void);
+	void *(*alloc_ctx)(int node);
 	void (*free_ctx)(void *);
 
 	struct crypto_acomp_stream __percpu *streams;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 11/16] crypto: deflate - add support for compression levels
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (10 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 10/16] crypto: acomp - add NUMA-aware stream allocation Giovanni Cabiddu
@ 2025-11-28 19:04 ` Giovanni Cabiddu
  2025-11-28 19:05 ` [RFC PATCH 12/16] crypto: zstd " Giovanni Cabiddu
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:04 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Add support for configurable compression levels in the deflate and
zlib-deflate algorithms by implementing the setparam() API. This API
allows the acomp interface to adjust compression parameters, providing
users with finer control over compression behavior.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 crypto/deflate.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/crypto/deflate.c b/crypto/deflate.c
index d75c2951dfa9..114bd1caddf6 100644
--- a/crypto/deflate.c
+++ b/crypto/deflate.c
@@ -120,6 +120,7 @@ enum algo {
 
 static int _deflate_compress(struct acomp_req *req, enum algo algo)
 {
+	struct crypto_acomp_params *p = acomp_tfm_ctx(crypto_acomp_reqtfm(req));
 	struct crypto_acomp_stream *s;
 	struct deflate_stream *ds;
 	int window_bits;
@@ -139,7 +140,7 @@ static int _deflate_compress(struct acomp_req *req, enum algo algo)
 	s = crypto_acomp_lock_stream_bh(&deflate_streams);
 	ds = s->ctx;
 
-	err = zlib_deflateInit2(&ds->stream, DEFLATE_DEF_LEVEL, Z_DEFLATED,
+	err = zlib_deflateInit2(&ds->stream, p->level, Z_DEFLATED,
 				window_bits, DEFLATE_DEF_MEMLEVEL,
 				Z_DEFAULT_STRATEGY);
 	if (err != Z_OK) {
@@ -267,8 +268,11 @@ static int zlib_deflate_decompress(struct acomp_req *req)
 
 static int deflate_init(struct crypto_acomp *tfm)
 {
+	struct crypto_acomp_params *p = acomp_tfm_ctx(tfm);
 	int ret;
 
+	p->level = DEFLATE_DEF_LEVEL;
+
 	mutex_lock(&deflate_stream_lock);
 	ret = crypto_acomp_alloc_streams(&deflate_streams);
 	mutex_unlock(&deflate_stream_lock);
@@ -276,21 +280,55 @@ static int deflate_init(struct crypto_acomp *tfm)
 	return ret;
 }
 
+static int deflate_setparam(struct crypto_acomp *tfm, const u8 *param,
+			    unsigned int len)
+{
+	struct crypto_acomp_params *p = acomp_tfm_ctx(tfm);
+	int ret;
+
+	ret = crypto_acomp_getparams(p, param, len);
+	if (ret)
+		return ret;
+
+	if (p->level > Z_BEST_COMPRESSION || p->level < Z_DEFAULT_COMPRESSION) {
+		p->level = DEFLATE_DEF_LEVEL;
+		return -EINVAL;
+	}
+
+	if (p->level == CRYPTO_COMP_NO_LEVEL)
+		p->level = DEFLATE_DEF_LEVEL;
+
+	return 0;
+}
+
+static void deflate_exit(struct crypto_acomp *tfm)
+{
+	struct crypto_acomp_params *p = acomp_tfm_ctx(tfm);
+
+	crypto_acomp_putparams(p);
+}
+
 static struct acomp_alg acomps[] = { {
 	.compress		= deflate_compress,
 	.decompress		= deflate_decompress,
+	.setparam		= deflate_setparam,
 	.init			= deflate_init,
+	.exit			= deflate_exit,
 	.base.cra_name		= "deflate",
 	.base.cra_driver_name	= "deflate-generic",
 	.base.cra_flags		= CRYPTO_ALG_REQ_VIRT,
+	.base.cra_ctxsize	= sizeof(struct crypto_acomp_params),
 	.base.cra_module	= THIS_MODULE,
 }, {
 	.compress		= zlib_deflate_compress,
 	.decompress		= zlib_deflate_decompress,
+	.setparam		= deflate_setparam,
 	.init			= deflate_init,
+	.exit			= deflate_exit,
 	.base.cra_name		= "zlib-deflate",
 	.base.cra_driver_name	= "zlib-deflate-generic",
 	.base.cra_flags		= CRYPTO_ALG_REQ_VIRT,
+	.base.cra_ctxsize	= sizeof(struct crypto_acomp_params),
 	.base.cra_module	= THIS_MODULE,
 } };
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 12/16] crypto: zstd - add support for compression levels
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (11 preceding siblings ...)
  2025-11-28 19:04 ` [RFC PATCH 11/16] crypto: deflate - add support for compression levels Giovanni Cabiddu
@ 2025-11-28 19:05 ` Giovanni Cabiddu
  2025-11-28 19:05 ` [RFC PATCH 13/16] crypto: qat - increase number of preallocated sgl descriptors Giovanni Cabiddu
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:05 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Suman Kumar Chakraborty

From: Suman Kumar Chakraborty <suman.kumar.chakraborty@intel.com>

Add support for configurable compression levels in the zstd algorithms
by implementing the setparam() API. This API allows the acomp interface
to adjust compression parameters, providing users with finer control
over compression behavior.

The context size has been increased to support the maximum level.

Signed-off-by: Suman Kumar Chakraborty <suman.kumar.chakraborty@intel.com>
---
 crypto/zstd.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/crypto/zstd.c b/crypto/zstd.c
index fd240130ad4c..ed59a3ce25ba 100644
--- a/crypto/zstd.c
+++ b/crypto/zstd.c
@@ -37,7 +37,7 @@ static void *zstd_alloc_stream(int node)
 	struct zstd_ctx *ctx;
 	size_t wksp_size;
 
-	params = zstd_get_params(ZSTD_DEF_LEVEL, ZSTD_MAX_SIZE);
+	params = zstd_get_params(zstd_max_clevel(), ZSTD_MAX_SIZE);
 
 	wksp_size = max(zstd_cstream_workspace_bound(&params.cParams),
 			zstd_dstream_workspace_bound(ZSTD_MAX_SIZE));
@@ -66,8 +66,11 @@ static struct crypto_acomp_streams zstd_streams = {
 
 static int zstd_init(struct crypto_acomp *acomp_tfm)
 {
+	struct crypto_acomp_params *p = acomp_tfm_ctx(acomp_tfm);
 	int ret = 0;
 
+	p->level = ZSTD_DEF_LEVEL;
+
 	mutex_lock(&zstd_stream_lock);
 	ret = crypto_acomp_alloc_streams(&zstd_streams);
 	mutex_unlock(&zstd_stream_lock);
@@ -96,6 +99,7 @@ static int zstd_compress_one(struct acomp_req *req, struct zstd_ctx *ctx,
 
 static int zstd_compress(struct acomp_req *req)
 {
+	struct crypto_acomp_params *p = acomp_tfm_ctx(crypto_acomp_reqtfm(req));
 	struct crypto_acomp_stream *s;
 	unsigned int pos, scur, dcur;
 	unsigned int total_out = 0;
@@ -111,6 +115,8 @@ static int zstd_compress(struct acomp_req *req)
 	s = crypto_acomp_lock_stream_bh(&zstd_streams);
 	ctx = s->ctx;
 
+	ctx->params = zstd_get_params(p->level, ZSTD_MAX_SIZE);
+
 	ret = acomp_walk_virt(&walk, req, true);
 	if (ret)
 		goto out;
@@ -284,14 +290,45 @@ static int zstd_decompress(struct acomp_req *req)
 	return ret;
 }
 
+static int zstd_setparam(struct crypto_acomp *tfm, const u8 *param,
+			 unsigned int len)
+{
+	struct crypto_acomp_params *p = acomp_tfm_ctx(tfm);
+	int ret;
+
+	ret = crypto_acomp_getparams(p, param, len);
+	if (ret)
+		return ret;
+
+	if (p->level > zstd_max_clevel() || p->level < zstd_min_clevel()) {
+		p->level = ZSTD_DEF_LEVEL;
+		return -EINVAL;
+	}
+
+	if (p->level == CRYPTO_COMP_NO_LEVEL)
+		p->level = ZSTD_DEF_LEVEL;
+
+	return 0;
+}
+
+static void zstd_exit(struct crypto_acomp *tfm)
+{
+	struct crypto_acomp_params *p = acomp_tfm_ctx(tfm);
+
+	crypto_acomp_putparams(p);
+}
+
 static struct acomp_alg zstd_acomp = {
 	.base = {
 		.cra_name = "zstd",
 		.cra_driver_name = "zstd-generic",
 		.cra_flags = CRYPTO_ALG_REQ_VIRT,
+		.cra_ctxsize = sizeof(struct crypto_acomp_params),
 		.cra_module = THIS_MODULE,
 	},
 	.init = zstd_init,
+	.exit = zstd_exit,
+	.setparam = zstd_setparam,
 	.compress = zstd_compress,
 	.decompress = zstd_decompress,
 };
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 13/16] crypto: qat - increase number of preallocated sgl descriptors
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (12 preceding siblings ...)
  2025-11-28 19:05 ` [RFC PATCH 12/16] crypto: zstd " Giovanni Cabiddu
@ 2025-11-28 19:05 ` Giovanni Cabiddu
  2025-11-28 19:05 ` [RFC PATCH 14/16] crypto: qat - add support for zstd Giovanni Cabiddu
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:05 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Increase the number of pre-allocated descriptors from 4 to 32 to avoid
allocations in the worst case in the btrfs usecase, i.e. 128KB sent as
individual 4KB pages.

This increases the size of a request from 752 to 1648 bytes.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 drivers/crypto/intel/qat/qat_common/qat_bl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/qat/qat_common/qat_bl.h b/drivers/crypto/intel/qat/qat_common/qat_bl.h
index 2827d5055d3c..b3c2167a8d3b 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_bl.h
+++ b/drivers/crypto/intel/qat/qat_common/qat_bl.h
@@ -6,7 +6,7 @@
 #include <linux/scatterlist.h>
 #include <linux/types.h>
 
-#define QAT_MAX_BUFF_DESC	4
+#define QAT_MAX_BUFF_DESC	32
 
 struct qat_alg_buf {
 	u32 len;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 14/16] crypto: qat - add support for zstd
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (13 preceding siblings ...)
  2025-11-28 19:05 ` [RFC PATCH 13/16] crypto: qat - increase number of preallocated sgl descriptors Giovanni Cabiddu
@ 2025-11-28 19:05 ` Giovanni Cabiddu
  2025-11-28 19:05 ` [RFC PATCH 15/16] crypto: qat - add support for compression levels Giovanni Cabiddu
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:05 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu,
	Suman Kumar Chakraborty

Add support for the ZSTD algorithm for both QAT GEN4 and QAT GEN6 via
the acomp API.

For GEN4, data is compressed in hardware using the LZ4s algorithm (a
QAT-specific variant of LZ4). The output is then parsed to generate ZSTD
sequences, and the ZSTD library is invoked to produce the final ZSTD
stream using the zstd_compress_sequences_and_literals() API. The maximum
compressed size is limited to 512 KB due to the use of per-CPU scratch
buffers. On GEN4, only compression is supported in hardware;
decompression falls back to software.

For GEN6, both compression and decompression are offloaded to the
accelerator, which natively supports the ZSTD algorithm. However, since
GEN6 is limited to a history size of 64 KB, decompression of frames
compressed with a larger history falls back to software.

Since GEN2 devices do not support ZSTD or LZ4s, add a mechanism that
prevents selecting GEN2 compression instances for ZSTD or LZ4s when a
GEN2 plug-in card is present on a system with an embedded GEN4 or GEN6
device.

In addition, modified the algorithm registration logic to allow
registering the correct implementation, i.e. LZ4s based for GEN4 or
native ZSTD for GEN6.

Co-developed-by: Suman Kumar Chakraborty <suman.kumar.chakraborty@intel.com>
Signed-off-by: Suman Kumar Chakraborty <suman.kumar.chakraborty@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 drivers/crypto/intel/qat/Kconfig              |   1 +
 .../intel/qat/qat_420xx/adf_420xx_hw_data.c   |   1 +
 .../intel/qat/qat_4xxx/adf_4xxx_hw_data.c     |   1 +
 .../intel/qat/qat_6xxx/adf_6xxx_hw_data.c     |   7 +
 drivers/crypto/intel/qat/qat_common/Makefile  |   1 +
 .../intel/qat/qat_common/adf_accel_devices.h  |   6 +
 .../intel/qat/qat_common/adf_common_drv.h     |   6 +-
 .../intel/qat/qat_common/adf_gen4_hw_data.c   |  19 +-
 .../crypto/intel/qat/qat_common/adf_init.c    |   6 +-
 .../crypto/intel/qat/qat_common/icp_qat_fw.h  |   7 +
 .../intel/qat/qat_common/icp_qat_fw_comp.h    |   2 +
 .../crypto/intel/qat/qat_common/icp_qat_hw.h  |   3 +-
 .../intel/qat/qat_common/qat_comp_algs.c      | 516 +++++++++++++++++-
 .../intel/qat/qat_common/qat_comp_req.h       |  11 +
 .../qat/qat_common/qat_comp_zstd_utils.c      | 120 ++++
 .../qat/qat_common/qat_comp_zstd_utils.h      |  13 +
 .../intel/qat/qat_common/qat_compression.c    |  23 +-
 17 files changed, 705 insertions(+), 38 deletions(-)
 create mode 100644 drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.c
 create mode 100644 drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.h

diff --git a/drivers/crypto/intel/qat/Kconfig b/drivers/crypto/intel/qat/Kconfig
index 4b4861460dd4..9a48bc7c3118 100644
--- a/drivers/crypto/intel/qat/Kconfig
+++ b/drivers/crypto/intel/qat/Kconfig
@@ -11,6 +11,7 @@ config CRYPTO_DEV_QAT
 	select CRYPTO_LIB_SHA1
 	select CRYPTO_LIB_SHA256
 	select CRYPTO_LIB_SHA512
+	select CRYPTO_ZSTD
 	select FW_LOADER
 	select CRC8
 
diff --git a/drivers/crypto/intel/qat/qat_420xx/adf_420xx_hw_data.c b/drivers/crypto/intel/qat/qat_420xx/adf_420xx_hw_data.c
index 53fa91d577ed..37a282bdb2e2 100644
--- a/drivers/crypto/intel/qat/qat_420xx/adf_420xx_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_420xx/adf_420xx_hw_data.c
@@ -469,6 +469,7 @@ void adf_init_hw_data_420xx(struct adf_hw_device_data *hw_data, u32 dev_id)
 	hw_data->clock_frequency = ADF_420XX_AE_FREQ;
 	hw_data->services_supported = adf_gen4_services_supported;
 	hw_data->get_svc_slice_cnt = adf_gen4_get_svc_slice_cnt;
+	hw_data->accel_capabilities_ext_mask = ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S;
 
 	adf_gen4_set_err_mask(&hw_data->dev_err_mask);
 	adf_gen4_init_hw_csr_ops(&hw_data->csr_ops);
diff --git a/drivers/crypto/intel/qat/qat_4xxx/adf_4xxx_hw_data.c b/drivers/crypto/intel/qat/qat_4xxx/adf_4xxx_hw_data.c
index 740f68a36ac5..9b36520812ba 100644
--- a/drivers/crypto/intel/qat/qat_4xxx/adf_4xxx_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_4xxx/adf_4xxx_hw_data.c
@@ -463,6 +463,7 @@ void adf_init_hw_data_4xxx(struct adf_hw_device_data *hw_data, u32 dev_id)
 	hw_data->clock_frequency = ADF_4XXX_AE_FREQ;
 	hw_data->services_supported = adf_gen4_services_supported;
 	hw_data->get_svc_slice_cnt = adf_gen4_get_svc_slice_cnt;
+	hw_data->accel_capabilities_ext_mask = ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S;
 
 	adf_gen4_set_err_mask(&hw_data->dev_err_mask);
 	adf_gen4_init_hw_csr_ops(&hw_data->csr_ops);
diff --git a/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c b/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c
index bed88d3ce8ca..b04d6b947da2 100644
--- a/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c
@@ -471,6 +471,9 @@ static int build_comp_block(void *ctx, enum adf_dc_algo algo)
 	case QAT_DEFLATE:
 		header->service_cmd_id = ICP_QAT_FW_COMP_CMD_DYNAMIC;
 	break;
+	case QAT_ZSTD:
+		header->service_cmd_id = ICP_QAT_FW_COMP_CMD_ZSTD_COMPRESS;
+	break;
 	default:
 		return -EINVAL;
 	}
@@ -494,6 +497,9 @@ static int build_decomp_block(void *ctx, enum adf_dc_algo algo)
 	case QAT_DEFLATE:
 		header->service_cmd_id = ICP_QAT_FW_COMP_CMD_DECOMPRESS;
 	break;
+	case QAT_ZSTD:
+		header->service_cmd_id = ICP_QAT_FW_COMP_CMD_ZSTD_DECOMPRESS;
+	break;
 	default:
 		return -EINVAL;
 	}
@@ -933,6 +939,7 @@ void adf_init_hw_data_6xxx(struct adf_hw_device_data *hw_data)
 	hw_data->num_rps = ADF_GEN6_ETR_MAX_BANKS;
 	hw_data->clock_frequency = ADF_6XXX_AE_FREQ;
 	hw_data->get_svc_slice_cnt = adf_gen6_get_svc_slice_cnt;
+	hw_data->accel_capabilities_ext_mask = ADF_ACCEL_CAPABILITIES_EXT_ZSTD;
 
 	adf_gen6_init_hw_csr_ops(&hw_data->csr_ops);
 	adf_gen6_init_pf_pfvf_ops(&hw_data->pfvf_ops);
diff --git a/drivers/crypto/intel/qat/qat_common/Makefile b/drivers/crypto/intel/qat/qat_common/Makefile
index 89845754841b..b56781c6a764 100644
--- a/drivers/crypto/intel/qat/qat_common/Makefile
+++ b/drivers/crypto/intel/qat/qat_common/Makefile
@@ -39,6 +39,7 @@ intel_qat-y := adf_accel_engine.o \
 	qat_bl.o \
 	qat_comp_algs.o \
 	qat_compression.o \
+	qat_comp_zstd_utils.o \
 	qat_crypto.o \
 	qat_hal.o \
 	qat_mig_dev.o \
diff --git a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
index 9fe3239f0114..aea24173efe4 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
@@ -58,6 +58,11 @@ enum adf_accel_capabilities {
 	ADF_ACCEL_CAPABILITIES_RANDOM_NUMBER = 128
 };
 
+enum adf_accel_capabilities_ext {
+	ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S = BIT(0),
+	ADF_ACCEL_CAPABILITIES_EXT_ZSTD = BIT(1),
+};
+
 enum adf_fuses {
 	ADF_FUSECTL0,
 	ADF_FUSECTL1,
@@ -334,6 +339,7 @@ struct adf_hw_device_data {
 	u32 fuses[ADF_MAX_FUSES];
 	u32 straps;
 	u32 accel_capabilities_mask;
+	u32 accel_capabilities_ext_mask;
 	u32 extended_dc_capabilities;
 	u16 fw_capabilities;
 	u32 clock_frequency;
diff --git a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
index 6cf3a95489e8..7b8b295ac459 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_common_drv.h
@@ -111,12 +111,12 @@ void qat_algs_unregister(void);
 int qat_asym_algs_register(void);
 void qat_asym_algs_unregister(void);
 
-struct qat_compression_instance *qat_compression_get_instance_node(int node);
+struct qat_compression_instance *qat_compression_get_instance_node(int node, int alg);
 void qat_compression_put_instance(struct qat_compression_instance *inst);
 int qat_compression_register(void);
 int qat_compression_unregister(void);
-int qat_comp_algs_register(void);
-void qat_comp_algs_unregister(void);
+int qat_comp_algs_register(u32 caps);
+void qat_comp_algs_unregister(u32 caps);
 void qat_comp_alg_callback(void *resp);
 
 int adf_isr_resource_alloc(struct adf_accel_dev *accel_dev);
diff --git a/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c b/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c
index 349fdb323763..faeffe941591 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c
@@ -504,14 +504,21 @@ static int adf_gen4_build_comp_block(void *ctx, enum adf_dc_algo algo)
 	switch (algo) {
 	case QAT_DEFLATE:
 		header->service_cmd_id = ICP_QAT_FW_COMP_CMD_DYNAMIC;
+		hw_comp_lower_csr.algo = ICP_QAT_HW_COMP_20_HW_COMP_FORMAT_ILZ77;
+		hw_comp_lower_csr.lllbd = ICP_QAT_HW_COMP_20_LLLBD_CTRL_LLLBD_ENABLED;
+		hw_comp_lower_csr.skip_ctrl = ICP_QAT_HW_COMP_20_BYTE_SKIP_3BYTE_LITERAL;
+		break;
+	case QAT_LZ4S:
+		header->service_cmd_id = ICP_QAT_FW_COMP_20_CMD_LZ4S_COMPRESS;
+		hw_comp_lower_csr.algo = ICP_QAT_HW_COMP_20_HW_COMP_FORMAT_LZ4S;
+		hw_comp_lower_csr.lllbd = ICP_QAT_HW_COMP_20_LLLBD_CTRL_LLLBD_DISABLED;
+		hw_comp_lower_csr.skip_ctrl = ICP_QAT_HW_COMP_20_BYTE_SKIP_3BYTE_TOKEN;
+		hw_comp_lower_csr.abd = ICP_QAT_HW_COMP_20_ABD_ABD_DISABLED;
 		break;
 	default:
 		return -EINVAL;
 	}
 
-	hw_comp_lower_csr.skip_ctrl = ICP_QAT_HW_COMP_20_BYTE_SKIP_3BYTE_LITERAL;
-	hw_comp_lower_csr.algo = ICP_QAT_HW_COMP_20_HW_COMP_FORMAT_ILZ77;
-	hw_comp_lower_csr.lllbd = ICP_QAT_HW_COMP_20_LLLBD_CTRL_LLLBD_ENABLED;
 	hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_1;
 	hw_comp_lower_csr.hash_update = ICP_QAT_HW_COMP_20_SKIP_HASH_UPDATE_DONT_ALLOW;
 	hw_comp_lower_csr.edmm = ICP_QAT_HW_COMP_20_EXTENDED_DELAY_MATCH_MODE_EDMM_ENABLED;
@@ -538,12 +545,16 @@ static int adf_gen4_build_decomp_block(void *ctx, enum adf_dc_algo algo)
 	switch (algo) {
 	case QAT_DEFLATE:
 		header->service_cmd_id = ICP_QAT_FW_COMP_CMD_DECOMPRESS;
+		hw_decomp_lower_csr.algo = ICP_QAT_HW_DECOMP_20_HW_DECOMP_FORMAT_DEFLATE;
+		break;
+	case QAT_LZ4S:
+		header->service_cmd_id = ICP_QAT_FW_COMP_20_CMD_LZ4S_DECOMPRESS;
+		hw_decomp_lower_csr.algo = ICP_QAT_HW_DECOMP_20_HW_DECOMP_FORMAT_LZ4S;
 		break;
 	default:
 		return -EINVAL;
 	}
 
-	hw_decomp_lower_csr.algo = ICP_QAT_HW_DECOMP_20_HW_DECOMP_FORMAT_DEFLATE;
 	lower_val = ICP_QAT_FW_DECOMP_20_BUILD_CONFIG_LOWER(hw_decomp_lower_csr);
 
 	cd_pars->u.sl.comp_slice_cfg_word[0] = lower_val;
diff --git a/drivers/crypto/intel/qat/qat_common/adf_init.c b/drivers/crypto/intel/qat/qat_common/adf_init.c
index 46491048e0bb..8da96ab4f62e 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_init.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_init.c
@@ -179,6 +179,7 @@ static int adf_dev_start(struct adf_accel_dev *accel_dev)
 {
 	struct adf_hw_device_data *hw_data = accel_dev->hw_device;
 	struct service_hndl *service;
+	u32 caps;
 	int ret;
 
 	set_bit(ADF_STATUS_STARTING, &accel_dev->status);
@@ -252,7 +253,8 @@ static int adf_dev_start(struct adf_accel_dev *accel_dev)
 	}
 	set_bit(ADF_STATUS_CRYPTO_ALGS_REGISTERED, &accel_dev->status);
 
-	if (!list_empty(&accel_dev->compression_list) && qat_comp_algs_register()) {
+	caps = hw_data->accel_capabilities_ext_mask;
+	if (!list_empty(&accel_dev->compression_list) && qat_comp_algs_register(caps)) {
 		dev_err(&GET_DEV(accel_dev),
 			"Failed to register compression algs\n");
 		set_bit(ADF_STATUS_STARTING, &accel_dev->status);
@@ -305,7 +307,7 @@ static void adf_dev_stop(struct adf_accel_dev *accel_dev)
 
 	if (!list_empty(&accel_dev->compression_list) &&
 	    test_bit(ADF_STATUS_COMP_ALGS_REGISTERED, &accel_dev->status))
-		qat_comp_algs_unregister();
+		qat_comp_algs_unregister(hw_data->accel_capabilities_ext_mask);
 	clear_bit(ADF_STATUS_COMP_ALGS_REGISTERED, &accel_dev->status);
 
 	list_for_each_entry(service, &service_table, list) {
diff --git a/drivers/crypto/intel/qat/qat_common/icp_qat_fw.h b/drivers/crypto/intel/qat/qat_common/icp_qat_fw.h
index c141160421e1..2fea30a78340 100644
--- a/drivers/crypto/intel/qat/qat_common/icp_qat_fw.h
+++ b/drivers/crypto/intel/qat/qat_common/icp_qat_fw.h
@@ -151,6 +151,13 @@ struct icp_qat_fw_comn_resp {
 	ICP_QAT_FW_COMN_CNV_FLAG_BITPOS, \
 	ICP_QAT_FW_COMN_CNV_FLAG_MASK)
 
+#define ICP_QAT_FW_COMN_ST_BLK_FLAG_BITPOS 4
+#define ICP_QAT_FW_COMN_ST_BLK_FLAG_MASK 0x1
+#define ICP_QAT_FW_COMN_HDR_ST_BLK_FLAG_GET(hdr_flags) \
+	QAT_FIELD_GET(hdr_flags, \
+	ICP_QAT_FW_COMN_ST_BLK_FLAG_BITPOS, \
+	ICP_QAT_FW_COMN_ST_BLK_FLAG_MASK)
+
 #define ICP_QAT_FW_COMN_HDR_CNV_FLAG_SET(hdr_t, val) \
 	QAT_FIELD_SET((hdr_t.hdr_flags), (val), \
 	ICP_QAT_FW_COMN_CNV_FLAG_BITPOS, \
diff --git a/drivers/crypto/intel/qat/qat_common/icp_qat_fw_comp.h b/drivers/crypto/intel/qat/qat_common/icp_qat_fw_comp.h
index 81969c515a17..2526053ee630 100644
--- a/drivers/crypto/intel/qat/qat_common/icp_qat_fw_comp.h
+++ b/drivers/crypto/intel/qat/qat_common/icp_qat_fw_comp.h
@@ -8,6 +8,8 @@ enum icp_qat_fw_comp_cmd_id {
 	ICP_QAT_FW_COMP_CMD_STATIC = 0,
 	ICP_QAT_FW_COMP_CMD_DYNAMIC = 1,
 	ICP_QAT_FW_COMP_CMD_DECOMPRESS = 2,
+	ICP_QAT_FW_COMP_CMD_ZSTD_COMPRESS = 10,
+	ICP_QAT_FW_COMP_CMD_ZSTD_DECOMPRESS = 11,
 	ICP_QAT_FW_COMP_CMD_DELIMITER
 };
 
diff --git a/drivers/crypto/intel/qat/qat_common/icp_qat_hw.h b/drivers/crypto/intel/qat/qat_common/icp_qat_hw.h
index b8f1c4ffb8b5..bbb8edcd09e8 100644
--- a/drivers/crypto/intel/qat/qat_common/icp_qat_hw.h
+++ b/drivers/crypto/intel/qat/qat_common/icp_qat_hw.h
@@ -335,7 +335,8 @@ enum icp_qat_hw_compression_delayed_match {
 enum icp_qat_hw_compression_algo {
 	ICP_QAT_HW_COMPRESSION_ALGO_DEFLATE = 0,
 	ICP_QAT_HW_COMPRESSION_ALGO_LZS = 1,
-	ICP_QAT_HW_COMPRESSION_ALGO_DELIMITER = 2
+	ICP_QAT_HW_COMPRESSION_ALGO_ZSTD = 2,
+	ICP_QAT_HW_COMPRESSION_ALGO_DELIMITER
 };
 
 enum icp_qat_hw_compression_depth {
diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index 23a1ed4f6b40..0e237c2d7966 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -6,6 +6,7 @@
 #include <crypto/scatterwalk.h>
 #include <linux/dma-mapping.h>
 #include <linux/workqueue.h>
+#include <linux/zstd.h>
 #include "adf_accel_devices.h"
 #include "adf_common_drv.h"
 #include "adf_dc.h"
@@ -13,18 +14,104 @@
 #include "qat_comp_req.h"
 #include "qat_compression.h"
 #include "qat_algs_send.h"
+#include "qat_comp_zstd_utils.h"
 
-#define QAT_RFC_1950_HDR_SIZE 2
-#define QAT_RFC_1950_FOOTER_SIZE 4
-#define QAT_RFC_1950_CM_DEFLATE 8
-#define QAT_RFC_1950_CM_DEFLATE_CINFO_32K 7
-#define QAT_RFC_1950_CM_MASK 0x0f
-#define QAT_RFC_1950_CM_OFFSET 4
-#define QAT_RFC_1950_DICT_MASK 0x20
-#define QAT_RFC_1950_COMP_HDR 0x785e
+#define QAT_RFC_1950_HDR_SIZE			2
+#define QAT_RFC_1950_FOOTER_SIZE		4
+#define QAT_RFC_1950_CM_DEFLATE			8
+#define QAT_RFC_1950_CM_DEFLATE_CINFO_32K	7
+#define QAT_RFC_1950_CM_MASK			0x0f
+#define QAT_RFC_1950_CM_OFFSET			4
+#define QAT_RFC_1950_DICT_MASK			0x20
+#define QAT_RFC_1950_COMP_HDR			0x785e
+#define QAT_ZSTD_SCRATCH_SIZE			524288
+#define QAT_ZSTD_MAX_BLOCK_SIZE			65536
+#define QAT_MAX_SEQUENCES			(128 * 1024)
+#define QAT_ZSTD_MAX_CONTENT_SIZE		4096
 
 static DEFINE_MUTEX(algs_lock);
-static unsigned int active_devs;
+static unsigned int active_devs_deflate;
+static unsigned int active_devs_lz4s;
+static unsigned int active_devs_zstd;
+
+struct qat_zstd_scratch {
+	size_t		cctx_buffer_size;
+	void		*lz4s;
+	void		*input_data;
+	void		*out_seqs;
+	void		*workspace;
+	ZSTD_CCtx	*ctx;
+};
+
+static void *qat_zstd_alloc_scratch(int node)
+{
+	struct qat_zstd_scratch *scratch;
+	ZSTD_parameters params;
+	size_t cctx_size;
+	ZSTD_CCtx *ctx;
+
+	scratch = kzalloc_node(sizeof(*scratch), GFP_KERNEL, node);
+	if (!scratch)
+		return ERR_PTR(-ENOMEM);
+
+	scratch->lz4s = kvmalloc_node(QAT_ZSTD_SCRATCH_SIZE, GFP_KERNEL, node);
+	if (!scratch->lz4s)
+		goto error;
+
+	scratch->input_data = kvmalloc_node(QAT_ZSTD_SCRATCH_SIZE, GFP_KERNEL, node);
+	if (!scratch->input_data)
+		goto error;
+
+	scratch->out_seqs = kvcalloc_node(QAT_MAX_SEQUENCES, sizeof(ZSTD_Sequence),
+					  GFP_KERNEL, node);
+	if (!scratch->out_seqs)
+		goto error;
+
+	params = zstd_get_params(zstd_max_clevel(), QAT_ZSTD_SCRATCH_SIZE);
+	cctx_size = zstd_cctx_workspace_bound(&params.cParams);
+
+	scratch->workspace = kvmalloc_node(cctx_size, GFP_KERNEL | __GFP_ZERO, node);
+	if (!scratch->workspace)
+		goto error;
+
+	ctx = zstd_init_cctx(scratch->workspace, cctx_size);
+	if (!ctx)
+		goto error;
+
+	scratch->ctx = ctx;
+	scratch->cctx_buffer_size = cctx_size;
+
+	zstd_cctx_set_param(ctx, ZSTD_c_blockDelimiters, ZSTD_sf_explicitBlockDelimiters);
+
+	return scratch;
+
+error:
+	kvfree(scratch->lz4s);
+	kvfree(scratch->input_data);
+	kvfree(scratch->out_seqs);
+	kvfree(scratch->workspace);
+	kfree(scratch);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void qat_zstd_free_scratch(void *ctx)
+{
+	struct qat_zstd_scratch *scratch = ctx;
+
+	if (!scratch)
+		return;
+
+	kvfree(scratch->lz4s);
+	kvfree(scratch->input_data);
+	kvfree(scratch->out_seqs);
+	kvfree(scratch->workspace);
+	kfree(scratch);
+}
+
+static struct crypto_acomp_streams qat_zstd_streams = {
+	.alloc_ctx = qat_zstd_alloc_scratch,
+	.free_ctx = qat_zstd_free_scratch,
+};
 
 enum direction {
 	DECOMPRESSION = 0,
@@ -33,10 +120,18 @@ enum direction {
 
 struct qat_compression_req;
 
+struct qat_callback_params {
+	unsigned int produced;
+	unsigned int dlen;
+	bool plain;
+};
+
 struct qat_compression_ctx {
 	u8 comp_ctx[QAT_COMP_CTX_SIZE];
 	struct qat_compression_instance *inst;
-	int (*qat_comp_callback)(struct qat_compression_req *qat_req, void *resp);
+	int (*qat_comp_callback)(struct qat_compression_req *qat_req, void *resp,
+				 struct qat_callback_params *params);
+	struct crypto_acomp *ftfm;
 };
 
 struct qat_compression_req {
@@ -89,7 +184,7 @@ static int parse_zlib_header(u16 zlib_h)
 }
 
 static int qat_comp_rfc1950_callback(struct qat_compression_req *qat_req,
-				     void *resp)
+				     void *resp, struct qat_callback_params *params)
 {
 	struct acomp_req *areq = qat_req->acompress_req;
 	enum direction dir = qat_req->dir;
@@ -134,6 +229,7 @@ static void qat_comp_generic_callback(struct qat_compression_req *qat_req,
 	struct adf_accel_dev *accel_dev = ctx->inst->accel_dev;
 	struct crypto_acomp *tfm = crypto_acomp_reqtfm(areq);
 	struct qat_compression_instance *inst = ctx->inst;
+	struct qat_callback_params params = { };
 	int consumed, produced;
 	s8 cmp_err, xlt_err;
 	int res = -EBADMSG;
@@ -148,6 +244,10 @@ static void qat_comp_generic_callback(struct qat_compression_req *qat_req,
 	consumed = qat_comp_get_consumed_ctr(resp);
 	produced = qat_comp_get_produced_ctr(resp);
 
+	/* Cache parameters for algorithm specific callback */
+	params.produced = produced;
+	params.dlen = areq->dlen;
+
 	dev_dbg(&GET_DEV(accel_dev),
 		"[%s][%s][%s] slen = %8d dlen = %8d consumed = %8d produced = %8d cmp_err = %3d xlt_err = %3d",
 		crypto_tfm_alg_driver_name(crypto_acomp_tfm(tfm)),
@@ -155,16 +255,20 @@ static void qat_comp_generic_callback(struct qat_compression_req *qat_req,
 		status ? "ERR" : "OK ",
 		areq->slen, areq->dlen, consumed, produced, cmp_err, xlt_err);
 
-	areq->dlen = 0;
+	if (unlikely(status != ICP_QAT_FW_COMN_STATUS_FLAG_OK)) {
+		if (cmp_err == ERR_CODE_OVERFLOW_ERROR || xlt_err == ERR_CODE_OVERFLOW_ERROR)
+			res = -E2BIG;
 
-	if (unlikely(status != ICP_QAT_FW_COMN_STATUS_FLAG_OK))
+		areq->dlen = 0;
 		goto end;
+	}
 
 	if (qat_req->dir == COMPRESSION) {
 		cnv = qat_comp_get_cmp_cnv_flag(resp);
 		if (unlikely(!cnv)) {
 			dev_err(&GET_DEV(accel_dev),
 				"Verified compression not supported\n");
+			areq->dlen = 0;
 			goto end;
 		}
 
@@ -174,15 +278,20 @@ static void qat_comp_generic_callback(struct qat_compression_req *qat_req,
 			dev_dbg(&GET_DEV(accel_dev),
 				"Actual buffer overflow: produced=%d, dlen=%d\n",
 				produced, qat_req->actual_dlen);
+
+			res = -E2BIG;
+			areq->dlen = 0;
 			goto end;
 		}
+
+		params.plain = !!qat_comp_get_cmp_uncomp_flag(resp);
 	}
 
 	res = 0;
 	areq->dlen = produced;
 
 	if (ctx->qat_comp_callback)
-		res = ctx->qat_comp_callback(qat_req, resp);
+		res = ctx->qat_comp_callback(qat_req, resp, &params);
 
 end:
 	qat_bl_free_bufl(accel_dev, &qat_req->buf);
@@ -200,7 +309,7 @@ void qat_comp_alg_callback(void *resp)
 	qat_alg_send_backlog(backlog);
 }
 
-static int qat_comp_alg_init_tfm(struct crypto_acomp *acomp_tfm)
+static int qat_comp_alg_init_tfm(struct crypto_acomp *acomp_tfm, int alg)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
 	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -213,12 +322,17 @@ static int qat_comp_alg_init_tfm(struct crypto_acomp *acomp_tfm)
 		node = tfm->node;
 
 	memset(ctx, 0, sizeof(*ctx));
-	inst = qat_compression_get_instance_node(node);
+	inst = qat_compression_get_instance_node(node, alg);
 	if (!inst)
 		return -EINVAL;
 	ctx->inst = inst;
 
-	return qat_comp_build_ctx(inst->accel_dev, ctx->comp_ctx, QAT_DEFLATE);
+	return qat_comp_build_ctx(inst->accel_dev, ctx->comp_ctx, alg);
+}
+
+static int qat_comp_alg_deflate_init_tfm(struct crypto_acomp *acomp_tfm)
+{
+	return qat_comp_alg_init_tfm(acomp_tfm, QAT_DEFLATE);
 }
 
 static void qat_comp_alg_exit_tfm(struct crypto_acomp *acomp_tfm)
@@ -236,7 +350,7 @@ static int qat_comp_alg_rfc1950_init_tfm(struct crypto_acomp *acomp_tfm)
 	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	int ret;
 
-	ret = qat_comp_alg_init_tfm(acomp_tfm);
+	ret = qat_comp_alg_init_tfm(acomp_tfm, QAT_DEFLATE);
 	ctx->qat_comp_callback = &qat_comp_rfc1950_callback;
 
 	return ret;
@@ -317,6 +431,43 @@ static int qat_comp_alg_decompress(struct acomp_req *req)
 	return qat_comp_alg_compress_decompress(req, DECOMPRESSION, 0, 0, 0, 0);
 }
 
+static int qat_comp_alg_zstd_decompress(struct acomp_req *req)
+{
+	struct crypto_acomp *acomp_tfm = crypto_acomp_reqtfm(req);
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct acomp_req *nreq = acomp_request_ctx(req);
+	zstd_frame_header header;
+	void *buffer;
+	int ret;
+
+	buffer = kmap_local_page(sg_page(req->src)) + req->src->offset;
+	ret = zstd_get_frame_header(&header, buffer, req->src->length);
+	kunmap_local(buffer);
+
+	if (ret) {
+		dev_err(&GET_DEV(ctx->inst->accel_dev),
+			"ZSTD-compressed data has an incomplete frame header\n");
+		return ret;
+	}
+
+	if (header.windowSize > QAT_ZSTD_MAX_BLOCK_SIZE ||
+	    header.frameContentSize >= QAT_ZSTD_MAX_CONTENT_SIZE) {
+		dev_dbg(&GET_DEV(ctx->inst->accel_dev),
+			"Window size=0x%llx\n", header.windowSize);
+
+		memcpy(nreq, req, sizeof(*req));
+		acomp_request_set_tfm(nreq, ctx->ftfm);
+
+		ret = crypto_acomp_decompress(nreq);
+		req->dlen = nreq->dlen;
+
+		return ret;
+	}
+
+	return qat_comp_alg_compress_decompress(req, DECOMPRESSION, 0, 0, 0, 0);
+}
+
 static int qat_comp_alg_rfc1950_compress(struct acomp_req *req)
 {
 	if (!req->dst && req->dlen != 0)
@@ -354,7 +505,193 @@ static int qat_comp_alg_rfc1950_decompress(struct acomp_req *req)
 						QAT_RFC_1950_FOOTER_SIZE, 0, 0);
 }
 
-static struct acomp_alg qat_acomp[] = { {
+static int qat_comp_lz4s_zstd_callback(struct qat_compression_req *qat_req, void *resp,
+				       struct qat_callback_params *params)
+{
+	struct acomp_req *areq = qat_req->acompress_req;
+	struct qat_zstd_scratch *scratch;
+	struct crypto_acomp_stream *s;
+	unsigned int lit_len = 0;
+	ZSTD_Sequence *out_seqs;
+	void *lz4s, *zstd;
+	size_t comp_size;
+	size_t seq_count;
+	void *input_data;
+	ZSTD_CCtx *ctx;
+	int ret = 0;
+
+	if (params->produced + QAT_ZSTD_LIT_COPY_LEN > QAT_ZSTD_SCRATCH_SIZE) {
+		pr_debug("[%s]: produced size (%u) + COPY_SIZE > QAT_ZSTD_SCRATCH_SIZE (%u)\n",
+			 __func__, params->produced, QAT_ZSTD_SCRATCH_SIZE);
+		areq->dlen = 0;
+		return -E2BIG;
+	}
+
+	s = crypto_acomp_lock_stream_bh(&qat_zstd_streams);
+	scratch = s->ctx;
+
+	lz4s = scratch->lz4s;
+	zstd = lz4s;  /* Output buffer is same as lz4s */
+	out_seqs = scratch->out_seqs;
+	ctx = scratch->ctx;
+	input_data = scratch->input_data;
+
+	if (likely(!params->plain)) {
+		if (likely(sg_nents(areq->dst) == 1)) {
+			zstd = sg_virt(areq->dst);
+			lz4s = zstd;
+		} else {
+			memcpy_from_sglist(lz4s, areq->dst, 0, params->produced);
+		}
+
+		seq_count = qat_alg_dec_lz4s(out_seqs, QAT_MAX_SEQUENCES, lz4s,
+					     params->produced, input_data, &lit_len);
+	} else {
+		out_seqs[0].litLength = areq->slen;
+		out_seqs[0].offset = 0;
+		out_seqs[0].matchLength = 0;
+
+		seq_count = 1;
+	}
+
+	comp_size = zstd_compress_sequences_and_literals(ctx, zstd, params->dlen,
+							 out_seqs, seq_count,
+							 input_data, lit_len,
+							 QAT_ZSTD_SCRATCH_SIZE,
+							 areq->slen);
+	if (zstd_is_error(comp_size)) {
+		if (comp_size == ZSTD_error_cannotProduce_uncompressedBlock)
+			ret = -E2BIG;
+		else
+			ret = -EINVAL;
+
+		comp_size = 0;
+		goto out;
+	}
+
+	if (comp_size > params->dlen) {
+		pr_debug("[%s]: compressed_size (%u) > output buffer size (%u)\n",
+			 __func__, (unsigned int)comp_size, params->dlen);
+		ret = -EOVERFLOW;
+		goto out;
+	}
+
+	if (unlikely(sg_nents(areq->dst) != 1))
+		memcpy_to_sglist(areq->dst, 0, zstd, comp_size);
+
+out:
+	areq->dlen = comp_size;
+	crypto_acomp_unlock_stream_bh(s);
+
+	return ret;
+}
+
+static int qat_comp_alg_lz4s_zstd_init_tfm(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	int reqsize;
+	int ret;
+
+	/* qat_comp_alg_init_tfm() wipes out the ctx */
+	ret = qat_comp_alg_init_tfm(acomp_tfm, QAT_LZ4S);
+	if (ret)
+		return ret;
+
+	ctx->ftfm = crypto_alloc_acomp_node("zstd", 0, CRYPTO_ALG_NEED_FALLBACK,
+					    tfm->node);
+	if (IS_ERR(ctx->ftfm))
+		return PTR_ERR(ctx->ftfm);
+
+	reqsize = max(sizeof(struct qat_compression_req),
+		      sizeof(struct acomp_req) + crypto_acomp_reqsize(ctx->ftfm));
+
+	acomp_tfm->reqsize = reqsize;
+
+	ctx->qat_comp_callback = &qat_comp_lz4s_zstd_callback;
+
+	return 0;
+}
+
+static int qat_comp_alg_zstd_init_tfm(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	int reqsize;
+	int ret;
+
+	/* qat_comp_alg_init_tfm() wipes out the ctx */
+	ret = qat_comp_alg_init_tfm(acomp_tfm, QAT_ZSTD);
+	if (ret)
+		return ret;
+
+	ctx->ftfm = crypto_alloc_acomp_node("zstd", 0, CRYPTO_ALG_NEED_FALLBACK,
+					    tfm->node);
+	if (IS_ERR(ctx->ftfm)) {
+		qat_comp_alg_exit_tfm(acomp_tfm);
+		return PTR_ERR(ctx->ftfm);
+	}
+
+	reqsize = max(sizeof(struct qat_compression_req),
+		      sizeof(struct acomp_req) + crypto_acomp_reqsize(ctx->ftfm));
+
+	acomp_tfm->reqsize = reqsize;
+
+	return 0;
+}
+
+static void qat_comp_alg_zstd_exit_tfm(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+
+	if (!ctx)
+		return;
+
+	if (ctx->ftfm)
+		crypto_free_acomp(ctx->ftfm);
+
+	qat_comp_alg_exit_tfm(acomp_tfm);
+}
+
+static int qat_comp_alg_lz4s_zstd_compress(struct acomp_req *req)
+{
+	struct crypto_acomp *acomp_tfm = crypto_acomp_reqtfm(req);
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct acomp_req *nreq = acomp_request_ctx(req);
+	int ret;
+
+	if (req->slen <= QAT_ZSTD_SCRATCH_SIZE && req->dlen <= QAT_ZSTD_SCRATCH_SIZE)
+		return qat_comp_alg_compress(req);
+
+	memcpy(nreq, req, sizeof(*req));
+	acomp_request_set_tfm(nreq, ctx->ftfm);
+
+	ret = crypto_acomp_compress(nreq);
+	req->dlen = nreq->dlen;
+
+	return ret;
+}
+
+static int qat_comp_alg_sw_decompress(struct acomp_req *req)
+{
+	struct crypto_acomp *acomp_tfm = crypto_acomp_reqtfm(req);
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct acomp_req *nreq = acomp_request_ctx(req);
+	int ret;
+
+	memcpy(nreq, req, sizeof(*req));
+	acomp_request_set_tfm(nreq, ctx->ftfm);
+
+	ret = crypto_acomp_decompress(nreq);
+	req->dlen = nreq->dlen;
+
+	return ret;
+}
+
+static struct acomp_alg qat_acomp_deflate[] = { {
 	.base = {
 		.cra_name = "deflate",
 		.cra_driver_name = "qat_deflate",
@@ -364,7 +701,7 @@ static struct acomp_alg qat_acomp[] = { {
 		.cra_reqsize = sizeof(struct qat_compression_req),
 		.cra_module = THIS_MODULE,
 	},
-	.init = qat_comp_alg_init_tfm,
+	.init = qat_comp_alg_deflate_init_tfm,
 	.exit = qat_comp_alg_exit_tfm,
 	.compress = qat_comp_alg_compress,
 	.decompress = qat_comp_alg_decompress,
@@ -382,23 +719,148 @@ static struct acomp_alg qat_acomp[] = { {
 	.exit = qat_comp_alg_exit_tfm,
 	.compress = qat_comp_alg_rfc1950_compress,
 	.decompress = qat_comp_alg_rfc1950_decompress,
-} };
+}};
 
-int qat_comp_algs_register(void)
+static struct acomp_alg qat_acomp_zstd_lz4s = {
+	.base = {
+		.cra_name = "zstd",
+		.cra_driver_name = "qat_zstd",
+		.cra_priority = 4001,
+		.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY |
+			     CRYPTO_ALG_NEED_FALLBACK,
+		.cra_reqsize = sizeof(struct qat_compression_req),
+		.cra_ctxsize = sizeof(struct qat_compression_ctx),
+		.cra_module = THIS_MODULE,
+	},
+	.init = qat_comp_alg_lz4s_zstd_init_tfm,
+	.exit = qat_comp_alg_zstd_exit_tfm,
+	.compress = qat_comp_alg_lz4s_zstd_compress,
+	.decompress = qat_comp_alg_sw_decompress,
+};
+
+static struct acomp_alg qat_acomp_zstd_native = {
+	.base = {
+		.cra_name = "zstd",
+		.cra_driver_name = "qat_zstd",
+		.cra_priority = 4001,
+		.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY |
+			     CRYPTO_ALG_NEED_FALLBACK,
+		.cra_reqsize = sizeof(struct qat_compression_req),
+		.cra_ctxsize = sizeof(struct qat_compression_ctx),
+		.cra_module = THIS_MODULE,
+	},
+	.init = qat_comp_alg_zstd_init_tfm,
+	.exit = qat_comp_alg_zstd_exit_tfm,
+	.compress = qat_comp_alg_compress,
+	.decompress = qat_comp_alg_zstd_decompress,
+};
+
+static int qat_comp_algs_register_deflate(void)
 {
 	int ret = 0;
 
 	mutex_lock(&algs_lock);
-	if (++active_devs == 1)
-		ret = crypto_register_acomps(qat_acomp, ARRAY_SIZE(qat_acomp));
+	if (++active_devs_deflate == 1) {
+		ret = crypto_register_acomps(qat_acomp_deflate,
+					     ARRAY_SIZE(qat_acomp_deflate));
+	}
 	mutex_unlock(&algs_lock);
+
 	return ret;
 }
 
-void qat_comp_algs_unregister(void)
+static void qat_comp_algs_unregister_deflate(void)
 {
 	mutex_lock(&algs_lock);
-	if (--active_devs == 0)
-		crypto_unregister_acomps(qat_acomp, ARRAY_SIZE(qat_acomp));
+	if (--active_devs_deflate == 0)
+		crypto_unregister_acomps(qat_acomp_deflate, ARRAY_SIZE(qat_acomp_deflate));
 	mutex_unlock(&algs_lock);
 }
+
+static int qat_comp_algs_register_lz4s(void)
+{
+	int ret = 0;
+
+	mutex_lock(&algs_lock);
+	if (++active_devs_lz4s == 1) {
+		ret = crypto_acomp_alloc_streams(&qat_zstd_streams);
+		if (ret) {
+			active_devs_lz4s--;
+			goto unlock;
+		}
+
+		ret = crypto_register_acomp(&qat_acomp_zstd_lz4s);
+		if (ret) {
+			crypto_acomp_free_streams(&qat_zstd_streams);
+			active_devs_lz4s--;
+		}
+	}
+unlock:
+	mutex_unlock(&algs_lock);
+
+	return ret;
+}
+
+static void qat_comp_algs_unregister_lz4s(void)
+{
+	mutex_lock(&algs_lock);
+	if (--active_devs_lz4s == 0) {
+		crypto_unregister_acomp(&qat_acomp_zstd_lz4s);
+		crypto_acomp_free_streams(&qat_zstd_streams);
+	}
+	mutex_unlock(&algs_lock);
+}
+
+static int qat_comp_algs_register_zstd(void)
+{
+	int ret = 0;
+
+	mutex_lock(&algs_lock);
+	if (++active_devs_zstd == 1)
+		ret = crypto_register_acomp(&qat_acomp_zstd_native);
+	mutex_unlock(&algs_lock);
+
+	return ret;
+}
+
+static void qat_comp_algs_unregister_zstd(void)
+{
+	mutex_lock(&algs_lock);
+	if (--active_devs_zstd == 0)
+		crypto_unregister_acomp(&qat_acomp_zstd_native);
+	mutex_unlock(&algs_lock);
+}
+
+int qat_comp_algs_register(u32 caps)
+{
+	int ret = 0;
+
+	ret = qat_comp_algs_register_deflate();
+	if (ret)
+		return ret;
+
+	if (caps & ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S) {
+		ret = qat_comp_algs_register_lz4s();
+		if (ret)
+			return ret;
+	}
+
+	if (caps & ADF_ACCEL_CAPABILITIES_EXT_ZSTD) {
+		ret = qat_comp_algs_register_zstd();
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+void qat_comp_algs_unregister(u32 caps)
+{
+	qat_comp_algs_unregister_deflate();
+
+	if (caps & ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S)
+		qat_comp_algs_unregister_lz4s();
+
+	if (caps & ADF_ACCEL_CAPABILITIES_EXT_ZSTD)
+		qat_comp_algs_unregister_zstd();
+}
diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_req.h b/drivers/crypto/intel/qat/qat_common/qat_comp_req.h
index 18a1f33a6db9..a3e5cd3c72c6 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_req.h
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_req.h
@@ -7,6 +7,8 @@
 
 #define QAT_COMP_REQ_SIZE (sizeof(struct icp_qat_fw_comp_req))
 #define QAT_COMP_CTX_SIZE (QAT_COMP_REQ_SIZE * 2)
+#define QAT_ASB_RATIO_MODE_VAL 8
+#define QAT_ASB_VALUE(slen) (((slen) >> 4) * (QAT_ASB_RATIO_MODE_VAL + 1))
 
 static inline void qat_comp_create_req(void *ctx, void *req, u64 src, u32 slen,
 				       u64 dst, u32 dlen, u64 opaque)
@@ -23,6 +25,7 @@ static inline void qat_comp_create_req(void *ctx, void *req, u64 src, u32 slen,
 	fw_req->comn_mid.opaque_data = opaque;
 	req_pars->comp_len = slen;
 	req_pars->out_buffer_sz = dlen;
+	fw_req->u3.asb_threshold.asb_value = QAT_ASB_VALUE(slen);
 }
 
 static inline void qat_comp_create_compression_req(void *ctx, void *req,
@@ -110,4 +113,12 @@ static inline u8 qat_comp_get_cmp_cnv_flag(void *resp)
 	return ICP_QAT_FW_COMN_HDR_CNV_FLAG_GET(flags);
 }
 
+static inline u8 qat_comp_get_cmp_uncomp_flag(void *resp)
+{
+	struct icp_qat_fw_comp_resp *qat_resp = resp;
+	u8 flags = qat_resp->comn_resp.hdr_flags;
+
+	return ICP_QAT_FW_COMN_HDR_ST_BLK_FLAG_GET(flags);
+}
+
 #endif
diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.c b/drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.c
new file mode 100644
index 000000000000..3cf4c3034d5d
--- /dev/null
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 Intel Corporation */
+#include <linux/module.h>
+#include <linux/byteorder/generic.h>
+#include <linux/zstd_lib.h>
+
+#include "qat_comp_zstd_utils.h"
+
+#define ML_BITS		4
+#define ML_MASK		((1U << ML_BITS) - 1)
+#define RUN_BITS	(8 - ML_BITS)
+#define RUN_MASK	((1U << RUN_BITS) - 1)
+#define LZ4MINMATCH	2
+
+/*
+ * Implement the same algorithm as the QAT ZSTD sequence producer plugin,
+ * to decode LZ4s formatted data into ZSTD_Sequence format.
+ */
+size_t qat_alg_dec_lz4s(ZSTD_Sequence *out_seqs, size_t out_seqs_capacity,
+			unsigned char *lz4s_buff, unsigned int lz4s_buff_size,
+			unsigned char *literals, unsigned int *lit_len)
+{
+	unsigned char *end_ip = lz4s_buff + lz4s_buff_size;
+	unsigned int hist_literal_len = 0;
+	unsigned char *ip = lz4s_buff;
+	size_t seqs_idx = 0;
+
+	*lit_len = 0;
+
+	if (!lz4s_buff_size)
+		return 0;
+
+	while (ip < end_ip) {
+		size_t length = 0;
+		size_t offset = 0;
+		size_t literal_len = 0, match_len = 0;
+
+		/* get literal length */
+		unsigned const token = *ip++;
+
+		length = token >> ML_BITS;
+		if (length == RUN_MASK) {
+			unsigned int s;
+
+			do {
+				s = *ip++;
+				length += s;
+			} while (s == 255);
+		}
+
+		literal_len = length;
+
+		{
+			u8 *start = ip;
+			u8 *dest = literals;
+			u8 *dest_end = literals + length;
+
+			do {
+				__builtin_memcpy(dest, start, QAT_ZSTD_LIT_COPY_LEN);
+				dest += QAT_ZSTD_LIT_COPY_LEN;
+				start += QAT_ZSTD_LIT_COPY_LEN;
+			} while (dest < dest_end);
+		}
+
+		literals += length;
+		*lit_len += length;
+
+		ip += length;
+		if (ip == end_ip) { /* Meet the end of the LZ4 sequence */
+			literal_len += hist_literal_len;
+			out_seqs[seqs_idx].litLength = literal_len;
+			out_seqs[seqs_idx].offset = offset;
+			out_seqs[seqs_idx].matchLength = match_len;
+			break;
+		}
+
+		/* get matchPos */
+		offset = le16_to_cpu(*(__le16 *)ip);
+		ip += 2;
+
+		/* get match length */
+		length = token & ML_MASK;
+		if (length == ML_MASK) {
+			unsigned int s;
+
+			do {
+				s = *ip++;
+				length += s;
+			} while (s == 255);
+		}
+		if (length != 0) {
+			length += LZ4MINMATCH;
+			match_len = (unsigned short)length;
+			literal_len += hist_literal_len;
+
+			/* update ZSTD_Sequence */
+			out_seqs[seqs_idx].offset = offset;
+			out_seqs[seqs_idx].litLength = literal_len;
+			out_seqs[seqs_idx].matchLength = match_len;
+			hist_literal_len = 0;
+			++seqs_idx;
+			if (seqs_idx >= (out_seqs_capacity - 1)) {
+				pr_debug("[%s]: qat zstd sequence overflow (seqs_idx:%lu, out_seqs_capacity:%lu, lz4s_buff_size:%u)\n",
+					 __func__, seqs_idx, out_seqs_capacity, lz4s_buff_size);
+				return -1;
+			}
+		} else {
+			if (literal_len > 0) {
+				/*
+				 * When match length is 0, the literalLen needs to be
+				 * temporarily stored and processed together with the next data
+				 * block.
+				 */
+				hist_literal_len += literal_len;
+			}
+		}
+	}
+
+	return ++seqs_idx;
+}
diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.h b/drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.h
new file mode 100644
index 000000000000..89fc8c8dceea
--- /dev/null
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_zstd_utils.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2025 Intel Corporation */
+#ifndef QAT_COMP_ZSTD_UTILS_H_
+#define QAT_COMP_ZSTD_UTILS_H_
+#include <linux/zstd_lib.h>
+
+#define QAT_ZSTD_LIT_COPY_LEN	8
+
+size_t qat_alg_dec_lz4s(ZSTD_Sequence *out_seqs, size_t out_seqs_capacity,
+			unsigned char *lz4s_buff, unsigned int lz4s_buff_size,
+			unsigned char *literals, unsigned int *lit_len);
+
+#endif
diff --git a/drivers/crypto/intel/qat/qat_common/qat_compression.c b/drivers/crypto/intel/qat/qat_common/qat_compression.c
index 53a4db5507ec..1424d7a9bcd3 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_compression.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_compression.c
@@ -46,12 +46,14 @@ static int qat_compression_free_instances(struct adf_accel_dev *accel_dev)
 	return 0;
 }
 
-struct qat_compression_instance *qat_compression_get_instance_node(int node)
+struct qat_compression_instance *qat_compression_get_instance_node(int node, int alg)
 {
 	struct qat_compression_instance *inst = NULL;
+	struct adf_hw_device_data *hw_data = NULL;
 	struct adf_accel_dev *accel_dev = NULL;
 	unsigned long best = ~0;
 	struct list_head *itr;
+	u32 caps, mask;
 
 	list_for_each(itr, adf_devmgr_get_head()) {
 		struct adf_accel_dev *tmp_dev;
@@ -61,6 +63,15 @@ struct qat_compression_instance *qat_compression_get_instance_node(int node)
 		tmp_dev = list_entry(itr, struct adf_accel_dev, list);
 		tmp_dev_node = dev_to_node(&GET_DEV(tmp_dev));
 
+		if (alg == QAT_ZSTD || alg == QAT_LZ4S) {
+			hw_data = tmp_dev->hw_device;
+			caps = hw_data->accel_capabilities_ext_mask;
+			mask = ADF_ACCEL_CAPABILITIES_EXT_ZSTD |
+			       ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S;
+			if (!(caps & mask))
+				continue;
+		}
+
 		if ((node == tmp_dev_node || tmp_dev_node < 0) &&
 		    adf_dev_started(tmp_dev) && !list_empty(&tmp_dev->compression_list)) {
 			ctr = atomic_read(&tmp_dev->ref_count);
@@ -78,6 +89,16 @@ struct qat_compression_instance *qat_compression_get_instance_node(int node)
 			struct adf_accel_dev *tmp_dev;
 
 			tmp_dev = list_entry(itr, struct adf_accel_dev, list);
+
+			if (alg == QAT_ZSTD || alg == QAT_LZ4S) {
+				hw_data = tmp_dev->hw_device;
+				caps = hw_data->accel_capabilities_ext_mask;
+				mask = ADF_ACCEL_CAPABILITIES_EXT_ZSTD |
+				       ADF_ACCEL_CAPABILITIES_EXT_ZSTD_LZ4S;
+				if (!(caps & mask))
+					continue;
+			}
+
 			if (adf_dev_started(tmp_dev) &&
 			    !list_empty(&tmp_dev->compression_list)) {
 				accel_dev = tmp_dev;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 15/16] crypto: qat - add support for compression levels
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (14 preceding siblings ...)
  2025-11-28 19:05 ` [RFC PATCH 14/16] crypto: qat - add support for zstd Giovanni Cabiddu
@ 2025-11-28 19:05 ` Giovanni Cabiddu
  2025-11-28 19:05 ` [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload Giovanni Cabiddu
  2025-12-02  7:53 ` [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Christoph Hellwig
  17 siblings, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:05 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Add support for compression levels for all compression implementations
in the QAT driver by implementing the setparam() API.

Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 .../intel/qat/qat_6xxx/adf_6xxx_hw_data.c     | 12 ++-
 .../intel/qat/qat_common/adf_accel_devices.h  |  2 +-
 drivers/crypto/intel/qat/qat_common/adf_dc.c  |  5 +-
 drivers/crypto/intel/qat/qat_common/adf_dc.h  |  3 +-
 .../intel/qat/qat_common/adf_gen2_hw_data.c   | 16 +++-
 .../intel/qat/qat_common/adf_gen4_hw_data.c   | 10 ++-
 .../intel/qat/qat_common/qat_comp_algs.c      | 85 ++++++++++++++++++-
 7 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c b/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c
index b04d6b947da2..07582ea35182 100644
--- a/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_6xxx/adf_6xxx_hw_data.c
@@ -459,7 +459,7 @@ static int ring_pair_reset(struct adf_accel_dev *accel_dev, u32 bank_number)
 	return ret;
 }
 
-static int build_comp_block(void *ctx, enum adf_dc_algo algo)
+static int build_comp_block(void *ctx, enum adf_dc_algo algo, unsigned int level)
 {
 	struct icp_qat_fw_comp_req *req_tmpl = ctx;
 	struct icp_qat_fw_comp_req_hdr_cd_pars *cd_pars = &req_tmpl->cd_pars;
@@ -478,8 +478,16 @@ static int build_comp_block(void *ctx, enum adf_dc_algo algo)
 		return -EINVAL;
 	}
 
+	if (level < 6)
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_51_SEARCH_DEPTH_LEVEL_1;
+	else if (level < 9)
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_51_SEARCH_DEPTH_LEVEL_6;
+	else if (level < 10)
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_51_SEARCH_DEPTH_LEVEL_9;
+	else
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_51_SEARCH_DEPTH_LEVEL_10;
+
 	hw_comp_lower_csr.lllbd = ICP_QAT_HW_COMP_51_LLLBD_CTRL_LLLBD_DISABLED;
-	hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_51_SEARCH_DEPTH_LEVEL_1;
 	lower_val = ICP_QAT_FW_COMP_51_BUILD_CONFIG_LOWER(hw_comp_lower_csr);
 	cd_pars->u.sl.comp_slice_cfg_word[0] = lower_val;
 	cd_pars->u.sl.comp_slice_cfg_word[1] = 0;
diff --git a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
index aea24173efe4..bc6951aa36a0 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_accel_devices.h
@@ -245,7 +245,7 @@ struct adf_pfvf_ops {
 };
 
 struct adf_dc_ops {
-	int (*build_comp_block)(void *ctx, enum adf_dc_algo algo);
+	int (*build_comp_block)(void *ctx, enum adf_dc_algo algo, unsigned int level);
 	int (*build_decomp_block)(void *ctx, enum adf_dc_algo algo);
 };
 
diff --git a/drivers/crypto/intel/qat/qat_common/adf_dc.c b/drivers/crypto/intel/qat/qat_common/adf_dc.c
index 3e8fb4e3ed97..c6aeda2258af 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_dc.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_dc.c
@@ -4,7 +4,8 @@
 #include "adf_dc.h"
 #include "icp_qat_fw_comp.h"
 
-int qat_comp_build_ctx(struct adf_accel_dev *accel_dev, void *ctx, enum adf_dc_algo algo)
+int qat_comp_build_ctx(struct adf_accel_dev *accel_dev, void *ctx,
+		       enum adf_dc_algo algo, unsigned int level)
 {
 	struct icp_qat_fw_comp_req *req_tmpl = ctx;
 	struct icp_qat_fw_comp_cd_hdr *comp_cd_ctrl = &req_tmpl->comp_cd_ctrl;
@@ -27,7 +28,7 @@ int qat_comp_build_ctx(struct adf_accel_dev *accel_dev, void *ctx, enum adf_dc_a
 					    ICP_QAT_FW_COMP_ENABLE_SECURE_RAM_USED_AS_INTMD_BUF);
 
 	/* Build HW config block for compression */
-	ret = GET_DC_OPS(accel_dev)->build_comp_block(ctx, algo);
+	ret = GET_DC_OPS(accel_dev)->build_comp_block(ctx, algo, level);
 	if (ret) {
 		dev_err(&GET_DEV(accel_dev), "Failed to build compression block\n");
 		return ret;
diff --git a/drivers/crypto/intel/qat/qat_common/adf_dc.h b/drivers/crypto/intel/qat/qat_common/adf_dc.h
index 6cb5e09054a6..eca5a63d85ab 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_dc.h
+++ b/drivers/crypto/intel/qat/qat_common/adf_dc.h
@@ -12,6 +12,7 @@ enum adf_dc_algo {
 	QAT_ZSTD,
 };
 
-int qat_comp_build_ctx(struct adf_accel_dev *accel_dev, void *ctx, enum adf_dc_algo algo);
+int qat_comp_build_ctx(struct adf_accel_dev *accel_dev, void *ctx, enum adf_dc_algo algo,
+		       unsigned int level);
 
 #endif /* ADF_DC_H */
diff --git a/drivers/crypto/intel/qat/qat_common/adf_gen2_hw_data.c b/drivers/crypto/intel/qat/qat_common/adf_gen2_hw_data.c
index 6a505e9a5cf9..8dea916ecc45 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_gen2_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_gen2_hw_data.c
@@ -172,11 +172,12 @@ void adf_gen2_set_ssm_wdtimer(struct adf_accel_dev *accel_dev)
 }
 EXPORT_SYMBOL_GPL(adf_gen2_set_ssm_wdtimer);
 
-static int adf_gen2_build_comp_block(void *ctx, enum adf_dc_algo algo)
+static int adf_gen2_build_comp_block(void *ctx, enum adf_dc_algo algo, unsigned int level)
 {
 	struct icp_qat_fw_comp_req *req_tmpl = ctx;
 	struct icp_qat_fw_comp_req_hdr_cd_pars *cd_pars = &req_tmpl->cd_pars;
 	struct icp_qat_fw_comn_req_hdr *header = &req_tmpl->comn_hdr;
+	u32 l;
 
 	switch (algo) {
 	case QAT_DEFLATE:
@@ -186,11 +187,22 @@ static int adf_gen2_build_comp_block(void *ctx, enum adf_dc_algo algo)
 		return -EINVAL;
 	}
 
+	if (level < 2)
+		l = ICP_QAT_HW_COMPRESSION_DEPTH_1;
+	else if (level < 4)
+		l = ICP_QAT_HW_COMPRESSION_DEPTH_4;
+	else if (level < 6)
+		l = ICP_QAT_HW_COMPRESSION_DEPTH_8;
+	else if (level < 8)
+		l = ICP_QAT_HW_COMPRESSION_DEPTH_16;
+	else
+		l = ICP_QAT_HW_COMPRESSION_DEPTH_128;
+
 	cd_pars->u.sl.comp_slice_cfg_word[0] =
 		ICP_QAT_HW_COMPRESSION_CONFIG_BUILD(ICP_QAT_HW_COMPRESSION_DIR_COMPRESS,
 						    ICP_QAT_HW_COMPRESSION_DELAYED_MATCH_DISABLED,
 						    ICP_QAT_HW_COMPRESSION_ALGO_DEFLATE,
-						    ICP_QAT_HW_COMPRESSION_DEPTH_1,
+						    l,
 						    ICP_QAT_HW_COMPRESSION_FILE_TYPE_0);
 
 	return 0;
diff --git a/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c b/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c
index faeffe941591..d949ed5400d0 100644
--- a/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c
+++ b/drivers/crypto/intel/qat/qat_common/adf_gen4_hw_data.c
@@ -491,7 +491,7 @@ int adf_gen4_bank_drain_start(struct adf_accel_dev *accel_dev,
 	return ret;
 }
 
-static int adf_gen4_build_comp_block(void *ctx, enum adf_dc_algo algo)
+static int adf_gen4_build_comp_block(void *ctx, enum adf_dc_algo algo, unsigned int level)
 {
 	struct icp_qat_fw_comp_req *req_tmpl = ctx;
 	struct icp_qat_fw_comp_req_hdr_cd_pars *cd_pars = &req_tmpl->cd_pars;
@@ -519,7 +519,13 @@ static int adf_gen4_build_comp_block(void *ctx, enum adf_dc_algo algo)
 		return -EINVAL;
 	}
 
-	hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_1;
+	if (level < 4)
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_1;
+	else if (level < 7)
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_6;
+	else
+		hw_comp_lower_csr.sd = ICP_QAT_HW_COMP_20_SEARCH_DEPTH_LEVEL_9;
+
 	hw_comp_lower_csr.hash_update = ICP_QAT_HW_COMP_20_SKIP_HASH_UPDATE_DONT_ALLOW;
 	hw_comp_lower_csr.edmm = ICP_QAT_HW_COMP_20_EXTENDED_DELAY_MATCH_MODE_EDMM_ENABLED;
 	hw_comp_upper_csr.nice = ICP_QAT_HW_COMP_20_CONFIG_CSR_NICE_PARAM_DEFAULT_VAL;
diff --git a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
index 0e237c2d7966..d549c5a315d8 100644
--- a/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
+++ b/drivers/crypto/intel/qat/qat_common/qat_comp_algs.c
@@ -6,6 +6,7 @@
 #include <crypto/scatterwalk.h>
 #include <linux/dma-mapping.h>
 #include <linux/workqueue.h>
+#include <linux/zlib.h>
 #include <linux/zstd.h>
 #include "adf_accel_devices.h"
 #include "adf_common_drv.h"
@@ -28,6 +29,7 @@
 #define QAT_ZSTD_MAX_BLOCK_SIZE			65536
 #define QAT_MAX_SEQUENCES			(128 * 1024)
 #define QAT_ZSTD_MAX_CONTENT_SIZE		4096
+#define QAT_DEFAULT_COMP_LEVEL			1
 
 static DEFINE_MUTEX(algs_lock);
 static unsigned int active_devs_deflate;
@@ -132,6 +134,7 @@ struct qat_compression_ctx {
 	int (*qat_comp_callback)(struct qat_compression_req *qat_req, void *resp,
 				 struct qat_callback_params *params);
 	struct crypto_acomp *ftfm;
+	struct crypto_acomp_params params;
 };
 
 struct qat_compression_req {
@@ -327,7 +330,7 @@ static int qat_comp_alg_init_tfm(struct crypto_acomp *acomp_tfm, int alg)
 		return -EINVAL;
 	ctx->inst = inst;
 
-	return qat_comp_build_ctx(inst->accel_dev, ctx->comp_ctx, alg);
+	return qat_comp_build_ctx(inst->accel_dev, ctx->comp_ctx, alg, QAT_DEFAULT_COMP_LEVEL);
 }
 
 static int qat_comp_alg_deflate_init_tfm(struct crypto_acomp *acomp_tfm)
@@ -340,6 +343,8 @@ static void qat_comp_alg_exit_tfm(struct crypto_acomp *acomp_tfm)
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
 	struct qat_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 
+	crypto_acomp_putparams(&ctx->params);
+
 	qat_compression_put_instance(ctx->inst);
 	memset(ctx, 0, sizeof(*ctx));
 }
@@ -691,6 +696,80 @@ static int qat_comp_alg_sw_decompress(struct acomp_req *req)
 	return ret;
 }
 
+static int qat_comp_setparam_deflate(struct crypto_acomp *tfm, const u8 *param,
+				     unsigned int len)
+{
+	struct qat_compression_ctx *ctx = acomp_tfm_ctx(tfm);
+	struct crypto_acomp_params *p = &ctx->params;
+	struct adf_accel_dev *accel_dev;
+	int ret;
+
+	if (!ctx->inst || !ctx->inst->accel_dev)
+		return -EINVAL;
+
+	accel_dev = ctx->inst->accel_dev;
+
+	ret = crypto_acomp_getparams(p, param, len);
+	if (ret)
+		return ret;
+
+	if (p->level > Z_BEST_COMPRESSION || p->level < Z_DEFAULT_COMPRESSION) {
+		dev_warn(&GET_DEV(accel_dev),
+			 "[%s]: invalid level %d\n", __func__, p->level);
+		p->level = QAT_DEFAULT_COMP_LEVEL;
+		return -EINVAL;
+	}
+
+	if (p->level == CRYPTO_COMP_NO_LEVEL)
+		p->level = QAT_DEFAULT_COMP_LEVEL;
+
+	return qat_comp_build_ctx(ctx->inst->accel_dev, ctx->comp_ctx,
+				  QAT_DEFLATE, p->level);
+}
+
+static int qat_comp_setparam_zstd(struct crypto_acomp *tfm, const u8 *param,
+				  unsigned int len, enum adf_dc_algo algo)
+{
+	struct qat_compression_ctx *ctx = acomp_tfm_ctx(tfm);
+	struct crypto_acomp_params *p = &ctx->params;
+	struct adf_accel_dev *accel_dev;
+	int ret;
+
+	if (!ctx->inst || !ctx->inst->accel_dev)
+		return -EINVAL;
+
+	accel_dev = ctx->inst->accel_dev;
+
+	ret = crypto_acomp_getparams(p, param, len);
+	if (ret)
+		return ret;
+
+	if (p->level > zstd_max_clevel() || p->level < zstd_min_clevel()) {
+		dev_warn(&GET_DEV(accel_dev),
+			 "[%s]: invalid level %d\n", __func__, p->level);
+		p->level = QAT_DEFAULT_COMP_LEVEL;
+		return -EINVAL;
+	}
+
+	if (p->level == CRYPTO_COMP_NO_LEVEL || p->level <= 0)
+		p->level = QAT_DEFAULT_COMP_LEVEL;
+
+	return qat_comp_build_ctx(ctx->inst->accel_dev, ctx->comp_ctx, algo,
+				  p->level);
+}
+
+static int qat_comp_setparam_zstd_native(struct crypto_acomp *tfm, const u8 *param,
+					 unsigned int len)
+{
+	return qat_comp_setparam_zstd(tfm, param, len, QAT_ZSTD);
+}
+
+static int qat_comp_setparam_zstd_lz4s(struct crypto_acomp *tfm, const u8 *param,
+				       unsigned int len)
+{
+	return qat_comp_setparam_zstd(tfm, param, len, QAT_LZ4S);
+}
+
 static struct acomp_alg qat_acomp_deflate[] = { {
 	.base = {
 		.cra_name = "deflate",
@@ -705,6 +784,7 @@ static struct acomp_alg qat_acomp_deflate[] = { {
 	.exit = qat_comp_alg_exit_tfm,
 	.compress = qat_comp_alg_compress,
 	.decompress = qat_comp_alg_decompress,
+	.setparam = qat_comp_setparam_deflate,
 }, {
 	.base = {
 		.cra_name = "zlib-deflate",
@@ -719,6 +799,7 @@ static struct acomp_alg qat_acomp_deflate[] = { {
 	.exit = qat_comp_alg_exit_tfm,
 	.compress = qat_comp_alg_rfc1950_compress,
 	.decompress = qat_comp_alg_rfc1950_decompress,
+	.setparam = qat_comp_setparam_deflate,
 }};
 
 static struct acomp_alg qat_acomp_zstd_lz4s = {
@@ -736,6 +817,7 @@ static struct acomp_alg qat_acomp_zstd_lz4s = {
 	.exit = qat_comp_alg_zstd_exit_tfm,
 	.compress = qat_comp_alg_lz4s_zstd_compress,
 	.decompress = qat_comp_alg_sw_decompress,
+	.setparam = qat_comp_setparam_zstd_lz4s,
 };
 
 static struct acomp_alg qat_acomp_zstd_native = {
@@ -753,6 +835,7 @@ static struct acomp_alg qat_acomp_zstd_native = {
 	.exit = qat_comp_alg_zstd_exit_tfm,
 	.compress = qat_comp_alg_compress,
 	.decompress = qat_comp_alg_zstd_decompress,
+	.setparam = qat_comp_setparam_zstd_native,
 };
 
 static int qat_comp_algs_register_deflate(void)
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (15 preceding siblings ...)
  2025-11-28 19:05 ` [RFC PATCH 15/16] crypto: qat - add support for compression levels Giovanni Cabiddu
@ 2025-11-28 19:05 ` Giovanni Cabiddu
  2025-11-28 21:55   ` Qu Wenruo
  2025-12-02  7:53 ` [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Christoph Hellwig
  17 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 19:05 UTC (permalink / raw)
  To: clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky, Giovanni Cabiddu

Add support for hardware-accelerated compression using the acomp API in
the crypto framework, enabling offload of zlib and zstd compression to
hardware accelerators. Hardware offload reduces CPU load during
compression, improving performance.

The implementation follows a generic design that works with any acomp
implementation, though this enablement targets Intel QAT devices
(similarly to what done in EROFS).

Input folios are organized into a scatter-gather list and submitted to
the accelerator in a single asynchronous request. The calling thread
sleeps while the hardware performs compression, freeing the CPU for
other tasks.  Upon completion, the acomp callback wakes the thread to
continue processing.

Offload is supported for:
  - zlib: compression and decompression
  - zstd: compression only

Offload is only attempted when the data size exceeds a minimum threshold,
ensuring that small operations remain efficient by avoiding hardware setup
overhead. All required buffers are pre-allocated in the workspace to
eliminate allocations in the data path.

This feature maintains full compatibility with the existing BTRFS disk
format. Files compressed by hardware can be decompressed by software
implementations and vice versa.

The feature is wrapped in CONFIG_BTRFS_EXPERIMENTAL and can be enabled
at runtime via the sysfs parameter /sys/fs/btrfs/<UUID>/offload_compress.
Enabling this parameter succeeds only if a compatible acomp
implementation (e.g., QAT driver) is available. The runtime control
allows unloading the hardware driver when needed. Without it, the driver
would remain permanently in use and could not be removed.

Co-developed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 fs/btrfs/Makefile          |   2 +-
 fs/btrfs/acomp.c           | 470 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/acomp_workspace.h |  61 +++++
 fs/btrfs/compression.c     |  66 ++++++
 fs/btrfs/compression.h     |  30 +++
 fs/btrfs/disk-io.c         |   6 +
 fs/btrfs/fs.h              |   8 +
 fs/btrfs/sysfs.c           |  29 +++
 fs/btrfs/zlib.c            |  81 +++++++
 fs/btrfs/zstd.c            |  64 +++++
 10 files changed, 816 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/acomp.c
 create mode 100644 fs/btrfs/acomp_workspace.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 743d7677b175..6f9959218de7 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -27,7 +27,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   transaction.o inode.o file.o defrag.o \
 	   extent_map.o sysfs.o accessors.o xattr.o ordered-data.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
-	   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
+	   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o acomp.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
diff --git a/fs/btrfs/acomp.c b/fs/btrfs/acomp.c
new file mode 100644
index 000000000000..403ae19d0c18
--- /dev/null
+++ b/fs/btrfs/acomp.c
@@ -0,0 +1,470 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BTRFS acomp layer
+ *
+ * Copyright (c) 2025, Intel Corporation
+ * Author: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
+ */
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+#include <crypto/acompress.h>
+#include <linux/bio.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/pagemap.h>
+#include <linux/rtnetlink.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "compression.h"
+#include "acomp_workspace.h"
+
+static int folios_to_scatterlist(struct folio **folios, unsigned int nr_folios,
+				 size_t size, struct scatterlist *sg,
+				 unsigned int first_folio_offset)
+{
+	size_t available, len;
+	unsigned int offset;
+	int i;
+
+	if (!folios || nr_folios == 0 || !sg)
+		return -EINVAL;
+
+	if (nr_folios > BTRFS_ACOMP_MAX_SGL_ENTRIES)
+		return -E2BIG;
+
+	sg_init_table(sg, nr_folios);
+
+	for (i = 0; i < nr_folios && size; i++) {
+		/* For the first folio, use the provided offset; for others, use 0 */
+		offset = (i == 0) ? first_folio_offset : 0;
+		available = folio_size(folios[i]) - offset;
+
+		len = min(size, available);
+		sg_set_folio(&sg[i], folios[i], len, offset);
+		size -= len;
+	}
+
+	return 0;
+}
+
+static int build_acomp_attr_buffer(u8 *buf, unsigned int *len, u32 level)
+{
+	unsigned int total_len;
+	struct rtattr *rta;
+	u8 *pos;
+
+	if (!buf || !len || *len == 0)
+		return -EINVAL;
+
+	total_len = RTA_SPACE(sizeof(u32)) + RTA_SPACE(0);
+	if (total_len > *len)
+		return -E2BIG;
+
+	pos = buf;
+
+	rta = (struct rtattr *)pos;
+	rta->rta_type = CRYPTO_COMP_PARAM_LEVEL;
+	rta->rta_len = RTA_LENGTH(sizeof(u32));
+	memcpy(RTA_DATA(rta), &level, sizeof(level));
+	pos += RTA_SPACE(sizeof(u32));
+
+	rta = (struct rtattr *)pos;
+	rta->rta_type = CRYPTO_COMP_PARAM_LAST;
+	rta->rta_len = RTA_LENGTH(0);
+	pos += RTA_SPACE(0);
+
+	*len = total_len;
+
+	return 0;
+}
+
+int acomp_comp_folios(struct btrfs_acomp_workspace *acomp_ws,
+		      struct btrfs_fs_info *fs_info,
+		      struct address_space *mapping, u64 start, unsigned long len,
+		      struct folio **folios, unsigned long *out_folios,
+		      unsigned long *total_in, unsigned long *total_out, int level)
+{
+	struct scatterlist *out_sgl = NULL;
+	struct scatterlist *in_sgl = NULL;
+	const u64 orig_end = start + len;
+	struct crypto_acomp *tfm = NULL;
+	struct folio **in_folios = NULL;
+	unsigned int first_folio_offset;
+	unsigned int nr_dst_folios = 0;
+	struct folio *out_folio = NULL;
+	unsigned int nr_src_folios = 0;
+	struct acomp_req *req = NULL;
+	unsigned int nr_folios = 0;
+	unsigned int dst_size = 0;
+	unsigned int raw_attr_len;
+	unsigned int bytes_left;
+	unsigned int nofs_flags;
+	struct crypto_wait wait;
+	struct folio *in_folio;
+	unsigned int cur_len;
+	unsigned int i;
+	u64 cur_start;
+	u8 *raw_attr;
+	int ret;
+
+	if (!acomp_ws)
+		return -EOPNOTSUPP;
+
+	/* Check if offload is enabled and acquire reference */
+	if (!atomic_read(&fs_info->compress_offload_enabled))
+		return -EOPNOTSUPP;
+
+	if (!atomic_inc_not_zero(&fs_info->compr_resource_refcnt))
+		return -EOPNOTSUPP;
+
+	/* Protect against GFP_KERNEL allocations in crypto subsystem */
+	nofs_flags = memalloc_nofs_save();
+
+	in_folios = btrfs_acomp_get_folios(acomp_ws);
+	if (!in_folios) {
+		btrfs_err(fs_info, "No input folios in workspace\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	cur_start = start;
+	while (cur_start < orig_end && nr_src_folios < BTRFS_ACOMP_MAX_SGL_ENTRIES) {
+		ret = btrfs_compress_filemap_get_folio(mapping, cur_start, &in_folio);
+		if (ret) {
+			btrfs_err(fs_info, "Error %d getting folio at %llu\n", ret, cur_start);
+			goto out;
+		}
+
+		cur_len = btrfs_calc_input_length(in_folio, orig_end, cur_start);
+		cur_start += cur_len;
+
+		in_folios[nr_src_folios] = in_folio;
+		nr_src_folios++;
+	}
+
+	/* Check if we can allocate enough output folios */
+	if (nr_src_folios > *out_folios) {
+		btrfs_err(fs_info, "Not enough output folios: have %lu need %u\n",
+			  *out_folios, nr_src_folios);
+		ret = -E2BIG;
+		goto out;
+	}
+
+	do {
+		out_folio = btrfs_alloc_compr_folio(fs_info);
+		if (!out_folio) {
+			btrfs_err(fs_info, "Failed to allocate output folio %u\n",
+				  nr_dst_folios);
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		folios[nr_dst_folios] = out_folio;
+		nr_dst_folios++;
+		dst_size += folio_size(out_folio);
+	} while (dst_size < len && nr_dst_folios < BTRFS_ACOMP_MAX_SGL_ENTRIES);
+
+	in_sgl = btrfs_acomp_get_input_sgl(acomp_ws);
+	if (!in_sgl) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Calculate the offset within the first input folio */
+	first_folio_offset = offset_in_folio(in_folios[0], start);
+
+	ret = folios_to_scatterlist(in_folios, nr_src_folios, len, in_sgl, first_folio_offset);
+	if (ret) {
+		btrfs_err(fs_info, "Failed to build input scatterlist\n");
+		goto out;
+	}
+
+	out_sgl = btrfs_acomp_get_output_sgl(acomp_ws);
+	if (!out_sgl) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = folios_to_scatterlist(folios, nr_dst_folios, dst_size, out_sgl, 0);
+	if (ret) {
+		btrfs_err(fs_info, "Failed to build output scatterlist\n");
+		goto out;
+	}
+
+	crypto_init_wait(&wait);
+
+	/* Get pre-allocated tfm and request from workspace */
+	tfm = btrfs_acomp_get_tfm(acomp_ws);
+	req = btrfs_acomp_get_request(acomp_ws);
+	if (!tfm || !req) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	raw_attr = btrfs_acomp_get_attr_buffer(acomp_ws);
+	raw_attr_len = BTRFS_ACOMP_ATTR_BUF_SIZE;
+	ret = build_acomp_attr_buffer(raw_attr, &raw_attr_len, level);
+	if (ret) {
+		btrfs_err(fs_info, "Failed to build acomp attr buffer: %d\n", ret);
+		goto out;
+	}
+
+	ret = crypto_acomp_setparam(tfm, raw_attr, raw_attr_len);
+	if (ret) {
+		btrfs_err(fs_info, "Failed to set acomp params: %d\n", ret);
+		goto out;
+	}
+
+	acomp_request_set_params(req, in_sgl, out_sgl, len, dst_size);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, crypto_req_done, &wait);
+
+	ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
+	if (ret)
+		goto out;
+
+	*total_in = len;
+	*total_out = req->dlen;
+
+	/* Calculate number of folios used based on total_out */
+	bytes_left = *total_out;
+	for (i = 0, nr_folios = 0; i < nr_dst_folios && bytes_left > 0; i++) {
+		bytes_left -= min_t(size_t, bytes_left, folio_size(folios[i]));
+		nr_folios++;
+	}
+
+out:
+	/* Free out un-used folios (or all on error since nr_folios = 0) */
+	for (i = nr_folios; i < nr_dst_folios; i++) {
+		if (folios[i]) {
+			btrfs_free_compr_folio(folios[i]);
+			folios[i] = NULL;
+		}
+	}
+
+	/* Free input folios */
+	for (i = 0; i < nr_src_folios; i++)
+		if (in_folios[i]) {
+			folio_put(in_folios[i]);
+			in_folios[i] = NULL;
+		}
+
+	*out_folios = nr_folios;
+
+	memalloc_nofs_restore(nofs_flags);
+
+	/* Release reference and wake up any waiters */
+	if (atomic_dec_and_test(&fs_info->compr_resource_refcnt))
+		wake_up(&fs_info->compr_wait_queue);
+
+	return ret;
+}
+
+int acomp_decomp_bio(struct btrfs_acomp_workspace *acomp_ws,
+		     struct btrfs_fs_info *fs_info,
+		     struct folio **in_folios,
+		     struct compressed_bio *cb, size_t srclen,
+		     unsigned long total_folios_in)
+{
+	const u32 min_folio_size = btrfs_min_folio_size(fs_info);
+	unsigned int nr_dst_folios = BTRFS_MAX_COMPRESSED_PAGES;
+	struct scatterlist *out_sgl = NULL;
+	struct scatterlist *in_sgl = NULL;
+	struct folio **out_folios = NULL;
+	struct crypto_acomp *tfm = NULL;
+	struct acomp_req *req = NULL;
+	struct crypto_wait wait;
+	unsigned int nofs_flags;
+	unsigned int dst_size;
+	char *data_out = NULL;
+	int bytes_left = 0;
+	unsigned int i;
+	int ret, ret2;
+
+	if (!acomp_ws)
+		return -EOPNOTSUPP;
+
+	/* Check if offload is enabled and acquire reference */
+	if (!atomic_read(&fs_info->compress_offload_enabled))
+		return -EOPNOTSUPP;
+
+	if (!atomic_inc_not_zero(&fs_info->compr_resource_refcnt))
+		return -EOPNOTSUPP;
+
+	/* Protect against GFP_KERNEL allocations in crypto subsystem */
+	nofs_flags = memalloc_nofs_save();
+
+	in_sgl = btrfs_acomp_get_input_sgl(acomp_ws);
+	if (!in_sgl) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	out_sgl = btrfs_acomp_get_output_sgl(acomp_ws);
+	if (!out_sgl) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	out_folios = btrfs_acomp_get_folios(acomp_ws);
+	if (!out_folios) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = folios_to_scatterlist(in_folios, total_folios_in, srclen, in_sgl, 0);
+	if (ret)
+		goto out;
+
+	for (i = 0; i < nr_dst_folios; i++) {
+		out_folios[i] = btrfs_alloc_compr_folio(fs_info);
+		if (!out_folios[i]) {
+			ret = -ENOMEM;
+			goto out;
+		}
+	}
+
+	dst_size = nr_dst_folios * min_folio_size;
+
+	ret = folios_to_scatterlist(out_folios, nr_dst_folios, dst_size, out_sgl, 0);
+	if (ret)
+		goto out;
+
+	crypto_init_wait(&wait);
+
+	/* Get pre-allocated tfm and request from workspace */
+	tfm = btrfs_acomp_get_tfm(acomp_ws);
+	req = btrfs_acomp_get_request(acomp_ws);
+	if (!tfm || !req) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	acomp_request_set_params(req, in_sgl, out_sgl, srclen, dst_size);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &wait);
+
+	ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+	if (ret)
+		goto out;
+
+	bytes_left = req->dlen;
+	for (i = 0; i < nr_dst_folios && bytes_left > 0; i++) {
+		size_t folio_bytes = min_t(size_t, bytes_left, min_folio_size);
+		unsigned long buf_start = req->dlen - bytes_left;
+
+		data_out = kmap_local_folio(out_folios[i], 0);
+
+		ret2 = btrfs_decompress_buf2page(data_out, folio_bytes, cb, buf_start);
+		kunmap_local(data_out);
+
+		if (ret2 == 0) {
+			ret = 0;
+			goto out;
+		}
+
+		bytes_left -= folio_bytes;
+	}
+
+out:
+	if (out_folios) {
+		for (i = 0; i < nr_dst_folios; i++) {
+			if (out_folios[i]) {
+				folio_put(out_folios[i]);
+				out_folios[i] = NULL;
+			}
+		}
+	}
+
+	memalloc_nofs_restore(nofs_flags);
+
+	/* Release reference and wake up any waiters */
+	if (atomic_dec_and_test(&fs_info->compr_resource_refcnt))
+		wake_up(&fs_info->compr_wait_queue);
+
+	return ret;
+}
+
+static const char *zlib_acomp_alg_name = "qat_zlib_deflate";
+static const char *zstd_acomp_alg_name = "qat_zstd";
+
+bool acomp_has_zlib(void)
+{
+	return crypto_has_acomp(zlib_acomp_alg_name, 0, 0);
+}
+
+bool acomp_has_zstd(void)
+{
+	return crypto_has_acomp(zstd_acomp_alg_name, 0, 0);
+}
+
+static struct btrfs_acomp_workspace *acomp_workspace_alloc(struct btrfs_fs_info *fs_info,
+							   const char *alg_name)
+{
+	struct btrfs_acomp_workspace *acomp_ws;
+
+	if (!alg_name)
+		return NULL;
+
+	if (!crypto_has_acomp(alg_name, 0, 0))
+		return NULL;
+
+	/* Only allocate workspace if offload is enabled */
+	if (!fs_info || !atomic_read(&fs_info->compress_offload_enabled))
+		return NULL;
+
+	acomp_ws = kzalloc(sizeof(*acomp_ws), GFP_KERNEL);
+	if (!acomp_ws)
+		return NULL;
+
+	sg_init_table(acomp_ws->in_sgl, BTRFS_ACOMP_MAX_SGL_ENTRIES);
+	sg_init_table(acomp_ws->out_sgl, BTRFS_ACOMP_MAX_SGL_ENTRIES);
+
+	acomp_ws->alg_name = alg_name;
+	acomp_ws->attr_buf_len = BTRFS_ACOMP_ATTR_BUF_SIZE;
+
+	/* Allocate tfm and req */
+	acomp_ws->tfm = crypto_alloc_acomp(alg_name, 0, 0);
+	if (IS_ERR(acomp_ws->tfm)) {
+		btrfs_err(fs_info, "Failed to allocate acomp tfm for %s: %ld\n",
+			  alg_name, PTR_ERR(acomp_ws->tfm));
+		kfree(acomp_ws);
+		return NULL;
+	}
+
+	acomp_ws->req = acomp_request_alloc(acomp_ws->tfm);
+	if (!acomp_ws->req) {
+		btrfs_err(fs_info, "Failed to allocate acomp request for %s\n", alg_name);
+		crypto_free_acomp(acomp_ws->tfm);
+		kfree(acomp_ws);
+		return NULL;
+	}
+
+	return acomp_ws;
+}
+
+struct btrfs_acomp_workspace *acomp_zlib_workspace_alloc(struct btrfs_fs_info *fs_info)
+{
+	return acomp_workspace_alloc(fs_info, zlib_acomp_alg_name);
+}
+
+struct btrfs_acomp_workspace *acomp_zstd_workspace_alloc(struct btrfs_fs_info *fs_info)
+{
+	return acomp_workspace_alloc(fs_info, zstd_acomp_alg_name);
+}
+
+void acomp_workspace_free(struct btrfs_acomp_workspace *acomp_ws)
+{
+	if (acomp_ws) {
+		if (acomp_ws->req)
+			acomp_request_free(acomp_ws->req);
+		if (acomp_ws->tfm)
+			crypto_free_acomp(acomp_ws->tfm);
+	}
+	kfree(acomp_ws);
+}
+#endif /* CONFIG_BTRFS_EXPERIMENTAL */
diff --git a/fs/btrfs/acomp_workspace.h b/fs/btrfs/acomp_workspace.h
new file mode 100644
index 000000000000..e886ab5657b2
--- /dev/null
+++ b/fs/btrfs/acomp_workspace.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_ACOMP_WORKSPACE_H
+#define BTRFS_ACOMP_WORKSPACE_H
+
+#include <crypto/acompress.h>
+#include <linux/scatterlist.h>
+
+#include "compression.h"
+
+/*
+ * Maximum number of scatterlist entries needed for btrfs compression.
+ * Based on BTRFS_MAX_COMPRESSED_PAGES (32 pages max).
+ */
+#define BTRFS_ACOMP_MAX_SGL_ENTRIES	BTRFS_MAX_COMPRESSED_PAGES
+
+/* Maximum size needed for compression attribute buffer */
+#define BTRFS_ACOMP_ATTR_BUF_SIZE	64
+
+struct btrfs_acomp_workspace {
+	struct scatterlist in_sgl[BTRFS_ACOMP_MAX_SGL_ENTRIES];
+	struct scatterlist out_sgl[BTRFS_ACOMP_MAX_SGL_ENTRIES];
+	struct folio *folios[BTRFS_ACOMP_MAX_SGL_ENTRIES];
+	u8 attr_buffer[BTRFS_ACOMP_ATTR_BUF_SIZE];
+	unsigned int attr_buf_len;
+	const char *alg_name;
+	struct crypto_acomp *tfm;
+	struct acomp_req *req;
+};
+
+static inline struct scatterlist *btrfs_acomp_get_input_sgl(struct btrfs_acomp_workspace *acomp_ws)
+{
+	return acomp_ws ? acomp_ws->in_sgl : NULL;
+}
+
+static inline struct scatterlist *btrfs_acomp_get_output_sgl(struct btrfs_acomp_workspace *acomp_ws)
+{
+	return acomp_ws ? acomp_ws->out_sgl : NULL;
+}
+
+static inline struct folio **btrfs_acomp_get_folios(struct btrfs_acomp_workspace *acomp_ws)
+{
+	return acomp_ws ? acomp_ws->folios : NULL;
+}
+
+static inline u8 *btrfs_acomp_get_attr_buffer(struct btrfs_acomp_workspace *acomp_ws)
+{
+	return acomp_ws ? acomp_ws->attr_buffer : NULL;
+}
+
+static inline struct crypto_acomp *btrfs_acomp_get_tfm(struct btrfs_acomp_workspace *acomp_ws)
+{
+	return acomp_ws ? acomp_ws->tfm : NULL;
+}
+
+static inline struct acomp_req *btrfs_acomp_get_request(struct btrfs_acomp_workspace *acomp_ws)
+{
+	return acomp_ws ? acomp_ws->req : NULL;
+}
+
+#endif /* BTRFS_ACOMP_WORKSPACE_H */
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index bacad18357b3..7d6083e5958e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1695,3 +1695,69 @@ int btrfs_compress_str2level(unsigned int type, const char *str, int *level_ret)
 	*level_ret = btrfs_compress_set_level(type, level);
 	return 0;
 }
+
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+int btrfs_set_compress_offload(struct btrfs_fs_info *fs_info, bool enable)
+{
+	int compress_type;
+	int ret = 0;
+
+	if (!fs_info)
+		return -EINVAL;
+
+	compress_type = fs_info->compress_type;
+	switch (compress_type) {
+	case BTRFS_COMPRESS_ZLIB:
+		if (!acomp_has_zlib()) {
+			btrfs_warn(fs_info, "Hardware does not support zlib compression offload");
+			return -EOPNOTSUPP;
+		}
+		break;
+	case BTRFS_COMPRESS_ZSTD:
+		if (!acomp_has_zstd()) {
+			btrfs_warn(fs_info, "Hardware does not support zstd compression offload");
+			return -EOPNOTSUPP;
+		}
+		break;
+	default:
+		btrfs_warn(fs_info, "Compression offload only supported for zlib and zstd (current: %d)",
+			   compress_type);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock(&fs_info->compress_offload_lock);
+
+	if (atomic_read(&fs_info->compress_offload_enabled) == (enable ? 1 : 0)) {
+		spin_unlock(&fs_info->compress_offload_lock);
+		btrfs_info(fs_info, "Compression hardware offload already %s",
+			   enable ? "enabled" : "disabled");
+		return 0;
+	}
+
+	atomic_set(&fs_info->compress_offload_enabled, enable ? 1 : 0);
+	atomic_dec(&fs_info->compr_resource_refcnt);
+	spin_unlock(&fs_info->compress_offload_lock);
+
+	wait_event(fs_info->compr_wait_queue,
+		   atomic_read(&fs_info->compr_resource_refcnt) == 0);
+
+	switch (compress_type) {
+	case BTRFS_COMPRESS_ZLIB:
+		ret = zlib_process_acomp_workspaces(fs_info, enable);
+		break;
+	case BTRFS_COMPRESS_ZSTD:
+		ret = zstd_process_acomp_workspaces(fs_info, enable);
+		break;
+	}
+
+	atomic_set(&fs_info->compr_resource_refcnt, 1);
+
+	if (ret == 0) {
+		btrfs_info(fs_info, "Compression hardware offload %s for %s",
+			   enable ? "enabled" : "disabled",
+			   compress_type == BTRFS_COMPRESS_ZLIB ? "zlib" : "zstd");
+	}
+
+	return ret;
+}
+#endif /* CONFIG_BTRFS_EXPERIMENTAL */
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index eba188a9e3bb..fa3e1e0c7d03 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -22,6 +22,7 @@ struct inode;
 struct btrfs_inode;
 struct btrfs_ordered_extent;
 struct btrfs_bio;
+struct btrfs_acomp_workspace;
 
 /*
  * We want to make sure that amount of RAM required to uncompress an extent is
@@ -162,6 +163,9 @@ int zlib_decompress(struct list_head *ws, const u8 *data_in,
 struct list_head *zlib_alloc_workspace(struct btrfs_fs_info *fs_info, unsigned int level);
 void zlib_free_workspace(struct list_head *ws);
 struct list_head *zlib_get_workspace(struct btrfs_fs_info *fs_info, unsigned int level);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+int zlib_process_acomp_workspaces(struct btrfs_fs_info *fs_info, bool enable);
+#endif
 
 int lzo_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
 			u64 start, struct folio **folios, unsigned long *out_folios,
@@ -186,5 +190,31 @@ struct list_head *zstd_alloc_workspace(struct btrfs_fs_info *fs_info, int level)
 void zstd_free_workspace(struct list_head *ws);
 struct list_head *zstd_get_workspace(struct btrfs_fs_info *fs_info, int level);
 void zstd_put_workspace(struct btrfs_fs_info *fs_info, struct list_head *ws);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+int zstd_process_acomp_workspaces(struct btrfs_fs_info *fs_info, bool enable);
+#endif
+
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+bool acomp_has_zlib(void);
+bool acomp_has_zstd(void);
+struct btrfs_acomp_workspace;
+
+struct btrfs_acomp_workspace *acomp_zlib_workspace_alloc(struct btrfs_fs_info *fs_info);
+struct btrfs_acomp_workspace *acomp_zstd_workspace_alloc(struct btrfs_fs_info *fs_info);
+void acomp_workspace_free(struct btrfs_acomp_workspace *acomp_ws);
+int acomp_comp_folios(struct btrfs_acomp_workspace *acomp_ws,
+		      struct btrfs_fs_info *fs_info,
+		      struct address_space *mapping, u64 start, unsigned long len,
+		      struct folio **folios, unsigned long *out_folios,
+		      unsigned long *total_in, unsigned long *total_out, int level);
+int acomp_decomp_bio(struct btrfs_acomp_workspace *acomp_ws,
+		     struct btrfs_fs_info *fs_info, struct folio **in_folios,
+		     struct compressed_bio *cb, size_t srclen,
+		     unsigned long total_folios_in);
+int btrfs_set_compress_offload(struct btrfs_fs_info *fs_info, bool enable);
+#else
+static inline bool acomp_has_zlib(void) { return false; }
+static inline bool acomp_has_zstd(void) { return false; }
+#endif
 
 #endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0aa7e5d1b05f..2f63f8221c95 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2778,6 +2778,12 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->relocation_bg_lock);
 	rwlock_init(&fs_info->tree_mod_log_lock);
 	rwlock_init(&fs_info->global_root_lock);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	spin_lock_init(&fs_info->compress_offload_lock);
+	atomic_set(&fs_info->compress_offload_enabled, 0);
+	atomic_set(&fs_info->compr_resource_refcnt, 1);
+	init_waitqueue_head(&fs_info->compr_wait_queue);
+#endif
 	mutex_init(&fs_info->unused_bg_unpin_mutex);
 	mutex_init(&fs_info->reclaim_bgs_lock);
 	mutex_init(&fs_info->reloc_mutex);
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 814bbc9417d2..c6ff901c557c 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -525,6 +525,14 @@ struct btrfs_fs_info {
 
 	int compress_type;
 	int compress_level;
+
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	spinlock_t compress_offload_lock; /* protects the two fields below */
+	atomic_t compress_offload_enabled;
+	atomic_t compr_resource_refcnt;
+	wait_queue_head_t compr_wait_queue;
+#endif
+
 	u32 commit_interval;
 	/*
 	 * It is a suggestive number, the read side is safe even it gets a
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 81f52c1f55ce..73420373b62c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1578,6 +1578,34 @@ static ssize_t btrfs_offload_csum_store(struct kobject *kobj,
 	return len;
 }
 BTRFS_ATTR_RW(, offload_csum, btrfs_offload_csum_show, btrfs_offload_csum_store);
+
+static ssize_t offload_compress_show(struct kobject *kobj,
+				     struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+
+	return sysfs_emit(buf, "%d\n", atomic_read(&fs_info->compress_offload_enabled));
+}
+
+static ssize_t offload_compress_store(struct kobject *kobj,
+				      struct kobj_attribute *a, const char *buf,
+				      size_t len)
+{
+	struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+	bool val;
+	int ret;
+
+	ret = kstrtobool(buf, &val);
+	if (ret)
+		return ret;
+
+	ret = btrfs_set_compress_offload(fs_info, val);
+	if (ret)
+		return ret;
+
+	return len;
+}
+BTRFS_ATTR_RW(, offload_compress, offload_compress_show, offload_compress_store);
 #endif
 
 /*
@@ -1601,6 +1629,7 @@ static const struct attribute *btrfs_attrs[] = {
 	BTRFS_ATTR_PTR(, temp_fsid),
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	BTRFS_ATTR_PTR(, offload_csum),
+	BTRFS_ATTR_PTR(, offload_compress),
 #endif
 	NULL,
 };
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index 6caba8be7c84..9adb1defaea3 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -18,6 +18,9 @@
 #include <linux/pagemap.h>
 #include <linux/bio.h>
 #include <linux/refcount.h>
+#include <linux/scatterlist.h>
+#include <crypto/acompress.h>
+#include "acomp_workspace.h"
 #include "btrfs_inode.h"
 #include "compression.h"
 #include "fs.h"
@@ -32,6 +35,7 @@ struct workspace {
 	unsigned int buf_size;
 	struct list_head list;
 	int level;
+	struct btrfs_acomp_workspace *acomp_ws;
 };
 
 struct list_head *zlib_get_workspace(struct btrfs_fs_info *fs_info, unsigned int level)
@@ -48,11 +52,50 @@ void zlib_free_workspace(struct list_head *ws)
 {
 	struct workspace *workspace = list_entry(ws, struct workspace, list);
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (workspace->acomp_ws)
+		acomp_workspace_free(workspace->acomp_ws);
+#endif
+
 	kvfree(workspace->strm.workspace);
 	kfree(workspace->buf);
 	kfree(workspace);
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+int zlib_process_acomp_workspaces(struct btrfs_fs_info *fs_info, bool enable)
+{
+	struct workspace_manager *wsm = fs_info->compr_wsm[BTRFS_COMPRESS_ZLIB];
+	struct list_head *ws, *tmp;
+
+	if (!wsm)
+		return 0;
+
+	spin_lock(&wsm->ws_lock);
+
+	list_for_each_safe(ws, tmp, &wsm->idle_ws) {
+		struct workspace *workspace = list_entry(ws, struct workspace, list);
+
+		if (enable) {
+			if (!workspace->acomp_ws) {
+				workspace->acomp_ws = acomp_zlib_workspace_alloc(fs_info);
+				if (!workspace->acomp_ws)
+					btrfs_warn(fs_info, "Failed to allocate zlib acomp workspace");
+			}
+		} else {
+			if (workspace->acomp_ws) {
+				acomp_workspace_free(workspace->acomp_ws);
+				workspace->acomp_ws = NULL;
+			}
+		}
+	}
+
+	spin_unlock(&wsm->ws_lock);
+
+	return 0;
+}
+#endif
+
 /*
  * For s390 hardware acceleration, the buffer size should be at least
  * ZLIB_DFLTCC_BUF_SIZE to achieve the best performance.
@@ -97,6 +140,14 @@ struct list_head *zlib_alloc_workspace(struct btrfs_fs_info *fs_info, unsigned i
 	if (!workspace->strm.workspace || !workspace->buf)
 		goto fail;
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	/* Try to allocate acomp workspace (will be NULL if offload disabled) */
+	workspace->acomp_ws = acomp_zlib_workspace_alloc(fs_info);
+	/* It's OK if this returns NULL when offload is disabled */
+#else
+	workspace->acomp_ws = NULL;
+#endif
+
 	INIT_LIST_HEAD(&workspace->list);
 
 	return &workspace->list;
@@ -165,6 +216,20 @@ int zlib_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
 	const u32 blocksize = fs_info->sectorsize;
 	const u64 orig_end = start + len;
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (workspace->acomp_ws && len >= 1024) {
+		ret = acomp_comp_folios(workspace->acomp_ws, fs_info, mapping, start,
+					len, folios, out_folios, total_in,
+					total_out, workspace->level);
+		/*
+		 * If hardware offload succeeded, or if there is an expansion,
+		 * return. Otherwise, compress in software.
+		 */
+		if (ret == 0 || ret == -E2BIG)
+			return ret;
+	}
+#endif
+
 	*out_folios = 0;
 	*total_out = 0;
 	*total_in = 0;
@@ -348,6 +413,22 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 	unsigned long buf_start;
 	struct folio **folios_in = cb->compressed_folios;
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (workspace->acomp_ws && srclen >= 1024) {
+		ret = acomp_decomp_bio(workspace->acomp_ws, fs_info, folios_in, cb,
+				       srclen, total_folios_in);
+		/* If hardware offload succeeded, return. */
+		if (ret == 0)
+			return 0;
+
+		/* Otherwise, decompress in software. This should not happen! */
+		if (ret)
+			btrfs_info(fs_info,
+				   "zlib hardware decompression offload failed, falling back to software ret=%d",
+				   ret);
+	}
+#endif
+
 	data_in = kmap_local_folio(folios_in[folio_in_index], 0);
 	workspace->strm.next_in = data_in;
 	workspace->strm.avail_in = min_t(size_t, srclen, min_folio_size);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index c9cddcfa337b..530aa5b7efee 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -22,6 +22,7 @@
 #include "btrfs_inode.h"
 #include "compression.h"
 #include "super.h"
+#include "acomp_workspace.h"
 
 #define ZSTD_BTRFS_MAX_WINDOWLOG 17
 #define ZSTD_BTRFS_MAX_INPUT (1U << ZSTD_BTRFS_MAX_WINDOWLOG)
@@ -54,6 +55,7 @@ struct workspace {
 	zstd_in_buffer in_buf;
 	zstd_out_buffer out_buf;
 	zstd_parameters params;
+	struct btrfs_acomp_workspace *acomp_ws;
 };
 
 /*
@@ -363,11 +365,53 @@ void zstd_free_workspace(struct list_head *ws)
 {
 	struct workspace *workspace = list_entry(ws, struct workspace, list);
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (workspace->acomp_ws)
+		acomp_workspace_free(workspace->acomp_ws);
+#endif
 	kvfree(workspace->mem);
 	kfree(workspace->buf);
 	kfree(workspace);
 }
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+int zstd_process_acomp_workspaces(struct btrfs_fs_info *fs_info, bool enable)
+{
+	struct zstd_workspace_manager *zwsm = fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD];
+	int i;
+
+	if (!zwsm)
+		return 0;
+
+	spin_lock_bh(&zwsm->lock);
+
+	for (i = 0; i < ZSTD_BTRFS_MAX_LEVEL; i++) {
+		struct list_head *ws, *tmp;
+
+		list_for_each_safe(ws, tmp, &zwsm->idle_ws[i]) {
+			struct workspace *workspace = list_entry(ws, struct workspace, list);
+
+			if (enable) {
+				if (!workspace->acomp_ws) {
+					workspace->acomp_ws = acomp_zstd_workspace_alloc(fs_info);
+					if (!workspace->acomp_ws)
+						btrfs_warn(fs_info, "Failed to allocate zstd acomp workspace");
+				}
+			} else {
+				if (workspace->acomp_ws) {
+					acomp_workspace_free(workspace->acomp_ws);
+					workspace->acomp_ws = NULL;
+				}
+			}
+		}
+	}
+
+	spin_unlock_bh(&zwsm->lock);
+
+	return 0;
+}
+#endif
+
 struct list_head *zstd_alloc_workspace(struct btrfs_fs_info *fs_info, int level)
 {
 	const u32 blocksize = fs_info->sectorsize;
@@ -387,6 +431,12 @@ struct list_head *zstd_alloc_workspace(struct btrfs_fs_info *fs_info, int level)
 	if (!workspace->mem || !workspace->buf)
 		goto fail;
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	/* Try to allocate acomp workspace (will be NULL if offload disabled) */
+	workspace->acomp_ws = acomp_zstd_workspace_alloc(fs_info);
+	/* It's OK if this returns NULL when offload is disabled */
+#endif
+
 	INIT_LIST_HEAD(&workspace->list);
 	INIT_LIST_HEAD(&workspace->lru_list);
 
@@ -418,6 +468,20 @@ int zstd_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
 	unsigned long max_out = nr_dest_folios * min_folio_size;
 	unsigned int cur_len;
 
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (workspace->acomp_ws && len >= 2048) {
+		ret = acomp_comp_folios(workspace->acomp_ws, fs_info, mapping, start,
+					len, folios, out_folios, total_in,
+					total_out, workspace->req_level);
+		/*
+		 * If hardware offload succeeded, or if there is an expansion,
+		 * return. Otherwise, compress in software.
+		 */
+		if (ret == 0 || ret == -E2BIG)
+			return ret;
+	}
+#endif
+
 	workspace->params = zstd_get_btrfs_parameters(workspace->req_level, len);
 	*out_folios = 0;
 	*total_out = 0;
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 19:05 ` [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload Giovanni Cabiddu
@ 2025-11-28 21:55   ` Qu Wenruo
  2025-11-28 22:40     ` Giovanni Cabiddu
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-11-28 21:55 UTC (permalink / raw)
  To: Giovanni Cabiddu, clm, dsterba, terrelln, herbert
  Cc: linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky



在 2025/11/29 05:35, Giovanni Cabiddu 写道:
> Add support for hardware-accelerated compression using the acomp API in
> the crypto framework, enabling offload of zlib and zstd compression to
> hardware accelerators. Hardware offload reduces CPU load during
> compression, improving performance.
> 
> The implementation follows a generic design that works with any acomp
> implementation, though this enablement targets Intel QAT devices
> (similarly to what done in EROFS).
> 
> Input folios are organized into a scatter-gather list and submitted to
> the accelerator in a single asynchronous request. The calling thread
> sleeps while the hardware performs compression, freeing the CPU for
> other tasks.  Upon completion, the acomp callback wakes the thread to
> continue processing.
> 
> Offload is supported for:
>    - zlib: compression and decompression
>    - zstd: compression only
> 
> Offload is only attempted when the data size exceeds a minimum threshold,
> ensuring that small operations remain efficient by avoiding hardware setup
> overhead. All required buffers are pre-allocated in the workspace to
> eliminate allocations in the data path.
> 
> This feature maintains full compatibility with the existing BTRFS disk
> format. Files compressed by hardware can be decompressed by software
> implementations and vice versa.
> 
> The feature is wrapped in CONFIG_BTRFS_EXPERIMENTAL and can be enabled
> at runtime via the sysfs parameter /sys/fs/btrfs/<UUID>/offload_compress.

Not an compression/crypto expert, thus just comment on the btrfs part.

sysfs is not a good long-term solution. Since it's already behind 
experiemental flags, you can just enable it unconditionally (with proper 
checks of-course).

[...]
> +int acomp_comp_folios(struct btrfs_acomp_workspace *acomp_ws,
> +		      struct btrfs_fs_info *fs_info,
> +		      struct address_space *mapping, u64 start, unsigned long len,
> +		      struct folio **folios, unsigned long *out_folios,
> +		      unsigned long *total_in, unsigned long *total_out, int level)
> +{
> +	struct scatterlist *out_sgl = NULL;
> +	struct scatterlist *in_sgl = NULL;
> +	const u64 orig_end = start + len;
> +	struct crypto_acomp *tfm = NULL;
> +	struct folio **in_folios = NULL;
> +	unsigned int first_folio_offset;
> +	unsigned int nr_dst_folios = 0;
> +	struct folio *out_folio = NULL;
> +	unsigned int nr_src_folios = 0;
> +	struct acomp_req *req = NULL;
> +	unsigned int nr_folios = 0;
> +	unsigned int dst_size = 0;
> +	unsigned int raw_attr_len;
> +	unsigned int bytes_left;
> +	unsigned int nofs_flags;
> +	struct crypto_wait wait;
> +	struct folio *in_folio;
> +	unsigned int cur_len;
> +	unsigned int i;
> +	u64 cur_start;
> +	u8 *raw_attr;
> +	int ret;
> +
> +	if (!acomp_ws)
> +		return -EOPNOTSUPP;
> +
> +	/* Check if offload is enabled and acquire reference */
> +	if (!atomic_read(&fs_info->compress_offload_enabled))
> +		return -EOPNOTSUPP;
> +
> +	if (!atomic_inc_not_zero(&fs_info->compr_resource_refcnt))
> +		return -EOPNOTSUPP;
> +
> +	/* Protect against GFP_KERNEL allocations in crypto subsystem */
> +	nofs_flags = memalloc_nofs_save();
> +
> +	in_folios = btrfs_acomp_get_folios(acomp_ws);
> +	if (!in_folios) {
> +		btrfs_err(fs_info, "No input folios in workspace\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	cur_start = start;
> +	while (cur_start < orig_end && nr_src_folios < BTRFS_ACOMP_MAX_SGL_ENTRIES) {

This function get all input/output folios in a batch, but ubifs, which 
also uses acomp for compression, seems to only compress one page one 
time (ubifs_compress_folio()).

I'm wondering what's preventing us from doing the existing 
folio-by-folio compression.
Is the batch folio acquiring just for performance or something else?

The folio-by-folio compression gives us more control on detecting 
incompressible data, which is now gone.
We will only know if the data is incompressible after all dst folios are 
allocated and tried compression.

[...]
> @@ -165,6 +216,20 @@ int zlib_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
>   	const u32 blocksize = fs_info->sectorsize;
>   	const u64 orig_end = start + len;
>   
> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> +	if (workspace->acomp_ws && len >= 1024) {

The length threshold looks way smaller than I expected. I was expecting 
multi-page lengths like S390, considering all the batch folio preparations.

If the threshold is really this low, what's preventing an acomp 
interface to provide the same multi-shot compression/decompression?


And please do not use an immediate number, use a macro instead.

[...]
>   
> @@ -418,6 +468,20 @@ int zstd_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
>   	unsigned long max_out = nr_dest_folios * min_folio_size;
>   	unsigned int cur_len;
>   
> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> +	if (workspace->acomp_ws && len >= 2048) {

And why zstd has a different threshold compared to zlib?

Thanks,
Qu

> +		ret = acomp_comp_folios(workspace->acomp_ws, fs_info, mapping, start,
> +					len, folios, out_folios, total_in,
> +					total_out, workspace->req_level);
> +		/*
> +		 * If hardware offload succeeded, or if there is an expansion,
> +		 * return. Otherwise, compress in software.
> +		 */
> +		if (ret == 0 || ret == -E2BIG)
> +			return ret;
> +	}
> +#endif
> +
>   	workspace->params = zstd_get_btrfs_parameters(workspace->req_level, len);
>   	*out_folios = 0;
>   	*total_out = 0;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 21:55   ` Qu Wenruo
@ 2025-11-28 22:40     ` Giovanni Cabiddu
  2025-11-28 23:59       ` Qu Wenruo
  2025-11-29  1:08       ` David Sterba
  0 siblings, 2 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-11-28 22:40 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: clm, dsterba, terrelln, herbert, linux-btrfs, linux-crypto,
	qat-linux, cyan, brian.will, weigang.li, senozhatsky

Thanks for your feedback, Qu Wenruo.

On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> 在 2025/11/29 05:35, Giovanni Cabiddu 写道:
> > Add support for hardware-accelerated compression using the acomp API in
> > the crypto framework, enabling offload of zlib and zstd compression to
> > hardware accelerators. Hardware offload reduces CPU load during
> > compression, improving performance.
> > 
> > The implementation follows a generic design that works with any acomp
> > implementation, though this enablement targets Intel QAT devices
> > (similarly to what done in EROFS).
> > 
> > Input folios are organized into a scatter-gather list and submitted to
> > the accelerator in a single asynchronous request. The calling thread
> > sleeps while the hardware performs compression, freeing the CPU for
> > other tasks.  Upon completion, the acomp callback wakes the thread to
> > continue processing.
> > 
> > Offload is supported for:
> >    - zlib: compression and decompression
> >    - zstd: compression only
> > 
> > Offload is only attempted when the data size exceeds a minimum threshold,
> > ensuring that small operations remain efficient by avoiding hardware setup
> > overhead. All required buffers are pre-allocated in the workspace to
> > eliminate allocations in the data path.
> > 
> > This feature maintains full compatibility with the existing BTRFS disk
> > format. Files compressed by hardware can be decompressed by software
> > implementations and vice versa.
> > 
> > The feature is wrapped in CONFIG_BTRFS_EXPERIMENTAL and can be enabled
> > at runtime via the sysfs parameter /sys/fs/btrfs/<UUID>/offload_compress.
> 
> Not an compression/crypto expert, thus just comment on the btrfs part.
> 
> sysfs is not a good long-term solution. Since it's already behind
> experiemental flags, you can just enable it unconditionally (with proper
> checks of-course).
The reason for introducing a sysfs attribute is to allow disabling the
feature to be able to unload the QAT driver or to assign a QAT device to
user space for example to QATlib or DPDK.

In the initial implementation, there was no sysfs switch because the
acomp tfm was allocated in the data path. With the current design,
where the tfm is allocated in the workspace, the driver remains
permanently in use.

Is there any other alternative to a sysfs attribute to dynamically
enable/disable this feature?

> 
> [...]
> > +int acomp_comp_folios(struct btrfs_acomp_workspace *acomp_ws,
> > +		      struct btrfs_fs_info *fs_info,
> > +		      struct address_space *mapping, u64 start, unsigned long len,
> > +		      struct folio **folios, unsigned long *out_folios,
> > +		      unsigned long *total_in, unsigned long *total_out, int level)
> > +{
> > +	struct scatterlist *out_sgl = NULL;
> > +	struct scatterlist *in_sgl = NULL;
> > +	const u64 orig_end = start + len;
> > +	struct crypto_acomp *tfm = NULL;
> > +	struct folio **in_folios = NULL;
> > +	unsigned int first_folio_offset;
> > +	unsigned int nr_dst_folios = 0;
> > +	struct folio *out_folio = NULL;
> > +	unsigned int nr_src_folios = 0;
> > +	struct acomp_req *req = NULL;
> > +	unsigned int nr_folios = 0;
> > +	unsigned int dst_size = 0;
> > +	unsigned int raw_attr_len;
> > +	unsigned int bytes_left;
> > +	unsigned int nofs_flags;
> > +	struct crypto_wait wait;
> > +	struct folio *in_folio;
> > +	unsigned int cur_len;
> > +	unsigned int i;
> > +	u64 cur_start;
> > +	u8 *raw_attr;
> > +	int ret;
> > +
> > +	if (!acomp_ws)
> > +		return -EOPNOTSUPP;
> > +
> > +	/* Check if offload is enabled and acquire reference */
> > +	if (!atomic_read(&fs_info->compress_offload_enabled))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!atomic_inc_not_zero(&fs_info->compr_resource_refcnt))
> > +		return -EOPNOTSUPP;
> > +
> > +	/* Protect against GFP_KERNEL allocations in crypto subsystem */
> > +	nofs_flags = memalloc_nofs_save();
> > +
> > +	in_folios = btrfs_acomp_get_folios(acomp_ws);
> > +	if (!in_folios) {
> > +		btrfs_err(fs_info, "No input folios in workspace\n");
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	cur_start = start;
> > +	while (cur_start < orig_end && nr_src_folios < BTRFS_ACOMP_MAX_SGL_ENTRIES) {
> 
> This function get all input/output folios in a batch, but ubifs, which also
> uses acomp for compression, seems to only compress one page one time
> (ubifs_compress_folio()).
> 
> I'm wondering what's preventing us from doing the existing folio-by-folio
> compression.
> Is the batch folio acquiring just for performance or something else?
There are a few reasons for using batch folio processing instead of
folio-by-folio compression:

* Performance: for a hardware accelerator like QAT, it is more efficient to
  process a larger chunk of data in a single request rather than issuing
  multiple small requests. (BTW, it would be even better if we could batch
  multiple requests and run them asynchronously!)

* API limitations: The current acomp interface is stateless. Supporting
  folio-by-folio compression with proper streaming would require changes
  to introduce a stateful API.

* Support for stateful compression in QAT: QAT would also need to implement
  stateful request handling to support incremental
  compression/decompression.

> 
> The folio-by-folio compression gives us more control on detecting
> incompressible data, which is now gone.
> We will only know if the data is incompressible after all dst folios are
> allocated and tried compression.
> 
> [...]
> > @@ -165,6 +216,20 @@ int zlib_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
> >   	const u32 blocksize = fs_info->sectorsize;
> >   	const u64 orig_end = start + len;
> > +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> > +	if (workspace->acomp_ws && len >= 1024) {
> 
> The length threshold looks way smaller than I expected. I was expecting
> multi-page lengths like S390, considering all the batch folio preparations.
> 
> If the threshold is really this low, what's preventing an acomp interface to
> provide the same multi-shot compression/decompression?
Performance (i.e., latency) and support for stateful (see above points).

> And please do not use an immediate number, use a macro instead.
Sure.

> 
> [...]
> > @@ -418,6 +468,20 @@ int zstd_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
> >   	unsigned long max_out = nr_dest_folios * min_folio_size;
> >   	unsigned int cur_len;
> > +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> > +	if (workspace->acomp_ws && len >= 2048) {
> 
> And why zstd has a different threshold compared to zlib?
These thresholds vary by algorithm, device, and compression level.
I tried to generalize them. For ZSTD on QAT GEN4, for example, QAT only
implements part of the algorithm. The remaining steps are handled in
software via the ZSTD library. This makes the offload threshold
different compared to zlib where the algorithm is completely executed in
HW.

That is why I asked in the cover letter whether it would make sense to
expose these thresholds through acomp. The concern is that I don’t want
to waste cycles converting folios to scatterlists only to discover that
software compression would have been more efficient.

Thanks,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 22:40     ` Giovanni Cabiddu
@ 2025-11-28 23:59       ` Qu Wenruo
  2025-11-29  0:23         ` Qu Wenruo
  2025-11-29  1:00         ` David Sterba
  2025-11-29  1:08       ` David Sterba
  1 sibling, 2 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-11-28 23:59 UTC (permalink / raw)
  To: Giovanni Cabiddu, Qu Wenruo
  Cc: clm, dsterba, terrelln, herbert, linux-btrfs, linux-crypto,
	qat-linux, cyan, brian.will, weigang.li, senozhatsky



在 2025/11/29 09:10, Giovanni Cabiddu 写道:
> Thanks for your feedback, Qu Wenruo.
> 
> On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
[...]
>> Not an compression/crypto expert, thus just comment on the btrfs part.
>>
>> sysfs is not a good long-term solution. Since it's already behind
>> experiemental flags, you can just enable it unconditionally (with proper
>> checks of-course).
> The reason for introducing a sysfs attribute is to allow disabling the
> feature to be able to unload the QAT driver or to assign a QAT device to
> user space for example to QATlib or DPDK.
> 
> In the initial implementation, there was no sysfs switch because the
> acomp tfm was allocated in the data path. With the current design,
> where the tfm is allocated in the workspace, the driver remains
> permanently in use.
> 
> Is there any other alternative to a sysfs attribute to dynamically
> enable/disable this feature?

For all needed compression algorithm modules are loaded at btrfs module 
load time (not mount time), thus I was expecting the driver being there 
until the btrfs module is removed from kernel.

This is a completely new use case. Have no good idea on this at all. 
Never expected an accelerated algorithm would even get removed halfway.

[...]
>>
>> This function get all input/output folios in a batch, but ubifs, which also
>> uses acomp for compression, seems to only compress one page one time
>> (ubifs_compress_folio()).
>>
>> I'm wondering what's preventing us from doing the existing folio-by-folio
>> compression.
>> Is the batch folio acquiring just for performance or something else?
> There are a few reasons for using batch folio processing instead of
> folio-by-folio compression:
> 
> * Performance: for a hardware accelerator like QAT, it is more efficient to
>    process a larger chunk of data in a single request rather than issuing
>    multiple small requests. (BTW, it would be even better if we could batch
>    multiple requests and run them asynchronously!)
> 
> * API limitations: The current acomp interface is stateless. Supporting
>    folio-by-folio compression with proper streaming would require changes
>    to introduce a stateful API.

BTW, is there any extra benefit of using acomp interface (independent of 
QAT acceleration)?

If so we may consider experiment with it first and migrate btrfs to that 
interface step by step, and making extra accelerated algorithms easier 
to implement.

[...]
> 
>>
>> [...]
>>> @@ -418,6 +468,20 @@ int zstd_compress_folios(struct list_head *ws, struct btrfs_inode *inode,
>>>    	unsigned long max_out = nr_dest_folios * min_folio_size;
>>>    	unsigned int cur_len;
>>> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
>>> +	if (workspace->acomp_ws && len >= 2048) {
>>
>> And why zstd has a different threshold compared to zlib?
> These thresholds vary by algorithm, device, and compression level.
> I tried to generalize them. For ZSTD on QAT GEN4, for example, QAT only
> implements part of the algorithm. The remaining steps are handled in
> software via the ZSTD library. This makes the offload threshold
> different compared to zlib where the algorithm is completely executed in
> HW.
> 
> That is why I asked in the cover letter whether it would make sense to
> expose these thresholds through acomp. The concern is that I don’t want
> to waste cycles converting folios to scatterlists only to discover that
> software compression would have been more efficient.

At least for btrfs' use case, the threshold is so small that it only 
makes a difference for inlined extents (the inline threshold is 
configurable at mount time, by default it's 2K, max to fs block size 
normally 4K).

Considering how small the inline extent is already, you may want to only 
focus on multi-block compression, and use a single threshold (fs block 
size) inside btrfs.

Thanks,
Qu

> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 23:59       ` Qu Wenruo
@ 2025-11-29  0:23         ` Qu Wenruo
  2025-12-01 14:32           ` Giovanni Cabiddu
  2025-11-29  1:00         ` David Sterba
  1 sibling, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-11-29  0:23 UTC (permalink / raw)
  To: Giovanni Cabiddu, Qu Wenruo
  Cc: clm, dsterba, terrelln, herbert, linux-btrfs, linux-crypto,
	qat-linux, cyan, brian.will, weigang.li, senozhatsky



在 2025/11/29 10:29, Qu Wenruo 写道:
> 
> 
> 在 2025/11/29 09:10, Giovanni Cabiddu 写道:
>> Thanks for your feedback, Qu Wenruo.
>>
>> On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> [...]
>>> Not an compression/crypto expert, thus just comment on the btrfs part.
>>>
>>> sysfs is not a good long-term solution. Since it's already behind
>>> experiemental flags, you can just enable it unconditionally (with proper
>>> checks of-course).
>> The reason for introducing a sysfs attribute is to allow disabling the
>> feature to be able to unload the QAT driver or to assign a QAT device to
>> user space for example to QATlib or DPDK.
>>
>> In the initial implementation, there was no sysfs switch because the
>> acomp tfm was allocated in the data path. With the current design,
>> where the tfm is allocated in the workspace, the driver remains
>> permanently in use.
>>
>> Is there any other alternative to a sysfs attribute to dynamically
>> enable/disable this feature?
> 
> For all needed compression algorithm modules are loaded at btrfs module 
> load time (not mount time), thus I was expecting the driver being there 
> until the btrfs module is removed from kernel.
> 
> This is a completely new use case. Have no good idea on this at all. 
> Never expected an accelerated algorithm would even get removed halfway.

Personally speaking, I'd prefer the acomp API/internals to handle those 
hardware acceleration algorithms selection.

If every fs type utilizes this new accelerated path needs an interface 
to disable QAT acceleration, it doesn't look sane that one has to toggle 
every involved fs type to disable QAT acceleration.

Thus hiding the accelerated details behind common acomp API looks more sane.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 23:59       ` Qu Wenruo
  2025-11-29  0:23         ` Qu Wenruo
@ 2025-11-29  1:00         ` David Sterba
  1 sibling, 0 replies; 42+ messages in thread
From: David Sterba @ 2025-11-29  1:00 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Giovanni Cabiddu, Qu Wenruo, clm, dsterba, terrelln, herbert,
	linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky

On Sat, Nov 29, 2025 at 10:29:04AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/11/29 09:10, Giovanni Cabiddu 写道:
> > Thanks for your feedback, Qu Wenruo.
> > 
> > On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> [...]
> >> Not an compression/crypto expert, thus just comment on the btrfs part.
> >>
> >> sysfs is not a good long-term solution. Since it's already behind
> >> experiemental flags, you can just enable it unconditionally (with proper
> >> checks of-course).
> > The reason for introducing a sysfs attribute is to allow disabling the
> > feature to be able to unload the QAT driver or to assign a QAT device to
> > user space for example to QATlib or DPDK.
> > 
> > In the initial implementation, there was no sysfs switch because the
> > acomp tfm was allocated in the data path. With the current design,
> > where the tfm is allocated in the workspace, the driver remains
> > permanently in use.
> > 
> > Is there any other alternative to a sysfs attribute to dynamically
> > enable/disable this feature?
> 
> For all needed compression algorithm modules are loaded at btrfs module 
> load time (not mount time), thus I was expecting the driver being there 
> until the btrfs module is removed from kernel.
> 
> This is a completely new use case. Have no good idea on this at all. 
> Never expected an accelerated algorithm would even get removed halfway.
> 
> [...]
> >>
> >> This function get all input/output folios in a batch, but ubifs, which also
> >> uses acomp for compression, seems to only compress one page one time
> >> (ubifs_compress_folio()).
> >>
> >> I'm wondering what's preventing us from doing the existing folio-by-folio
> >> compression.
> >> Is the batch folio acquiring just for performance or something else?
> > There are a few reasons for using batch folio processing instead of
> > folio-by-folio compression:
> > 
> > * Performance: for a hardware accelerator like QAT, it is more efficient to
> >    process a larger chunk of data in a single request rather than issuing
> >    multiple small requests. (BTW, it would be even better if we could batch
> >    multiple requests and run them asynchronously!)
> > 
> > * API limitations: The current acomp interface is stateless. Supporting
> >    folio-by-folio compression with proper streaming would require changes
> >    to introduce a stateful API.
> 
> BTW, is there any extra benefit of using acomp interface (independent of 
> QAT acceleration)?

The promise of the acomp interface is that the CPU can be handed to
other processes while the async hw does the compression. Using the async
requests has some problems with deadlocks and allocation. The async
crypto API has been deprecated or removed where used, IIRC fsverity.

> If so we may consider experiment with it first and migrate btrfs to that 
> interface step by step, and making extra accelerated algorithms easier 
> to implement.

I had a prototype for acomp and switching to it via sysfs but once I
saw the patches removing acomp I scrapped it. The QAT is a bit different
as it's extensible, the acomp hw engines I assume the interface was made
for were hw cards and accelerators that eg. implemented zlib. With QAT
we can have LZO, ZSTD so it's worth adding the support.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-28 22:40     ` Giovanni Cabiddu
  2025-11-28 23:59       ` Qu Wenruo
@ 2025-11-29  1:08       ` David Sterba
  1 sibling, 0 replies; 42+ messages in thread
From: David Sterba @ 2025-11-29  1:08 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: Qu Wenruo, clm, dsterba, terrelln, herbert, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

On Fri, Nov 28, 2025 at 10:40:33PM +0000, Giovanni Cabiddu wrote:
> On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> > 在 2025/11/29 05:35, Giovanni Cabiddu 写道:
> > > Add support for hardware-accelerated compression using the acomp API in
> > > the crypto framework, enabling offload of zlib and zstd compression to
> > > hardware accelerators. Hardware offload reduces CPU load during
> > > compression, improving performance.
> > > 
> > > The implementation follows a generic design that works with any acomp
> > > implementation, though this enablement targets Intel QAT devices
> > > (similarly to what done in EROFS).
> > > 
> > > Input folios are organized into a scatter-gather list and submitted to
> > > the accelerator in a single asynchronous request. The calling thread
> > > sleeps while the hardware performs compression, freeing the CPU for
> > > other tasks.  Upon completion, the acomp callback wakes the thread to
> > > continue processing.
> > > 
> > > Offload is supported for:
> > >    - zlib: compression and decompression
> > >    - zstd: compression only
> > > 
> > > Offload is only attempted when the data size exceeds a minimum threshold,
> > > ensuring that small operations remain efficient by avoiding hardware setup
> > > overhead. All required buffers are pre-allocated in the workspace to
> > > eliminate allocations in the data path.
> > > 
> > > This feature maintains full compatibility with the existing BTRFS disk
> > > format. Files compressed by hardware can be decompressed by software
> > > implementations and vice versa.
> > > 
> > > The feature is wrapped in CONFIG_BTRFS_EXPERIMENTAL and can be enabled
> > > at runtime via the sysfs parameter /sys/fs/btrfs/<UUID>/offload_compress.
> > 
> > Not an compression/crypto expert, thus just comment on the btrfs part.
> > 
> > sysfs is not a good long-term solution. Since it's already behind
> > experiemental flags, you can just enable it unconditionally (with proper
> > checks of-course).
> The reason for introducing a sysfs attribute is to allow disabling the
> feature to be able to unload the QAT driver or to assign a QAT device to
> user space for example to QATlib or DPDK.
> 
> In the initial implementation, there was no sysfs switch because the
> acomp tfm was allocated in the data path. With the current design,
> where the tfm is allocated in the workspace, the driver remains
> permanently in use.
> 
> Is there any other alternative to a sysfs attribute to dynamically
> enable/disable this feature?

At least for the duration of the experimental phase the sysfs is the
right interface for enabing/disabling the QAT support.

In the long term I'd still recommend sysfs, this is easily accessible by
scripts unlike alternative way of ioctl. I don't know about other
reasonable options for that, certainly not a mount option.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-11-29  0:23         ` Qu Wenruo
@ 2025-12-01 14:32           ` Giovanni Cabiddu
  2025-12-01 15:10             ` Giovanni Cabiddu
  0 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-01 14:32 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, herbert
  Cc: Qu Wenruo, clm, terrelln, linux-btrfs, linux-crypto, qat-linux,
	cyan, brian.will, weigang.li, senozhatsky

On Sat, Nov 29, 2025 at 10:53:02AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/11/29 10:29, Qu Wenruo 写道:
> > 
> > 
> > 在 2025/11/29 09:10, Giovanni Cabiddu 写道:
> > > Thanks for your feedback, Qu Wenruo.
> > > 
> > > On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> > [...]
> > > > Not an compression/crypto expert, thus just comment on the btrfs part.
> > > > 
> > > > sysfs is not a good long-term solution. Since it's already behind
> > > > experiemental flags, you can just enable it unconditionally (with proper
> > > > checks of-course).
> > > The reason for introducing a sysfs attribute is to allow disabling the
> > > feature to be able to unload the QAT driver or to assign a QAT device to
> > > user space for example to QATlib or DPDK.
> > > 
> > > In the initial implementation, there was no sysfs switch because the
> > > acomp tfm was allocated in the data path. With the current design,
> > > where the tfm is allocated in the workspace, the driver remains
> > > permanently in use.
> > > 
> > > Is there any other alternative to a sysfs attribute to dynamically
> > > enable/disable this feature?
> > 
> > For all needed compression algorithm modules are loaded at btrfs module
> > load time (not mount time), thus I was expecting the driver being there
> > until the btrfs module is removed from kernel.
> > 
> > This is a completely new use case. Have no good idea on this at all.
> > Never expected an accelerated algorithm would even get removed halfway.
To clarify, the sysfs switch does not disable the algorithms themselves,
it only controls whether acceleration of that algorithm is used, if
supported.  If enabled, the filesystem can offload operations to the
accelerator. If disabled, it performs them in software. The
implementation also handles the case where acceleration is disabled or
enabled while the filesystem is in use.

BTW, currently, the feature is disabled by default. If that is
not preferable, we can enable it by default.

> Personally speaking, I'd prefer the acomp API/internals to handle those
> hardware acceleration algorithms selection.
> 
> If every fs type utilizes this new accelerated path needs an interface to
> disable QAT acceleration, it doesn't look sane that one has to toggle every
> involved fs type to disable QAT acceleration.
> 
> Thus hiding the accelerated details behind common acomp API looks more sane.
Even if we hide these details behind the acomp API, we would still face
a similar issue with the current acomp algorithms. If we need to disable
compression acceleration, for example, to assign a QAT device to user
space, we would have to unmount the filesystem.

What's needed is an `acomp_alg` implementation that is independent of the
QAT driver (or any specific accelerator) and can transparently fall back
to software. We already have a software fallback in the QAT driver, but
as explained, that does not prevent unloading the driver or re-purposing
the device. @David and @Herbert, any thoughts?

Thanks,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-01 14:32           ` Giovanni Cabiddu
@ 2025-12-01 15:10             ` Giovanni Cabiddu
  2025-12-01 20:57               ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-01 15:10 UTC (permalink / raw)
  To: Qu Wenruo, dsterba, herbert
  Cc: Qu Wenruo, clm, terrelln, linux-btrfs, linux-crypto, qat-linux,
	cyan, brian.will, weigang.li, senozhatsky

On Mon, Dec 01, 2025 at 02:32:35PM +0000, Giovanni Cabiddu wrote:
> On Sat, Nov 29, 2025 at 10:53:02AM +1030, Qu Wenruo wrote:
> > 
> > 
> > 在 2025/11/29 10:29, Qu Wenruo 写道:
> > > 
> > > 
> > > 在 2025/11/29 09:10, Giovanni Cabiddu 写道:
> > > > Thanks for your feedback, Qu Wenruo.
> > > > 
> > > > On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> > > [...]
> > > > > Not an compression/crypto expert, thus just comment on the btrfs part.
> > > > > 
> > > > > sysfs is not a good long-term solution. Since it's already behind
> > > > > experiemental flags, you can just enable it unconditionally (with proper
> > > > > checks of-course).
> > > > The reason for introducing a sysfs attribute is to allow disabling the
> > > > feature to be able to unload the QAT driver or to assign a QAT device to
> > > > user space for example to QATlib or DPDK.
> > > > 
> > > > In the initial implementation, there was no sysfs switch because the
> > > > acomp tfm was allocated in the data path. With the current design,
> > > > where the tfm is allocated in the workspace, the driver remains
> > > > permanently in use.
> > > > 
> > > > Is there any other alternative to a sysfs attribute to dynamically
> > > > enable/disable this feature?
> > > 
> > > For all needed compression algorithm modules are loaded at btrfs module
> > > load time (not mount time), thus I was expecting the driver being there
> > > until the btrfs module is removed from kernel.
> > > 
> > > This is a completely new use case. Have no good idea on this at all.
> > > Never expected an accelerated algorithm would even get removed halfway.
> To clarify, the sysfs switch does not disable the algorithms themselves,
> it only controls whether acceleration of that algorithm is used, if
> supported.  If enabled, the filesystem can offload operations to the
> accelerator. If disabled, it performs them in software. The
> implementation also handles the case where acceleration is disabled or
> enabled while the filesystem is in use.
> 
> BTW, currently, the feature is disabled by default. If that is
> not preferable, we can enable it by default.
> 
> > Personally speaking, I'd prefer the acomp API/internals to handle those
> > hardware acceleration algorithms selection.
> > 
> > If every fs type utilizes this new accelerated path needs an interface to
> > disable QAT acceleration, it doesn't look sane that one has to toggle every
> > involved fs type to disable QAT acceleration.
> > 
> > Thus hiding the accelerated details behind common acomp API looks more sane.
> Even if we hide these details behind the acomp API, we would still face
> a similar issue with the current acomp algorithms. If we need to disable
> compression acceleration, for example, to assign a QAT device to user
> space, we would have to unmount the filesystem.
> 
> What's needed is an `acomp_alg` implementation that is independent of the
> QAT driver (or any specific accelerator) and can transparently fall back
> to software. We already have a software fallback in the QAT driver, but
> as explained, that does not prevent unloading the driver or re-purposing
> the device. @David and @Herbert, any thoughts?
Perhaps I should clarify the use case to remove ambiguity.

I added the `enable/disable` switch to allow disabling acceleration on
the QAT device so it can be reassigned to user space.  In the current
design, the acomp tfm is allocated in the workspace and persists for the
lifetime of the filesystem (unlike the previous preliminary version of
this series where the acomp tfm was allocated in the datapath).
This change was introduced after a review comment.

Here is what happens:
1. The acomp tfm is allocated as part of the compression workspace.
2. The selected compression implementation is chosen by the acomp
   framework.
3. The driver (in this case, the QAT driver) increments its reference
   count.

At this point, if we try to remove the QAT driver, that operation is
blocked (as expected) because the driver is in use.

Without a switch to disable the feature, the only way to free the device
is to unmount the filesystem, perform the required operation on the QAT
driver/device (e.g., assign it to user space), and then re-mount the
filesystem (which will now use sw compression).

This becomes a real problem for root filesystems with compression
enabled (e.g., Fedora). If the QAT device is automatically used and
there is no way to disable its usage in BTRFS, how can we repurpose it
for something else?
[BTW, for context, QAT is currently used in user space via VFIO, which
requires disabling all kernel users and enabling VFs. There is no
hardware mechanism to partition resources between VFs and kernel usage.]

Regards,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-01 15:10             ` Giovanni Cabiddu
@ 2025-12-01 20:57               ` Qu Wenruo
  2025-12-01 22:18                 ` Giovanni Cabiddu
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-12-01 20:57 UTC (permalink / raw)
  To: Giovanni Cabiddu, dsterba, herbert
  Cc: Qu Wenruo, clm, terrelln, linux-btrfs, linux-crypto, qat-linux,
	cyan, brian.will, weigang.li, senozhatsky



在 2025/12/2 01:40, Giovanni Cabiddu 写道:
> On Mon, Dec 01, 2025 at 02:32:35PM +0000, Giovanni Cabiddu wrote:
>> On Sat, Nov 29, 2025 at 10:53:02AM +1030, Qu Wenruo wrote:
>>>
>>>
>>> 在 2025/11/29 10:29, Qu Wenruo 写道:
>>>>
>>>>
>>>> 在 2025/11/29 09:10, Giovanni Cabiddu 写道:
>>>>> Thanks for your feedback, Qu Wenruo.
>>>>>
>>>>> On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
>>>> [...]
>>>>>> Not an compression/crypto expert, thus just comment on the btrfs part.
>>>>>>
>>>>>> sysfs is not a good long-term solution. Since it's already behind
>>>>>> experiemental flags, you can just enable it unconditionally (with proper
>>>>>> checks of-course).
>>>>> The reason for introducing a sysfs attribute is to allow disabling the
>>>>> feature to be able to unload the QAT driver or to assign a QAT device to
>>>>> user space for example to QATlib or DPDK.
>>>>>
>>>>> In the initial implementation, there was no sysfs switch because the
>>>>> acomp tfm was allocated in the data path. With the current design,
>>>>> where the tfm is allocated in the workspace, the driver remains
>>>>> permanently in use.
>>>>>
>>>>> Is there any other alternative to a sysfs attribute to dynamically
>>>>> enable/disable this feature?
>>>>
>>>> For all needed compression algorithm modules are loaded at btrfs module
>>>> load time (not mount time), thus I was expecting the driver being there
>>>> until the btrfs module is removed from kernel.
>>>>
>>>> This is a completely new use case. Have no good idea on this at all.
>>>> Never expected an accelerated algorithm would even get removed halfway.
>> To clarify, the sysfs switch does not disable the algorithms themselves,
>> it only controls whether acceleration of that algorithm is used, if
>> supported.  If enabled, the filesystem can offload operations to the
>> accelerator. If disabled, it performs them in software. The
>> implementation also handles the case where acceleration is disabled or
>> enabled while the filesystem is in use.
>>
>> BTW, currently, the feature is disabled by default. If that is
>> not preferable, we can enable it by default.
>>
>>> Personally speaking, I'd prefer the acomp API/internals to handle those
>>> hardware acceleration algorithms selection.
>>>
>>> If every fs type utilizes this new accelerated path needs an interface to
>>> disable QAT acceleration, it doesn't look sane that one has to toggle every
>>> involved fs type to disable QAT acceleration.
>>>
>>> Thus hiding the accelerated details behind common acomp API looks more sane.
>> Even if we hide these details behind the acomp API, we would still face
>> a similar issue with the current acomp algorithms. If we need to disable
>> compression acceleration, for example, to assign a QAT device to user
>> space, we would have to unmount the filesystem.
>>
>> What's needed is an `acomp_alg` implementation that is independent of the
>> QAT driver (or any specific accelerator) and can transparently fall back
>> to software. We already have a software fallback in the QAT driver, but
>> as explained, that does not prevent unloading the driver or re-purposing
>> the device. @David and @Herbert, any thoughts?
> Perhaps I should clarify the use case to remove ambiguity.
> 
> I added the `enable/disable` switch to allow disabling acceleration on
> the QAT device so it can be reassigned to user space.  In the current
> design, the acomp tfm is allocated in the workspace and persists for the
> lifetime of the filesystem (unlike the previous preliminary version of
> this series where the acomp tfm was allocated in the datapath).
> This change was introduced after a review comment.
> 
> Here is what happens:
> 1. The acomp tfm is allocated as part of the compression workspace.

Not an expert on crypto, but I guess acomp is not able to really 
dynamically queue the workload into different implementations, but has 
to determine it at workspace allocation time due to the differences in 
tfm/buffersize/scatter list size?


This may be unrealistic, but is it even feasible to hide QAT behind 
generic acomp decompress/compress algorithm names.
Then only queue the workload to QAT devices when it's available?

Just like that we have several different implementation for RAID6 and 
can select at module load time, but more dynamically in this case.

With runtime workload delivery, the removal of QAT device can be pretty 
generic and transparent. Just mark the QAT device unavailable for new 
workload, and wait for any existing workload to finish.

And this also makes btrfs part easier to implement, just add acomp 
interface support, no special handling for QAT and acomp will select the 
best implementation for us.

But for sure, this is just some wild idea from an uneducated non-crypto guy.

Thanks,
Qu

> 2. The selected compression implementation is chosen by the acomp
>     framework.
> 3. The driver (in this case, the QAT driver) increments its reference
>     count.
> 
> At this point, if we try to remove the QAT driver, that operation is
> blocked (as expected) because the driver is in use.
> 
> Without a switch to disable the feature, the only way to free the device
> is to unmount the filesystem, perform the required operation on the QAT
> driver/device (e.g., assign it to user space), and then re-mount the
> filesystem (which will now use sw compression).
> 
> This becomes a real problem for root filesystems with compression
> enabled (e.g., Fedora). If the QAT device is automatically used and
> there is no way to disable its usage in BTRFS, how can we repurpose it
> for something else?
> [BTW, for context, QAT is currently used in user space via VFIO, which
> requires disabling all kernel users and enabling VFs. There is no
> hardware mechanism to partition resources between VFs and kernel usage.]
> 
> Regards,
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-01 20:57               ` Qu Wenruo
@ 2025-12-01 22:18                 ` Giovanni Cabiddu
  2025-12-01 23:13                   ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-01 22:18 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, herbert, Qu Wenruo, clm, terrelln, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

On Tue, Dec 02, 2025 at 07:27:18AM +1030, Qu Wenruo wrote:
> 在 2025/12/2 01:40, Giovanni Cabiddu 写道:
> > On Mon, Dec 01, 2025 at 02:32:35PM +0000, Giovanni Cabiddu wrote:
> > > On Sat, Nov 29, 2025 at 10:53:02AM +1030, Qu Wenruo wrote:
> > > > 
> > > > 
> > > > 在 2025/11/29 10:29, Qu Wenruo 写道:
> > > > > 
> > > > > 
> > > > > 在 2025/11/29 09:10, Giovanni Cabiddu 写道:
> > > > > > Thanks for your feedback, Qu Wenruo.
> > > > > > 
> > > > > > On Sat, Nov 29, 2025 at 08:25:30AM +1030, Qu Wenruo wrote:
> > > > > [...]
> > > > > > > Not an compression/crypto expert, thus just comment on the btrfs part.
> > > > > > > 
> > > > > > > sysfs is not a good long-term solution. Since it's already behind
> > > > > > > experiemental flags, you can just enable it unconditionally (with proper
> > > > > > > checks of-course).
> > > > > > The reason for introducing a sysfs attribute is to allow disabling the
> > > > > > feature to be able to unload the QAT driver or to assign a QAT device to
> > > > > > user space for example to QATlib or DPDK.
> > > > > > 
> > > > > > In the initial implementation, there was no sysfs switch because the
> > > > > > acomp tfm was allocated in the data path. With the current design,
> > > > > > where the tfm is allocated in the workspace, the driver remains
> > > > > > permanently in use.
> > > > > > 
> > > > > > Is there any other alternative to a sysfs attribute to dynamically
> > > > > > enable/disable this feature?
> > > > > 
> > > > > For all needed compression algorithm modules are loaded at btrfs module
> > > > > load time (not mount time), thus I was expecting the driver being there
> > > > > until the btrfs module is removed from kernel.
> > > > > 
> > > > > This is a completely new use case. Have no good idea on this at all.
> > > > > Never expected an accelerated algorithm would even get removed halfway.
> > > To clarify, the sysfs switch does not disable the algorithms themselves,
> > > it only controls whether acceleration of that algorithm is used, if
> > > supported.  If enabled, the filesystem can offload operations to the
> > > accelerator. If disabled, it performs them in software. The
> > > implementation also handles the case where acceleration is disabled or
> > > enabled while the filesystem is in use.
> > > 
> > > BTW, currently, the feature is disabled by default. If that is
> > > not preferable, we can enable it by default.
> > > 
> > > > Personally speaking, I'd prefer the acomp API/internals to handle those
> > > > hardware acceleration algorithms selection.
> > > > 
> > > > If every fs type utilizes this new accelerated path needs an interface to
> > > > disable QAT acceleration, it doesn't look sane that one has to toggle every
> > > > involved fs type to disable QAT acceleration.
> > > > 
> > > > Thus hiding the accelerated details behind common acomp API looks more sane.
> > > Even if we hide these details behind the acomp API, we would still face
> > > a similar issue with the current acomp algorithms. If we need to disable
> > > compression acceleration, for example, to assign a QAT device to user
> > > space, we would have to unmount the filesystem.
> > > 
> > > What's needed is an `acomp_alg` implementation that is independent of the
> > > QAT driver (or any specific accelerator) and can transparently fall back
> > > to software. We already have a software fallback in the QAT driver, but
> > > as explained, that does not prevent unloading the driver or re-purposing
> > > the device. @David and @Herbert, any thoughts?
> > Perhaps I should clarify the use case to remove ambiguity.
> > 
> > I added the `enable/disable` switch to allow disabling acceleration on
> > the QAT device so it can be reassigned to user space.  In the current
> > design, the acomp tfm is allocated in the workspace and persists for the
> > lifetime of the filesystem (unlike the previous preliminary version of
> > this series where the acomp tfm was allocated in the datapath).
> > This change was introduced after a review comment.
> > 
> > Here is what happens:
> > 1. The acomp tfm is allocated as part of the compression workspace.
> 
> Not an expert on crypto, but I guess acomp is not able to really dynamically
> queue the workload into different implementations, but has to determine it
> at workspace allocation time due to the differences in
> tfm/buffersize/scatter list size?
Correct. There isn't an intermediate layer that can enqueue to a
separate implementation. The enqueue to a separate implementation can be
done in a specific implementation. The QAT driver does that to implement
a fallback to software.

> This may be unrealistic, but is it even feasible to hide QAT behind generic
> acomp decompress/compress algorithm names.
> Then only queue the workload to QAT devices when it's available?
That is possible. It is possible to specify a generic algorithm name to
crypto_alloc_acomp() and the implementation that has the highest
priority will be selected.

> Just like that we have several different implementation for RAID6 and can
> select at module load time, but more dynamically in this case.
> 
> With runtime workload delivery, the removal of QAT device can be pretty
> generic and transparent. Just mark the QAT device unavailable for new
> workload, and wait for any existing workload to finish.
> 
> And this also makes btrfs part easier to implement, just add acomp interface
> support, no special handling for QAT and acomp will select the best
> implementation for us.
> 
> But for sure, this is just some wild idea from an uneducated non-crypto guy.

I'm trying to better understand the concern:

Is the issue that QAT specific details are leaking into BTRFS?
Or that we currently have two APIs performing similar functions being
called (acomp and the sw libs)?

If it is the first case, the only QAT-related details exposed are:

 * Offload threshold – This can be hidden inside the implementation of
   crypto_acomp_compress/decompress() in the QAT driver or exposed as a
   tfm attribute (that would be my preference), so we can decide early
   whether offloading makes sense without going throught the conversions
   between folios and scatterlists

 * QAT implementation names, i.e.:
       static const char *zlib_acomp_alg_name = "qat_zlib_deflate";
       static const char *zstd_acomp_alg_name = "qat_zstd";
   We can use the generic names instead. If the returned implementation is
   software, we simply ignore it. This way we will enable all the devices
   that implement the acomp API, not only QAT. However, the risk is testing.
   I won't be able to test such devices...

Beyond that, the BTRFS/acomp code can use a software backend without any
changes.

If the concern is about having two APIs, we could remove direct calls to
the software libraries and rely only on acomp. One option might be to
allocate two tfms in the workspace, one for software and one for the
accelerator, since the software names are stable and hardcoded, and
perform the switch.  However, the trend in the kernel nowadays is to
prefer direct calls to the libraries, rather than going through the
crypto layer.  That said, I still need a mechanism to indicate when the
accelerator should not be used. (BTW, I saw David's email confirming
that using sysfs for this is acceptable.)

Thanks,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-01 22:18                 ` Giovanni Cabiddu
@ 2025-12-01 23:13                   ` Qu Wenruo
  2025-12-02 17:09                     ` Giovanni Cabiddu
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-12-01 23:13 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: dsterba, herbert, Qu Wenruo, clm, terrelln, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky



在 2025/12/2 08:48, Giovanni Cabiddu 写道:
> On Tue, Dec 02, 2025 at 07:27:18AM +1030, Qu Wenruo wrote:
[...]
>>> Here is what happens:
>>> 1. The acomp tfm is allocated as part of the compression workspace.
>>
>> Not an expert on crypto, but I guess acomp is not able to really dynamically
>> queue the workload into different implementations, but has to determine it
>> at workspace allocation time due to the differences in
>> tfm/buffersize/scatter list size?
> Correct. There isn't an intermediate layer that can enqueue to a
> separate implementation. The enqueue to a separate implementation can be
> done in a specific implementation. The QAT driver does that to implement
> a fallback to software.
> 
>> This may be unrealistic, but is it even feasible to hide QAT behind generic
>> acomp decompress/compress algorithm names.
>> Then only queue the workload to QAT devices when it's available?
> That is possible. It is possible to specify a generic algorithm name to
> crypto_alloc_acomp() and the implementation that has the highest
> priority will be selected.

I think it will be the best solution, and the most transparent one.

> 
>> Just like that we have several different implementation for RAID6 and can
>> select at module load time, but more dynamically in this case.
>>
>> With runtime workload delivery, the removal of QAT device can be pretty
>> generic and transparent. Just mark the QAT device unavailable for new
>> workload, and wait for any existing workload to finish.
>>
>> And this also makes btrfs part easier to implement, just add acomp interface
>> support, no special handling for QAT and acomp will select the best
>> implementation for us.
>>
>> But for sure, this is just some wild idea from an uneducated non-crypto guy.
> 
> I'm trying to better understand the concern:
> 
> Is the issue that QAT specific details are leaking into BTRFS?
> Or that we currently have two APIs performing similar functions being
> called (acomp and the sw libs)?
> 
> If it is the first case, the only QAT-related details exposed are:
> 
>   * Offload threshold – This can be hidden inside the implementation of
>     crypto_acomp_compress/decompress() in the QAT driver or exposed as a
>     tfm attribute (that would be my preference), so we can decide early
>     whether offloading makes sense without going throught the conversions
>     between folios and scatterlists

This part is fine, the practical threshold will be larger than 1024 and 
2048 anyway.

> 
>   * QAT implementation names, i.e.:
>         static const char *zlib_acomp_alg_name = "qat_zlib_deflate";
>         static const char *zstd_acomp_alg_name = "qat_zstd";
>     We can use the generic names instead. If the returned implementation is
>     software, we simply ignore it. This way we will enable all the devices
>     that implement the acomp API, not only QAT. However, the risk is testing.
>     I won't be able to test such devices...

This is only a minor part of the concern.

The other is the removal of QAT, which is implemented as a per-fs 
interface and fully exposed to btrfs.
And that's really the only blockage to me.

If QAT is the first one doing this, would there be another drive 
implementing the same interface for its removal in the future?
To me this doesn't look to scale.


And that also looks like a layer violation, exporting low-level crypto 
details into a fs, which shouldn't really care about the fast 
implementation or the details on how to remove a QAT device.

Thus I really want to follow the RAID6 scheme, let RAID6 module to 
select the fastest one, btrfs just use the provide interface.
(And add the missing part of dynamically remove one implementation at 
runtime)


I understand your concern related to the QAT device removal, but 
considering the layer separation, QAT device removal would be better to 
be handled inside acomp layer, so that not every QAT user needs to 
implement a kill switch.

Considering acomp is never designed with such runtime workload delivery, 
this may be too much to ask though.

> 
> Beyond that, the BTRFS/acomp code can use a software backend without any
> changes.
> 
> If the concern is about having two APIs, we could remove direct calls to
> the software libraries and rely only on acomp.

That's not a huge deal, at least not to me.

I'm fine to have acomp interface with the existing interface, as long as 
it provides better performance.

Thanks,
Qu

> One option might be to
> allocate two tfms in the workspace, one for software and one for the
> accelerator, since the software names are stable and hardcoded, and
> perform the switch.  However, the trend in the kernel nowadays is to
> prefer direct calls to the libraries, rather than going through the
> crypto layer.  That said, I still need a mechanism to indicate when the
> accelerator should not be used. (BTW, I saw David's email confirming
> that using sysfs for this is acceptable.)
> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
                   ` (16 preceding siblings ...)
  2025-11-28 19:05 ` [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload Giovanni Cabiddu
@ 2025-12-02  7:53 ` Christoph Hellwig
  2025-12-02 15:46   ` Jani Partanen
  17 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-12-02  7:53 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: clm, dsterba, terrelln, herbert, linux-btrfs, linux-crypto,
	qat-linux, cyan, brian.will, weigang.li, senozhatsky

On Fri, Nov 28, 2025 at 07:04:48PM +0000, Giovanni Cabiddu wrote:
> +---------------------------+---------+---------+---------+---------+
> |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
> +---------------------------+---------+---------+---------+---------+
> | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
> +---------------------------+---------+---------+---------+---------+
> | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
> +---------------------------+---------+---------+---------+---------+
> | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
> +---------------------------+---------+---------+---------+---------+

Is it just me, or do the numbers not look all that great at least
when comparing to ZSTD-L3 and LZO-L1?  What are the decompression
numbers?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-02  7:53 ` [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Christoph Hellwig
@ 2025-12-02 15:46   ` Jani Partanen
  2025-12-02 17:19     ` Giovanni Cabiddu
  2025-12-03  7:00     ` Christoph Hellwig
  0 siblings, 2 replies; 42+ messages in thread
From: Jani Partanen @ 2025-12-02 15:46 UTC (permalink / raw)
  To: Christoph Hellwig, Giovanni Cabiddu
  Cc: clm, dsterba, terrelln, herbert, linux-btrfs, linux-crypto,
	qat-linux, cyan, brian.will, weigang.li, senozhatsky


On 02/12/2025 9.53, Christoph Hellwig wrote:
> On Fri, Nov 28, 2025 at 07:04:48PM +0000, Giovanni Cabiddu wrote:
>> +---------------------------+---------+---------+---------+---------+
>> |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
>> +---------------------------+---------+---------+---------+---------+
>> | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
>> +---------------------------+---------+---------+---------+---------+
>> | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
>> +---------------------------+---------+---------+---------+---------+
>> | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
>> +---------------------------+---------+---------+---------+---------+
> Is it just me, or do the numbers not look all that great at least
> when comparing to ZSTD-L3 and LZO-L1?  What are the decompression
> numbers?
>

What makes you think so?

If CPU util numbers was single core %, then I would agree with you. But 
its 208 cores so there is quite big saving, like over 20 cores saved if 
I have understood this right.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-01 23:13                   ` Qu Wenruo
@ 2025-12-02 17:09                     ` Giovanni Cabiddu
  2025-12-02 20:38                       ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-02 17:09 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, herbert, Qu Wenruo, clm, terrelln, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

On Tue, Dec 02, 2025 at 09:43:24AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/12/2 08:48, Giovanni Cabiddu 写道:
> > On Tue, Dec 02, 2025 at 07:27:18AM +1030, Qu Wenruo wrote:
> [...]
> > > > Here is what happens:
> > > > 1. The acomp tfm is allocated as part of the compression workspace.
> > > 
> > > Not an expert on crypto, but I guess acomp is not able to really dynamically
> > > queue the workload into different implementations, but has to determine it
> > > at workspace allocation time due to the differences in
> > > tfm/buffersize/scatter list size?
> > Correct. There isn't an intermediate layer that can enqueue to a
> > separate implementation. The enqueue to a separate implementation can be
> > done in a specific implementation. The QAT driver does that to implement
> > a fallback to software.
> > 
> > > This may be unrealistic, but is it even feasible to hide QAT behind generic
> > > acomp decompress/compress algorithm names.
> > > Then only queue the workload to QAT devices when it's available?
> > That is possible. It is possible to specify a generic algorithm name to
> > crypto_alloc_acomp() and the implementation that has the highest
> > priority will be selected.
> 
> I think it will be the best solution, and the most transparent one.
> 
> > 
> > > Just like that we have several different implementation for RAID6 and can
> > > select at module load time, but more dynamically in this case.
> > > 
> > > With runtime workload delivery, the removal of QAT device can be pretty
> > > generic and transparent. Just mark the QAT device unavailable for new
> > > workload, and wait for any existing workload to finish.
> > > 
> > > And this also makes btrfs part easier to implement, just add acomp interface
> > > support, no special handling for QAT and acomp will select the best
> > > implementation for us.
> > > 
> > > But for sure, this is just some wild idea from an uneducated non-crypto guy.
> > 
> > I'm trying to better understand the concern:
> > 
> > Is the issue that QAT specific details are leaking into BTRFS?
> > Or that we currently have two APIs performing similar functions being
> > called (acomp and the sw libs)?
> > 
> > If it is the first case, the only QAT-related details exposed are:
> > 
> >   * Offload threshold – This can be hidden inside the implementation of
> >     crypto_acomp_compress/decompress() in the QAT driver or exposed as a
> >     tfm attribute (that would be my preference), so we can decide early
> >     whether offloading makes sense without going throught the conversions
> >     between folios and scatterlists
> 
> This part is fine, the practical threshold will be larger than 1024 and 2048
> anyway.
> 
> > 
> >   * QAT implementation names, i.e.:
> >         static const char *zlib_acomp_alg_name = "qat_zlib_deflate";
> >         static const char *zstd_acomp_alg_name = "qat_zstd";
> >     We can use the generic names instead. If the returned implementation is
> >     software, we simply ignore it. This way we will enable all the devices
> >     that implement the acomp API, not only QAT. However, the risk is testing.
> >     I won't be able to test such devices...
> 
> This is only a minor part of the concern.
> 
> The other is the removal of QAT, which is implemented as a per-fs interface
> and fully exposed to btrfs.
> And that's really the only blockage to me.
> 
> If QAT is the first one doing this, would there be another drive
> implementing the same interface for its removal in the future?
> To me this doesn't look to scale.

I should have explained this better.

The switch is not QAT specific:

    /sys/fs/btrfs/<UUID>/offload_compress

It does not require any other compression engine that plugs into the
acomp framework to implement anything.

Here's how it works:

  * If `offload_compress` is enabled, an acomp tfm is allocated. The tfm
    allocation in the algorithm implementation typically increments the
    reference count on the driver that provides the algorithm. At this
    point, the hardware implementation of the algorithm is selected.
    Compression/decompression is done through the acomp APIs.

  * If `offload_compress` is disabled, the acomp tfms in the workspace are
    freed, and the software libraries are used instead.

So there is nothing QAT specific here. The mechanism is generic.

Have a look at the code, it is pretty straightforward :-).

Hopefully this clarifies.

Thanks,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-02 15:46   ` Jani Partanen
@ 2025-12-02 17:19     ` Giovanni Cabiddu
  2025-12-03  7:00     ` Christoph Hellwig
  1 sibling, 0 replies; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-02 17:19 UTC (permalink / raw)
  To: Jani Partanen
  Cc: Christoph Hellwig, clm, dsterba, terrelln, herbert, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

On Tue, Dec 02, 2025 at 05:46:29PM +0200, Jani Partanen wrote:
> 
> On 02/12/2025 9.53, Christoph Hellwig wrote:
> > On Fri, Nov 28, 2025 at 07:04:48PM +0000, Giovanni Cabiddu wrote:
> > > +---------------------------+---------+---------+---------+---------+
> > > |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
> > > +---------------------------+---------+---------+---------+---------+
> > > | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
> > > +---------------------------+---------+---------+---------+---------+
> > > | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
> > > +---------------------------+---------+---------+---------+---------+
> > > | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
> > > +---------------------------+---------+---------+---------+---------+
> > Is it just me, or do the numbers not look all that great at least
> > when comparing to ZSTD-L3 and LZO-L1?  What are the decompression
> > numbers?
> > 
> 
> What makes you think so?
> 
> If CPU util numbers was single core %, then I would agree with you. But its
> 208 cores so there is quite big saving, like over 20 cores saved if I have
> understood this right.

Probably what triggered the question is that all compression ratios are
similar, except for LZO.

Also, in the previous version of the table I didn't specify that for QAT
we are running ZLIB (even if this set supports ZSTD w/ QAT).

Here is the updated table:

 +---------------------------+-------------+-----------+-----------+-----------+
 |                           | QAT-ZLIB-L9 | ZSTD-L3   | ZLIB-L3   | LZO-L1    |
 +---------------------------+-------------+-----------+-----------+-----------+
 | Disk Write TPUT (GiB/s)   | 6.5         | 5.2       | 2.2       | 6.5       |
 | (higher is better)        |             |           |           |           |
 +---------------------------+-------------+-----------+-----------+-----------+
 | CPU utils %age @208 cores | 4.56%       | 15.67%    | 12.79%    | 19.85%    |
 | (lower is better)         | ~9 cores    | ~33 cores | ~27 cores | ~41 cores |
 +---------------------------+-------------+-----------+-----------+-----------+
 | Compression Ratio         | 34%         | 35%       | 37%       | 58%       |
 | (lower is better)         |             |           |           |           |
 +---------------------------+-------------+-----------+-----------+-----------+

Key takeaway: QAT offload aims to reduce CPU utilization while maintaining
competitive throughput and compression ratio.

At a throughput of 6.5 GiB/s, QAT-ZLIB-L9 stores the data in significantly
less space (compression ratio 34% vs 58%) and uses far less cores to do
that (4.56% vs 19.85%) compared to SW-LZO-L1.

As for decompression:
  * zlib offload supports both compression and decompression.
  * zstd offload currently supports compression only; decompression
    falls back to software.

I'll share zstd and decompression benchmarks in a future revision once
they pass internal approval. I will also include measurements for SW-ZLIB-L9
for a direct comparison.

Thanks,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-02 17:09                     ` Giovanni Cabiddu
@ 2025-12-02 20:38                       ` Qu Wenruo
  2025-12-02 22:37                         ` Giovanni Cabiddu
  0 siblings, 1 reply; 42+ messages in thread
From: Qu Wenruo @ 2025-12-02 20:38 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: dsterba, herbert, Qu Wenruo, clm, terrelln, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky



在 2025/12/3 03:39, Giovanni Cabiddu 写道:
> On Tue, Dec 02, 2025 at 09:43:24AM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2025/12/2 08:48, Giovanni Cabiddu 写道:
>>> On Tue, Dec 02, 2025 at 07:27:18AM +1030, Qu Wenruo wrote:
>> [...]
>>>>> Here is what happens:
>>>>> 1. The acomp tfm is allocated as part of the compression workspace.
>>>>
>>>> Not an expert on crypto, but I guess acomp is not able to really dynamically
>>>> queue the workload into different implementations, but has to determine it
>>>> at workspace allocation time due to the differences in
>>>> tfm/buffersize/scatter list size?
>>> Correct. There isn't an intermediate layer that can enqueue to a
>>> separate implementation. The enqueue to a separate implementation can be
>>> done in a specific implementation. The QAT driver does that to implement
>>> a fallback to software.
>>>
>>>> This may be unrealistic, but is it even feasible to hide QAT behind generic
>>>> acomp decompress/compress algorithm names.
>>>> Then only queue the workload to QAT devices when it's available?
>>> That is possible. It is possible to specify a generic algorithm name to
>>> crypto_alloc_acomp() and the implementation that has the highest
>>> priority will be selected.
>>
>> I think it will be the best solution, and the most transparent one.
>>
>>>
>>>> Just like that we have several different implementation for RAID6 and can
>>>> select at module load time, but more dynamically in this case.
>>>>
>>>> With runtime workload delivery, the removal of QAT device can be pretty
>>>> generic and transparent. Just mark the QAT device unavailable for new
>>>> workload, and wait for any existing workload to finish.
>>>>
>>>> And this also makes btrfs part easier to implement, just add acomp interface
>>>> support, no special handling for QAT and acomp will select the best
>>>> implementation for us.
>>>>
>>>> But for sure, this is just some wild idea from an uneducated non-crypto guy.
>>>
>>> I'm trying to better understand the concern:
>>>
>>> Is the issue that QAT specific details are leaking into BTRFS?
>>> Or that we currently have two APIs performing similar functions being
>>> called (acomp and the sw libs)?
>>>
>>> If it is the first case, the only QAT-related details exposed are:
>>>
>>>    * Offload threshold – This can be hidden inside the implementation of
>>>      crypto_acomp_compress/decompress() in the QAT driver or exposed as a
>>>      tfm attribute (that would be my preference), so we can decide early
>>>      whether offloading makes sense without going throught the conversions
>>>      between folios and scatterlists
>>
>> This part is fine, the practical threshold will be larger than 1024 and 2048
>> anyway.
>>
>>>
>>>    * QAT implementation names, i.e.:
>>>          static const char *zlib_acomp_alg_name = "qat_zlib_deflate";
>>>          static const char *zstd_acomp_alg_name = "qat_zstd";
>>>      We can use the generic names instead. If the returned implementation is
>>>      software, we simply ignore it. This way we will enable all the devices
>>>      that implement the acomp API, not only QAT. However, the risk is testing.
>>>      I won't be able to test such devices...
>>
>> This is only a minor part of the concern.
>>
>> The other is the removal of QAT, which is implemented as a per-fs interface
>> and fully exposed to btrfs.
>> And that's really the only blockage to me.
>>
>> If QAT is the first one doing this, would there be another drive
>> implementing the same interface for its removal in the future?
>> To me this doesn't look to scale.
> 
> I should have explained this better.
> 
> The switch is not QAT specific:
> 
>      /sys/fs/btrfs/<UUID>/offload_compress
> 
> It does not require any other compression engine that plugs into the
> acomp framework to implement anything.
> 
> Here's how it works:
> 
>    * If `offload_compress` is enabled, an acomp tfm is allocated. The tfm
>      allocation in the algorithm implementation typically increments the
>      reference count on the driver that provides the algorithm. At this
>      point, the hardware implementation of the algorithm is selected.
>      Compression/decompression is done through the acomp APIs.
> 
>    * If `offload_compress` is disabled, the acomp tfms in the workspace are
>      freed, and the software libraries are used instead.
> 
> So there is nothing QAT specific here. The mechanism is generic.

But only QAT requires this, a "generic" mechanism only for QAT.

> 
> Have a look at the code, it is pretty straightforward :-).
> 
> Hopefully this clarifies.
> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-02 20:38                       ` Qu Wenruo
@ 2025-12-02 22:37                         ` Giovanni Cabiddu
  2025-12-02 22:59                           ` Qu Wenruo
  0 siblings, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-02 22:37 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, herbert, Qu Wenruo, clm, terrelln, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

On Wed, Dec 03, 2025 at 07:08:50AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/12/3 03:39, Giovanni Cabiddu 写道:
> > On Tue, Dec 02, 2025 at 09:43:24AM +1030, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2025/12/2 08:48, Giovanni Cabiddu 写道:
> > > > On Tue, Dec 02, 2025 at 07:27:18AM +1030, Qu Wenruo wrote:
> > > [...]
> > > > > > Here is what happens:
> > > > > > 1. The acomp tfm is allocated as part of the compression workspace.
> > > > > 
> > > > > Not an expert on crypto, but I guess acomp is not able to really dynamically
> > > > > queue the workload into different implementations, but has to determine it
> > > > > at workspace allocation time due to the differences in
> > > > > tfm/buffersize/scatter list size?
> > > > Correct. There isn't an intermediate layer that can enqueue to a
> > > > separate implementation. The enqueue to a separate implementation can be
> > > > done in a specific implementation. The QAT driver does that to implement
> > > > a fallback to software.
> > > > 
> > > > > This may be unrealistic, but is it even feasible to hide QAT behind generic
> > > > > acomp decompress/compress algorithm names.
> > > > > Then only queue the workload to QAT devices when it's available?
> > > > That is possible. It is possible to specify a generic algorithm name to
> > > > crypto_alloc_acomp() and the implementation that has the highest
> > > > priority will be selected.
> > > 
> > > I think it will be the best solution, and the most transparent one.
> > > 
> > > > 
> > > > > Just like that we have several different implementation for RAID6 and can
> > > > > select at module load time, but more dynamically in this case.
> > > > > 
> > > > > With runtime workload delivery, the removal of QAT device can be pretty
> > > > > generic and transparent. Just mark the QAT device unavailable for new
> > > > > workload, and wait for any existing workload to finish.
> > > > > 
> > > > > And this also makes btrfs part easier to implement, just add acomp interface
> > > > > support, no special handling for QAT and acomp will select the best
> > > > > implementation for us.
> > > > > 
> > > > > But for sure, this is just some wild idea from an uneducated non-crypto guy.
> > > > 
> > > > I'm trying to better understand the concern:
> > > > 
> > > > Is the issue that QAT specific details are leaking into BTRFS?
> > > > Or that we currently have two APIs performing similar functions being
> > > > called (acomp and the sw libs)?
> > > > 
> > > > If it is the first case, the only QAT-related details exposed are:
> > > > 
> > > >    * Offload threshold – This can be hidden inside the implementation of
> > > >      crypto_acomp_compress/decompress() in the QAT driver or exposed as a
> > > >      tfm attribute (that would be my preference), so we can decide early
> > > >      whether offloading makes sense without going throught the conversions
> > > >      between folios and scatterlists
> > > 
> > > This part is fine, the practical threshold will be larger than 1024 and 2048
> > > anyway.
> > > 
> > > > 
> > > >    * QAT implementation names, i.e.:
> > > >          static const char *zlib_acomp_alg_name = "qat_zlib_deflate";
> > > >          static const char *zstd_acomp_alg_name = "qat_zstd";
> > > >      We can use the generic names instead. If the returned implementation is
> > > >      software, we simply ignore it. This way we will enable all the devices
> > > >      that implement the acomp API, not only QAT. However, the risk is testing.
> > > >      I won't be able to test such devices...
> > > 
> > > This is only a minor part of the concern.
> > > 
> > > The other is the removal of QAT, which is implemented as a per-fs interface
> > > and fully exposed to btrfs.
> > > And that's really the only blockage to me.
> > > 
> > > If QAT is the first one doing this, would there be another drive
> > > implementing the same interface for its removal in the future?
> > > To me this doesn't look to scale.
> > 
> > I should have explained this better.
> > 
> > The switch is not QAT specific:
> > 
> >      /sys/fs/btrfs/<UUID>/offload_compress
> > 
> > It does not require any other compression engine that plugs into the
> > acomp framework to implement anything.
> > 
> > Here's how it works:
> > 
> >    * If `offload_compress` is enabled, an acomp tfm is allocated. The tfm
> >      allocation in the algorithm implementation typically increments the
> >      reference count on the driver that provides the algorithm. At this
> >      point, the hardware implementation of the algorithm is selected.
> >      Compression/decompression is done through the acomp APIs.
> > 
> >    * If `offload_compress` is disabled, the acomp tfms in the workspace are
> >      freed, and the software libraries are used instead.
> > 
> > So there is nothing QAT specific here. The mechanism is generic.
> 
> But only QAT requires this, a "generic" mechanism only for QAT.
If this solution is adopted for other accelerators, they will need it as
well.

I tested sending traffic to another device that plugs into the acomp API
(Intel IAA) and then tried removing the module while in-flight
compression operations were ongoing. It was not possible to remove it
(as expected!).

IAA currently only implements deflate (not zlib-deflate), so it cannot
be used for this specific case, but the same limitation applies. There
are also other drivers in drivers/crypto, including non-Intel ones, that
integrate with acomp for compression.

So this is not QAT-specific. The problem exists for any accelerator using
acomp when dynamic removal is required.

Thanks,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload
  2025-12-02 22:37                         ` Giovanni Cabiddu
@ 2025-12-02 22:59                           ` Qu Wenruo
  0 siblings, 0 replies; 42+ messages in thread
From: Qu Wenruo @ 2025-12-02 22:59 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: dsterba, herbert, Qu Wenruo, clm, terrelln, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

[...]
>>> So there is nothing QAT specific here. The mechanism is generic.
>>
>> But only QAT requires this, a "generic" mechanism only for QAT.
> If this solution is adopted for other accelerators, they will need it as
> well.

Does this sound sane to you? Every future acomp user needs to implement 
a removal? At least it doesn't sound sane to me.

> 
> I tested sending traffic to another device that plugs into the acomp API
> (Intel IAA) and then tried removing the module while in-flight
> compression operations were ongoing. It was not possible to remove it
> (as expected!).

I know, but that doesn't mean it's the correct way to go.

You're pushing for a workaround of QAT into an acomp user, completely 
breaking the layer separation.

If the current acomp layer can't handle this feature (dynamically 
deliver to different implementation), then please add such feature to 
acomp layer.

> 
> IAA currently only implements deflate (not zlib-deflate), so it cannot
> be used for this specific case, but the same limitation applies. There
> are also other drivers in drivers/crypto, including non-Intel ones, that
> integrate with acomp for compression.
> 
> So this is not QAT-specific. The problem exists for any accelerator using
> acomp when dynamic removal is required.

Then please give an example where such removal is needed in the existing 
code.

I tried searching for those acomp users, but none of them seems to need 
bother the removal of an implementation.

Again, if QAT requires a new ability (dynamic removal) from acomp, add 
it into acomp not the acomp user.

Thanks,
Qu

> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-02 15:46   ` Jani Partanen
  2025-12-02 17:19     ` Giovanni Cabiddu
@ 2025-12-03  7:00     ` Christoph Hellwig
  2025-12-03 10:15       ` Giovanni Cabiddu
  2025-12-03 10:47       ` Simon Richter
  1 sibling, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-12-03  7:00 UTC (permalink / raw)
  To: Jani Partanen
  Cc: Christoph Hellwig, Giovanni Cabiddu, clm, dsterba, terrelln,
	herbert, linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky

On Tue, Dec 02, 2025 at 05:46:29PM +0200, Jani Partanen wrote:
> 
> On 02/12/2025 9.53, Christoph Hellwig wrote:
> > On Fri, Nov 28, 2025 at 07:04:48PM +0000, Giovanni Cabiddu wrote:
> > > +---------------------------+---------+---------+---------+---------+
> > > |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
> > > +---------------------------+---------+---------+---------+---------+
> > > | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
> > > +---------------------------+---------+---------+---------+---------+
> > > | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
> > > +---------------------------+---------+---------+---------+---------+
> > > | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
> > > +---------------------------+---------+---------+---------+---------+
> > Is it just me, or do the numbers not look all that great at least
> > when comparing to ZSTD-L3 and LZO-L1?  What are the decompression
> > numbers?
> > 
> 
> What makes you think so?

Well, if you compared QAT-L9 to LZO-L1 specifically:

 - yes, cpu usage is reduced to a quarter
 - disk performance is the same
 - the compression ratio is much, much worse

and we don't know anything about the decompression speed.

All the while you significantly complicate the code.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-03  7:00     ` Christoph Hellwig
@ 2025-12-03 10:15       ` Giovanni Cabiddu
  2025-12-04  9:59         ` Christoph Hellwig
  2025-12-03 10:47       ` Simon Richter
  1 sibling, 1 reply; 42+ messages in thread
From: Giovanni Cabiddu @ 2025-12-03 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jani Partanen, clm, dsterba, terrelln, herbert, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

Apologies, I should have included you in the TO list on my earlier reply
to Jani:
https://lore.kernel.org/all/aS8fo0h32Yp+ZSPV@gcabiddu-mobl.ger.corp.intel.com/

On Tue, Dec 02, 2025 at 11:00:05PM -0800, Christoph Hellwig wrote:
> On Tue, Dec 02, 2025 at 05:46:29PM +0200, Jani Partanen wrote:
> > 
> > On 02/12/2025 9.53, Christoph Hellwig wrote:
> > > On Fri, Nov 28, 2025 at 07:04:48PM +0000, Giovanni Cabiddu wrote:
> > > > +---------------------------+---------+---------+---------+---------+
> > > > |                           | QAT-L9  | ZSTD-L3 | ZLIB-L3 | LZO-L1  |
> > > > +---------------------------+---------+---------+---------+---------+
> > > > | Disk Write TPUT (GiB/s)   | 6.5     | 5.2     | 2.2     | 6.5     |
> > > > +---------------------------+---------+---------+---------+---------+
> > > > | CPU utils %age @208 cores | 4.56%   | 15.67%  | 12.79%  | 19.85%  |
> > > > +---------------------------+---------+---------+---------+---------+
> > > > | Compression Ratio         | 34%     | 35%     | 37%     | 58%     |
> > > > +---------------------------+---------+---------+---------+---------+
> > > Is it just me, or do the numbers not look all that great at least
> > > when comparing to ZSTD-L3 and LZO-L1?  What are the decompression
> > > numbers?
> > > 
> > 
> > What makes you think so?
> 
> Well, if you compared QAT-L9 to LZO-L1 specifically:
> 
>  - yes, cpu usage is reduced to a quarter
>  - disk performance is the same
>  - the compression ratio is much, much worse
The compression ratio with QAT-ZLIB-L9 is close to SW-ZSTD-L3 (lower is better).

Here is an updated version of the table for clarity:
 +---------------------------+-------------+-----------+-----------+-----------+
 |                           | QAT-ZLIB-L9 | ZSTD-L3   | ZLIB-L3   | LZO-L1    |
 +---------------------------+-------------+-----------+-----------+-----------+
 | Disk Write TPUT (GiB/s)   | 6.5         | 5.2       | 2.2       | 6.5       |
 | (higher is better)        |             |           |           |           |
 +---------------------------+-------------+-----------+-----------+-----------+
 | CPU utils %age @208 cores | 4.56%       | 15.67%    | 12.79%    | 19.85%    |
 | (lower is better)         | ~9 cores    | ~33 cores | ~27 cores | ~41 cores |
 +---------------------------+-------------+-----------+-----------+-----------+
 | Compression Ratio         | 34%         | 35%       | 37%       | 58%       |
 | (lower is better)         |             |           |           |           |
 +---------------------------+-------------+-----------+-----------+-----------+

> 
> and we don't know anything about the decompression speed.
I'll share the decompression benchmarks in a future revision once they
pass internal approval.

> All the while you significantly complicate the code.
The changes in BTRFS are about 800 LOC and the core logic is straightforward:
convert folios to scatterlists and invoke the acomp APIs for offloading
compression/decompression.

The majority of the code is in the QAT driver to enable the algorithms.

Regards,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-03  7:00     ` Christoph Hellwig
  2025-12-03 10:15       ` Giovanni Cabiddu
@ 2025-12-03 10:47       ` Simon Richter
  2025-12-04 10:06         ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Simon Richter @ 2025-12-03 10:47 UTC (permalink / raw)
  To: Christoph Hellwig, Jani Partanen
  Cc: Giovanni Cabiddu, clm, dsterba, terrelln, herbert, linux-btrfs,
	linux-crypto, qat-linux, cyan, brian.will, weigang.li,
	senozhatsky

Hi,

On 12/3/25 4:00 PM, Christoph Hellwig wrote:

> Well, if you compared QAT-L9 to LZO-L1 specifically:

FWIW I think the same infrastructure is useful for nx-gzip on POWER9+, 
which is a definite win:

1. With 4 GiB of random data, effectively uncompressible:

$ time nx_gzip -c test.bin >test.bin.nxgz
real    0m4.716s
user    0m1.381s
sys     0m2.237s

$ time gzip -c test.bin >test.bin.gz
real    2m58.536s
user    2m56.098s
sys     0m2.084s

2. With 4 GiB of NUL bytes:

$ time nx_gzip -c zero.bin >zero.bin.nxgz
real    0m0.855s
user    0m0.613s
sys     0m0.241s

$ time gzip -c zero.bin >zero.bin.gz
real    0m25.944s
user    0m25.600s
sys     0m0.336s

This includes quite a bit of overhead because we're commanding the 
coprocessor from a userspace library with a zlib compatible interface, 
so there is syscall overhead for reading, poking the coprocessor and 
writing, and the blocks submitted aren't as large as they could be, so 
I'd expect an acomp module running before transferring data to userspace 
to be a bit faster still.

Unpacking is quite a bit faster as well, to the point where unpacking 
the compressed block of 4GiB NUL bytes is faster than reading 4 GiB from 
/dev/zero for me.

For acomp, I pretty much always expect offloading to be worth the 
overhead if hardware is available, simply because working with 
bitstreams is awkward on any architecture that isn't specifically 
designed for it, and when an algorithm requires building a dictionary, 
gathering statistics and two-pass processing, that becomes even more 
visible.

For ahash/acrypt, there is a trade-off here, and where it is depends on 
CPU features, the overhead of offloading, the overhead of receiving the 
result, and how much of that overhead can be mitigated by submitting a 
batch of operations.

For the latter, we also need a better submission interface that actually 
allows large batches, and submitters to use that.

Much of the discussion about hardware offload has been circular -- no 
one is submitting large requests because for CPU based implementations 
there is no benefit in doing so (it just makes the interface more 
complex), and hardware based implementations are sequentially processing 
one small request at a time because no one is submitting larger batches, 
and as a result we can't see a lot of performance improvements.

As an example of interface pain points: ahash has synchronous 
import/export functions, and no way for the driver to indicate that the 
result buffer must be reachable by DMA as well, so even with a mailbox 
interface that allows me to submit operations with low overhead, I need 
to synthesize state readbacks into an auxiliary buffer and request an 
interrupt to be delivered after each "update" operation, simply so I can 
have the state available in case it is requested, while normally I would 
only generate an interrupt after an "export" or "final" operation is 
completed (and also rate-limit these).

There's a lot of additional things that I think a good API would allow, 
such as directly feeding data from mass storage to a coprocessor if they 
have compatible interfaces -- since the initial fetch is asynchronous 
anyway, the offload overhead becomes even less relevant then.

I also think that zswap is going to be an important use case here, and 
there has been quite a bit of discussion about large folios and batching 
requests here. It would be cool if QAT or nx-gzip could be plugged into 
zswap.

    Simon

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-03 10:15       ` Giovanni Cabiddu
@ 2025-12-04  9:59         ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-12-04  9:59 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: Christoph Hellwig, Jani Partanen, clm, dsterba, terrelln, herbert,
	linux-btrfs, linux-crypto, qat-linux, cyan, brian.will,
	weigang.li, senozhatsky

On Wed, Dec 03, 2025 at 10:15:29AM +0000, Giovanni Cabiddu wrote:
> The compression ratio with QAT-ZLIB-L9 is close to SW-ZSTD-L3 (lower is better).

Oh, right, this makes the numbers looks much better.

> > All the while you significantly complicate the code.
> The changes in BTRFS are about 800 LOC and the core logic is straightforward:
> convert folios to scatterlists and invoke the acomp APIs for offloading
> compression/decompression.

That is quite a bit of code.  Even more so given that we're really
trying to kill the spread of scatterlists.

I think a better argument could be made it we had a generic compression
API that doesn't require structures like the scatterlist and handles
software compression without significant slowdowns.  I.e. something
replacing the btrfs internal method table for the different algorithms.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators
  2025-12-03 10:47       ` Simon Richter
@ 2025-12-04 10:06         ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-12-04 10:06 UTC (permalink / raw)
  To: Simon Richter
  Cc: Christoph Hellwig, Jani Partanen, Giovanni Cabiddu, clm, dsterba,
	terrelln, herbert, linux-btrfs, linux-crypto, qat-linux, cyan,
	brian.will, weigang.li, senozhatsky

On Wed, Dec 03, 2025 at 07:47:11PM +0900, Simon Richter wrote:
> Unpacking is quite a bit faster as well, to the point where unpacking the
> compressed block of 4GiB NUL bytes is faster than reading 4 GiB from
> /dev/zero for me.

Which makes me wonder why Intel isn't showing decompression numbers.
For file system workloads those generally are more much common, and
they are generally synchronous while writes often or not, and
compressible ones should be even less so.

> For acomp, I pretty much always expect offloading to be worth the overhead
> if hardware is available, simply because working with bitstreams is awkward
> on any architecture that isn't specifically designed for it, and when an
> algorithm requires building a dictionary, gathering statistics and two-pass
> processing, that becomes even more visible.

I would be really surprised if it makes sense for just a few kilobyes,
e.g. a single compressible btrfs extent.  I'd love to see numbers
proving me wrong, though.

> For ahash/acrypt, there is a trade-off here, and where it is depends on CPU
> features, the overhead of offloading, the overhead of receiving the result,
> and how much of that overhead can be mitigated by submitting a batch of
> operations.
>
> For the latter, we also need a better submission interface that actually
> allows large batches, and submitters to use that.

For acrypt Eric has shown pretty devastating numbers for offloads.  Which
doesn't surprise me at all given how well modern CPUs handle the
low-level building blocks for cryptographic algorithms.

> As an example of interface pain points: ahash has synchronous import/export
> functions, and no way for the driver to indicate that the result buffer must
> be reachable by DMA as well, so even with a mailbox interface that allows me
> to submit operations with low overhead, I need to synthesize state readbacks
> into an auxiliary buffer and request an interrupt to be delivered after each
> "update" operation, simply so I can have the state available in case it is
> requested, while normally I would only generate an interrupt after an
> "export" or "final" operation is completed (and also rate-limit these).

Which brings me to my previous point:  ahash from the looks off it
just looks like a pretty horrible interface.  So someone really needs
to come up with an easy to use interface that covers to hardware and
software needs.  Note that on the software side offloading to multiple
other CPU core would be a natural fit and make it look a lot like an
async hardware offload.   You'd need to make it use the correct data
structures, e.g. bio_vecs provided for source and destination instead
of scatterlist, and clear definitions of addressability.  Bounce points
for supporting PCIe P2P transfers, which seems like a very natural fit
here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-12-04 10:06 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-28 19:04 [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Giovanni Cabiddu
2025-11-28 14:25 ` Giovanni Cabiddu
2025-11-28 16:13   ` Chris Mason
2025-11-28 19:04 ` [RFC PATCH 01/16] crypto: zstd - fix double-free in per-CPU stream cleanup Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 02/16] Revert "crypto: qat - remove unused macros in qat_comp_alg.c" Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 03/16] Revert "crypto: qat - Remove zlib-deflate" Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 04/16] crypto: qat - use memcpy_*_sglist() in zlib deflate Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 05/16] Revert "crypto: testmgr - Remove zlib-deflate" Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 06/16] crypto: deflate - add support for deflate rfc1950 (zlib) Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 07/16] crypto: scomp - Add setparam interface Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 08/16] crypto: acomp " Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 09/16] crypto: acomp - Add comp_params helpers Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 10/16] crypto: acomp - add NUMA-aware stream allocation Giovanni Cabiddu
2025-11-28 19:04 ` [RFC PATCH 11/16] crypto: deflate - add support for compression levels Giovanni Cabiddu
2025-11-28 19:05 ` [RFC PATCH 12/16] crypto: zstd " Giovanni Cabiddu
2025-11-28 19:05 ` [RFC PATCH 13/16] crypto: qat - increase number of preallocated sgl descriptors Giovanni Cabiddu
2025-11-28 19:05 ` [RFC PATCH 14/16] crypto: qat - add support for zstd Giovanni Cabiddu
2025-11-28 19:05 ` [RFC PATCH 15/16] crypto: qat - add support for compression levels Giovanni Cabiddu
2025-11-28 19:05 ` [RFC PATCH 16/16] btrfs: add compression hw-accelerated offload Giovanni Cabiddu
2025-11-28 21:55   ` Qu Wenruo
2025-11-28 22:40     ` Giovanni Cabiddu
2025-11-28 23:59       ` Qu Wenruo
2025-11-29  0:23         ` Qu Wenruo
2025-12-01 14:32           ` Giovanni Cabiddu
2025-12-01 15:10             ` Giovanni Cabiddu
2025-12-01 20:57               ` Qu Wenruo
2025-12-01 22:18                 ` Giovanni Cabiddu
2025-12-01 23:13                   ` Qu Wenruo
2025-12-02 17:09                     ` Giovanni Cabiddu
2025-12-02 20:38                       ` Qu Wenruo
2025-12-02 22:37                         ` Giovanni Cabiddu
2025-12-02 22:59                           ` Qu Wenruo
2025-11-29  1:00         ` David Sterba
2025-11-29  1:08       ` David Sterba
2025-12-02  7:53 ` [RFC PATCH 00/16] btrfs: offload compression to hardware accelerators Christoph Hellwig
2025-12-02 15:46   ` Jani Partanen
2025-12-02 17:19     ` Giovanni Cabiddu
2025-12-03  7:00     ` Christoph Hellwig
2025-12-03 10:15       ` Giovanni Cabiddu
2025-12-04  9:59         ` Christoph Hellwig
2025-12-03 10:47       ` Simon Richter
2025-12-04 10:06         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox