* [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
@ 2026-06-15 11:14 Leonid Ravich
2026-06-15 22:53 ` Eric Biggers
0 siblings, 1 reply; 6+ messages in thread
From: Leonid Ravich @ 2026-06-15 11:14 UTC (permalink / raw)
To: Herbert Xu
Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
linux-block
This is v4, addressing Herbert's review of v3. Two architectural
changes:
- data_unit_size is now per-request (on struct skcipher_request)
rather than per-tfm. Reverts to the v1 placement.
- The crypto API auto-splits multi-data-unit requests when the
underlying algorithm does not advertise
CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU. Consumers no longer test
for multi-DU support before submitting; setting data_unit_size
on any skcipher request whose algorithm uses the 128-bit LE
counter IV convention "just works".
These two changes shrink the series from 4 patches to 3 (the
generic xts(...) template needs no special handling - the
auto-splitter calls its single-DU encrypt/decrypt once per data
unit) and simplify the dm-crypt consumer (no advertise-flag check,
no per-tfm setup).
v3: https://lore.kernel.org/linux-crypto/20260601085641.16028-1-lravich@amazon.com/
v2: https://lore.kernel.org/linux-crypto/20260527065021.19525-1-lravich@amazon.com/
v1: https://lore.kernel.org/linux-crypto/20260519115955.27267-1-lravich@amazon.com/
The series adds a per-request "data unit size" to the skcipher API
so a caller can submit several data units (typically 512..4096-byte
sectors) sharing one starting IV in a single request. Algorithms
derive each data unit's IV from the caller-supplied IV by treating
it as a 128-bit little-endian counter and adding the data-unit
index, matching the layout produced by dm-crypt's plain64 IV mode
and by typical inline-encryption hardware.
This mirrors the data_unit_size concept already exposed by
struct blk_crypto_config for inline encryption.
The first user is dm-crypt, which today issues one skcipher request
per sector and so pays a per-sector cost in request allocation,
callback dispatch, completion handling, and scatterlist setup.
Proof-of-concept performance numbers from the RFC reply [1]: +19%
throughput / -40% CPU on a single-core arm64 system with a hardware
XTS-AES-256 accelerator running fio 4 KiB sequential writes through
dm-crypt, when an out-of-tree arm64 xts driver advertises
CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU. This series itself does not
include arch enablement; the fast path is opt-in per driver, the
slow path is universal via the auto-splitter.
The native fast path amortises both per-sector dispatch and per-sector
crypto setup across a bio - the measured win above, on an engine that
offloads the AES compute. The auto-splitter is for correctness and
reach: any consumer can set data_unit_size and get correct output with
the per-request allocation/callback/completion cost removed, but it
still issues one alg->encrypt per data unit, so on a software cipher it
saves only dispatch overhead (no throughput figure claimed - that is
hardware- and workload-dependent). What it guarantees unconditionally
is byte-identical output (Verification below) at O(entries + units),
walking the scatterlists with a pair of struct scatter_walk cursors
rather than rescanning from the head per unit.
[1] https://lore.kernel.org/linux-crypto/20260428101225.24316-1-lravich@amazon.com/
Changes since v3
----------------
- data_unit_size moved from struct crypto_skcipher (per-tfm) to
struct skcipher_request (per-request). (Herbert)
- Crypto API auto-splits multi-data-unit requests when the algorithm
does not advertise CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU. Drops the
per-tfm setter/probe in favour of a single
skcipher_request_set_data_unit_size() usable by every consumer.
(Herbert)
- CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU is a type-specific cra_flags
bit (0x01000000) in crypto/internal/skcipher.h, not a generic bit
in the public header; drivers set it to opt OUT of auto-splitting.
- The auto-splitter advances through src/dst with a pair of struct
scatter_walk cursors (scatterwalk_start / scatterwalk_get_sglist /
scatterwalk_skip) instead of scatterwalk_ffwd() per unit, which
rescans from the head and is O(units^2) under fragmentation; the
cursors give a single linear pass. (Eric)
- crypto_skcipher_validate_multi_du() reports -EINVAL for a malformed
geometry (du not a power of two, cryptlen not a positive multiple)
and -EOPNOTSUPP for a target that cannot do multi-DU (ivsize != 16,
lskcipher, or async without the native flag), so a caller can fall
back. Gates the native path too, not just the auto-splitter.
(Eric)
- testmgr cross-checks the batched dispatch against an independent
N x single-DU reference with LE128-walked IVs over a fragmented
scatterlist (pins the IV convention and exercises the cursor),
round-trips, and checks IV preservation. Ineligible algorithms
skip via -EOPNOTSUPP; a real mismatch returns -EBADMSG.
- dm-crypt enables batching only for IV modes flagged sector_iv_le128
(a new bool on struct crypt_iv_operations, set on plain64 only),
plus ivsize 16, sync, single-tfm, no integrity, no post() hook. The
flag replaces a hardcoded plain64 pointer-compare, so eligibility is
a self-documenting property of the IV mode rather than a special
case. plain stays excluded (its 32-bit counter wraps differently
past 2^32 sectors). Sets req->data_unit_size = sector_size and
submits; -EOPNOTSUPP/-EAGAIN fall back to the per-sector path.
Mikulas's v2 Reviewed-by is dropped as the dm-crypt patch was
substantially rewritten.
- The generic xts(...) template needs no separate handling, dropping
the v3 crypto/xts.c patch (4 -> 3 patches).
Design overview
---------------
* Patch 1 adds the data_unit_size field, the setter, the
CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU flag, and the auto-splitter in
crypto_skcipher_encrypt()/decrypt(). skcipher_request_set_tfm()
resets the field so a reused request defaults to single-DU.
* Patch 2 adds the testmgr multi-DU test (every ivsize == 16
skcipher).
* Patch 3 turns dm-crypt batching on automatically under the
conditions above and sets req->data_unit_size = cc->sector_size.
This series does NOT add the capability flag to any arch driver; the
auto-splitter ensures correctness without that opt-in.
Verification
------------
A regression protocol is included in the project tree
(.claude/regression-protocol.md, .claude/run-regression.sh). The
reference run reports 12/12 PASS:
- x86 + arm64 build clean; checkpatch.pl --strict clean.
- testmgr multi-DU: PASS for every ivsize == 16 skcipher in-tree.
- dm-crypt activation gating: plain64 enabled; essiv:sha256 /
plain64be / plain fall back.
- dm-crypt round-trip plain64 with multi-DU via the auto-splitter
(xts-aes-aesni, no native flag): PASS.
- dm-crypt round-trip essiv:sha256 (per-sector path): PASS.
- dm-crypt low-memory (mem=128M): PASS, no OOM kill.
- Byte-equivalence: 256 MB of ciphertext through the auto-splitter
is bit-identical to an unpatched axboe/for-next baseline (sha256
4913910b1aa6f8859fcb8f4adec20230274993a3ade8f4dd0140a323dc43efc0).
- arm64 functional under qemu-aarch64: PASS.
Leonid Ravich (3):
crypto: skcipher - add per-request data_unit_size with auto-splitting
crypto: testmgr - test for multi-data-unit dispatch
dm crypt: batch all sectors of a bio per crypto request
crypto/skcipher.c | 132 +++++++++++++++++++
crypto/testmgr.c | 192 +++++++++++++++++++++++++
drivers/md/dm-crypt.c | 215 +++++++++++++++++++++++++++--
include/crypto/internal/skcipher.h | 10 ++
include/crypto/skcipher.h | 28 ++++
5 files changed, 569 insertions(+), 8 deletions(-)
base-commit: a8cafdf8c949f17c92eca0045532e88ac0dac30d
--
2.47.3
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
2026-06-15 11:14 Leonid Ravich
@ 2026-06-15 22:53 ` Eric Biggers
2026-06-16 4:13 ` Herbert Xu
0 siblings, 1 reply; 6+ messages in thread
From: Eric Biggers @ 2026-06-15 22:53 UTC (permalink / raw)
To: Leonid Ravich
Cc: Herbert Xu, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
linux-block
On Mon, Jun 15, 2026 at 11:14:56AM +0000, Leonid Ravich wrote:
> The series adds a per-request "data unit size" to the skcipher API
> so a caller can submit several data units (typically 512..4096-byte
> sectors) sharing one starting IV in a single request. Algorithms
> derive each data unit's IV from the caller-supplied IV by treating
> it as a 128-bit little-endian counter and adding the data-unit
> index, matching the layout produced by dm-crypt's plain64 IV mode
> and by typical inline-encryption hardware.
>
> This mirrors the data_unit_size concept already exposed by
> struct blk_crypto_config for inline encryption.
>
> The first user is dm-crypt, which today issues one skcipher request
> per sector and so pays a per-sector cost in request allocation,
> callback dispatch, completion handling, and scatterlist setup.
>
> Proof-of-concept performance numbers from the RFC reply [1]: +19%
> throughput / -40% CPU on a single-core arm64 system with a hardware
> XTS-AES-256 accelerator running fio 4 KiB sequential writes through
> dm-crypt, when an out-of-tree arm64 xts driver advertises
> CRYPTO_ALG_SKCIPHER_NATIVE_MULTI_DU. This series itself does not
> include arch enablement; the fast path is opt-in per driver, the
> slow path is universal via the auto-splitter.
>
> The native fast path amortises both per-sector dispatch and per-sector
> crypto setup across a bio - the measured win above, on an engine that
> offloads the AES compute. The auto-splitter is for correctness and
> reach: any consumer can set data_unit_size and get correct output with
> the per-request allocation/callback/completion cost removed, but it
> still issues one alg->encrypt per data unit, so on a software cipher it
> saves only dispatch overhead (no throughput figure claimed - that is
> hardware- and workload-dependent). What it guarantees unconditionally
> is byte-identical output (Verification below) at O(entries + units),
> walking the scatterlists with a pair of struct scatter_walk cursors
> rather than rescanning from the head per unit.
So in other words, this series slows down dm-crypt and crypto_skcipher
for everyone to optimize for an out-of-tree driver. And there's also no
benchmark showing that your driver is even worth it over just using the
CPU.
- Eric
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
2026-06-15 22:53 ` Eric Biggers
@ 2026-06-16 4:13 ` Herbert Xu
2026-06-16 4:50 ` Eric Biggers
0 siblings, 1 reply; 6+ messages in thread
From: Herbert Xu @ 2026-06-16 4:13 UTC (permalink / raw)
To: Eric Biggers
Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
linux-block
On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
>
> So in other words, this series slows down dm-crypt and crypto_skcipher
> for everyone to optimize for an out-of-tree driver. And there's also no
> benchmark showing that your driver is even worth it over just using the
> CPU.
There is no reason why the software fallback should be slower
than the status quo. Existing callers of the Crypto API will
be issuing one indirect function call per data unit. With the
new scheme, the indirect calls per unit moves from from the caller
into the Crypto API.
In fact, we could move it down further and improve upon the
status quo by splitting the data in each algorithm implemntation
so that the calls per unit become direct function calls and only
the overall call into the Crypto API remains indirect.
But yes it would be nice to provide numbers for the fallback
path to verify that we didn't get this case terribly wrong.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
2026-06-16 4:13 ` Herbert Xu
@ 2026-06-16 4:50 ` Eric Biggers
2026-06-16 4:53 ` Herbert Xu
0 siblings, 1 reply; 6+ messages in thread
From: Eric Biggers @ 2026-06-16 4:50 UTC (permalink / raw)
To: Herbert Xu
Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
linux-block
On Tue, Jun 16, 2026 at 12:13:03PM +0800, Herbert Xu wrote:
> On Mon, Jun 15, 2026 at 03:53:17PM -0700, Eric Biggers wrote:
> >
> > So in other words, this series slows down dm-crypt and crypto_skcipher
> > for everyone to optimize for an out-of-tree driver. And there's also no
> > benchmark showing that your driver is even worth it over just using the
> > CPU.
>
> There is no reason why the software fallback should be slower
> than the status quo. Existing callers of the Crypto API will
> be issuing one indirect function call per data unit. With the
> new scheme, the indirect calls per unit moves from from the caller
> into the Crypto API.
Have you checked the code? This patchset adds overhead in multiple
places. Dynamically allocating multiple scatterlists and then parsing
them, adding a new field to skcipher_request for everyone, new checks in
crypto_skcipher_en/decrypt for everyone, new checks to validate the data
unit size that the caller knew was valid in the first place, etc.
> In fact, we could move it down further and improve upon the
> status quo by splitting the data in each algorithm implemntation
> so that the calls per unit become direct function calls and only
> the overall call into the Crypto API remains indirect.
That's not what this patchset does. But also, as we know, a better way
to eliminate "Crypto API" overhead is to call the algorithms directly.
- Eric
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
2026-06-16 4:50 ` Eric Biggers
@ 2026-06-16 4:53 ` Herbert Xu
0 siblings, 0 replies; 6+ messages in thread
From: Herbert Xu @ 2026-06-16 4:53 UTC (permalink / raw)
To: Eric Biggers
Cc: Leonid Ravich, Alasdair Kergon, Ard Biesheuvel, Jens Axboe,
Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
linux-block
On Mon, Jun 15, 2026 at 09:50:23PM -0700, Eric Biggers wrote:
>
> Have you checked the code? This patchset adds overhead in multiple
> places. Dynamically allocating multiple scatterlists and then parsing
> them, adding a new field to skcipher_request for everyone, new checks in
> crypto_skcipher_en/decrypt for everyone, new checks to validate the data
> unit size that the caller knew was valid in the first place, etc.
No I have not :)
I'm just stating the general principle.
Of course I will not apply the patch-set until I have reviewed it
properly.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching
[not found] <20260622182328.GB1250822@google.com>
@ 2026-06-24 19:52 ` Leonid Ravich
0 siblings, 0 replies; 6+ messages in thread
From: Leonid Ravich @ 2026-06-24 19:52 UTC (permalink / raw)
To: Eric Biggers, Herbert Xu
Cc: Alasdair Kergon, Ard Biesheuvel, Jens Axboe, dm-devel,
linux-block, linux-crypto
On Mon, Jun 22, 2026 at 06:23:28PM +0000, Eric Biggers wrote:
> I don't think there's a path forward without an in-tree user that's
> shown to be worthwhile over just using the acceleration built directly
> into the CPU. As well as confirmation of no regression to existing
> users, including in cases where the inline sg list can't be used.
Agreed. Proposing a smaller v5 that meets the no-regression bar now and
leaves "beats the CPU" to a follow-up with a real in-tree user.
dm-crypt submits one request per contiguous bio segment (a single
bio_vec) with data_unit_size = sector_size, instead of one per sector.
E.g. default sector_size 512 with a 4 KiB bio_vec: one request of 8
data units, which the fallback splitter walks as 8 per-sector calls --
dm-crypt no longer open-codes the per-data-unit loop itself.
- Uses only the existing inline sg_in[0]/sg_out[0] entry. No per-bio
scatterlist, no kmalloc -- the "inline sg list can't be used" case
doesn't exist here, so there's nothing to regress.
- For a non-native algorithm the core auto-splits into the same
per-sector calls dm-crypt makes today: identical output and cost.
This is what Herbert predicted -- the per-unit indirect call just
moves from the caller into the API; the fallback is no slower.
So it stands on no-regression alone, with no software throughput claim.
What it adds is the interface a native one-pass driver needs. I'd land
that now and bring a native offload user + numbers as the follow-up,
rather than block the interface on the driver.
Acceptable? If so I'll respin v5 as the minimal version.
Thanks,
Leonid
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-24 19:53 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260622182328.GB1250822@google.com>
2026-06-24 19:52 ` [PATCH v4 0/3] crypto: skcipher - per-request multi-data-unit batching Leonid Ravich
2026-06-15 11:14 Leonid Ravich
2026-06-15 22:53 ` Eric Biggers
2026-06-16 4:13 ` Herbert Xu
2026-06-16 4:50 ` Eric Biggers
2026-06-16 4:53 ` Herbert Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox