Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] Revert "nbd: freeze the queue while we're adding connections"
From: yangerkun @ 2026-05-27  3:52 UTC (permalink / raw)
  To: josef, axboe; +Cc: linux-block, nbd, yangerkun
In-Reply-To: <20260526115253.746625-1-yangerkun@huawei.com>



在 2026/5/26 19:52, Yang Erkun 写道:
> This reverts commit b98e762e3d71e893b221f871825dc64694cfb258.
> 
> Commit b98e762e3d71 ("nbd: freeze the queue while we're adding
> connections") added blk_mq_freeze_queue/blk_mq_unfreeze_queue in
> nbd_add_socket() to protect krealloc(config->socks) from concurrent I/O
> that could cause a Use-After-Free.
> 
> However, analysis shows that in all current code paths, concurrent I/O
> cannot actually reach nbd_add_socket():
> 
> 1. nbd_genl_connect() path:
>     nbd_add_socket() is called first, and nbd_start_device() -- which
>     starts the queue and enables I/O -- is called only after all sockets
>     have been added. So the freeze/unfreeze runs against an idle queue,
>     marking then waiting on a percpu_ref that is already zero, and then
>     resurrecting it -- a pure no-op that burns an RCU grace period per
>     socket on multi-core systems.
> 
> 2. nbd_ioctl(NBD_SET_SOCK) path:
>     The task_setup check enforces that only the thread which performed
>     the first NBD_SET_SOCK can call NBD_SET_SOCK again. That thread is
>     blocked in NBD_DO_IT's wait_event_interruptible, so it cannot issue
>     another NBD_SET_SOCK concurrently with I/O. Other threads are
>     rejected by the task_setup != current check.

Apologies, but the analysis provided here is inadequate. A 
use-after-free (UAF) can still occur in the following scenario:

task A: ioctl NBD_SET_SOCK => task_setup = A
task B: ioctl NBD_DO_IT    => nbd_start_device_ioctl, nbd can receive IO
task A: ioctl NBD_SET_SOCK => task_setup == A, so racer can happend with
concurrent IO!

This patch is misleading, please disregard it. Sorry once again.

> 
> 3. nbd_genl_reconfigure() does not call nbd_add_socket() at all; it
>     uses nbd_reconnect_socket() which replaces a dead socket in-place
>     without reallocating config->socks.
> 
> Therefore the freeze/unfreeze provides no actual protection in any
> reachable code path, while imposing the cost of blk_mq_freeze_queue
> (percpu_ref_kill + RCU grace period wait + percpu_ref_resurrect) on
> every socket addition during device setup[1].
> 
> Revert the change to eliminate the unnecessary overhead.
> 
> Link: https://lore.kernel.org/all/20260327091223.4147956-1-leo.lilong@huaweicloud.com/ [1]
> Signed-off-by: Yang Erkun <yangerkun@huawei.com>
> ---
>   drivers/block/nbd.c | 11 +----------
>   1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index fe63f3c55d0d..9033d996c9a9 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1245,22 +1245,16 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
>   	struct socket *sock;
>   	struct nbd_sock **socks;
>   	struct nbd_sock *nsock;
> -	unsigned int memflags;
>   	int err;
>   
>   	/* Arg will be cast to int, check it to avoid overflow */
>   	if (arg > INT_MAX)
>   		return -EINVAL;
> +
>   	sock = nbd_get_socket(nbd, arg, &err);
>   	if (!sock)
>   		return err;
>   
> -	/*
> -	 * We need to make sure we don't get any errant requests while we're
> -	 * reallocating the ->socks array.
> -	 */
> -	memflags = blk_mq_freeze_queue(nbd->disk->queue);
> -
>   	if (!netlink && !nbd->task_setup &&
>   	    !test_bit(NBD_RT_BOUND, &config->runtime_flags))
>   		nbd->task_setup = current;
> @@ -1300,12 +1294,9 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
>   	INIT_WORK(&nsock->work, nbd_pending_cmd_work);
>   	socks[config->num_connections++] = nsock;
>   	atomic_inc(&config->live_connections);
> -	blk_mq_unfreeze_queue(nbd->disk->queue, memflags);
> -
>   	return 0;
>   
>   put_socket:
> -	blk_mq_unfreeze_queue(nbd->disk->queue, memflags);
>   	sockfd_put(sock);
>   	return err;
>   }


^ permalink raw reply

* Re: [PATCH V4 0/3] md/nvme: Enable PCI P2PDMA support for RAID0 and NVMe Multipath
From: Chaitanya Kulkarni @ 2026-05-27  4:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: song@kernel.org, yukuai@fnnas.com, Christoph Hellwig,
	linan122@huawei.com, kbusch@kernel.org, sagi@grimberg.me,
	linux-block@vger.kernel.org, linux-raid@vger.kernel.org,
	linux-nvme@lists.infradead.org, Kiran Modukuri
In-Reply-To: <b89e372e-3068-4c26-9552-13e6853ba000@kernel.dk>

On 5/26/26 14:51, Jens Axboe wrote:
>> There is outstanding work I want to send out based on this one.
> Out standing, outstanding, or both? 🙂

just few fixes :)

>> May I please request you to merge this patch series ?
> Was waiting on the md parts to get reviewed, by I missed that Xiao Ni
> already did. I'll queue it up.
>
> -- Jens Axboe

Thanks a lot.

-ck



^ permalink raw reply

* Re: [PATCH] zram: fix use-after-free in zram_bvec_write_partial()
From: Cunlong Li @ 2026-05-27  4:48 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Minchan Kim, Jens Axboe, Andrew Morton, linux-kernel, linux-block,
	Christoph Hellwig, stable
In-Reply-To: <ahZov_99kMxaTH2P@google.com>

On Wed, May 27, 2026 at 12:45:37PM +0900, Sergey Senozhatsky wrote:
> On (26/05/27 11:26), Cunlong Li wrote:
> > zram_read_page() picks the sync or async backing device read path
> > based on whether the parent bio is NULL.  zram_bvec_write_partial()
> > passes its parent bio down, so for ZRAM_WB slots the read is
> > dispatched asynchronously and zram_read_page() returns 0 while the
> > bio is still in flight.  The caller then runs memcpy_from_bvec(),
> > zram_write_page() and __free_page() on the buffer, leaving the
> > async read to write into a freed page.
> > 
> > zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
> > ("zram: fix synchronous reads") for the same reason; the
> > write_partial counterpart was missed.
> > 
> > Fixes: 4e3c87b9421d ("zram: fix synchronous reads")
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
> > ---
> >  drivers/block/zram/zram_drv.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index aebc710f0d6a..b23a8bbb687c 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -2333,7 +2333,7 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
> >  	if (!page)
> >  		return -ENOMEM;
> >  
> > -	ret = zram_read_page(zram, page, index, bio);
> > +	ret = zram_read_page(zram, page, index, NULL);
> 
> Sounds like zram_bvec_write_partial() doesn't need bio parameter then?

Right -- v2 follows up with a cleanup patch that drops the bio
parameter from both zram_bvec_write_partial() and zram_bvec_write().

Will send v2 shortly.

Thanks,
Cunlong

^ permalink raw reply

* [PATCH v2 0/2] zram: fix UAF in zram_bvec_write_partial() and drop dead bio plumbing
From: Cunlong Li @ 2026-05-27  4:49 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel,
	Cunlong Li, stable

Patch 1 fixes a use-after-free in zram_bvec_write_partial() that
happens on PAGE_SIZE > 4K configurations when a partial write hits a
ZRAM_WB slot.

Patch 2 is a follow-up cleanup that drops the now-unused bio parameter
from zram_bvec_write_partial() and zram_bvec_write(), no functional
change.

Patch 1 is tagged for stable; patch 2 is not.

Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
Changes in v2:
- Add patch 2: drop the now-unused bio parameter from
  zram_bvec_write_partial() and zram_bvec_write(), per Sergey's
  suggestion on v1.
- Link to v1: https://lore.kernel.org/r/20260527-zram-v1-1-ce1acb2bfaf9@gmail.com

---
Cunlong Li (2):
      zram: fix use-after-free in zram_bvec_write_partial()
      zram: drop unused bio parameter from write helpers

 drivers/block/zram/zram_drv.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)
---
base-commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7
change-id: 20260526-zram-b01425b7e6c6

Best regards,
-- 
Cunlong Li <shenxiaogll@gmail.com>


^ permalink raw reply

* [PATCH v2 1/2] zram: fix use-after-free in zram_bvec_write_partial()
From: Cunlong Li @ 2026-05-27  4:49 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel,
	Cunlong Li, stable
In-Reply-To: <20260527-zram-v2-0-2fb84b054b5c@gmail.com>

zram_read_page() picks the sync or async backing device read path
based on whether the parent bio is NULL.  zram_bvec_write_partial()
passes its parent bio down, so for ZRAM_WB slots the read is
dispatched asynchronously and zram_read_page() returns 0 while the
bio is still in flight.  The caller then runs memcpy_from_bvec(),
zram_write_page() and __free_page() on the buffer, leaving the
async read to write into a freed page.

zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
("zram: fix synchronous reads") for the same reason; the
write_partial counterpart was missed.

Fixes: 4e3c87b9421d ("zram: fix synchronous reads")
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
 drivers/block/zram/zram_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aebc710f0d6a..b23a8bbb687c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2333,7 +2333,7 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
 	if (!page)
 		return -ENOMEM;

-	ret = zram_read_page(zram, page, index, bio);
+	ret = zram_read_page(zram, page, index, NULL);
 	if (!ret) {
 		memcpy_from_bvec(page_address(page) + offset, bvec);
 		ret = zram_write_page(zram, page, index);

-- 
2.30.2

^ permalink raw reply related

* [PATCH v2 2/2] zram: drop unused bio parameter from write helpers
From: Cunlong Li @ 2026-05-27  4:49 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton
  Cc: Christoph Hellwig, linux-block, linux-mm, linux-kernel,
	Cunlong Li
In-Reply-To: <20260527-zram-v2-0-2fb84b054b5c@gmail.com>

After the previous fix, zram_bvec_write_partial() always passes NULL
to zram_read_page() and no longer needs the parent bio.  Mirror the
read side (zram_bvec_read_partial() has not taken a bio since commit
4e3c87b9421d ("zram: fix synchronous reads")) and drop the parameter
from zram_bvec_write_partial() and zram_bvec_write().

No functional change.

Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
 drivers/block/zram/zram_drv.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b23a8bbb687c..66347915a2cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2325,7 +2325,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
  * This is a partial IO. Read the full page before writing the changes.
  */
 static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
-				   u32 index, int offset, struct bio *bio)
+				   u32 index, int offset)
 {
 	struct page *page = alloc_page(GFP_NOIO);
 	int ret;
@@ -2343,10 +2343,10 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
 }
 
 static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
-			   u32 index, int offset, struct bio *bio)
+			   u32 index, int offset)
 {
 	if (is_partial_io(bvec))
-		return zram_bvec_write_partial(zram, bvec, index, offset, bio);
+		return zram_bvec_write_partial(zram, bvec, index, offset);
 	return zram_write_page(zram, bvec->bv_page, index);
 }
 
@@ -2743,7 +2743,7 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 
 		bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
 
-		if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
+		if (zram_bvec_write(zram, &bv, index, offset) < 0) {
 			atomic64_inc(&zram->stats.failed_writes);
 			bio->bi_status = BLK_STS_IOERR;
 			break;

-- 
2.30.2


^ permalink raw reply related

* Re: [PATCHv2 1/2] block: export passthrough stats enabled
From: Nilay Shroff @ 2026-05-27  6:15 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme; +Cc: axboe, hch, Keith Busch
In-Reply-To: <20260526153921.2402015-2-kbusch@meta.com>

On 5/26/26 9:09 PM, Keith Busch wrote:
> From: Keith Busch<kbusch@kernel.org>
> 
> A user can enable io accounting for passthrough requests, so export the
> helper that checks if the request should be tracked. This will enable
> stacking drivers to to report iostats for passthrough workloads.
> 
> Signed-off-by: Keith Busch<kbusch@kernel.org>
> ---
>   block/blk-mq.c         | 32 +-------------------------------
>   include/linux/blk-mq.h | 30 ++++++++++++++++++++++++++++++
>   2 files changed, 31 insertions(+), 31 deletions(-)

Looks good to me.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply

* Re: [PATCHv2 2/2] nvme: add support multipath passthrough iostats
From: Nilay Shroff @ 2026-05-27  6:15 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme; +Cc: axboe, hch, Keith Busch
In-Reply-To: <20260526153921.2402015-3-kbusch@meta.com>

On 5/26/26 9:09 PM, Keith Busch wrote:
> From: Keith Busch<kbusch@kernel.org>
> 
> Don't skip io accounting for passthrough commands if the user enabled
> tracking these.
> 
> Signed-off-by: Keith Busch<kbusch@kernel.org>
> ---
>   drivers/nvme/host/ioctl.c     | 4 ++++
>   drivers/nvme/host/multipath.c | 5 ++++-
>   2 files changed, 8 insertions(+), 1 deletion(-)

Looks good to me.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Christoph Hellwig @ 2026-05-27  6:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	io-uring, linux-mm, Leon Romanovsky
In-Reply-To: <ahXcsrxUFfzoVCOr@casper.infradead.org>

On Tue, May 26, 2026 at 06:47:30PM +0100, Matthew Wilcox wrote:
> How about:
> 
> /**
>  * bvec_folio - Return the first folio referenced by this bvec
>  * @bv: bvec to access
>  *
>  * bvecs can contain non-folio memory, so this should only be called by
>  * the creator of the bvec; drivers have no business looking at the owner
>  * of the memory.  It may not even be the right interface for the caller
>  * to use as bvecs can span multiple folios.  You may be better off using
>  * something like bio_for_each_folio_all() which iterates over all folios.
>  */

Sounds good, although I'd captialize the first word in the sentence.
(Not that anyone should follow my spelling advice in general)


^ permalink raw reply

* Re: [PATCH] bvec: make the bvec_iter helpers inline functions
From: Christoph Hellwig @ 2026-05-27  6:39 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, axboe, linux-block
In-Reply-To: <ahX50HKkPtN7rNEq@kbusch-mbp>

On Tue, May 26, 2026 at 01:51:44PM -0600, Keith Busch wrote:
> On Tue, May 26, 2026 at 09:00:27AM +0200, Christoph Hellwig wrote:
> > -#define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
> > +static __always_inline const struct bio_vec *
> > +__bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
> > +{
> > +	return bvecs + iter.bi_idx;
> > +}
> 
> There's a couple drivers, nvme-tcp and loop, that call this without the
> const qualifier, so this will produce new warnings. The nvme-tcp one is
> simpler to fix by just adding the 'const', where loop looks like it
> needs a little more consideration to get there, but still doable.

Yeah.  I actually had this fixes but sent out just HEAD^..HEAD instead
of including the cleanups from them, which basically copy the nicer
code pattern from the zloop driver.

^ permalink raw reply

* [PATCH v2 0/4] crypto: skcipher - per-tfm multi-data-unit batching
From: Leonid Ravich @ 2026-05-27  6:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S . Miller, Mike Snitzer, Mikulas Patocka, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block

This is v2 of the multi-data-unit skcipher request series, addressing
review feedback from Mikulas Patocka on v1.

v1: https://lore.kernel.org/linux-crypto/20260519115955.27267-1-lravich@amazon.com/

The series adds a per-tfm "data unit size" to the skcipher API so a
caller can submit several data units in one crypto request, mirroring
the data_unit_size concept already exposed by struct blk_crypto_config
for inline encryption hardware.  The first user is dm-crypt, which
today issues one skcipher request per sector and so pays a per-sector
cost in request allocation, callback dispatch, completion handling,
and scatterlist setup.

Proof-of-concept performance numbers from the RFC reply [1]: +19%
throughput / -40% CPU on a single-core arm64 system with a hardware
XTS-AES-256 accelerator running fio 4 KiB sequential writes through
dm-crypt, when an out-of-tree arm64 xts driver advertises the new
flag.  This series itself does not include arch enablement.

[1] https://lore.kernel.org/linux-crypto/20260428101225.24316-1-lravich@amazon.com/

Changes since v1
----------------

Patch 4 (dm-crypt) only.  Patches 1-3 are unchanged from v1.

  - Multi-DU scatterlist allocation now uses
    GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN, so the allocator does
    not loop forever waiting for memory that won't come on the
    swap-out-to-dm-crypt path.  (Mikulas)

  - On scatterlist allocation failure, return -EAGAIN instead of
    -ENOMEM.  crypt_convert() handles -EAGAIN by clearing its
    local multi_du flag and re-entering the per-sector path for
    the rest of this crypt_convert() invocation.  The per-tfm
    data_unit_size on the cipher remains set, so subsequent bios
    (which start a fresh crypt_convert() and re-read cipher_flags)
    get to try multi-DU again once memory pressure eases.

    This gives forward progress under total memory exhaustion: the
    per-sector path uses only cc->req_pool (a mempool with
    reservoir set up at table-load time) and the inline
    dmreq->sg_in[]/sg_out[] arrays, never doing any allocation
    that could fail.  The previous v1 mapping of -ENOMEM to
    BLK_STS_DEV_RESOURCE could loop indefinitely on swap, since
    the bio retry would try the same multi-DU allocation again.
    (Mikulas)

  - Walk the bio with __bio_for_each_bvec instead of
    __bio_for_each_segment.  __bio_for_each_segment splits each
    bvec at PAGE_SIZE boundaries; __bio_for_each_bvec keeps
    multi-page bvecs as single units, which is faster with folios
    and produces fewer scatterlist entries.  (Mikulas)

Design overview (unchanged from v1)
-----------------------------------

* Patch 1 adds an `unsigned int data_unit_size` field to
  `struct crypto_skcipher` (per-tfm: invariant for the consumer's
  lifetime, set once via `crypto_skcipher_set_data_unit_size()`),
  plus a capability flag CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in
  `cra_flags` (type-specific high-byte range, mirroring the
  CRYPTO_AHASH_ALG_BLOCK_ONLY precedent).  `crypto_skcipher_encrypt()`
  and `crypto_skcipher_decrypt()` validate that `cryptlen` is a
  positive multiple of `data_unit_size`.  The setter rejects
  sub-blocksize values; algorithm registration rejects the flag
  for algorithms with `ivsize != 16`.

  Also exposes `skcipher_walk_data_units()` in
  <crypto/internal/skcipher.h> as a default per-DU dispatcher for
  drivers that don't want to roll their own.

* Patch 2 lets the generic `xts(...)` template advertise the flag
  when the inner cipher is synchronous.  This is the in-tree
  software producer of the new capability.

* Patch 3 extends `testmgr` with a self-comparison test that fires
  automatically for every alg advertising the flag.

* Patch 4 turns dm-crypt on automatically when all of the
  following hold at table load: skcipher (not aead), tfms_count
  == 1, IV mode is plain or plain64, no per-sector
  iv_gen_ops->post() hook, no dm-integrity stacking, and the
  underlying cipher advertises the capability.

This series intentionally does NOT add the capability flag to any
arch crypto driver.  Arch maintainers can opt in independently in
follow-up patches by wrapping their xts(aes) entry points with
skcipher_walk_data_units() or, for hardware engines, by submitting
one HW command for the whole multi-DU request.

Verification
------------

* checkpatch.pl --strict: clean on all 4 patches.
* Builds clean on x86_64 (defconfig + DM_CRYPT + CRYPTO_AES_NI_INTEL)
  and arm64 (cross-compile, defconfig + DM_CRYPT +
  CRYPTO_AES_ARM64_CE_BLK + CRYPTO_AES_ARM64_NEON_BLK) on top of
  axboe/for-next (a8cafdf8c949).
* QEMU boots; existing xts-aes-aesni / xts-aes-ce / xts-aes-neon
  crypto self-tests pass.
* In-kernel testmgr self-comparison passes for any algorithm
  advertising the flag.
* dm-crypt round-trip with plain64: PASS on x86 and arm64.
* dm-crypt round-trip with essiv:sha256 (single-DU path): PASS.
* dm-crypt large-bio: PASS.
* dm-crypt activation gating: plain -> enabled, plain64 ->
  enabled, essiv:sha256 -> fallback, plain64be -> fallback.
* Byte-equivalence: 256 MB of ciphertext written through the
  multi-DU path is bit-identical to ciphertext written on an
  unpatched axboe/for-next baseline (sha256
  4913910b1aa6f8859fcb8f4adec20230274993a3ade8f4dd0140a323dc43efc0).
* Low-memory boot (mem=128M): PASS — no regression in the
  per-sector path under tight memory.

The OOM-fallback path (multi-DU helper returns -EAGAIN, caller
reverts to per-sector) is verified by inspection: the fallback is
two lines in crypt_convert(), the per-sector path uses only the
existing mempool reserve and the inline dmreq SG arrays (no
allocation that could fail), and there is no shared state between
the two paths that could deadlock.

Leonid Ravich (4):
  crypto: skcipher - add per-tfm data_unit_size for batched requests
  crypto: xts - support multiple data units per request in template
  crypto: testmgr - exercise multi-data-unit path for skcipher
  dm crypt: batch all sectors of a bio per crypto request

 crypto/skcipher.c                  | 120 +++++++++++++
 crypto/testmgr.c                   | 129 ++++++++++++++
 crypto/xts.c                       |  25 ++-
 drivers/md/dm-crypt.c              | 272 ++++++++++++++++++++++++++++-
 include/crypto/internal/skcipher.h |  34 ++++
 include/crypto/skcipher.h          |  85 +++++++++
 6 files changed, 656 insertions(+), 9 deletions(-)

base-commit: a8cafdf8c949f17c92eca0045532e88ac0dac30d
-- 
2.47.3

^ permalink raw reply

* [PATCH v2 1/4] crypto: skcipher - add per-tfm data_unit_size for batched requests
From: Leonid Ravich @ 2026-05-27  6:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S . Miller, Mike Snitzer, Mikulas Patocka, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block
In-Reply-To: <20260527065021.19525-1-lravich@amazon.com>

Add a per-tfm data_unit_size and an algorithm capability flag that
together allow a caller to submit several data units in a single
skcipher request.  The IV passed in the request applies to the first
data unit; the algorithm advances the tweak between data units
according to the mode specification (e.g., LE128 multiply for XTS per
IEEE 1619).

This mirrors the data_unit_size concept already exposed by
struct blk_crypto_config for inline encryption hardware, but at the
software skcipher layer.  The first user is dm-crypt, which today
issues one request per sector and so pays a per-sector cost in
request allocation, IV generation, callback dispatch, and completion
handling.  Allowing the cipher to consume a whole bio per request
removes that overhead for drivers that can chain across data units
internally.

The data_unit_size lives on struct crypto_skcipher rather than on
struct skcipher_request because it does not change between requests
for any plausible consumer: dm-crypt picks one sector size per
mapped target at table load time; fscrypt would pick one per master
key.  Anchoring it to the tfm also lets the driver validate it once
at setkey() time and avoids per-request initialisation hazards on
mempool-recycled requests.

Capability is advertised with CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT
in cra_flags (type-specific high-byte range, mirroring the
CRYPTO_AHASH_ALG_* convention).  This makes the capability visible
in /proc/crypto and lets templates OR it into their derived
algorithms.

crypto_skcipher_set_data_unit_size() returns -EOPNOTSUPP if the
algorithm does not advertise the flag, and accepts 0 (the default)
unconditionally so callers can re-disable batching cheaply.

crypto_skcipher_encrypt()/decrypt() reject requests whose cryptlen
is not a multiple of the configured data_unit_size with -EINVAL.
The check is gated on data_unit_size != 0 so it costs nothing for
the common single-data-unit case.

No in-tree algorithm advertises the flag yet; subsequent patches
add the generic xts() template, arm64, and x86 producers as well
as the dm-crypt consumer.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 crypto/skcipher.c                  | 120 +++++++++++++++++++++++++++++
 include/crypto/internal/skcipher.h |  34 ++++++++
 include/crypto/skcipher.h          |  85 ++++++++++++++++++++
 3 files changed, 239 insertions(+)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 2b31d1d5d268..bc37bd554aec 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -432,13 +432,119 @@ int crypto_skcipher_setkey(struct crypto_skcipher *tfm, const u8 *key,
 }
 EXPORT_SYMBOL_GPL(crypto_skcipher_setkey);
 
+int crypto_skcipher_set_data_unit_size(struct crypto_skcipher *tfm,
+				       unsigned int data_unit_size)
+{
+	unsigned int blocksize;
+
+	if (!data_unit_size) {
+		tfm->data_unit_size = 0;
+		return 0;
+	}
+
+	if (!crypto_skcipher_supports_multi_data_unit(tfm))
+		return -EOPNOTSUPP;
+
+	blocksize = crypto_skcipher_blocksize(tfm);
+	if (data_unit_size < blocksize || data_unit_size % blocksize)
+		return -EINVAL;
+
+	tfm->data_unit_size = data_unit_size;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_skcipher_set_data_unit_size);
+
+static int crypto_skcipher_check_data_unit_size(struct crypto_skcipher *tfm,
+						struct skcipher_request *req)
+{
+	unsigned int du = tfm->data_unit_size;
+
+	if (likely(!du))
+		return 0;
+	if (req->cryptlen % du)
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Increment a 16-byte little-endian counter held in @iv.  See
+ * crypto_skcipher_set_data_unit_size() for the convention.
+ */
+static inline void skcipher_iv_inc_le128(u8 *iv)
+{
+	__le64 lo_le, hi_le;
+	u64 lo;
+
+	memcpy(&lo_le, iv, 8);
+	memcpy(&hi_le, iv + 8, 8);
+	lo = le64_to_cpu(lo_le) + 1;
+	lo_le = cpu_to_le64(lo);
+	memcpy(iv, &lo_le, 8);
+	if (unlikely(lo == 0)) {
+		hi_le = cpu_to_le64(le64_to_cpu(hi_le) + 1);
+		memcpy(iv + 8, &hi_le, 8);
+	}
+}
+
+int skcipher_walk_data_units(struct skcipher_request *req,
+			     int (*body)(struct skcipher_request *))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const unsigned int du = tfm->data_unit_size;
+	const unsigned int total = req->cryptlen;
+	struct scatterlist *orig_src = req->src;
+	struct scatterlist *orig_dst = req->dst;
+	struct scatterlist src_sg[2], dst_sg[2];
+	u8 iv_save[16];
+	unsigned int off;
+	int err = 0;
+
+	if (likely(!du))
+		return body(req);
+
+	/*
+	 * Registration of an algorithm advertising
+	 * CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT enforces ivsize == 16
+	 * (see skcipher_prepare_alg_common()), so this is purely
+	 * defensive against algorithm-registration bugs.
+	 */
+	if (WARN_ON_ONCE(crypto_skcipher_ivsize(tfm) != 16))
+		return -EINVAL;
+
+	memcpy(iv_save, req->iv, 16);
+
+	for (off = 0; off < total; off += du) {
+		req->cryptlen = du;
+		req->src = scatterwalk_ffwd(src_sg, orig_src, off);
+		req->dst = (orig_src == orig_dst) ? req->src :
+			   scatterwalk_ffwd(dst_sg, orig_dst, off);
+
+		err = body(req);
+		if (err)
+			break;
+
+		skcipher_iv_inc_le128(iv_save);
+		memcpy(req->iv, iv_save, 16);
+	}
+
+	req->src = orig_src;
+	req->dst = orig_dst;
+	req->cryptlen = total;
+	return err;
+}
+EXPORT_SYMBOL_GPL(skcipher_walk_data_units);
+
 int crypto_skcipher_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct skcipher_alg *alg = crypto_skcipher_alg(tfm);
+	int err;
 
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		return -ENOKEY;
+	err = crypto_skcipher_check_data_unit_size(tfm, req);
+	if (err)
+		return err;
 	if (alg->co.base.cra_type != &crypto_skcipher_type)
 		return crypto_lskcipher_encrypt_sg(req);
 	return alg->encrypt(req);
@@ -449,9 +555,13 @@ int crypto_skcipher_decrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct skcipher_alg *alg = crypto_skcipher_alg(tfm);
+	int err;
 
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		return -ENOKEY;
+	err = crypto_skcipher_check_data_unit_size(tfm, req);
+	if (err)
+		return err;
 	if (alg->co.base.cra_type != &crypto_skcipher_type)
 		return crypto_lskcipher_decrypt_sg(req);
 	return alg->decrypt(req);
@@ -680,6 +790,16 @@ int skcipher_prepare_alg_common(struct skcipher_alg_common *alg)
 	    (alg->ivsize + alg->statesize) > PAGE_SIZE / 2)
 		return -EINVAL;
 
+	/*
+	 * Algorithms advertising multi-data-unit support must use the
+	 * 16-byte little-endian counter convention documented in
+	 * crypto_skcipher_set_data_unit_size(); see also
+	 * skcipher_walk_data_units().
+	 */
+	if ((base->cra_flags & CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT) &&
+	    alg->ivsize != 16)
+		return -EINVAL;
+
 	if (!alg->chunksize)
 		alg->chunksize = base->cra_blocksize;
 
diff --git a/include/crypto/internal/skcipher.h b/include/crypto/internal/skcipher.h
index a965b6aabf61..bed1b1f1bbdc 100644
--- a/include/crypto/internal/skcipher.h
+++ b/include/crypto/internal/skcipher.h
@@ -21,6 +21,40 @@
  */
 #define CRYPTO_ALG_SKCIPHER_REQSIZE_LARGE CRYPTO_ALG_OPTIONAL_KEY
 
+/**
+ * skcipher_walk_data_units - dispatch a request as one body call per data unit
+ * @req: the caller's skcipher request
+ * @body: the algorithm's single-data-unit encrypt or decrypt function
+ *
+ * When tfm->data_unit_size is zero this is a tail call into @body with
+ * @req unchanged.  Otherwise the request is split into
+ * cryptlen / data_unit_size sub-ranges and @body is called once per
+ * sub-range with req->cryptlen, req->src, req->dst, and req->iv adjusted
+ * for that sub-range.  The IV passed to data unit n is the caller-
+ * supplied IV plus n, where + is a 128-bit little-endian add — this
+ * matches the convention documented in
+ * crypto_skcipher_set_data_unit_size().
+ *
+ * Many single-data-unit XTS bodies modify the IV buffer in place during
+ * processing (the tweak is walked block by block).  This helper saves
+ * the caller's IV before each call and rewrites the next data unit's
+ * IV from the saved value, so the body always sees a fresh per-DU IV
+ * regardless of any in-place mutation it performs.
+ *
+ * The body MUST run to completion synchronously.  Drivers that use this
+ * helper therefore advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT only
+ * for synchronous configurations.
+ *
+ * After the call returns, the contents of req->iv are unspecified per
+ * the documented contract.  src/dst/cryptlen are restored to the
+ * caller's values to keep skcipher request post-conditions intact.
+ *
+ * Return: 0 on success, or the body's negative errno on the first
+ *	   data unit that returned non-zero.
+ */
+int skcipher_walk_data_units(struct skcipher_request *req,
+			     int (*body)(struct skcipher_request *));
+
 struct aead_request;
 struct rtattr;
 
diff --git a/include/crypto/skcipher.h b/include/crypto/skcipher.h
index 4efe2ca8c4d1..5941b6b24b98 100644
--- a/include/crypto/skcipher.h
+++ b/include/crypto/skcipher.h
@@ -26,6 +26,15 @@
 /* Set this bit if the skcipher operation is not final. */
 #define CRYPTO_SKCIPHER_REQ_NOTFINAL	0x00000002
 
+/*
+ * Set in cra_flags by an skcipher algorithm that supports processing
+ * multiple data units in a single request.  See
+ * crypto_skcipher_set_data_unit_size().
+ *
+ * Type-specific flag in the 0xff000000 reserved range.
+ */
+#define CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT	0x01000000
+
 struct scatterlist;
 
 /**
@@ -53,6 +62,22 @@ struct skcipher_request {
 struct crypto_skcipher {
 	unsigned int reqsize;
 
+	/*
+	 * Number of bytes in one data unit when batching multiple data units
+	 * per request.  0 means "single data unit per request" (legacy
+	 * behaviour).  Set via crypto_skcipher_set_data_unit_size().
+	 *
+	 * When non-zero, cryptlen must be a multiple of data_unit_size.  The
+	 * IV passed in skcipher_request::iv applies to the first data unit;
+	 * the algorithm advances the tweak between data units according to
+	 * the mode specification (e.g., LE128 multiply for XTS per
+	 * IEEE 1619).
+	 *
+	 * Only algorithms that advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT
+	 * in cra_flags accept a non-zero value.
+	 */
+	unsigned int data_unit_size;
+
 	struct crypto_tfm base;
 };
 
@@ -492,6 +517,66 @@ static inline unsigned int crypto_lskcipher_chunksize(
 	return crypto_lskcipher_alg(tfm)->co.chunksize;
 }
 
+/**
+ * crypto_skcipher_supports_multi_data_unit() - test multi-data-unit support
+ * @tfm: cipher handle
+ *
+ * Return: true if the algorithm advertises that it can process multiple
+ *	   data units in a single skcipher_request.
+ */
+static inline bool
+crypto_skcipher_supports_multi_data_unit(struct crypto_skcipher *tfm)
+{
+	return crypto_skcipher_alg_common(tfm)->base.cra_flags &
+		CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT;
+}
+
+/**
+ * crypto_skcipher_set_data_unit_size() - set data unit size for the tfm
+ * @tfm: cipher handle
+ * @data_unit_size: data unit size in bytes; 0 disables multi-data-unit mode
+ *
+ * Configure the tfm to process multiple data units per request.  When set
+ * to a non-zero value, every subsequent encrypt/decrypt request must have
+ * cryptlen that is a multiple of @data_unit_size.  Each data unit is
+ * processed as if it were a separate request whose IV is derived from the
+ * preceding data unit's IV by the algorithm-specific tweak update rule:
+ * the implementation treats the caller-supplied IV as a 128-bit
+ * little-endian counter and adds the data-unit index for each subsequent
+ * data unit.
+ *
+ * The contents of req->iv after a multi-data-unit request returns are
+ * unspecified — callers MUST NOT rely on it being either the original
+ * value or the final-data-unit value.  Set a fresh IV before every
+ * request.
+ *
+ * The algorithm must advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in its
+ * cra_flags.  @data_unit_size must be a positive multiple of the
+ * algorithm's cra_blocksize, otherwise -EINVAL is returned.
+ *
+ * Setting @data_unit_size to 0 reverts the tfm to single-data-unit
+ * behaviour and is always permitted.
+ *
+ * Return: 0 on success; -EOPNOTSUPP if the algorithm does not advertise
+ *	   multi-data-unit support; -EINVAL if @data_unit_size is not a
+ *	   positive multiple of the cipher block size.
+ */
+int crypto_skcipher_set_data_unit_size(struct crypto_skcipher *tfm,
+				       unsigned int data_unit_size);
+
+/**
+ * crypto_skcipher_data_unit_size() - obtain data unit size
+ * @tfm: cipher handle
+ *
+ * Return: configured data unit size in bytes; 0 if multi-data-unit mode
+ *	   is disabled.
+ */
+static inline unsigned int
+crypto_skcipher_data_unit_size(struct crypto_skcipher *tfm)
+{
+	return tfm->data_unit_size;
+}
+
 /**
  * crypto_skcipher_statesize() - obtain state size
  * @tfm: cipher handle
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 2/4] crypto: xts - support multiple data units per request in template
From: Leonid Ravich @ 2026-05-27  6:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S . Miller, Mike Snitzer, Mikulas Patocka, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block
In-Reply-To: <20260527065021.19525-1-lravich@amazon.com>

Teach the generic xts() template to consume cryptlen larger than one
data unit when the caller has configured a non-zero data_unit_size on
the tfm.  Each data unit is processed with its own IV, derived from
the caller-supplied IV by treating it as a 128-bit little-endian
counter and adding the data-unit index.  This matches the
sector-indexed XTS used by dm-crypt's plain64 IV mode and by typical
inline-encryption hardware.

The single-data-unit body is unchanged and is now reached via a thin
xts_crypt_multi() dispatcher that skips straight to the body when
data_unit_size is zero (the legacy default), so existing users see
no extra cost.

Advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in cra_flags only when
the inner cipher is synchronous.  An async inner cipher would require
a per-DU completion chain which is out of scope for the slow software
template; consumers that need multi-DU on async hardware will use one
of the arch-specific drivers added later in this series.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 crypto/xts.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/crypto/xts.c b/crypto/xts.c
index ad97c8091582..f0585ea9d6d5 100644
--- a/crypto/xts.c
+++ b/crypto/xts.c
@@ -258,7 +258,7 @@ static int xts_init_crypt(struct skcipher_request *req,
 	return 0;
 }
 
-static int xts_encrypt(struct skcipher_request *req)
+static int xts_encrypt_one(struct skcipher_request *req)
 {
 	struct xts_request_ctx *rctx = skcipher_request_ctx(req);
 	struct skcipher_request *subreq = &rctx->subreq;
@@ -275,7 +275,7 @@ static int xts_encrypt(struct skcipher_request *req)
 	return xts_cts_final(req, crypto_skcipher_encrypt);
 }
 
-static int xts_decrypt(struct skcipher_request *req)
+static int xts_decrypt_one(struct skcipher_request *req)
 {
 	struct xts_request_ctx *rctx = skcipher_request_ctx(req);
 	struct skcipher_request *subreq = &rctx->subreq;
@@ -292,6 +292,16 @@ static int xts_decrypt(struct skcipher_request *req)
 	return xts_cts_final(req, crypto_skcipher_decrypt);
 }
 
+static int xts_encrypt(struct skcipher_request *req)
+{
+	return skcipher_walk_data_units(req, xts_encrypt_one);
+}
+
+static int xts_decrypt(struct skcipher_request *req)
+{
+	return skcipher_walk_data_units(req, xts_decrypt_one);
+}
+
 static int xts_init_tfm(struct crypto_skcipher *tfm)
 {
 	struct skcipher_instance *inst = skcipher_alg_instance(tfm);
@@ -427,6 +437,17 @@ static int xts_create(struct crypto_template *tmpl, struct rtattr **tb)
 	inst->alg.base.cra_alignmask = alg->base.cra_alignmask |
 				       (__alignof__(u64) - 1);
 
+	/*
+	 * Advertise multi-data-unit support only when the inner cipher is
+	 * synchronous.  The dispatcher in skcipher_walk_data_units() calls
+	 * the single-DU body in a loop and assumes synchronous completion;
+	 * supporting async would require a per-DU callback chain, which
+	 * the slow software template does not need.
+	 */
+	if (!(alg->base.cra_flags & CRYPTO_ALG_ASYNC))
+		inst->alg.base.cra_flags |=
+			CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT;
+
 	inst->alg.ivsize = XTS_BLOCK_SIZE;
 	inst->alg.min_keysize = alg->min_keysize * 2;
 	inst->alg.max_keysize = alg->max_keysize * 2;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 3/4] crypto: testmgr - exercise multi-data-unit path for skcipher
From: Leonid Ravich @ 2026-05-27  6:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S . Miller, Mike Snitzer, Mikulas Patocka, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block
In-Reply-To: <20260527065021.19525-1-lravich@amazon.com>

Add a self-comparison test that runs whenever an skcipher algorithm
advertises CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in cra_flags.  The test
encrypts the same random plaintext two ways:

  1. as one batched request with data_unit_size set, and
  2. as N back-to-back single-data-unit requests with IVs derived from
     the original IV by adding the data-unit index (treated as a
     128-bit little-endian counter, matching the convention documented
     in crypto_skcipher_set_data_unit_size()).

Both encrypts must produce byte-identical ciphertext, otherwise the
algorithm's multi-DU implementation is inconsistent with its single-DU
behaviour.  Iterates over a fixed set of typical data unit sizes
(512, 1024, 2048, 4096) which cover the dm-crypt sector-size range.

The test is gated on ivsize == 16 (XTS, the only multi-DU consumer in
the kernel today) and on the algorithm advertising the capability,
so it costs nothing for the existing fleet of skcipher drivers.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 crypto/testmgr.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 4d86efae65b2..8ca92ee6b37c 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3211,6 +3211,123 @@ static int test_skcipher(int enc, const struct cipher_test_suite *suite,
 	return 0;
 }
 
+/*
+ * For algorithms that advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT,
+ * verify that one request batching N data units produces the same
+ * ciphertext as N back-to-back single-data-unit requests with IVs
+ * derived from the original IV by adding the data-unit index (treated
+ * as a 128-bit little-endian counter).
+ *
+ * This is a self-comparison: it does not depend on test-vector
+ * authoritativeness, only on the algorithm being internally consistent
+ * between its single-DU and multi-DU paths.
+ */
+#define TEST_MDU_NR_UNITS	4
+static int test_skcipher_multi_du(struct crypto_skcipher *tfm,
+				  unsigned int du_size)
+{
+	const char *driver = crypto_skcipher_driver_name(tfm);
+	const unsigned int ivsize = crypto_skcipher_ivsize(tfm);
+	const unsigned int total = du_size * TEST_MDU_NR_UNITS;
+	struct skcipher_request *req = NULL;
+	struct scatterlist sg_in, sg_out;
+	DECLARE_CRYPTO_WAIT(wait);
+	u8 iv_orig[16] = {0};
+	u8 iv_work[16];
+	u8 *plain = NULL, *batched = NULL, *unit = NULL;
+	unsigned int i;
+	int err;
+
+	if (ivsize != 16)
+		return 0;
+
+	plain = kmalloc(total, GFP_KERNEL);
+	batched = kmalloc(total, GFP_KERNEL);
+	unit = kmalloc(total, GFP_KERNEL);
+	req = skcipher_request_alloc(tfm, GFP_KERNEL);
+	if (!plain || !batched || !unit || !req) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	get_random_bytes(plain, total);
+	get_random_bytes(iv_orig, ivsize);
+
+	/* Pass 1: one batched encrypt with data_unit_size set. */
+	err = crypto_skcipher_set_data_unit_size(tfm, du_size);
+	if (err) {
+		pr_err("alg: skcipher: %s set_data_unit_size(%u) failed: %d\n",
+		       driver, du_size, err);
+		goto out;
+	}
+	memcpy(batched, plain, total);
+	memcpy(iv_work, iv_orig, ivsize);
+	sg_init_one(&sg_in, batched, total);
+	sg_out = sg_in;
+	skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG |
+				      CRYPTO_TFM_REQ_MAY_SLEEP,
+				      crypto_req_done, &wait);
+	skcipher_request_set_crypt(req, &sg_in, &sg_out, total, iv_work);
+	err = crypto_wait_req(crypto_skcipher_encrypt(req), &wait);
+	if (err) {
+		pr_err("alg: skcipher: %s multi-DU batched encrypt failed: %d\n",
+		       driver, err);
+		goto out_clear_du;
+	}
+
+	/* Pass 2: TEST_MDU_NR_UNITS single-DU encrypts with derived IVs. */
+	err = crypto_skcipher_set_data_unit_size(tfm, 0);
+	if (err)
+		goto out;
+	memcpy(unit, plain, total);
+	memcpy(iv_work, iv_orig, ivsize);
+	for (i = 0; i < TEST_MDU_NR_UNITS; i++) {
+		sg_init_one(&sg_in, unit + i * du_size, du_size);
+		sg_out = sg_in;
+		skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG |
+					      CRYPTO_TFM_REQ_MAY_SLEEP,
+					      crypto_req_done, &wait);
+		skcipher_request_set_crypt(req, &sg_in, &sg_out, du_size,
+					   iv_work);
+		err = crypto_wait_req(crypto_skcipher_encrypt(req), &wait);
+		if (err) {
+			pr_err("alg: skcipher: %s single-DU[%u] encrypt failed: %d\n",
+			       driver, i, err);
+			goto out;
+		}
+		/* Increment iv_work as a 128-bit little-endian counter. */
+		{
+			__le64 lo_le, hi_le;
+			u64 lo;
+
+			memcpy(&lo_le, iv_work, 8);
+			memcpy(&hi_le, iv_work + 8, 8);
+			lo = le64_to_cpu(lo_le) + 1;
+			lo_le = cpu_to_le64(lo);
+			memcpy(iv_work, &lo_le, 8);
+			if (lo == 0) {
+				hi_le = cpu_to_le64(le64_to_cpu(hi_le) + 1);
+				memcpy(iv_work + 8, &hi_le, 8);
+			}
+		}
+	}
+
+	if (memcmp(batched, unit, total) != 0) {
+		pr_err("alg: skcipher: %s multi-DU mismatch (du=%u, n=%u)\n",
+		       driver, du_size, TEST_MDU_NR_UNITS);
+		err = -EINVAL;
+	}
+
+out_clear_du:
+	(void)crypto_skcipher_set_data_unit_size(tfm, 0);
+out:
+	skcipher_request_free(req);
+	kfree(unit);
+	kfree(batched);
+	kfree(plain);
+	return err;
+}
+
 static int alg_test_skcipher(const struct alg_test_desc *desc,
 			     const char *driver, u32 type, u32 mask)
 {
@@ -3259,6 +3376,18 @@ static int alg_test_skcipher(const struct alg_test_desc *desc,
 	if (err)
 		goto out;
 
+	if (crypto_skcipher_supports_multi_data_unit(tfm)) {
+		static const unsigned int du_sizes[] = { 512, 1024, 2048, 4096 };
+		unsigned int j;
+
+		for (j = 0; j < ARRAY_SIZE(du_sizes); j++) {
+			err = test_skcipher_multi_du(tfm, du_sizes[j]);
+			if (err)
+				goto out;
+			cond_resched();
+		}
+	}
+
 	err = test_skcipher_vs_generic_impl(desc->generic_driver, req, tsgls);
 out:
 	free_cipher_test_sglists(tsgls);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 4/4] dm crypt: batch all sectors of a bio per crypto request
From: Leonid Ravich @ 2026-05-27  6:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S . Miller, Mike Snitzer, Mikulas Patocka, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block
In-Reply-To: <20260527065021.19525-1-lravich@amazon.com>

When the underlying skcipher driver advertises support for multiple
data units in a single request (CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT),
configure the cipher with cc->sector_size as data_unit_size and
submit one request per bio instead of one request per sector.  This
removes per-sector overhead in the crypto API hot path: request
allocation, callback dispatch, completion handling, and SG setup.

The optimisation is enabled automatically at table load when all
of the following hold:

 - the cipher is non-aead (i.e. skcipher);
 - tfms_count is 1 (interleaved per-sector keys would break batching);
 - the IV mode is plain or plain64 (the only modes whose generator
   produces a sequential 64-bit little-endian counter that the cipher
   can extend by adding the data-unit index, matching the convention
   documented in crypto_skcipher_set_data_unit_size());
 - the iv_gen_ops->post() hook is unset (lmk and tcw use it; both are
   already excluded by the IV-mode test, but the explicit check makes
   the assumption durable against future IV modes);
 - dm-integrity is not stacked (no integrity tag or integrity IV);
 - the cipher driver advertises multi-data-unit support.

A new CRYPT_MULTI_DATA_UNIT cipher_flag, set once at construction
time, gates the multi-data-unit path.  The existing per-sector path
in crypt_convert_block_skcipher() is unchanged; the new
crypt_convert_block_skcipher_multi() is reached from a small dispatch
in crypt_convert() and shares the same backlog/-EBUSY/-EINPROGRESS
flow control with the per-sector path.

Heap-allocated scatterlists are stashed in dm_crypt_request and freed
in crypt_free_req_skcipher() to avoid races between the synchronous-
success free path and async-completion reuse from the request pool.
On -ENOMEM during scatterlist allocation, the bio is requeued via
BLK_STS_DEV_RESOURCE rather than failed, matching the behaviour of
the existing -ENOMEM path for crypto request allocation.

Verified end-to-end with a byte-equivalence test: encrypted output of
plain64 dm-crypt with the multi-data-unit path matches output of the
single-data-unit path bit-for-bit over a 256 MB device.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 drivers/md/dm-crypt.c | 272 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 265 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 608b617fb817..e3cc88cf0095 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -101,6 +101,14 @@ struct dm_crypt_request {
 	struct scatterlist sg_in[4];
 	struct scatterlist sg_out[4];
 	u64 iv_sector;
+	/*
+	 * Heap-allocated scatterlists used by the multi-data-unit path
+	 * when one bio is processed in a single skcipher request.  NULL
+	 * when the inline sg_in[]/sg_out[] arrays above are sufficient
+	 * (single-data-unit path).  Freed in crypt_free_req_skcipher().
+	 */
+	struct scatterlist *sg_in_ext;
+	struct scatterlist *sg_out_ext;
 };
 
 struct crypt_config;
@@ -151,6 +159,7 @@ enum cipher_flags {
 	CRYPT_IV_LARGE_SECTORS,		/* Calculate IV from sector_size, not 512B sectors */
 	CRYPT_ENCRYPT_PREPROCESS,	/* Must preprocess data for encryption (elephant) */
 	CRYPT_KEY_MAC_SIZE_SET,		/* The integrity_key_size option was used */
+	CRYPT_MULTI_DATA_UNIT,		/* Batch all sectors of a bio per crypto request */
 };
 
 /*
@@ -1426,12 +1435,153 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc,
 	return r;
 }
 
+/*
+ * Multi-data-unit variant of crypt_convert_block_skcipher.  Submits all
+ * remaining sectors of the current bio in one skcipher request whose
+ * data_unit_size is cc->sector_size.  The cipher walks the IV between
+ * data units (see crypto_skcipher_set_data_unit_size()).
+ *
+ * Returns the same set of values as crypt_convert_block_skcipher:
+ *   0 on synchronous success (full chunk processed),
+ *   -EINPROGRESS / -EBUSY on asynchronous dispatch,
+ *   -EAGAIN if the per-bio scatterlist allocation cannot be made.  The
+ *           caller MUST disable multi-data-unit batching for the rest
+ *           of this bio and re-enter the per-sector path, which uses
+ *           only mempool reserves and is therefore safe even on the
+ *           swap-out-to-dm-crypt path under total memory exhaustion.
+ *   negative errno otherwise.
+ *
+ * On success the bio iterators have been advanced by the chunk size.
+ *
+ * Walks the bio with __bio_for_each_bvec so that multi-page folios
+ * produce one scatterlist entry rather than N (one per PAGE_SIZE).
+ */
+static int crypt_convert_block_skcipher_multi(struct crypt_config *cc,
+					      struct convert_context *ctx,
+					      struct skcipher_request *req,
+					      unsigned int *out_processed)
+{
+	const unsigned int sector_size = cc->sector_size;
+	const gfp_t gfp = GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN;
+	unsigned int total_in = ctx->iter_in.bi_size;
+	unsigned int total_out = ctx->iter_out.bi_size;
+	unsigned int total = min(total_in, total_out);
+	unsigned int n_sectors;
+	unsigned int n_sg_in = 0, n_sg_out = 0;
+	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
+	struct scatterlist *sg_in = NULL, *sg_out = NULL;
+	struct bvec_iter iter_in, iter_out;
+	struct bio_vec bv;
+	u8 *iv, *org_iv;
+	int r;
+
+	if (unlikely(total < sector_size))
+		return -EIO;
+	n_sectors = total / sector_size;
+	total = n_sectors * sector_size;
+
+	/*
+	 * Walk the bio_vec iterators to count how many SG entries we need
+	 * for exactly @total bytes.  bi_size of the iterators is at least
+	 * @total by construction above.
+	 */
+	iter_in = ctx->iter_in;
+	iter_in.bi_size = total;
+	__bio_for_each_bvec(bv, ctx->bio_in, iter_in, iter_in)
+		n_sg_in++;
+
+	iter_out = ctx->iter_out;
+	iter_out.bi_size = total;
+	__bio_for_each_bvec(bv, ctx->bio_out, iter_out, iter_out)
+		n_sg_out++;
+
+	sg_in = kmalloc_array(n_sg_in, sizeof(*sg_in), gfp);
+	sg_out = (ctx->bio_in == ctx->bio_out) ? sg_in :
+		 kmalloc_array(n_sg_out, sizeof(*sg_out), gfp);
+	if (!sg_in || !sg_out) {
+		/*
+		 * Allocation may legitimately fail under memory pressure on
+		 * the swap-out-to-dm-crypt path.  Return -EAGAIN so the
+		 * caller falls back to the per-sector path for this bio
+		 * rather than looping forever in the allocator or requeueing
+		 * the bio just to fail again.
+		 */
+		kfree(sg_in);
+		if (sg_out != sg_in)
+			kfree(sg_out);
+		return -EAGAIN;
+	}
+
+	sg_init_table(sg_in, n_sg_in);
+	{
+		unsigned int i = 0;
+
+		iter_in = ctx->iter_in;
+		iter_in.bi_size = total;
+		__bio_for_each_bvec(bv, ctx->bio_in, iter_in, iter_in)
+			sg_set_page(&sg_in[i++], bv.bv_page, bv.bv_len,
+				    bv.bv_offset);
+	}
+
+	if (sg_out != sg_in) {
+		unsigned int i = 0;
+
+		sg_init_table(sg_out, n_sg_out);
+		iter_out = ctx->iter_out;
+		iter_out.bi_size = total;
+		__bio_for_each_bvec(bv, ctx->bio_out, iter_out, iter_out)
+			sg_set_page(&sg_out[i++], bv.bv_page, bv.bv_len,
+				    bv.bv_offset);
+	}
+
+	/*
+	 * Compute the IV for the first data unit.  The cipher will derive
+	 * IVs for subsequent data units by treating this one as a 128-bit
+	 * little-endian counter and adding the data-unit index, which
+	 * matches the layout produced by plain and plain64.
+	 */
+	dmreq->iv_sector = ctx->cc_sector;
+	if (test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags))
+		dmreq->iv_sector >>= cc->sector_shift;
+	dmreq->ctx = ctx;
+
+	iv = iv_of_dmreq(cc, dmreq);
+	org_iv = org_iv_of_dmreq(cc, dmreq);
+	r = cc->iv_gen_ops->generator(cc, org_iv, dmreq);
+	if (r < 0)
+		goto out_free_sg;
+	memcpy(iv, org_iv, cc->iv_size);
+
+	/* Stash the SG arrays for cleanup on completion / free. */
+	dmreq->sg_in_ext = sg_in;
+	dmreq->sg_out_ext = (sg_out == sg_in) ? NULL : sg_out;
+
+	skcipher_request_set_crypt(req, sg_in, sg_out, total, iv);
+
+	if (bio_data_dir(ctx->bio_in) == WRITE)
+		r = crypto_skcipher_encrypt(req);
+	else
+		r = crypto_skcipher_decrypt(req);
+
+	*out_processed = total;
+	return r;
+
+out_free_sg:
+	kfree(sg_in);
+	if (sg_out != sg_in)
+		kfree(sg_out);
+	dmreq->sg_in_ext = NULL;
+	dmreq->sg_out_ext = NULL;
+	return r;
+}
+
 static void kcryptd_async_done(void *async_req, int error);
 
 static int crypt_alloc_req_skcipher(struct crypt_config *cc,
 				     struct convert_context *ctx)
 {
 	unsigned int key_index = ctx->cc_sector & (cc->tfms_count - 1);
+	struct dm_crypt_request *dmreq;
 
 	if (!ctx->r.req) {
 		ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO);
@@ -1441,6 +1591,18 @@ static int crypt_alloc_req_skcipher(struct crypt_config *cc,
 
 	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
+	/*
+	 * Initialise the heap-allocated scatterlist pointers so that
+	 * crypt_free_req_skcipher() does not read uninitialised memory
+	 * for paths that don't take the multi-data-unit branch.  The
+	 * dmreq trailer lives in the per-bio data area which is not
+	 * zeroed by the dm core, and the request is reused from the
+	 * mempool across many bios.
+	 */
+	dmreq = dmreq_of_req(cc, ctx->r.req);
+	dmreq->sg_in_ext = NULL;
+	dmreq->sg_out_ext = NULL;
+
 	/*
 	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
 	 * requests if driver request queue is full.
@@ -1487,6 +1649,12 @@ static void crypt_free_req_skcipher(struct crypt_config *cc,
 				    struct skcipher_request *req, struct bio *base_bio)
 {
 	struct dm_crypt_io *io = dm_per_bio_data(base_bio, cc->per_bio_data_size);
+	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
+
+	kfree(dmreq->sg_in_ext);
+	dmreq->sg_in_ext = NULL;
+	kfree(dmreq->sg_out_ext);
+	dmreq->sg_out_ext = NULL;
 
 	if ((struct skcipher_request *)(io + 1) != req)
 		mempool_free(req, &cc->req_pool);
@@ -1515,7 +1683,9 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_
 static blk_status_t crypt_convert(struct crypt_config *cc,
 			 struct convert_context *ctx, bool atomic, bool reset_pending)
 {
-	unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
+	const unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
+	bool multi_du = test_bit(CRYPT_MULTI_DATA_UNIT, &cc->cipher_flags);
+	unsigned int processed;
 	int r;
 
 	/*
@@ -1536,8 +1706,13 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 
 		atomic_inc(&ctx->cc_pending);
 
+		processed = cc->sector_size;
 		if (crypt_integrity_aead(cc))
 			r = crypt_convert_block_aead(cc, ctx, ctx->r.req_aead, ctx->tag_offset);
+		else if (multi_du)
+			r = crypt_convert_block_skcipher_multi(cc, ctx,
+							       ctx->r.req,
+							       &processed);
 		else
 			r = crypt_convert_block_skcipher(cc, ctx, ctx->r.req, ctx->tag_offset);
 
@@ -1559,8 +1734,19 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 					 * exit and continue processing in a workqueue
 					 */
 					ctx->r.req = NULL;
-					ctx->tag_offset++;
-					ctx->cc_sector += sector_step;
+					if (!multi_du) {
+						ctx->tag_offset++;
+						ctx->cc_sector += sector_step;
+					} else {
+						bio_advance_iter(ctx->bio_in,
+								 &ctx->iter_in,
+								 processed);
+						bio_advance_iter(ctx->bio_out,
+								 &ctx->iter_out,
+								 processed);
+						ctx->cc_sector +=
+							processed >> SECTOR_SHIFT;
+					}
 					return BLK_STS_DEV_RESOURCE;
 				}
 			} else {
@@ -1574,19 +1760,52 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 		 */
 		case -EINPROGRESS:
 			ctx->r.req = NULL;
-			ctx->tag_offset++;
-			ctx->cc_sector += sector_step;
+			if (!multi_du) {
+				ctx->tag_offset++;
+				ctx->cc_sector += sector_step;
+			} else {
+				bio_advance_iter(ctx->bio_in, &ctx->iter_in,
+						 processed);
+				bio_advance_iter(ctx->bio_out, &ctx->iter_out,
+						 processed);
+				ctx->cc_sector += processed >> SECTOR_SHIFT;
+			}
 			continue;
 		/*
 		 * The request was already processed (synchronously).
 		 */
 		case 0:
 			atomic_dec(&ctx->cc_pending);
-			ctx->cc_sector += sector_step;
-			ctx->tag_offset++;
+			if (!multi_du) {
+				ctx->cc_sector += sector_step;
+				ctx->tag_offset++;
+			} else {
+				bio_advance_iter(ctx->bio_in, &ctx->iter_in,
+						 processed);
+				bio_advance_iter(ctx->bio_out, &ctx->iter_out,
+						 processed);
+				ctx->cc_sector += processed >> SECTOR_SHIFT;
+			}
 			if (!atomic)
 				cond_resched();
 			continue;
+		/*
+		 * Multi-data-unit scatterlist allocation failed.  This can
+		 * happen on the swap-out-to-dm-crypt path under memory
+		 * pressure, where retrying with the same allocation policy
+		 * could loop forever.  Disable multi-data-unit batching for
+		 * the rest of this crypt_convert() invocation and re-enter
+		 * the per-sector path, which uses only mempool reserves and
+		 * is guaranteed to make forward progress even under total
+		 * memory exhaustion.  The per-tfm data_unit_size is left
+		 * unchanged, so subsequent bios (which start a fresh
+		 * crypt_convert() and re-read cipher_flags) will retry the
+		 * multi-data-unit path once memory pressure eases.
+		 */
+		case -EAGAIN:
+			atomic_dec(&ctx->cc_pending);
+			multi_du = false;
+			continue;
 		/*
 		 * There was a data integrity error.
 		 */
@@ -3063,6 +3282,45 @@ static int crypt_ctr_cipher(struct dm_target *ti, char *cipher_in, char *key)
 		}
 	}
 
+	/*
+	 * Enable multi-data-unit batching when the cipher supports it and
+	 * the IV layout is one we can derive per-DU from a single starting
+	 * IV: plain or plain64 produce a sequential 64-bit little-endian
+	 * counter, which matches the convention of
+	 * crypto_skcipher_set_data_unit_size().  Restrict to the simple
+	 * case (single tfm, no integrity, no per-sector post() callback)
+	 * to keep the consumer path small; modes like essiv, lmk, tcw,
+	 * eboiv, plain64be, random, null, benbi, and elephant are
+	 * deliberately excluded because their generators or post-IV hooks
+	 * cannot be re-derived by the cipher between data units.
+	 */
+	if (!crypt_integrity_aead(cc) && cc->tfms_count == 1 &&
+	    cc->iv_gen_ops &&
+	    (cc->iv_gen_ops == &crypt_iv_plain_ops ||
+	     cc->iv_gen_ops == &crypt_iv_plain64_ops) &&
+	    !cc->iv_gen_ops->post &&
+	    !cc->integrity_tag_size && !cc->integrity_iv_size &&
+	    crypto_skcipher_supports_multi_data_unit(cc->cipher_tfm.tfms[0])) {
+		ret = crypto_skcipher_set_data_unit_size(cc->cipher_tfm.tfms[0],
+							 cc->sector_size);
+		if (!ret) {
+			set_bit(CRYPT_MULTI_DATA_UNIT, &cc->cipher_flags);
+			DMINFO("Using multi-data-unit crypto offload (du=%u)",
+			       cc->sector_size);
+		} else {
+			/*
+			 * The driver advertised the capability via cra_flags
+			 * but rejected the requested data unit size.  This is
+			 * a driver bug worth seeing in dmesg; fall back to
+			 * the per-sector path so the device still activates.
+			 */
+			DMWARN_LIMIT("multi-DU offload disabled: %s rejected du=%u (%d)",
+				     crypto_skcipher_driver_name(cc->cipher_tfm.tfms[0]),
+				     cc->sector_size, ret);
+			ret = 0;
+		}
+	}
+
 	/* wipe the kernel key payload copy */
 	if (cc->key_string)
 		memset(cc->key, 0, cc->key_size * sizeof(u8));
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCHv2 2/2] nvme: add support multipath passthrough iostats
From: Nitesh Shetty @ 2026-05-27  6:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, axboe, hch, nilay, Keith Busch
In-Reply-To: <20260526153921.2402015-3-kbusch@meta.com>

[-- Attachment #1: Type: text/plain, Size: 276 bytes --]

On 26/05/26 08:39AM, Keith Busch wrote:
>From: Keith Busch <kbusch@kernel.org>
>
>Don't skip io accounting for passthrough commands if the user enabled
>tracking these.
>
>Signed-off-by: Keith Busch <kbusch@kernel.org>
>---

Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* [PATCH] block/partitions/ldm: fix TOCBLOCK bitmap mismatch message argument order
From: dayou5941 @ 2026-05-27  6:52 UTC (permalink / raw)
  To: ldm, axboe; +Cc: linux-block, liyouhong

From: liyouhong <liyouhong@kylinos.cn>

The ldm_crit() calls in ldm_parse_tocblock() have the format string
arguments in the wrong order. The format string reads:

  "TOCBLOCK's first bitmap is '%s', should be '%s'."

The intent is to print the actual (on-disk) name first and the expected
name second. However, the constant TOC_BITMAP1/TOC_BITMAP2 (expected) is
passed as the first argument and toc->bitmapX_name (actual) as the
second, producing misleading diagnostic output on corrupt disks.

Swap the two arguments so the printed message matches its intended
semantics.

Signed-off-by: liyouhong <liyouhong@kylinos.cn>
---
 block/partitions/ldm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c
index c0bdcae58a3e..fca55f9a583e 100644
--- a/block/partitions/ldm.c
+++ b/block/partitions/ldm.c
@@ -138,7 +138,7 @@ static bool ldm_parse_tocblock (const u8 *data, struct tocblock *toc)
 	if (strncmp (toc->bitmap1_name, TOC_BITMAP1,
 			sizeof (toc->bitmap1_name)) != 0) {
 		ldm_crit ("TOCBLOCK's first bitmap is '%s', should be '%s'.",
-				TOC_BITMAP1, toc->bitmap1_name);
+				toc->bitmap1_name, TOC_BITMAP1);
 		return false;
 	}
 	strscpy_pad(toc->bitmap2_name, data + 0x46, sizeof(toc->bitmap2_name));
@@ -147,7 +147,7 @@ static bool ldm_parse_tocblock (const u8 *data, struct tocblock *toc)
 	if (strncmp (toc->bitmap2_name, TOC_BITMAP2,
 			sizeof (toc->bitmap2_name)) != 0) {
 		ldm_crit ("TOCBLOCK's second bitmap is '%s', should be '%s'.",
-				TOC_BITMAP2, toc->bitmap2_name);
+				toc->bitmap2_name, TOC_BITMAP2);
 		return false;
 	}
 	ldm_debug ("Parsed TOCBLOCK successfully.");
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCHv2 1/2] block: export passthrough stats enabled
From: Nitesh Shetty @ 2026-05-27  6:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, axboe, hch, nilay, Keith Busch
In-Reply-To: <20260526153921.2402015-2-kbusch@meta.com>

[-- Attachment #1: Type: text/plain, Size: 393 bytes --]

On 26/05/26 08:39AM, Keith Busch wrote:
>From: Keith Busch <kbusch@kernel.org>
>
>A user can enable io accounting for passthrough requests, so export the
>helper that checks if the request should be tracked. This will enable
>stacking drivers to to report iostats for passthrough workloads.
>
>Signed-off-by: Keith Busch <kbusch@kernel.org>

Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* Re: [PATCH v2 0/2] zram: fix UAF in zram_bvec_write_partial() and drop dead bio plumbing
From: Sergey Senozhatsky @ 2026-05-27  7:21 UTC (permalink / raw)
  To: Cunlong Li
  Cc: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Christoph Hellwig, linux-block, linux-mm, linux-kernel, stable
In-Reply-To: <20260527-zram-v2-0-2fb84b054b5c@gmail.com>

On (26/05/27 12:49), Cunlong Li wrote:
> Patch 1 fixes a use-after-free in zram_bvec_write_partial() that
> happens on PAGE_SIZE > 4K configurations when a partial write hits a
> ZRAM_WB slot.
> 
> Patch 2 is a follow-up cleanup that drops the now-unused bio parameter
> from zram_bvec_write_partial() and zram_bvec_write(), no functional
> change.

Did you test it?

Looks reasonable (unless I'm missing something):
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>

^ permalink raw reply

* Re: [PATCH v2 1/2] zram: fix use-after-free in zram_bvec_write_partial()
From: Christoph Hellwig @ 2026-05-27  7:24 UTC (permalink / raw)
  To: Cunlong Li
  Cc: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Christoph Hellwig, linux-block, linux-mm, linux-kernel, stable
In-Reply-To: <20260527-zram-v2-1-2fb84b054b5c@gmail.com>

On Wed, May 27, 2026 at 12:49:24PM +0800, Cunlong Li wrote:
> zram_read_page() picks the sync or async backing device read path
> based on whether the parent bio is NULL.  zram_bvec_write_partial()
> passes its parent bio down, so for ZRAM_WB slots the read is
> dispatched asynchronously and zram_read_page() returns 0 while the
> bio is still in flight.  The caller then runs memcpy_from_bvec(),
> zram_write_page() and __free_page() on the buffer, leaving the
> async read to write into a freed page.
> 
> zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
> ("zram: fix synchronous reads") for the same reason; the
> write_partial counterpart was missed.
> 
> Fixes: 4e3c87b9421d ("zram: fix synchronous reads")

That's just the last patch touching the line.  This bio chaining goes
further back.  AFAICS all the way to introducing backing device support
in: 8e654f8fbff5 ("zram: read page from backing device")

The patch itself looks good, though:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 2/2] zram: drop unused bio parameter from write helpers
From: Christoph Hellwig @ 2026-05-27  7:24 UTC (permalink / raw)
  To: Cunlong Li
  Cc: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	Christoph Hellwig, linux-block, linux-mm, linux-kernel
In-Reply-To: <20260527-zram-v2-2-2fb84b054b5c@gmail.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* [PATCH] rust: block: mq: align init_request numa_node arg with C signature
From: Andreas Hindborg @ 2026-05-27  9:18 UTC (permalink / raw)
  To: Andreas Hindborg, Boqun Feng, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Jens Axboe
  Cc: Mateusz Nowicki, linux-block, rust-for-linux, linux-kernel

Commit b040a1a4523d ("block: switch numa_node to int in
blk_mq_hw_ctx and init_request") changed the type of the
`numa_node` argument of `blk_mq_ops::init_request` from
`unsigned int` to `int`. Update the Rust callback signature to
match, so that the function item can be coerced to the C fn
pointer type stored in `blk_mq_ops`.

Without this change the Rust block layer fails to build:

  error[E0308]: mismatched types
     --> rust/kernel/block/mq/operations.rs:274:28
      |
  274 |         init_request: Some(Self::init_request_callback),
      |                       ---- ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      |                       expected fn pointer, found fn item
      |
      = note: expected fn pointer
                `unsafe extern "C" fn(_, _, _, i32) -> _`
                    found fn item
                `unsafe extern "C" fn(_, _, _, u32) -> _ {...}`

The argument is unused on the Rust side, so this is a pure
type-signature change with no functional impact.

Fixes: b040a1a4523d ("block: switch numa_node to int in blk_mq_hw_ctx and init_request")
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
---
 rust/kernel/block/mq/operations.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
index 8ad46129a52c..861903e18fbf 100644
--- a/rust/kernel/block/mq/operations.rs
+++ b/rust/kernel/block/mq/operations.rs
@@ -218,7 +218,7 @@ impl<T: Operations> OperationsVTable<T> {
         _set: *mut bindings::blk_mq_tag_set,
         rq: *mut bindings::request,
         _hctx_idx: crate::ffi::c_uint,
-        _numa_node: crate::ffi::c_uint,
+        _numa_node: crate::ffi::c_int,
     ) -> crate::ffi::c_int {
         from_result(|| {
             // SAFETY: By the safety requirements of this function, `rq` points

---
base-commit: 27236c051c01c1c1025e0e0d12a107082557e8f1
change-id: 20260527-block-for-next-2026-05-26-2200-failure-64907085fc49

Best regards,
-- 
Andreas Hindborg <a.hindborg@kernel.org>



^ permalink raw reply related

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Jan Kara @ 2026-05-27  9:42 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, Dave Chinner, Bart Van Assche,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	Gao Xiang
In-Reply-To: <f1b8eeb6-2397-4f48-a21a-a023eb0c80ab@columbia.edu>

On Tue 26-05-26 15:29:28, Tal Zussman wrote:
> On 5/25/26 1:17 AM, Christoph Hellwig wrote:
> > On Fri, May 22, 2026 at 06:47:43PM -0400, Tal Zussman wrote:
> >> > But this 1-jiffie delay also means we unconditionally increase
> >> > completion latency, which feels like a bad idea.  Do you have any
> >> > measurements that show where it does benefit?  Note that queing work
> >> > already often has very measurable latency on it's own.  This also
> >> > directly contradics the erofs experience that even went to a RT
> >> > thread to reduce the latency.
> >> 
> >> I added this per Dave's feedback on v4, where he noted that XFS inodegc
> >> uses a delayed work item to avoid context switch storms. There's only a
> >> delay for the first bio in a batch to complete, as we only delay when the
> >> list is empty. I'll run some experiments and measure context switches,
> >> completion latency, etc. to see if this is necessary.
> > 
> > The difference is that XFS inodegc is not latency bound.  Most of the
> > time no one cares if it is delayed a bit, in the cases where someone
> > cares we explicitly flush the queues.  I/O completion on the other hand
> > is something where users very much care about latency.
> > 
> 
> I ran some experiments with fio on both XFS and a raw block device. Five
> iterations each for 60s. Results below.
> 
> TLDR: Removing the delay doesn't significantly decrease user-visible
> latency or otherwise improve performance, but does significantly reduce
> throughput and increase context switches in some workloads (e.g. C).
> I think it makes sense to leave the delay as-is. Thoughts?

Thanks for the test! One question below:

> Results:
> 
> Workloads (all `uncached=1`):
>   A: rw=write     bs=128k iodepth=1   ioengine=pvsync2     # XFS
>   B: rw=write     bs=128k iodepth=128 ioengine=io_uring    # XFS
>   C: rw=randwrite bs=4k   iodepth=32  ioengine=io_uring    # XFS
>   D: rw=rw 50/50  bs=64k  iodepth=32  ioengine=io_uring    # XFS
>   E: rw=write     bs=128k iodepth=128 ioengine=io_uring    # raw /dev/nvmeXn1
>   F: rw=write     bs=128k iodepth=128 numjobs=4
>      + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
> 
> Mean ± stddev across 5 iterations:
> 
>     metric                     delay=1           delay=0     delta
>     --------------------------------------------------------------
> 
>   A seq 128k qd1
>     BW (MB/s)                4333 ± 27         4374 ± 34     +0.9%
>     p99   (us)              36.2 ± 0.8        35.8 ± 0.4     -1.1%
>     p999  (us)               3260 ± 75         3228 ± 29     -1.0%
>     ctx-switches          184 k ± 59 k     3.68 M ± 65 k    +1903%
>     cs / io                0.09 ± 0.03       1.86 ± 0.03    +1888%
>     avg bios/run            80.4 ± 0.6         1.1 ± 0.0    -98.7%

So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
the completion latency should be at least 1000us but your results show p99
latency of 36. What am I missing?

>   B seq 128k qd128
>     BW (MB/s)               4393 ± 3.3        4311 ± 5.3     -1.9%
>     p99   (us)               8461 ± 73        8638 ± 105     +2.1%
>     p999  (us)             12465 ± 213       12386 ± 299     -0.6%
>     ctx-switches        6.90 M ± 186 k    9.72 M ± 184 k    +40.7%
>     cs / io                3.43 ± 0.10       4.92 ± 0.10    +43.4%
>     avg bios/run            51.9 ± 2.2         1.3 ± 0.0    -97.4%
> 
>   C rand 4k qd32
>     BW (MB/s)               66.2 ± 0.8        44.6 ± 7.4    -32.7%
>     p99   (us)              8002 ± 174      17990 ± 6800   +124.8%
>     p999  (us)             11390 ± 554     31890 ± 11076   +180.0%
>     ctx-switches         3.67 M ± 45 k    3.59 M ± 106 k     -2.2%
>     cs / io                3.78 ± 0.04       5.62 ± 0.83    +48.7%
>     avg bios/run            32.3 ± 1.0         3.1 ± 0.3    -90.5%

I'm somewhat surprised how larger is the completion latency is here without
the delay. Is that due to a contention on local lock between the IO completion
interrupt and the worker? Or why is the completion latency so big here when
the case B with more IOs in flight, less bios per run, still had significantly
lower latency in the delay=0 case?

								Honza

>   D mixed 50/50 r/w 64k qd32
>     write BW (MB/s)       892.4 ± 20.9      925.3 ± 18.3     +3.7%
>     write p99 (us)          3562 ± 107         3601 ± 82     +1.1%
>     write p999 (us)         4673 ± 217        4647 ± 107     -0.6%
>     read BW (MB/s)        893.6 ± 20.8      926.6 ± 18.4     +3.7%
>     read p99 (us)            1003 ± 48         1035 ± 39     +3.2%
>     read p999 (us)           1545 ± 63         1476 ± 50     -4.5%
>     ctx-switches         5.15 M ± 75 k    5.79 M ± 230 k    +12.6%
>     cs / io                6.32 ± 0.15       6.85 ± 0.20     +8.5%
>     avg bios/run            23.9 ± 0.3         2.5 ± 0.0    -89.4%
> 
>   E raw 128k qd128
>     BW (MB/s)               1043 ± 1.0        1045 ± 0.5     +0.1%
>     p99   (us)             26922 ± 105       27027 ± 128     +0.4%
>     p999  (us)            37906 ± 4527      37408 ± 2464     -1.3%
>     ctx-switches          3.20 M ± 6 k     3.33 M ± 10 k     +3.8%
>     cs / io                6.71 ± 0.01       6.95 ± 0.02     +3.7%
>     avg bios/run            38.0 ± 0.1        32.0 ± 0.0    -15.6%
> 
>   F mem-pressure (dirty_bytes=64MB, 4 writers)
>     BW (MB/s)                4361 ± 24         4444 ± 40     +1.9%
>     p99   (us)             29439 ± 419       30173 ± 788     +2.5%
>     p999  (us)            35704 ± 1773       36648 ± 535     +2.6%
>     ctx-switches        20.8 M ± 1.6 M    27.1 M ± 1.4 M    +30.1%
>     cs / io                6.94 ± 0.49       8.87 ± 0.46    +27.8%
>     avg bios/run            23.6 ± 0.3         1.2 ± 0.0    -94.9%
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Hannes Reinecke @ 2026-05-27 10:04 UTC (permalink / raw)
  To: Vlastimil Babka, Matthew Wilcox
  Cc: Mike Rapoport, Christoph Hellwig, Jens Axboe, linux-block,
	linux-kernel, linux-mm
In-Reply-To: <8b780901-8184-4908-8bda-56fac42fe6b3@suse.com>

On 5/26/26 22:57, Vlastimil Babka wrote:
> On 5/26/26 16:37, Matthew Wilcox wrote:
>> On Tue, May 26, 2026 at 02:07:36PM +0200, Vlastimil Babka wrote:
>>> The main reasons for switching AFAIU would be related with the
>>> folio/memdesc conversions? If one needs just a kernel memory buffer,
>>> kmalloc() it is, even if it happens to be page size. Page allocator
>>> should be only used if you need e.g. the refcounting or anything else
>>> that struct page provides. But then in some cases the memdesc conversion
>>> would need adjustments at some point. With kmalloc() we can forget about
>>> this user.
>>
>> No, I think this is unrelated to memdescs.
>>
>> I've seen a few people say slightly wrong things about
>> folios/pages/memdescs recently, so let me try to clarify the end state.
>>
>> I do not intend to get rid of the ability to allocate a bare page of
>> memory with something like alloc_pages() or get_free_page().  It's
>> just that the struct page associated with it will contain far less
>> information (because it's smaller).
> 
> Alright, but isn't it still the case that if you don't need any of what
> struct page provides today or will do in the future, it's better if you just
> use kmalloc()? I thought you said so yourself?
> 
> https://lore.kernel.org/all/aPQxN7-FeFB6vTuv@casper.infradead.org/
> 
Precisely my reasoning. In most cases, __get_free_page() is just a
lazy way of saying "I need some memory and the allocation should not 
fail". And typically these callers don't really care about the page
mapping, too.
Additionally, these applications get in the way when using large block
sizes on the system, as they needlessly increase the pressure on
compaction.
So switching them over to kmalloc() is a good thing IMO.
If nothing else it allows us to differentiate which places _actually_
need struct page, and which callers just want to do a memory allocation.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply

* Re: [PATCH] rust: block: mq: align init_request numa_node arg with C signature
From: Gary Guo @ 2026-05-27 10:56 UTC (permalink / raw)
  To: Andreas Hindborg, Boqun Feng, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Jens Axboe
  Cc: Mateusz Nowicki, linux-block, rust-for-linux, linux-kernel
In-Reply-To: <20260527-block-for-next-2026-05-26-2200-failure-v1-1-4865889e282c@kernel.org>

On Wed May 27, 2026 at 10:18 AM BST, Andreas Hindborg wrote:
> Commit b040a1a4523d ("block: switch numa_node to int in
> blk_mq_hw_ctx and init_request") changed the type of the
> `numa_node` argument of `blk_mq_ops::init_request` from
> `unsigned int` to `int`. Update the Rust callback signature to
> match, so that the function item can be coerced to the C fn
> pointer type stored in `blk_mq_ops`.
> 
> Without this change the Rust block layer fails to build:
> 
>   error[E0308]: mismatched types
>      --> rust/kernel/block/mq/operations.rs:274:28
>       |
>   274 |         init_request: Some(Self::init_request_callback),
>       |                       ---- ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       |                       expected fn pointer, found fn item
>       |
>       = note: expected fn pointer
>                 `unsafe extern "C" fn(_, _, _, i32) -> _`
>                     found fn item
>                 `unsafe extern "C" fn(_, _, _, u32) -> _ {...}`
> 
> The argument is unused on the Rust side, so this is a pure
> type-signature change with no functional impact.
> 
> Fixes: b040a1a4523d ("block: switch numa_node to int in blk_mq_hw_ctx and init_request")
> Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>

You could also just use `i32` instead of `ffi::c_int`. But it doesn't really
matter for this patch.

Reviewed-by: Gary Guo <gary@garyguo.net>

> ---
>  rust/kernel/block/mq/operations.rs | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox