Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH v5 05/12] block/cgroup: Improve lock context annotations
From: Christoph Hellwig @ 2026-06-01  7:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Tejun Heo, Josef Bacik
In-Reply-To: <a30486e2e9695614e4407d5ad8c75d637f5b31db.1779997063.git.bvanassche@acm.org>

On Thu, May 28, 2026 at 12:45:42PM -0700, Bart Van Assche wrote:
> Add lock context annotations where these are missing. Move the
> blkg_conf_prep() annotation into block/blk-cgroup.h to make it visible
> to all blkg_conf_prep() callers.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH v5 06/12] block/cgroup: Inline blkg_conf_{open,close}_bdev_frozen()
From: Christoph Hellwig @ 2026-06-01  7:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Tejun Heo, Josef Bacik
In-Reply-To: <f7be976e5a79afc88970ea16158d0a64bbf8f25e.1779997063.git.bvanassche@acm.org>

On Thu, May 28, 2026 at 12:45:43PM -0700, Bart Van Assche wrote:
> The blkg_conf_open_bdev_frozen() calling convention is not compatible
> with lock context annotations. Inline both blkg_conf_open_bdev_frozen()

Maybe say fold into the only caller here?  Inline to me implies turning
it into an inline function.
> +	q = ctx.bdev->bd_queue;
> +	blkg_conf_close_bdev(&ctx);
> +	blk_mq_unfreeze_queue(q, memflags);
> +
>  	return nbytes;

[...]

> +	q = ctx.bdev->bd_queue;
> +	blkg_conf_close_bdev(&ctx);
> +	blk_mq_unfreeze_queue(q, memflags);
>  	return ret;

It looks like this could be easily shared into a single label
if you check for a non-zero ret and return that and else nbytes.

Otherwise this looks good.


^ permalink raw reply

* Re: [PATCH v5 08/12] block/blk-iocost: Add lock context annotations
From: Christoph Hellwig @ 2026-06-01  7:32 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Tejun Heo, Josef Bacik, Marco Elver
In-Reply-To: <669c0e4dd3a8e8e038dff8c19aa61f0359df2bd8.1779997063.git.bvanassche@acm.org>

On Thu, May 28, 2026 at 12:45:45PM -0700, Bart Van Assche wrote:
> Since iocg_lock() and iocg_unlock() both use conditional locking,
> annotate both with __no_context_analysis and use token_context_lock() to
> introduce a new lock context.

Both of these are only called from two funtions.  Have you looked into
merging the locking logic into these helpers to see if this improves
the situation?  __no_context_analysis just seems like a big hammer
for something relatively simple like this.

> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-iocost.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> index 353c165c5cd4..3bb8ce50af42 100644
> --- a/block/blk-iocost.c
> +++ b/block/blk-iocost.c
> @@ -727,7 +727,11 @@ static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio,
>  	put_cpu_ptr(gcs);
>  }
>  
> +token_context_lock(ioc_lock);
> +
>  static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
> +	__acquires(ioc_lock)
> +	__context_unsafe(conditional locking)
>  {
>  	if (lock_ioc) {
>  		spin_lock_irqsave(&iocg->ioc->lock, *flags);
> @@ -738,6 +742,8 @@ static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
>  }
>  
>  static void iocg_unlock(struct ioc_gq *iocg, bool unlock_ioc, unsigned long *flags)
> +	__releases(ioc_lock)
> +	__context_unsafe(conditional locking)
>  {
>  	if (unlock_ioc) {
>  		spin_unlock(&iocg->waitq.lock);
---end quoted text---

^ permalink raw reply

* Re: [PATCH v5 09/12] block/blk-mq-debugfs: Improve lock context annotations
From: Christoph Hellwig @ 2026-06-01  7:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Nathan Chancellor
In-Reply-To: <b49bc91945356894b8ab1dc2de1aabad6db3ce16.1779997063.git.bvanassche@acm.org>

On Thu, May 28, 2026 at 12:45:46PM -0700, Bart Van Assche wrote:
> Make the existing lock context annotations compatible with Clang. Add
> the lock context annotations that are missing.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq-debugfs.c | 12 ++++++------
>  block/blk.h            |  4 ++++
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 047ec887456b..5c168e82273e 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -20,7 +20,7 @@ static int queue_poll_stat_show(void *data, struct seq_file *m)
>  }
>  
>  static void *queue_requeue_list_start(struct seq_file *m, loff_t *pos)
> -	__acquires(&q->requeue_lock)
> +	__acquires(&((struct request_queue *)m->private)->requeue_lock)

I try to member where we got stuck on this, but isn't there a way
to at least have a macro for this dereference that can have a comment?

Also I guess most seq_file users will have a mess like this, but I
don't really know a good way around it.


^ permalink raw reply

* Re: [PATCH v5 10/12] block/kyber: Make the lock context annotations compatible with Clang
From: Christoph Hellwig @ 2026-06-01  7:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Nathan Chancellor
In-Reply-To: <5165d212359b6b2ac70f52917282b85ea6c75fdf.1779997063.git.bvanassche@acm.org>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH v5 11/12] block/mq-deadline: Make the lock context annotations compatible with Clang
From: Christoph Hellwig @ 2026-06-01  7:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Nathan Chancellor
In-Reply-To: <be44a8b8ed93792d33b07de74c971d1a8a5703f8.1779997063.git.bvanassche@acm.org>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH v5 12/12] block: Enable lock context analysis
From: Christoph Hellwig @ 2026-06-01  7:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal
In-Reply-To: <e4d44af70627b83fedadd9501609a2eec5d21ec3.1779997063.git.bvanassche@acm.org>

On Thu, May 28, 2026 at 12:45:49PM -0700, Bart Van Assche wrote:
> Now that all block/*.c files have been annotated, enable lock context
> analysis for all these source files.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH RFC] blk-integrity: fix slab-out-of-bounds in t10_pi_verify on namespace revalidation
From: Christoph Hellwig @ 2026-06-01  7:40 UTC (permalink / raw)
  To: samin_c
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, linux-block,
	linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu, Ruimin Sun
In-Reply-To: <20260531-blk-integrity-fix-v1-1-cc7084f42cf1@outlook.com>

On Sun, May 31, 2026 at 06:45:07PM -0400, Samin Y. Chowdhury via B4 Relay wrote:
> When a namespace is revalidated between bio_integrity_prep() and
> bio_integrity_verify_fn(), the integrity profile's metadata_size may
> change under the in-flight bio. bio_integrity_verify_fn() re-reads the
> live blk_integrity via blk_get_integrity(), so blk_integrity_iterate()
> uses the new metadata_size as the per-interval step size against a
> buffer sized for the old one, advancing iter->prot_buf past the end of
> the allocation.

I don't think changing fundamental device properies such as the LBA
or integrity tag size under a live device is a good model.  So instead
of coming up with bandaids like this, we should probably just fail
any such revalidation when there are openers instead of trying to deal
with the fallout.


^ permalink raw reply

* Re: [PATCH] block: use blk_validate_byte_range() for BLKZEROOUT and BLKSECDISCARD
From: Christoph Hellwig @ 2026-06-01  7:42 UTC (permalink / raw)
  To: dayou5941; +Cc: axboe, linux-block, liyouhong
In-Reply-To: <20260529065618.3091286-1-dayou5941@163.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* [PATCH] blk-iolatency: fix child_lat lock irq state
From: Yu Kuai @ 2026-06-01  8:02 UTC (permalink / raw)
  To: tj, axboe; +Cc: linux-block, yukuai

iolatency_clear_scaling() updates child_lat.lock with hardirqs enabled.
The bio completion path can take the same lock from hardirq context.

This triggers lockdep after io.latency is configured and I/O completes.
Full lockdep report:

  WARNING: inconsistent lock state
  7.1.0-rc2-g6a04b2279273 #1 Not tainted
  --------------------------------
  inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
  swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
  ffff88810c682d10 (&iolat->child_lat.lock){?.+.}-{3:3}, at: blkcg_iolatency_done_bio+0x6e7/0xb90
  {HARDIRQ-ON-W} state was registered at:
    lock_acquire+0xd4/0x290
    _raw_spin_lock+0x3a/0x70
    iolatency_set_limit+0x49b/0x590
    cgroup_file_write+0x1c5/0x4b0
    kernfs_fop_write_iter+0x1d7/0x280
    vfs_write+0x580/0x630
    ksys_write+0xec/0x190
    do_syscall_64+0x156/0x490
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
  irq event stamp: 328476
  hardirqs last  enabled at (328475): [<ffffffffa4dd93b1>] do_idle+0x261/0x400
  hardirqs last disabled at (328476): [<ffffffffa68347f3>] common_interrupt+0x13/0x90
  softirqs last  enabled at (328398): [<ffffffffa4d508ac>] __irq_exit_rcu+0x8c/0x150
  softirqs last disabled at (328387): [<ffffffffa4d508ac>] __irq_exit_rcu+0x8c/0x150

                            other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&iolat->child_lat.lock);
    <Interrupt>
      lock(&iolat->child_lat.lock);

                             *** DEADLOCK ***

  1 lock held by swapper/0/0:
   #0: ffff888103365450 (&virtscsi_vq->vq_lock){-.-.}-{3:3}, at: virtscsi_vq_done+0x9f/0x130

                            stack backtrace:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.1.0-rc2-g6a04b2279273 #1 PREEMPT  1c49bdb9e32f352d2b66a5ca23d36d656c610458
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x54/0x70
   print_usage_bug+0x26d/0x280
   mark_lock_irq+0x3ef/0x400
   ? save_trace+0x3d/0x2f0
   ? __pfx_stack_trace_consume_entry+0x10/0x10
   mark_lock+0x117/0x190
   __lock_acquire+0x570/0x2850
   ? stack_trace_save+0xa1/0xe0
   ? __pfx_stack_trace_save+0x10/0x10
   ? filter_irq_stacks+0x27/0x80
   ? stack_depot_save_flags+0x32/0x7f0
   lock_acquire+0xd4/0x290
   ? blkcg_iolatency_done_bio+0x6e7/0xb90
   ? kvm_sched_clock_read+0x11/0x20
   ? local_clock_noinstr+0xc/0xc0
   ? local_clock+0x15/0x30
   ? lock_release+0x111/0x470
   ? blkcg_iolatency_done_bio+0x6e7/0xb90
   _raw_spin_lock_irqsave+0x4c/0x90
   ? blkcg_iolatency_done_bio+0x6e7/0xb90
   blkcg_iolatency_done_bio+0x6e7/0xb90
   ? __pfx_blkcg_iolatency_done_bio+0x10/0x10
   __rq_qos_done_bio+0x51/0x60
   bio_endio+0x135/0x320
   blk_update_request+0x1e6/0x570
   scsi_end_request+0x4b/0x410
   scsi_io_completion+0x83/0x170
   ? __pfx_virtscsi_complete_cmd+0x10/0x10
   virtscsi_vq_done+0xd7/0x130
   ? lock_acquire+0xd4/0x290
   ? __pfx_virtscsi_vq_done+0x10/0x10
   ? local_clock_noinstr+0xc/0xc0
   ? local_clock+0x15/0x30
   vring_interrupt+0x13b/0x150
   ? __pfx_vring_interrupt+0x10/0x10
   __handle_irq_event_percpu+0x145/0x4b0
   handle_irq_event+0x54/0xb0
   handle_edge_irq+0x111/0x320
   __common_interrupt+0x97/0xf0
   common_interrupt+0x7e/0x90
   </IRQ>
   <TASK>
   asm_common_interrupt+0x26/0x40
  RIP: 0010:pv_native_safe_halt+0x13/0x20
  Code: d3 a5 01 00 cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 2f 39 21 00 f3 0f 1e fa fb f4 <c3> cc cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90
  RSP: 0018:ffffffffa7607e00 EFLAGS: 00000246
  RAX: 000000000005031b RBX: ffffffffa4dd93b1 RCX: ffffffffa683884b
  RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffffa4dd93b1
  RBP: ffffffffa7607ed0 R08: ffff888117bf408b R09: 1ffff11022f7e811
  R10: dffffc0000000000 R11: ffffed1022f7e812 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffa7f6cff0
   ? do_idle+0x261/0x400
   ? ct_kernel_exit+0xcb/0x110
   ? do_idle+0x261/0x400
   default_idle+0x9/0x20
   default_idle_call+0x73/0xb0
   do_idle+0x261/0x400
   ? __pfx_do_idle+0x10/0x10
   ? local_clock_noinstr+0x30/0xc0
   ? local_clock+0x15/0x30
   cpu_startup_entry+0x36/0x40
   rest_init+0x207/0x210
   start_kernel+0x321/0x370
   x86_64_start_reservations+0x24/0x30
   x86_64_start_kernel+0x13a/0x140
   common_startup_64+0x13e/0x147
   </TASK>

Fix it by using spin_lock_irqsave() in iolatency_clear_scaling().
Use irqsave rather than spin_lock_irq() because the same helper is also
called from pd_offline_fn paths where hardirqs can already be disabled
by blkcg teardown/deactivation locks. spin_unlock_irq() would wrongly
enable hardirqs in those paths.

Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-iolatency.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 53e8dd2dfa8a..9152dc86b08b 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -811,16 +811,18 @@ static void iolatency_clear_scaling(struct blkcg_gq *blkg)
 	if (blkg->parent) {
 		struct iolatency_grp *iolat = blkg_to_lat(blkg->parent);
 		struct child_latency_info *lat_info;
+		unsigned long flags;
+
 		if (!iolat)
 			return;
 
 		lat_info = &iolat->child_lat;
-		spin_lock(&lat_info->lock);
+		spin_lock_irqsave(&lat_info->lock, flags);
 		atomic_set(&lat_info->scale_cookie, DEFAULT_SCALE_COOKIE);
 		lat_info->last_scale_event = 0;
 		lat_info->scale_grp = NULL;
 		lat_info->scale_lat = 0;
-		spin_unlock(&lat_info->lock);
+		spin_unlock_irqrestore(&lat_info->lock, flags);
 	}
 }
 
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH] MAINTAINERS: use new drbd-dev mailing list
From: Christoph Böhmwalder @ 2026-06-01  8:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Philipp Reisner, Lars Ellenberg, drbd-dev, linux-block,
	linux-kernel
In-Reply-To: <20260513065557.36042-1-christoph.boehmwalder@linbit.com>

On Wed, May 13, 2026 at 08:55:57AM +0200, Christoph Böhmwalder wrote:
>We are migrating from our own infrastructure to lists.linux.dev, so
>change the drbd-dev address to point to the new domain.
>
>Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
>---
> MAINTAINERS | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/MAINTAINERS b/MAINTAINERS
>index 4abb3345bc4e..b8db0c038c55 100644
>--- a/MAINTAINERS
>+++ b/MAINTAINERS
>@@ -7773,7 +7773,7 @@ DRBD DRIVER
> M:	Philipp Reisner <philipp.reisner@linbit.com>
> M:	Lars Ellenberg <lars.ellenberg@linbit.com>
> M:	Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
>-L:	drbd-dev@lists.linbit.com
>+L:	drbd-dev@lists.linux.dev
> S:	Supported
> W:	http://www.drbd.org
> T:	git git://git.linbit.com/linux-drbd.git
>
>base-commit: 36446de0c30c62b9d89502fd36c4904996d86ecd
>-- 
>2.53.0

Ping. @Jens is there anything still missing before this can be applied?

Thanks,
Christoph

^ permalink raw reply

* [PATCH v3 0/4] crypto: skcipher - per-tfm multi-data-unit batching
From: Leonid Ravich @ 2026-06-01  8:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block

This is v3 of the multi-data-unit skcipher request series, addressing
review feedback from Mikulas Patocka on v2.

v2: https://lore.kernel.org/linux-crypto/20260527065021.19525-1-lravich@amazon.com/
v1: https://lore.kernel.org/linux-crypto/20260519115955.27267-1-lravich@amazon.com/

The series adds a per-tfm "data unit size" to the skcipher API so a
caller can submit several data units in one crypto request, mirroring
the data_unit_size concept already exposed by struct blk_crypto_config
for inline encryption hardware.  The first user is dm-crypt, which
today issues one skcipher request per sector and so pays a per-sector
cost in request allocation, callback dispatch, completion handling,
and scatterlist setup.

Proof-of-concept performance numbers from the RFC reply [1]: +19%
throughput / -40% CPU on a single-core arm64 system with a hardware
XTS-AES-256 accelerator running fio 4 KiB sequential writes through
dm-crypt, when an out-of-tree arm64 xts driver advertises the new
flag.  This series itself does not include arch enablement.

[1] https://lore.kernel.org/linux-crypto/20260428101225.24316-1-lravich@amazon.com/

Changes since v2
----------------

Patch 4 (dm-crypt) only.  Patches 1-3 are unchanged from v2.

  - Replace integer division with the equivalent shift, and tighten
    the size sanity check from "is total < sector_size?" to "is
    total a multiple of sector_size?".  Reject unaligned residues
    explicitly instead of silently truncating them.  The local
    n_sectors variable used only for a now-redundant !=0 check was
    dropped — crypt_convert()'s outer while-loop already guarantees
    iter_in.bi_size > 0 on entry.  (Mikulas)

  - Drop `min(iter_in.bi_size, iter_out.bi_size)` in favour of using
    iter_in.bi_size directly, with a WARN_ON_ONCE() to flag any
    future violation of the "iter_in and iter_out describe equally-
    sized payloads" invariant maintained by crypt_convert_init().
    Replaces a silent mask of a real bug with an explicit warning.
    (Mikulas)

Changes since v1
----------------

Patch 4 only.  Addressed Mikulas's review of v1:

  - Multi-DU scatterlist allocation uses
    GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN.

  - On scatterlist allocation failure, return -EAGAIN.
    crypt_convert() handles -EAGAIN by clearing its local multi_du
    flag and re-entering the per-sector path for the rest of this
    crypt_convert() invocation.  The per-tfm data_unit_size on the
    cipher remains set, so subsequent bios (which start a fresh
    crypt_convert() and re-read cipher_flags) get to try multi-DU
    again once memory pressure eases.

    This gives forward progress under total memory exhaustion: the
    per-sector path uses only cc->req_pool (a mempool with reservoir
    set up at table-load time) and the inline
    dmreq->sg_in[]/sg_out[] arrays, never doing any allocation that
    could fail.

  - Walk the bio with __bio_for_each_bvec instead of
    __bio_for_each_segment for folio-friendly SG construction.

Design overview (unchanged from v1)
-----------------------------------

* Patch 1 adds an `unsigned int data_unit_size` field to
  `struct crypto_skcipher` (per-tfm: invariant for the consumer's
  lifetime, set once via `crypto_skcipher_set_data_unit_size()`),
  plus a capability flag CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in
  `cra_flags` (type-specific high-byte range, mirroring the
  CRYPTO_AHASH_ALG_BLOCK_ONLY precedent).  `crypto_skcipher_encrypt()`
  and `crypto_skcipher_decrypt()` validate that `cryptlen` is a
  positive multiple of `data_unit_size`.  The setter rejects
  sub-blocksize values; algorithm registration rejects the flag for
  algorithms with `ivsize != 16`.

  Also exposes `skcipher_walk_data_units()` in
  <crypto/internal/skcipher.h> as a default per-DU dispatcher for
  drivers that don't want to roll their own.

* Patch 2 lets the generic `xts(...)` template advertise the flag
  when the inner cipher is synchronous.

* Patch 3 extends `testmgr` with a self-comparison test that fires
  automatically for every alg advertising the flag.

* Patch 4 turns dm-crypt on automatically when all of the
  following hold at table load: skcipher (not aead), tfms_count
  == 1, IV mode is plain or plain64, no per-sector
  iv_gen_ops->post() hook, no dm-integrity stacking, and the
  underlying cipher advertises the capability.

This series intentionally does NOT add the capability flag to any
arch crypto driver.  Arch maintainers can opt in independently in
follow-up patches.

Verification
------------

A formal regression protocol is included in the project tree
(.claude/regression-protocol.md, .claude/run-regression.sh).  The
v3 reference run reports 12/12 cases PASS:

  - x86 + arm64 build clean (with and without out-of-tree arch
    enablement).
  - checkpatch.pl --strict: clean on all 4 patches.
  - testmgr self-comparison: PASS for any algorithm advertising the
    flag (verified end-to-end against an out-of-tree arm64/x86 xts
    driver during regression).
  - dm-crypt activation gating: plain/plain64 enabled,
    essiv:sha256 / plain64be fall back.
  - dm-crypt round-trip plain64: PASS with multi-DU active.
  - dm-crypt round-trip essiv:sha256 (per-sector path on multi-DU
    kernel): PASS.
  - dm-crypt low-memory (mem=128M): PASS, no OOM kill.
  - Byte-equivalence: 256 MB of ciphertext written through the
    multi-DU path is bit-identical to ciphertext written through
    the per-sector path on an unpatched axboe/for-next baseline
    (sha256
    4913910b1aa6f8859fcb8f4adec20230274993a3ade8f4dd0140a323dc43efc0).
    The on-disk format is unchanged.
  - arm64 functional (activation + round-trip) under qemu-aarch64:
    PASS.

The OOM-fallback path (multi-DU helper returns -EAGAIN, caller
reverts to per-sector) is verified by inspection: the fallback is
two lines in crypt_convert(), the per-sector path uses only the
existing mempool reserve and the inline dmreq SG arrays (no
allocation that could fail), and there is no shared state between
the two paths that could deadlock.

Leonid Ravich (4):
  crypto: skcipher - add per-tfm data_unit_size for batched requests
  crypto: xts - support multiple data units per request in template
  crypto: testmgr - exercise multi-data-unit path for skcipher
  dm crypt: batch all sectors of a bio per crypto request

 crypto/skcipher.c                  | 120 ++++++++++++
 crypto/testmgr.c                   | 129 +++++++++++++
 crypto/xts.c                       |  25 ++-
 drivers/md/dm-crypt.c              | 281 ++++++++++++++++++++++++++++-
 include/crypto/internal/skcipher.h |  34 ++++
 include/crypto/skcipher.h          |  85 +++++++++
 6 files changed, 665 insertions(+), 9 deletions(-)

base-commit: a8cafdf8c949f17c92eca0045532e88ac0dac30d
-- 
2.47.3

^ permalink raw reply

* [PATCH v3 1/4] crypto: skcipher - add per-tfm data_unit_size for batched requests
From: Leonid Ravich @ 2026-06-01  8:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260601085644.13026-1-lravich@amazon.com>

Add a per-tfm data_unit_size and an algorithm capability flag that
together allow a caller to submit several data units in a single
skcipher request.  The IV passed in the request applies to the first
data unit; the algorithm advances the tweak between data units
according to the mode specification (e.g., LE128 multiply for XTS per
IEEE 1619).

This mirrors the data_unit_size concept already exposed by
struct blk_crypto_config for inline encryption hardware, but at the
software skcipher layer.  The first user is dm-crypt, which today
issues one request per sector and so pays a per-sector cost in
request allocation, IV generation, callback dispatch, and completion
handling.  Allowing the cipher to consume a whole bio per request
removes that overhead for drivers that can chain across data units
internally.

The data_unit_size lives on struct crypto_skcipher rather than on
struct skcipher_request because it does not change between requests
for any plausible consumer: dm-crypt picks one sector size per
mapped target at table load time; fscrypt would pick one per master
key.  Anchoring it to the tfm also lets the driver validate it once
at setkey() time and avoids per-request initialisation hazards on
mempool-recycled requests.

Capability is advertised with CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT
in cra_flags (type-specific high-byte range, mirroring the
CRYPTO_AHASH_ALG_* convention).  This makes the capability visible
in /proc/crypto and lets templates OR it into their derived
algorithms.

crypto_skcipher_set_data_unit_size() returns -EOPNOTSUPP if the
algorithm does not advertise the flag, and accepts 0 (the default)
unconditionally so callers can re-disable batching cheaply.

crypto_skcipher_encrypt()/decrypt() reject requests whose cryptlen
is not a multiple of the configured data_unit_size with -EINVAL.
The check is gated on data_unit_size != 0 so it costs nothing for
the common single-data-unit case.

No in-tree algorithm advertises the flag yet; subsequent patches
add the generic xts() template, arm64, and x86 producers as well
as the dm-crypt consumer.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 crypto/skcipher.c                  | 120 +++++++++++++++++++++++++++++
 include/crypto/internal/skcipher.h |  34 ++++++++
 include/crypto/skcipher.h          |  85 ++++++++++++++++++++
 3 files changed, 239 insertions(+)

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 2b31d1d5d268..bc37bd554aec 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -432,13 +432,119 @@ int crypto_skcipher_setkey(struct crypto_skcipher *tfm, const u8 *key,
 }
 EXPORT_SYMBOL_GPL(crypto_skcipher_setkey);
 
+int crypto_skcipher_set_data_unit_size(struct crypto_skcipher *tfm,
+				       unsigned int data_unit_size)
+{
+	unsigned int blocksize;
+
+	if (!data_unit_size) {
+		tfm->data_unit_size = 0;
+		return 0;
+	}
+
+	if (!crypto_skcipher_supports_multi_data_unit(tfm))
+		return -EOPNOTSUPP;
+
+	blocksize = crypto_skcipher_blocksize(tfm);
+	if (data_unit_size < blocksize || data_unit_size % blocksize)
+		return -EINVAL;
+
+	tfm->data_unit_size = data_unit_size;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(crypto_skcipher_set_data_unit_size);
+
+static int crypto_skcipher_check_data_unit_size(struct crypto_skcipher *tfm,
+						struct skcipher_request *req)
+{
+	unsigned int du = tfm->data_unit_size;
+
+	if (likely(!du))
+		return 0;
+	if (req->cryptlen % du)
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Increment a 16-byte little-endian counter held in @iv.  See
+ * crypto_skcipher_set_data_unit_size() for the convention.
+ */
+static inline void skcipher_iv_inc_le128(u8 *iv)
+{
+	__le64 lo_le, hi_le;
+	u64 lo;
+
+	memcpy(&lo_le, iv, 8);
+	memcpy(&hi_le, iv + 8, 8);
+	lo = le64_to_cpu(lo_le) + 1;
+	lo_le = cpu_to_le64(lo);
+	memcpy(iv, &lo_le, 8);
+	if (unlikely(lo == 0)) {
+		hi_le = cpu_to_le64(le64_to_cpu(hi_le) + 1);
+		memcpy(iv + 8, &hi_le, 8);
+	}
+}
+
+int skcipher_walk_data_units(struct skcipher_request *req,
+			     int (*body)(struct skcipher_request *))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const unsigned int du = tfm->data_unit_size;
+	const unsigned int total = req->cryptlen;
+	struct scatterlist *orig_src = req->src;
+	struct scatterlist *orig_dst = req->dst;
+	struct scatterlist src_sg[2], dst_sg[2];
+	u8 iv_save[16];
+	unsigned int off;
+	int err = 0;
+
+	if (likely(!du))
+		return body(req);
+
+	/*
+	 * Registration of an algorithm advertising
+	 * CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT enforces ivsize == 16
+	 * (see skcipher_prepare_alg_common()), so this is purely
+	 * defensive against algorithm-registration bugs.
+	 */
+	if (WARN_ON_ONCE(crypto_skcipher_ivsize(tfm) != 16))
+		return -EINVAL;
+
+	memcpy(iv_save, req->iv, 16);
+
+	for (off = 0; off < total; off += du) {
+		req->cryptlen = du;
+		req->src = scatterwalk_ffwd(src_sg, orig_src, off);
+		req->dst = (orig_src == orig_dst) ? req->src :
+			   scatterwalk_ffwd(dst_sg, orig_dst, off);
+
+		err = body(req);
+		if (err)
+			break;
+
+		skcipher_iv_inc_le128(iv_save);
+		memcpy(req->iv, iv_save, 16);
+	}
+
+	req->src = orig_src;
+	req->dst = orig_dst;
+	req->cryptlen = total;
+	return err;
+}
+EXPORT_SYMBOL_GPL(skcipher_walk_data_units);
+
 int crypto_skcipher_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct skcipher_alg *alg = crypto_skcipher_alg(tfm);
+	int err;
 
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		return -ENOKEY;
+	err = crypto_skcipher_check_data_unit_size(tfm, req);
+	if (err)
+		return err;
 	if (alg->co.base.cra_type != &crypto_skcipher_type)
 		return crypto_lskcipher_encrypt_sg(req);
 	return alg->encrypt(req);
@@ -449,9 +555,13 @@ int crypto_skcipher_decrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct skcipher_alg *alg = crypto_skcipher_alg(tfm);
+	int err;
 
 	if (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
 		return -ENOKEY;
+	err = crypto_skcipher_check_data_unit_size(tfm, req);
+	if (err)
+		return err;
 	if (alg->co.base.cra_type != &crypto_skcipher_type)
 		return crypto_lskcipher_decrypt_sg(req);
 	return alg->decrypt(req);
@@ -680,6 +790,16 @@ int skcipher_prepare_alg_common(struct skcipher_alg_common *alg)
 	    (alg->ivsize + alg->statesize) > PAGE_SIZE / 2)
 		return -EINVAL;
 
+	/*
+	 * Algorithms advertising multi-data-unit support must use the
+	 * 16-byte little-endian counter convention documented in
+	 * crypto_skcipher_set_data_unit_size(); see also
+	 * skcipher_walk_data_units().
+	 */
+	if ((base->cra_flags & CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT) &&
+	    alg->ivsize != 16)
+		return -EINVAL;
+
 	if (!alg->chunksize)
 		alg->chunksize = base->cra_blocksize;
 
diff --git a/include/crypto/internal/skcipher.h b/include/crypto/internal/skcipher.h
index a965b6aabf61..bed1b1f1bbdc 100644
--- a/include/crypto/internal/skcipher.h
+++ b/include/crypto/internal/skcipher.h
@@ -21,6 +21,40 @@
  */
 #define CRYPTO_ALG_SKCIPHER_REQSIZE_LARGE CRYPTO_ALG_OPTIONAL_KEY
 
+/**
+ * skcipher_walk_data_units - dispatch a request as one body call per data unit
+ * @req: the caller's skcipher request
+ * @body: the algorithm's single-data-unit encrypt or decrypt function
+ *
+ * When tfm->data_unit_size is zero this is a tail call into @body with
+ * @req unchanged.  Otherwise the request is split into
+ * cryptlen / data_unit_size sub-ranges and @body is called once per
+ * sub-range with req->cryptlen, req->src, req->dst, and req->iv adjusted
+ * for that sub-range.  The IV passed to data unit n is the caller-
+ * supplied IV plus n, where + is a 128-bit little-endian add — this
+ * matches the convention documented in
+ * crypto_skcipher_set_data_unit_size().
+ *
+ * Many single-data-unit XTS bodies modify the IV buffer in place during
+ * processing (the tweak is walked block by block).  This helper saves
+ * the caller's IV before each call and rewrites the next data unit's
+ * IV from the saved value, so the body always sees a fresh per-DU IV
+ * regardless of any in-place mutation it performs.
+ *
+ * The body MUST run to completion synchronously.  Drivers that use this
+ * helper therefore advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT only
+ * for synchronous configurations.
+ *
+ * After the call returns, the contents of req->iv are unspecified per
+ * the documented contract.  src/dst/cryptlen are restored to the
+ * caller's values to keep skcipher request post-conditions intact.
+ *
+ * Return: 0 on success, or the body's negative errno on the first
+ *	   data unit that returned non-zero.
+ */
+int skcipher_walk_data_units(struct skcipher_request *req,
+			     int (*body)(struct skcipher_request *));
+
 struct aead_request;
 struct rtattr;
 
diff --git a/include/crypto/skcipher.h b/include/crypto/skcipher.h
index 4efe2ca8c4d1..5941b6b24b98 100644
--- a/include/crypto/skcipher.h
+++ b/include/crypto/skcipher.h
@@ -26,6 +26,15 @@
 /* Set this bit if the skcipher operation is not final. */
 #define CRYPTO_SKCIPHER_REQ_NOTFINAL	0x00000002
 
+/*
+ * Set in cra_flags by an skcipher algorithm that supports processing
+ * multiple data units in a single request.  See
+ * crypto_skcipher_set_data_unit_size().
+ *
+ * Type-specific flag in the 0xff000000 reserved range.
+ */
+#define CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT	0x01000000
+
 struct scatterlist;
 
 /**
@@ -53,6 +62,22 @@ struct skcipher_request {
 struct crypto_skcipher {
 	unsigned int reqsize;
 
+	/*
+	 * Number of bytes in one data unit when batching multiple data units
+	 * per request.  0 means "single data unit per request" (legacy
+	 * behaviour).  Set via crypto_skcipher_set_data_unit_size().
+	 *
+	 * When non-zero, cryptlen must be a multiple of data_unit_size.  The
+	 * IV passed in skcipher_request::iv applies to the first data unit;
+	 * the algorithm advances the tweak between data units according to
+	 * the mode specification (e.g., LE128 multiply for XTS per
+	 * IEEE 1619).
+	 *
+	 * Only algorithms that advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT
+	 * in cra_flags accept a non-zero value.
+	 */
+	unsigned int data_unit_size;
+
 	struct crypto_tfm base;
 };
 
@@ -492,6 +517,66 @@ static inline unsigned int crypto_lskcipher_chunksize(
 	return crypto_lskcipher_alg(tfm)->co.chunksize;
 }
 
+/**
+ * crypto_skcipher_supports_multi_data_unit() - test multi-data-unit support
+ * @tfm: cipher handle
+ *
+ * Return: true if the algorithm advertises that it can process multiple
+ *	   data units in a single skcipher_request.
+ */
+static inline bool
+crypto_skcipher_supports_multi_data_unit(struct crypto_skcipher *tfm)
+{
+	return crypto_skcipher_alg_common(tfm)->base.cra_flags &
+		CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT;
+}
+
+/**
+ * crypto_skcipher_set_data_unit_size() - set data unit size for the tfm
+ * @tfm: cipher handle
+ * @data_unit_size: data unit size in bytes; 0 disables multi-data-unit mode
+ *
+ * Configure the tfm to process multiple data units per request.  When set
+ * to a non-zero value, every subsequent encrypt/decrypt request must have
+ * cryptlen that is a multiple of @data_unit_size.  Each data unit is
+ * processed as if it were a separate request whose IV is derived from the
+ * preceding data unit's IV by the algorithm-specific tweak update rule:
+ * the implementation treats the caller-supplied IV as a 128-bit
+ * little-endian counter and adds the data-unit index for each subsequent
+ * data unit.
+ *
+ * The contents of req->iv after a multi-data-unit request returns are
+ * unspecified — callers MUST NOT rely on it being either the original
+ * value or the final-data-unit value.  Set a fresh IV before every
+ * request.
+ *
+ * The algorithm must advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in its
+ * cra_flags.  @data_unit_size must be a positive multiple of the
+ * algorithm's cra_blocksize, otherwise -EINVAL is returned.
+ *
+ * Setting @data_unit_size to 0 reverts the tfm to single-data-unit
+ * behaviour and is always permitted.
+ *
+ * Return: 0 on success; -EOPNOTSUPP if the algorithm does not advertise
+ *	   multi-data-unit support; -EINVAL if @data_unit_size is not a
+ *	   positive multiple of the cipher block size.
+ */
+int crypto_skcipher_set_data_unit_size(struct crypto_skcipher *tfm,
+				       unsigned int data_unit_size);
+
+/**
+ * crypto_skcipher_data_unit_size() - obtain data unit size
+ * @tfm: cipher handle
+ *
+ * Return: configured data unit size in bytes; 0 if multi-data-unit mode
+ *	   is disabled.
+ */
+static inline unsigned int
+crypto_skcipher_data_unit_size(struct crypto_skcipher *tfm)
+{
+	return tfm->data_unit_size;
+}
+
 /**
  * crypto_skcipher_statesize() - obtain state size
  * @tfm: cipher handle
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 2/4] crypto: xts - support multiple data units per request in template
From: Leonid Ravich @ 2026-06-01  8:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260601085644.13026-1-lravich@amazon.com>

Teach the generic xts() template to consume cryptlen larger than one
data unit when the caller has configured a non-zero data_unit_size on
the tfm.  Each data unit is processed with its own IV, derived from
the caller-supplied IV by treating it as a 128-bit little-endian
counter and adding the data-unit index.  This matches the
sector-indexed XTS used by dm-crypt's plain64 IV mode and by typical
inline-encryption hardware.

The single-data-unit body is unchanged and is now reached via a thin
xts_crypt_multi() dispatcher that skips straight to the body when
data_unit_size is zero (the legacy default), so existing users see
no extra cost.

Advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in cra_flags only when
the inner cipher is synchronous.  An async inner cipher would require
a per-DU completion chain which is out of scope for the slow software
template; consumers that need multi-DU on async hardware will use one
of the arch-specific drivers added later in this series.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 crypto/xts.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/crypto/xts.c b/crypto/xts.c
index ad97c8091582..f0585ea9d6d5 100644
--- a/crypto/xts.c
+++ b/crypto/xts.c
@@ -258,7 +258,7 @@ static int xts_init_crypt(struct skcipher_request *req,
 	return 0;
 }
 
-static int xts_encrypt(struct skcipher_request *req)
+static int xts_encrypt_one(struct skcipher_request *req)
 {
 	struct xts_request_ctx *rctx = skcipher_request_ctx(req);
 	struct skcipher_request *subreq = &rctx->subreq;
@@ -275,7 +275,7 @@ static int xts_encrypt(struct skcipher_request *req)
 	return xts_cts_final(req, crypto_skcipher_encrypt);
 }
 
-static int xts_decrypt(struct skcipher_request *req)
+static int xts_decrypt_one(struct skcipher_request *req)
 {
 	struct xts_request_ctx *rctx = skcipher_request_ctx(req);
 	struct skcipher_request *subreq = &rctx->subreq;
@@ -292,6 +292,16 @@ static int xts_decrypt(struct skcipher_request *req)
 	return xts_cts_final(req, crypto_skcipher_decrypt);
 }
 
+static int xts_encrypt(struct skcipher_request *req)
+{
+	return skcipher_walk_data_units(req, xts_encrypt_one);
+}
+
+static int xts_decrypt(struct skcipher_request *req)
+{
+	return skcipher_walk_data_units(req, xts_decrypt_one);
+}
+
 static int xts_init_tfm(struct crypto_skcipher *tfm)
 {
 	struct skcipher_instance *inst = skcipher_alg_instance(tfm);
@@ -427,6 +437,17 @@ static int xts_create(struct crypto_template *tmpl, struct rtattr **tb)
 	inst->alg.base.cra_alignmask = alg->base.cra_alignmask |
 				       (__alignof__(u64) - 1);
 
+	/*
+	 * Advertise multi-data-unit support only when the inner cipher is
+	 * synchronous.  The dispatcher in skcipher_walk_data_units() calls
+	 * the single-DU body in a loop and assumes synchronous completion;
+	 * supporting async would require a per-DU callback chain, which
+	 * the slow software template does not need.
+	 */
+	if (!(alg->base.cra_flags & CRYPTO_ALG_ASYNC))
+		inst->alg.base.cra_flags |=
+			CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT;
+
 	inst->alg.ivsize = XTS_BLOCK_SIZE;
 	inst->alg.min_keysize = alg->min_keysize * 2;
 	inst->alg.max_keysize = alg->max_keysize * 2;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 4/4] dm crypt: batch all sectors of a bio per crypto request
From: Leonid Ravich @ 2026-06-01  8:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260601085644.13026-1-lravich@amazon.com>

When the underlying skcipher driver advertises support for multiple
data units in a single request (CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT),
configure the cipher with cc->sector_size as data_unit_size and
submit one request per bio instead of one request per sector.  This
removes per-sector overhead in the crypto API hot path: request
allocation, callback dispatch, completion handling, and SG setup.

The optimisation is enabled automatically at table load when all
of the following hold:

 - the cipher is non-aead (i.e. skcipher);
 - tfms_count is 1 (interleaved per-sector keys would break batching);
 - the IV mode is plain or plain64 (the only modes whose generator
   produces a sequential 64-bit little-endian counter that the cipher
   can extend by adding the data-unit index, matching the convention
   documented in crypto_skcipher_set_data_unit_size());
 - the iv_gen_ops->post() hook is unset (lmk and tcw use it; both are
   already excluded by the IV-mode test, but the explicit check makes
   the assumption durable against future IV modes);
 - dm-integrity is not stacked (no integrity tag or integrity IV);
 - the cipher driver advertises multi-data-unit support.

A new CRYPT_MULTI_DATA_UNIT cipher_flag, set once at construction
time, gates the multi-data-unit path.  The existing per-sector path
in crypt_convert_block_skcipher() is unchanged; the new
crypt_convert_block_skcipher_multi() is reached from a small dispatch
in crypt_convert() and shares the same backlog/-EBUSY/-EINPROGRESS
flow control with the per-sector path.

Heap-allocated scatterlists are stashed in dm_crypt_request and freed
in crypt_free_req_skcipher() to avoid races between the synchronous-
success free path and async-completion reuse from the request pool.
On -ENOMEM during scatterlist allocation, the bio is requeued via
BLK_STS_DEV_RESOURCE rather than failed, matching the behaviour of
the existing -ENOMEM path for crypto request allocation.

Verified end-to-end with a byte-equivalence test: encrypted output of
plain64 dm-crypt with the multi-data-unit path matches output of the
single-data-unit path bit-for-bit over a 256 MB device.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 drivers/md/dm-crypt.c | 281 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 274 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 608b617fb817..df20ffa6e61e 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -101,6 +101,14 @@ struct dm_crypt_request {
 	struct scatterlist sg_in[4];
 	struct scatterlist sg_out[4];
 	u64 iv_sector;
+	/*
+	 * Heap-allocated scatterlists used by the multi-data-unit path
+	 * when one bio is processed in a single skcipher request.  NULL
+	 * when the inline sg_in[]/sg_out[] arrays above are sufficient
+	 * (single-data-unit path).  Freed in crypt_free_req_skcipher().
+	 */
+	struct scatterlist *sg_in_ext;
+	struct scatterlist *sg_out_ext;
 };
 
 struct crypt_config;
@@ -151,6 +159,7 @@ enum cipher_flags {
 	CRYPT_IV_LARGE_SECTORS,		/* Calculate IV from sector_size, not 512B sectors */
 	CRYPT_ENCRYPT_PREPROCESS,	/* Must preprocess data for encryption (elephant) */
 	CRYPT_KEY_MAC_SIZE_SET,		/* The integrity_key_size option was used */
+	CRYPT_MULTI_DATA_UNIT,		/* Batch all sectors of a bio per crypto request */
 };
 
 /*
@@ -1426,12 +1435,162 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc,
 	return r;
 }
 
+/*
+ * Multi-data-unit variant of crypt_convert_block_skcipher.  Submits all
+ * remaining sectors of the current bio in one skcipher request whose
+ * data_unit_size is cc->sector_size.  The cipher walks the IV between
+ * data units (see crypto_skcipher_set_data_unit_size()).
+ *
+ * Returns the same set of values as crypt_convert_block_skcipher:
+ *   0 on synchronous success (full chunk processed),
+ *   -EINPROGRESS / -EBUSY on asynchronous dispatch,
+ *   -EAGAIN if the per-bio scatterlist allocation cannot be made.  The
+ *           caller MUST disable multi-data-unit batching for the rest
+ *           of this bio and re-enter the per-sector path, which uses
+ *           only mempool reserves and is therefore safe even on the
+ *           swap-out-to-dm-crypt path under total memory exhaustion.
+ *   negative errno otherwise.
+ *
+ * On success the bio iterators have been advanced by the chunk size.
+ *
+ * Walks the bio with __bio_for_each_bvec so that multi-page folios
+ * produce one scatterlist entry rather than N (one per PAGE_SIZE).
+ */
+static int crypt_convert_block_skcipher_multi(struct crypt_config *cc,
+					      struct convert_context *ctx,
+					      struct skcipher_request *req,
+					      unsigned int *out_processed)
+{
+	const unsigned int sector_size = cc->sector_size;
+	const gfp_t gfp = GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN;
+	unsigned int total = ctx->iter_in.bi_size;
+	unsigned int n_sg_in = 0, n_sg_out = 0;
+	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
+	struct scatterlist *sg_in = NULL, *sg_out = NULL;
+	struct bvec_iter iter_in, iter_out;
+	struct bio_vec bv;
+	u8 *iv, *org_iv;
+	int r;
+
+	/*
+	 * crypt_convert_init() sets bio_in == bio_out for reads and aligns
+	 * the read/write iterators to the same byte count, so iter_in and
+	 * iter_out always describe equally-sized payloads.  WARN if that
+	 * invariant is ever violated by a future change.
+	 */
+	if (WARN_ON_ONCE(ctx->iter_in.bi_size != ctx->iter_out.bi_size))
+		return -EIO;
+
+	/*
+	 * crypt_convert()'s outer loop only enters this helper when
+	 * iter_in.bi_size > 0, so total is non-zero here; reject any
+	 * sub-DU residue.
+	 */
+	if (unlikely(total & (sector_size - 1)))
+		return -EIO;
+
+	/*
+	 * Walk the bio_vec iterators to count how many SG entries we need
+	 * for exactly @total bytes.  bi_size of the iterators is at least
+	 * @total by construction above.
+	 */
+	iter_in = ctx->iter_in;
+	iter_in.bi_size = total;
+	__bio_for_each_bvec(bv, ctx->bio_in, iter_in, iter_in)
+		n_sg_in++;
+
+	iter_out = ctx->iter_out;
+	iter_out.bi_size = total;
+	__bio_for_each_bvec(bv, ctx->bio_out, iter_out, iter_out)
+		n_sg_out++;
+
+	sg_in = kmalloc_array(n_sg_in, sizeof(*sg_in), gfp);
+	sg_out = (ctx->bio_in == ctx->bio_out) ? sg_in :
+		 kmalloc_array(n_sg_out, sizeof(*sg_out), gfp);
+	if (!sg_in || !sg_out) {
+		/*
+		 * Allocation may legitimately fail under memory pressure on
+		 * the swap-out-to-dm-crypt path.  Return -EAGAIN so the
+		 * caller falls back to the per-sector path for this bio
+		 * rather than looping forever in the allocator or requeueing
+		 * the bio just to fail again.
+		 */
+		kfree(sg_in);
+		if (sg_out != sg_in)
+			kfree(sg_out);
+		return -EAGAIN;
+	}
+
+	sg_init_table(sg_in, n_sg_in);
+	{
+		unsigned int i = 0;
+
+		iter_in = ctx->iter_in;
+		iter_in.bi_size = total;
+		__bio_for_each_bvec(bv, ctx->bio_in, iter_in, iter_in)
+			sg_set_page(&sg_in[i++], bv.bv_page, bv.bv_len,
+				    bv.bv_offset);
+	}
+
+	if (sg_out != sg_in) {
+		unsigned int i = 0;
+
+		sg_init_table(sg_out, n_sg_out);
+		iter_out = ctx->iter_out;
+		iter_out.bi_size = total;
+		__bio_for_each_bvec(bv, ctx->bio_out, iter_out, iter_out)
+			sg_set_page(&sg_out[i++], bv.bv_page, bv.bv_len,
+				    bv.bv_offset);
+	}
+
+	/*
+	 * Compute the IV for the first data unit.  The cipher will derive
+	 * IVs for subsequent data units by treating this one as a 128-bit
+	 * little-endian counter and adding the data-unit index, which
+	 * matches the layout produced by plain and plain64.
+	 */
+	dmreq->iv_sector = ctx->cc_sector;
+	if (test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags))
+		dmreq->iv_sector >>= cc->sector_shift;
+	dmreq->ctx = ctx;
+
+	iv = iv_of_dmreq(cc, dmreq);
+	org_iv = org_iv_of_dmreq(cc, dmreq);
+	r = cc->iv_gen_ops->generator(cc, org_iv, dmreq);
+	if (r < 0)
+		goto out_free_sg;
+	memcpy(iv, org_iv, cc->iv_size);
+
+	/* Stash the SG arrays for cleanup on completion / free. */
+	dmreq->sg_in_ext = sg_in;
+	dmreq->sg_out_ext = (sg_out == sg_in) ? NULL : sg_out;
+
+	skcipher_request_set_crypt(req, sg_in, sg_out, total, iv);
+
+	if (bio_data_dir(ctx->bio_in) == WRITE)
+		r = crypto_skcipher_encrypt(req);
+	else
+		r = crypto_skcipher_decrypt(req);
+
+	*out_processed = total;
+	return r;
+
+out_free_sg:
+	kfree(sg_in);
+	if (sg_out != sg_in)
+		kfree(sg_out);
+	dmreq->sg_in_ext = NULL;
+	dmreq->sg_out_ext = NULL;
+	return r;
+}
+
 static void kcryptd_async_done(void *async_req, int error);
 
 static int crypt_alloc_req_skcipher(struct crypt_config *cc,
 				     struct convert_context *ctx)
 {
 	unsigned int key_index = ctx->cc_sector & (cc->tfms_count - 1);
+	struct dm_crypt_request *dmreq;
 
 	if (!ctx->r.req) {
 		ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO);
@@ -1441,6 +1600,18 @@ static int crypt_alloc_req_skcipher(struct crypt_config *cc,
 
 	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
+	/*
+	 * Initialise the heap-allocated scatterlist pointers so that
+	 * crypt_free_req_skcipher() does not read uninitialised memory
+	 * for paths that don't take the multi-data-unit branch.  The
+	 * dmreq trailer lives in the per-bio data area which is not
+	 * zeroed by the dm core, and the request is reused from the
+	 * mempool across many bios.
+	 */
+	dmreq = dmreq_of_req(cc, ctx->r.req);
+	dmreq->sg_in_ext = NULL;
+	dmreq->sg_out_ext = NULL;
+
 	/*
 	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
 	 * requests if driver request queue is full.
@@ -1487,6 +1658,12 @@ static void crypt_free_req_skcipher(struct crypt_config *cc,
 				    struct skcipher_request *req, struct bio *base_bio)
 {
 	struct dm_crypt_io *io = dm_per_bio_data(base_bio, cc->per_bio_data_size);
+	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
+
+	kfree(dmreq->sg_in_ext);
+	dmreq->sg_in_ext = NULL;
+	kfree(dmreq->sg_out_ext);
+	dmreq->sg_out_ext = NULL;
 
 	if ((struct skcipher_request *)(io + 1) != req)
 		mempool_free(req, &cc->req_pool);
@@ -1515,7 +1692,9 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_
 static blk_status_t crypt_convert(struct crypt_config *cc,
 			 struct convert_context *ctx, bool atomic, bool reset_pending)
 {
-	unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
+	const unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
+	bool multi_du = test_bit(CRYPT_MULTI_DATA_UNIT, &cc->cipher_flags);
+	unsigned int processed;
 	int r;
 
 	/*
@@ -1536,8 +1715,13 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 
 		atomic_inc(&ctx->cc_pending);
 
+		processed = cc->sector_size;
 		if (crypt_integrity_aead(cc))
 			r = crypt_convert_block_aead(cc, ctx, ctx->r.req_aead, ctx->tag_offset);
+		else if (multi_du)
+			r = crypt_convert_block_skcipher_multi(cc, ctx,
+							       ctx->r.req,
+							       &processed);
 		else
 			r = crypt_convert_block_skcipher(cc, ctx, ctx->r.req, ctx->tag_offset);
 
@@ -1559,8 +1743,19 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 					 * exit and continue processing in a workqueue
 					 */
 					ctx->r.req = NULL;
-					ctx->tag_offset++;
-					ctx->cc_sector += sector_step;
+					if (!multi_du) {
+						ctx->tag_offset++;
+						ctx->cc_sector += sector_step;
+					} else {
+						bio_advance_iter(ctx->bio_in,
+								 &ctx->iter_in,
+								 processed);
+						bio_advance_iter(ctx->bio_out,
+								 &ctx->iter_out,
+								 processed);
+						ctx->cc_sector +=
+							processed >> SECTOR_SHIFT;
+					}
 					return BLK_STS_DEV_RESOURCE;
 				}
 			} else {
@@ -1574,19 +1769,52 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 		 */
 		case -EINPROGRESS:
 			ctx->r.req = NULL;
-			ctx->tag_offset++;
-			ctx->cc_sector += sector_step;
+			if (!multi_du) {
+				ctx->tag_offset++;
+				ctx->cc_sector += sector_step;
+			} else {
+				bio_advance_iter(ctx->bio_in, &ctx->iter_in,
+						 processed);
+				bio_advance_iter(ctx->bio_out, &ctx->iter_out,
+						 processed);
+				ctx->cc_sector += processed >> SECTOR_SHIFT;
+			}
 			continue;
 		/*
 		 * The request was already processed (synchronously).
 		 */
 		case 0:
 			atomic_dec(&ctx->cc_pending);
-			ctx->cc_sector += sector_step;
-			ctx->tag_offset++;
+			if (!multi_du) {
+				ctx->cc_sector += sector_step;
+				ctx->tag_offset++;
+			} else {
+				bio_advance_iter(ctx->bio_in, &ctx->iter_in,
+						 processed);
+				bio_advance_iter(ctx->bio_out, &ctx->iter_out,
+						 processed);
+				ctx->cc_sector += processed >> SECTOR_SHIFT;
+			}
 			if (!atomic)
 				cond_resched();
 			continue;
+		/*
+		 * Multi-data-unit scatterlist allocation failed.  This can
+		 * happen on the swap-out-to-dm-crypt path under memory
+		 * pressure, where retrying with the same allocation policy
+		 * could loop forever.  Disable multi-data-unit batching for
+		 * the rest of this crypt_convert() invocation and re-enter
+		 * the per-sector path, which uses only mempool reserves and
+		 * is guaranteed to make forward progress even under total
+		 * memory exhaustion.  The per-tfm data_unit_size is left
+		 * unchanged, so subsequent bios (which start a fresh
+		 * crypt_convert() and re-read cipher_flags) will retry the
+		 * multi-data-unit path once memory pressure eases.
+		 */
+		case -EAGAIN:
+			atomic_dec(&ctx->cc_pending);
+			multi_du = false;
+			continue;
 		/*
 		 * There was a data integrity error.
 		 */
@@ -3063,6 +3291,45 @@ static int crypt_ctr_cipher(struct dm_target *ti, char *cipher_in, char *key)
 		}
 	}
 
+	/*
+	 * Enable multi-data-unit batching when the cipher supports it and
+	 * the IV layout is one we can derive per-DU from a single starting
+	 * IV: plain or plain64 produce a sequential 64-bit little-endian
+	 * counter, which matches the convention of
+	 * crypto_skcipher_set_data_unit_size().  Restrict to the simple
+	 * case (single tfm, no integrity, no per-sector post() callback)
+	 * to keep the consumer path small; modes like essiv, lmk, tcw,
+	 * eboiv, plain64be, random, null, benbi, and elephant are
+	 * deliberately excluded because their generators or post-IV hooks
+	 * cannot be re-derived by the cipher between data units.
+	 */
+	if (!crypt_integrity_aead(cc) && cc->tfms_count == 1 &&
+	    cc->iv_gen_ops &&
+	    (cc->iv_gen_ops == &crypt_iv_plain_ops ||
+	     cc->iv_gen_ops == &crypt_iv_plain64_ops) &&
+	    !cc->iv_gen_ops->post &&
+	    !cc->integrity_tag_size && !cc->integrity_iv_size &&
+	    crypto_skcipher_supports_multi_data_unit(cc->cipher_tfm.tfms[0])) {
+		ret = crypto_skcipher_set_data_unit_size(cc->cipher_tfm.tfms[0],
+							 cc->sector_size);
+		if (!ret) {
+			set_bit(CRYPT_MULTI_DATA_UNIT, &cc->cipher_flags);
+			DMINFO("Using multi-data-unit crypto offload (du=%u)",
+			       cc->sector_size);
+		} else {
+			/*
+			 * The driver advertised the capability via cra_flags
+			 * but rejected the requested data unit size.  This is
+			 * a driver bug worth seeing in dmesg; fall back to
+			 * the per-sector path so the device still activates.
+			 */
+			DMWARN_LIMIT("multi-DU offload disabled: %s rejected du=%u (%d)",
+				     crypto_skcipher_driver_name(cc->cipher_tfm.tfms[0]),
+				     cc->sector_size, ret);
+			ret = 0;
+		}
+	}
+
 	/* wipe the kernel key payload copy */
 	if (cc->key_string)
 		memset(cc->key, 0, cc->key_size * sizeof(u8));
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 3/4] crypto: testmgr - exercise multi-data-unit path for skcipher
From: Leonid Ravich @ 2026-06-01  8:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260601085644.13026-1-lravich@amazon.com>

Add a self-comparison test that runs whenever an skcipher algorithm
advertises CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT in cra_flags.  The test
encrypts the same random plaintext two ways:

  1. as one batched request with data_unit_size set, and
  2. as N back-to-back single-data-unit requests with IVs derived from
     the original IV by adding the data-unit index (treated as a
     128-bit little-endian counter, matching the convention documented
     in crypto_skcipher_set_data_unit_size()).

Both encrypts must produce byte-identical ciphertext, otherwise the
algorithm's multi-DU implementation is inconsistent with its single-DU
behaviour.  Iterates over a fixed set of typical data unit sizes
(512, 1024, 2048, 4096) which cover the dm-crypt sector-size range.

The test is gated on ivsize == 16 (XTS, the only multi-DU consumer in
the kernel today) and on the algorithm advertising the capability,
so it costs nothing for the existing fleet of skcipher drivers.

Signed-off-by: Leonid Ravich <lravich@amazon.com>
---
 crypto/testmgr.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 4d86efae65b2..8ca92ee6b37c 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3211,6 +3211,123 @@ static int test_skcipher(int enc, const struct cipher_test_suite *suite,
 	return 0;
 }
 
+/*
+ * For algorithms that advertise CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT,
+ * verify that one request batching N data units produces the same
+ * ciphertext as N back-to-back single-data-unit requests with IVs
+ * derived from the original IV by adding the data-unit index (treated
+ * as a 128-bit little-endian counter).
+ *
+ * This is a self-comparison: it does not depend on test-vector
+ * authoritativeness, only on the algorithm being internally consistent
+ * between its single-DU and multi-DU paths.
+ */
+#define TEST_MDU_NR_UNITS	4
+static int test_skcipher_multi_du(struct crypto_skcipher *tfm,
+				  unsigned int du_size)
+{
+	const char *driver = crypto_skcipher_driver_name(tfm);
+	const unsigned int ivsize = crypto_skcipher_ivsize(tfm);
+	const unsigned int total = du_size * TEST_MDU_NR_UNITS;
+	struct skcipher_request *req = NULL;
+	struct scatterlist sg_in, sg_out;
+	DECLARE_CRYPTO_WAIT(wait);
+	u8 iv_orig[16] = {0};
+	u8 iv_work[16];
+	u8 *plain = NULL, *batched = NULL, *unit = NULL;
+	unsigned int i;
+	int err;
+
+	if (ivsize != 16)
+		return 0;
+
+	plain = kmalloc(total, GFP_KERNEL);
+	batched = kmalloc(total, GFP_KERNEL);
+	unit = kmalloc(total, GFP_KERNEL);
+	req = skcipher_request_alloc(tfm, GFP_KERNEL);
+	if (!plain || !batched || !unit || !req) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	get_random_bytes(plain, total);
+	get_random_bytes(iv_orig, ivsize);
+
+	/* Pass 1: one batched encrypt with data_unit_size set. */
+	err = crypto_skcipher_set_data_unit_size(tfm, du_size);
+	if (err) {
+		pr_err("alg: skcipher: %s set_data_unit_size(%u) failed: %d\n",
+		       driver, du_size, err);
+		goto out;
+	}
+	memcpy(batched, plain, total);
+	memcpy(iv_work, iv_orig, ivsize);
+	sg_init_one(&sg_in, batched, total);
+	sg_out = sg_in;
+	skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG |
+				      CRYPTO_TFM_REQ_MAY_SLEEP,
+				      crypto_req_done, &wait);
+	skcipher_request_set_crypt(req, &sg_in, &sg_out, total, iv_work);
+	err = crypto_wait_req(crypto_skcipher_encrypt(req), &wait);
+	if (err) {
+		pr_err("alg: skcipher: %s multi-DU batched encrypt failed: %d\n",
+		       driver, err);
+		goto out_clear_du;
+	}
+
+	/* Pass 2: TEST_MDU_NR_UNITS single-DU encrypts with derived IVs. */
+	err = crypto_skcipher_set_data_unit_size(tfm, 0);
+	if (err)
+		goto out;
+	memcpy(unit, plain, total);
+	memcpy(iv_work, iv_orig, ivsize);
+	for (i = 0; i < TEST_MDU_NR_UNITS; i++) {
+		sg_init_one(&sg_in, unit + i * du_size, du_size);
+		sg_out = sg_in;
+		skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG |
+					      CRYPTO_TFM_REQ_MAY_SLEEP,
+					      crypto_req_done, &wait);
+		skcipher_request_set_crypt(req, &sg_in, &sg_out, du_size,
+					   iv_work);
+		err = crypto_wait_req(crypto_skcipher_encrypt(req), &wait);
+		if (err) {
+			pr_err("alg: skcipher: %s single-DU[%u] encrypt failed: %d\n",
+			       driver, i, err);
+			goto out;
+		}
+		/* Increment iv_work as a 128-bit little-endian counter. */
+		{
+			__le64 lo_le, hi_le;
+			u64 lo;
+
+			memcpy(&lo_le, iv_work, 8);
+			memcpy(&hi_le, iv_work + 8, 8);
+			lo = le64_to_cpu(lo_le) + 1;
+			lo_le = cpu_to_le64(lo);
+			memcpy(iv_work, &lo_le, 8);
+			if (lo == 0) {
+				hi_le = cpu_to_le64(le64_to_cpu(hi_le) + 1);
+				memcpy(iv_work + 8, &hi_le, 8);
+			}
+		}
+	}
+
+	if (memcmp(batched, unit, total) != 0) {
+		pr_err("alg: skcipher: %s multi-DU mismatch (du=%u, n=%u)\n",
+		       driver, du_size, TEST_MDU_NR_UNITS);
+		err = -EINVAL;
+	}
+
+out_clear_du:
+	(void)crypto_skcipher_set_data_unit_size(tfm, 0);
+out:
+	skcipher_request_free(req);
+	kfree(unit);
+	kfree(batched);
+	kfree(plain);
+	return err;
+}
+
 static int alg_test_skcipher(const struct alg_test_desc *desc,
 			     const char *driver, u32 type, u32 mask)
 {
@@ -3259,6 +3376,18 @@ static int alg_test_skcipher(const struct alg_test_desc *desc,
 	if (err)
 		goto out;
 
+	if (crypto_skcipher_supports_multi_data_unit(tfm)) {
+		static const unsigned int du_sizes[] = { 512, 1024, 2048, 4096 };
+		unsigned int j;
+
+		for (j = 0; j < ARRAY_SIZE(du_sizes); j++) {
+			err = test_skcipher_multi_du(tfm, du_sizes[j]);
+			if (err)
+				goto out;
+			cond_resched();
+		}
+	}
+
 	err = test_skcipher_vs_generic_impl(desc->generic_driver, req, tsgls);
 out:
 	free_cipher_test_sglists(tsgls);
-- 
2.47.3


^ permalink raw reply related

* block: fix handling of dead zone write plugs
From: Gyokhan Kochmarla @ 2026-06-01  9:29 UTC (permalink / raw)
  To: stable
  Cc: gregkh, axboe, dlemoal, johannes.thumshirn, linux-block,
	Shin'ichiro Kawasaki, Gyokhan Kochmarla

From: Damien Le Moal <dlemoal@kernel.org>

commit 836efd35c472d89c838d7b17ef339ddb3286ffc5 upstream.

Shin'ichiro reported hard to reproduce unaligned write errors with zoned
block devices. Under normal operation conditions (e.g. running XFS on an
SMR disk), these errors are nearly impossible to trigger. But using a
"slow" kernel with many debug options enables and some specific use
cases (e.g. fio zbd test case 46), the errors can be reproduced fairly
easily.

The unaligned write errors come from mishandling a valid reference
counting pattern of zone write plugs. Such pattern triggers for instance
if a process A writes a zone (not necessarilly to the full state),
another process B immediately resets the zone and immediately following
the completion of the zone reset, starts issuing writes to the zone.
With such pattern, in some cases, the zone write plugs worker thread of
the device may still be holding a reference to the zone write plug of
the zone taken when process A was writing to the zone. The following
zone reset from process B marks the zone as dead but does not remove the
zone write plug from the device hash table as a reference to the plug
still exist. Once process B starts issuing new writes, the zone write
plug is seen as dead and the writes from process B are immediately
failed, despite this write pattern being perfectly legal.

Fix this by allowing restoring a dead zone write plug to a live state if
a write is issued to the zone when the zone is: marked as dead, empty
and the write sector corresponds to the first sector of the zone (that
is, the write is aligned to the zone write pointer). This is done with
the new helper function disk_check_zone_wplug_dead(), which restores a
dead zone write plug to a live state by clearing the BLK_ZONE_WPLUG_DEAD
flag and restoring the initial reference to the zone write plug taken
when the plug was added to the device hash table.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: b7d4ffb51037 ("block: fix zone write plug removal")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://patch.msgid.link/20260513111129.108809-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

[ context conflict due to different line offsets in blk-zoned.c ]
Signed-off-by: Gyokhan Kochmarla <gyokhan@amazon.de>

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -505,6 +505,28 @@ static void disk_mark_zone_wplug_dead(struct blk_zone_wplug *zwplug)
 	}
 }

+static inline bool disk_check_zone_wplug_dead(struct blk_zone_wplug *zwplug)
+{
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_DEAD))
+		return false;
+
+	/*
+	 * If a new write is received right after a zone reset completes and
+	 * while the disk_zone_wplugs_worker() thread has not yet released the
+	 * reference on the zone write plug after processing the last write to
+	 * the zone, then the new write BIO will see the zone write plug marked
+	 * as dead. This case is however a false positive and a perfectly valid
+	 * pattern. In such case, restore the zone write plug to a live one.
+	 */
+	if (!zwplug->wp_offset && bio_list_empty(&zwplug->bio_list)) {
+		zwplug->flags &= ~BLK_ZONE_WPLUG_DEAD;
+		refcount_inc(&zwplug->ref);
+		return false;
+	}
+
+	return true;
+}
+
 static void blk_zone_wplug_bio_work(struct work_struct *work);

 /*
@@ -1027,12 +1049,12 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	}

 	/*
-	 * If we got a zone write plug marked as dead, then the user is issuing
-	 * writes to a full zone, or without synchronizing with zone reset or
-	 * zone finish operations. In such case, fail the BIO to signal this
-	 * invalid usage.
+	 * Check if we got a zone write plug marked as dead. If yes, then the
+	 * user is likely issuing writes to a full zone, or without
+	 * synchronizing with zone reset or zone finish operations. In such
+	 * case, fail the BIO to signal this invalid usage.
 	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_DEAD) {
+	if (disk_check_zone_wplug_dead(zwplug)) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
 		disk_put_zone_wplug(zwplug);
 		bio_io_error(bio);

Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Jan Kara @ 2026-06-01 11:04 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Dave Chinner, Bart Van Assche, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <80cb53f3-e7e5-4f96-bacc-f9fb7661d976@columbia.edu>

On Fri 29-05-26 16:46:15, Tal Zussman wrote:
> On 5/27/26 9:00 AM, Christoph Hellwig wrote:
> > On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
> >> > I ran some experiments with fio on both XFS and a raw block device. Five
> >> > iterations each for 60s. Results below.
> >> > 
> >> > TLDR: Removing the delay doesn't significantly decrease user-visible
> >> > latency or otherwise improve performance, but does significantly reduce
> >> > throughput and increase context switches in some workloads (e.g. C).
> >> > I think it makes sense to leave the delay as-is. Thoughts?
> >> 
> >> Thanks for the test! One question below:
> > 
> > Thanks from me as well!
> > 
> >> 
> >> > Results:
> >> > 
> >> > Workloads (all `uncached=1`):
> >> >   A: rw=write     bs=128k iodepth=1   ioengine=pvsync2     # XFS
> >> >   B: rw=write     bs=128k iodepth=128 ioengine=io_uring    # XFS
> >> >   C: rw=randwrite bs=4k   iodepth=32  ioengine=io_uring    # XFS
> >> >   D: rw=rw 50/50  bs=64k  iodepth=32  ioengine=io_uring    # XFS
> >> >   E: rw=write     bs=128k iodepth=128 ioengine=io_uring    # raw /dev/nvmeXn1
> >> >   F: rw=write     bs=128k iodepth=128 numjobs=4
> >> >      + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
> >> > 
> >> > Mean ± stddev across 5 iterations:
> >> > 
> >> >     metric                     delay=1           delay=0     delta
> >> >     --------------------------------------------------------------
> >> > 
> >> >   A seq 128k qd1
> >> >     BW (MB/s)                4333 ± 27         4374 ± 34     +0.9%
> >> >     p99   (us)              36.2 ± 0.8        35.8 ± 0.4     -1.1%
> >> >     p999  (us)               3260 ± 75         3228 ± 29     -1.0%
> >> >     ctx-switches          184 k ± 59 k     3.68 M ± 65 k    +1903%
> >> >     cs / io                0.09 ± 0.03       1.86 ± 0.03    +1888%
> >> >     avg bios/run            80.4 ± 0.6         1.1 ± 0.0    -98.7%
> >> 
> >> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
> >> the completion latency should be at least 1000us but your results show p99
> >> latency of 36. What am I missing?
> > 
> > Yes, this looks a bit odd.  Unless there's multiple threads submitting
> > and somehow the completions get batched this should complete one
> > bio at a time and be the worst case for the delay scheme.
> 
> Sorry, I should've clarified - the latency here is the userspace-visible
> I/O completion latency (i.e. fio's clat value).
> 
> I ran again and traced to get the actual time from __bio_complete_in_task()
> to calling ->bi_end_io(). The results match the 1 jiffie delay now:
> 
>   metric                  delay=1  delay=0
> 
>   A seq 128k qd1
>     fio clat p99             38us     36us
>     bio cb p50             1.23ms    2.5us
>     bio cb p99             4.13ms   1.44ms
>     bio cb p999            5.01ms   2.63ms

So I'm clearly missing something fundamental as I don't see how can fio
reported IO completion time be lower than the end_io callback latency...
Ahh, it is the strange meaning of clat in fio in combination with sync
engine where clat means: "how long after the syscall has returned the data
is ready". Which for sync engine is immediately so the clat number is
meaningless. I think reporting 'lat' numbers from fio would make more
sense but whatever.

The bio cb latency indeed looks like what I'd roughly expect now. And
notice how the median latency of IO completion is 1.23ms in delay=1 case
and your throughput isn't abbysmal only because writes end up accumulating
in the page cache and writeback infrastructure ends up submitting a lot of
writeback IOs in parallel (you have ~80 bios to complete per run which
amortizes the latency to decent level).

However if you'd have IO that were to use BIO_COMPLETE_IN_TASK
infrastructure which doesn't have so many IOs in flight (like direct IO
with lower queue depth which has to do extent conversion on completion),
you would very much see the latency hit on your throughput as well. In the
extreme case of qd=1 direct IO you'd reduce the throughput to ~4MB/s.

Now I'm not saying the delay is bad - it is a tradeoff with clear wins in
CPU overhead your benchmarks are showing. I just wanted to point out
there's also the cost side which your benchmarks don't show very clearly.
So we might need to keep some stats showing how many IO completions we are
offloading per second on each CPU and switch to delaying the work only once
it crosses a threshold like 1000000/HZ per second or so (so we at most
double the IO latency by delaying the end_io callback).

								Honza

>   B seq 128k qd128
>     fio clat p99           8.74ms   8.85ms
>     bio cb p50             1.27ms    3.1us
>     bio cb p99             4.05ms   2.27ms
>     bio cb p999            4.91ms   2.77ms
> 
>   C rand 4k qd32
>     fio clat p99           8.16ms   8.11ms
>     bio cb p50             1.09ms   97.7us
>     bio cb p99             3.73ms   2.06ms
>     bio cb p999           11.87ms   3.79ms
> 
>   D mixed 64k qd32
>     fio clat p99            981us   1.03ms
>     bio cb p50             1.14ms   39.5us
>     bio cb p99             2.83ms    275us
>     bio cb p999            3.06ms    595us
> 
>   E raw 128k qd128
>     fio clat p99          26.97ms  27.34ms
>     bio cb p50             1.58ms   41.5us
>     bio cb p99             2.98ms    325us
>     bio cb p999            3.02ms    575us
> 
>   F mem-pressure
>     fio clat p99          29.75ms  30.43ms
>     bio cb p50             1.32ms    2.5us
>     bio cb p99             3.73ms   2.48ms
>     bio cb p999            4.62ms   2.83ms
> 
> Note that in the above, the C degradation didn't reproduce as much. The
> bandwidth does go down from 64.5 MB/s with delay=1 to 54.9 MB/s with delay=0,
> but it's a much smaller drop. I ran it several more times and ran into the
> degradation ~20% of the time. The lack of batching means the completion
> kworker fires for nearly every bio, leading to heavier preemption when a
> writer is placed on a CPU that receives many completion IRQs. The degradation
> seems to occur when the writers are migrated less often, leading to more
> preemption. But I haven't dug into why the scheduler chooses to migrate more
> in some runs vs. others. However, when pinning to 16 cores, the difference
> between delay=0 and delay=1 goes away.
> 
> C specifically also seems to get worse because we're doing random writes to a
> sparse file, so each bio goes through the IOMAP_IOEND_UNWRITTEN path and the
> completion path is heavier, leading to more CPU stealing from the writing
> threads compared to the other workloads.
> 
> >> >   C rand 4k qd32
> >> >     BW (MB/s)               66.2 ± 0.8        44.6 ± 7.4    -32.7%
> >> >     p99   (us)              8002 ± 174      17990 ± 6800   +124.8%
> >> >     p999  (us)             11390 ± 554     31890 ± 11076   +180.0%
> >> >     ctx-switches         3.67 M ± 45 k    3.59 M ± 106 k     -2.2%
> >> >     cs / io                3.78 ± 0.04       5.62 ± 0.83    +48.7%
> >> >     avg bios/run            32.3 ± 1.0         3.1 ± 0.3    -90.5%
> >> 
> >> I'm somewhat surprised how larger is the completion latency is here without
> >> the delay. Is that due to a contention on local lock between the IO completion
> >> interrupt and the worker? Or why is the completion latency so big here when
> >> the case B with more IOs in flight, less bios per run, still had significantly
> >> lower latency in the delay=0 case?
> > 
> > Note that in the past we had major problems with workqueue scheduling
> > latency.  At some point these got mitigated a lot, but if they are back
> > for this workload that might be one reason.
> > 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 04/79] block: rust: fix generation of bindings to `BLK_STS_.*`
From: Andreas Hindborg @ 2026-06-01 11:16 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Boqun Feng, Jens Axboe, Miguel Ojeda, Gary Guo,
	Björn Roy Baron, Benno Lossin, Trevor Gross,
	Danilo Krummrich, FUJITA Tomonori, Frederic Weisbecker,
	Lyude Paul, Thomas Gleixner, Anna-Maria Behnsen, John Stultz,
	Stephen Boyd, Lorenzo Stoakes, Liam R. Howlett, linux-block,
	rust-for-linux, linux-kernel, linux-mm
In-Reply-To: <CAH5fLgi+K_5So-kLwyZGhaagEx-rRXPmk=zCOTqG3yh=bSe9Ww@mail.gmail.com>

Alice Ryhl <aliceryhl@google.com> writes:

> On Mon, Mar 16, 2026 at 10:27 AM Alice Ryhl <aliceryhl@google.com> wrote:
>>
>> On Mon, Feb 16, 2026 at 12:34:51AM +0100, Andreas Hindborg wrote:
>> > Bindgen generates constants for CPP integer literals as u32. The
>> > `blk_status_t` type is defined as `u8` but the variants of the type are
>> > defined as integer literals via CPP macros. Thus the defined variants of
>> > the type are not of the same type as the type itself.
>> >
>> > Prevent bindgen from emitting generated bindings for the `BLK_STS_.*`
>> > defines and instead define constants manually in `bindings_helper.h`
>> >
>> > Also remove casts that are no longer necessary.
>> >
>> > Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
>>
>> It'd be ideal to change the C header to use an enum, but that may not
>> work as I'm not sure you can specify the integer width you want for an
>> enum.
>>
>> Reviewed-by: Alice Ryhl <aliceryhl@google.com>
>
> Honestly, it might be better just to declare a Rust module somewhere
> with each constant redeclared:
>
> const BLK_STS_FOO: blk_status_t = bindings::BLK_STS_FOO as blk_status_t;

As it turns out, `bindgen` only emits a binding for `BLK_STS_OK`. It
cannot parse the rest, because they are declared as `((__force
blk_status_t)<N>)`. So in order to avoid special casing the `BLK_STS_OK`
constant, I think we should just keep the patch as is. I don't think we
will gain much from re-declaring all these in a local module.

I will update the comment on the `blocklist-item`:

+# Bindgen cannot extract values from the `((__force blk_status_t)N)`
+# CPP-macro form used by most of these and emits the few it can extract
+# as `u32`. Block them entirely; the `RUST_CONST_HELPER_BLK_STS_*`
+# definitions in `bindings_helper.h` expose them as `blk_status_t`.


Best regards,
Andreas Hindborg



^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
From: Philipp Reisner @ 2026-06-01 12:26 UTC (permalink / raw)
  To: Haris Iqbal; +Cc: lsf-pc, linux-block, Jia Li
In-Reply-To: <CAJpMwyjcUc7n9g0YMbpBPZorUOiyseBOHHKoUhYDriEu5gzLEg@mail.gmail.com>

On Wed, May 27, 2026 at 2:16 PM Haris Iqbal <haris.iqbal@ionos.com> wrote:
>
> On Tue, May 5, 2026 at 11:20 AM Philipp Reisner
> <philipp.reisner@linbit.com> wrote:
> >
> > Am Tue, Feb 03, 2026 at 04:09:59PM +0100 schrieb Haris Iqbal:
> > > Hi Haris,
> > >
> > > We are working on a pair of kernel modules which would offer a new
> > > replication solution in the Linux kernel. It would be a block level,
> > > active-active replication solution for RDMA transport.
> > >
> > > The existing block level replication solution in the Linux kernel is
> > > DRBD, which is an active-passive solution. The data replication in
> > > DRBD happens through 2 network hops.
> > >
> > >
> > > An active-active solution which one can build is by exporting block
> > > devices, either through NVMeOF or RNBD/RTRS, over the network, and
> > > then creating a raid1 device over it. It would provide a single hop
> > > replication solution, but the synchronization during a degraded state
> > > goes through 2 hops.
> > >
> > > The proposed solution would provide an active-active single hop
> > > replication, and a single hop synchronization (directly between
> > > storage nodes) in case of a degraded state.
> > [...]
> >
> >
> > I stumbled across this post because of the newer replies.
> >
> > I want to point out that we have significantly developed DRBD over the
> > last 15 Years as an out-of-tree module. In the past months, we began
> > the process of getting all those improvements back into Linux
> > upstream.
> >
> > With that, DRBD9 became multi-node. It does the “active-active single
> > hop replication” as it is. The networking part is now abstracted into
> > transport modules. We have one for TCP, one for load balancing across
> > multiple TCP connections, and one for RDMA.
> >
> > What you are doing here, in DRBD lingo, is a diskless primary
> > connected to multiple storage nodes.
> >
> > Find everything here https://github.com/LINBIT.
> > The latest edition of what we bring to the upstreaming discussion:
> > https://github.com/LINBIT/linux-drbd/tree/drbd-next
>
> Hi Philipp,
>
> Interesting.
> I looked into the diskless primary mode configuration for DRBD, and it
> does look similar to what RMR/BRMR offers.
> We plan to do comparison runs of DRBD diskless primary mode, and RMR/BRMR.
>
> I see the DRBD version in the current kernel is still 8.x.x.
> Do you have an ETA by when can we have version 9 in the kernel?
>

Hi Haris,

We are working on it. We are currently aligning the way we do things
with Generic Netlink. We are aligning our code to follow the upstream
conventions,
and the .YML and code generator used nowadays in this area.

If we are not discovering another area where similar cleanups are necessary,
I expect that we will submit it in August.

Philipp

^ permalink raw reply

* Re: [PATCH] rbd: check snap_count against RBD_MAX_SNAP_COUNT
From: Jens Axboe @ 2026-06-01 14:23 UTC (permalink / raw)
  To: linux-block, Rosen Penev
  Cc: Ilya Dryomov, Dongsheng Yang, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, ceph-devel, linux-kernel, llvm
In-Reply-To: <20260530011255.52916-1-rosenp@gmail.com>


On Fri, 29 May 2026 18:12:55 -0700, Rosen Penev wrote:
> snap_count is u32 but the comparison is against a SIZE_MAX-derived value
> (~2^61 on 64-bit), which clang flags as always false with
> -Wtautological-constant-out-of-range-compare.
> 
> The proper check here should be that snap_count does not go over
> RBD_MAX_SNAP_COUNT.
> 
> [...]

Applied, thanks!

[1/1] rbd: check snap_count against RBD_MAX_SNAP_COUNT
      commit: 2e1b3f4c51ace14f67201bd2a92ca6312a3c3724

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] MAINTAINERS: use new drbd-dev mailing list
From: Jens Axboe @ 2026-06-01 14:23 UTC (permalink / raw)
  To: Christoph Böhmwalder
  Cc: Philipp Reisner, Lars Ellenberg, drbd-dev, linux-block,
	linux-kernel
In-Reply-To: <20260513065557.36042-1-christoph.boehmwalder@linbit.com>


On Wed, 13 May 2026 08:55:57 +0200, Christoph Böhmwalder wrote:
> We are migrating from our own infrastructure to lists.linux.dev, so
> change the drbd-dev address to point to the new domain.

Applied, thanks!

[1/1] MAINTAINERS: use new drbd-dev mailing list
      commit: 9310b955c85ceb4700c7208baff2373a611a5070

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Christoph Hellwig @ 2026-06-01 14:40 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Damien Le Moal, Tetsuo Handa, Ming Lei,
	Jens Axboe, Bart Van Assche, linux-block, LKML, Andrew Morton,
	Linus Torvalds, linux-btrfs, David Sterba, linux-fsdevel,
	Christian Brauner, Brian Foster
In-Reply-To: <36571f8a-4df8-4152-b078-d82dbff4ad7e@suse.com>

On Thu, May 28, 2026 at 07:46:24PM +0930, Qu Wenruo wrote:
>> e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
>> unmount")
> Considering the xfs fix is pretty old, it's before the fix hint thus no 
> such mention in fstests.
>
> Do you happen to know which test case is for that fix?
> I'd like to adapt it for btrfs as a reproducer.

No.  Adding Brian who authored that commit.


^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Ming Lei @ 2026-06-01 15:29 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Damien Le Moal, Tetsuo Handa, Jens Axboe,
	Bart Van Assche, linux-block, LKML, Andrew Morton, Linus Torvalds,
	linux-btrfs, David Sterba, linux-fsdevel, Christian Brauner
In-Reply-To: <36571f8a-4df8-4152-b078-d82dbff4ad7e@suse.com>

On Thu, May 28, 2026 at 5:16 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2026/5/28 18:08, Christoph Hellwig 写道:
> > On Thu, May 28, 2026 at 03:11:05AM +0900, Damien Le Moal wrote:
> >> It sounds like the VFS unmount call needs to have something that waits for
> >> sync() to complete. Though, it really feels very strange that an FS can complete
> >
> > I don't think this is the VFS-controlled VFS file data writeback, which
> > we wait on, but some kind of fs controlled metadata.  And yes, it looks
> > like those file systems are buggy in that area.  We definitively had
> > such bugs in XFS before and fixed them.
> >
> > e.g. 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against
> > unmount")
> Considering the xfs fix is pretty old, it's before the fix hint thus no
> such mention in fstests.
>
> Do you happen to know which test case is for that fix?
> I'd like to adapt it for btrfs as a reproducer.
>
> This syzbot report doesn't provide a reproducer.
>
>
> Another thing is, if it's some btrfs bios on-the-fly after
> close_ctree(), the most common symptom should be NULL pointer
> dereference inside various btrfs endio functions.
> As all those end_bbio_*() functions are referring to either fs_info or
> inode/eb, thus if the fs is unmounted before the bio finished, they
> should all cause use-after-free.
>
> The only exception is discard, which is using blkdev_issue_discard()
> thus has no such reference to btrfs internal structure, but that's out
> of my understanding.

syzbot log shows the null-ptr-deref  is on WRITE, instead of DISCARD.

https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28

Adding WARN_ON(!lo->lo_backing_file) in loop_queue_rq() might capture
this bio submission context if this req isn't issued via wq.

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH] block, bfq: release cgroup stats with bfq_group
From: Jan Kara @ 2026-06-01 16:13 UTC (permalink / raw)
  To: Yu Kuai; +Cc: axboe, linux-block, linux-kernel, jack
In-Reply-To: <20260601061502.899552-1-yukuai@fygo.io>

On Mon 01-06-26 14:15:02, Yu Kuai wrote:
> BFQ cgroup stats contain percpu counters embedded in struct bfq_group,
> but the old free path destroys them from bfq_pd_free(), which is tied
> to blkg policy-data teardown.
> 
> That is not the same lifetime as struct bfq_group. BFQ pins bfq_group
> while bfq_queue entities refer to it, so bfq_pd_free() can drop the
> policy-data reference while other bfq_group references still exist. The
> following blkcg change also defers policy-data release through RCU and
> leaves BFQ to run the final bfqg_put() from an RCU callback. For that
> conversion, stats teardown must belong to the last bfq_group put, not to
> policy-data teardown.
> 
> Move stats teardown to bfqg_put() so the embedded counters are destroyed
> exactly when the last bfq_group reference is released, before kfree(bfqg).
> 
> Without this preparatory change, the RCU-delayed policy-data free
> conversion reproduced the following KASAN report:
> 
>   BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0
>   Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535
> 
>   CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT  ea13f83d4b74a12510d20db4a7d9a0fe8275f05c
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x54/0x70
>    print_address_description+0x77/0x200
>    ? percpu_counter_destroy_many+0xf1/0x2e0
>    print_report+0x64/0x70
>    kasan_report+0x118/0x150
>    ? percpu_counter_destroy_many+0xf1/0x2e0
>    percpu_counter_destroy_many+0xf1/0x2e0
>    __mmdrop+0x1d8/0x350
>    finish_task_switch+0x3f5/0x570
>    __schedule+0xe8e/0x18a0
>    schedule+0xfe/0x1c0
>    schedule_timeout+0x7f/0x1d0
>    __wait_for_common+0x26c/0x3f0
>    wait_for_completion_state+0x21/0x40
>    call_usermodehelper_exec+0x271/0x2c0
>    __request_module+0x296/0x410
>    elv_iosched_store+0x1bc/0x2c0
>    queue_attr_store+0x152/0x1c0
>    kernfs_fop_write_iter+0x1d7/0x280
>    vfs_write+0x580/0x630
>    ksys_write+0xec/0x190
>    do_syscall_64+0x156/0x490
>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
>   Allocated by task 535:
>    kasan_save_track+0x3e/0x80
>    __kasan_kmalloc+0x72/0x90
>    bfq_pd_alloc+0x60/0x100 [bfq]
>    blkg_create+0x3bb/0xbe0
>    blkg_lookup_create+0x3a2/0x460
>    blkg_conf_start+0x24a/0x2d0
>    bfq_io_set_weight+0x17f/0x430 [bfq]
>    cgroup_file_write+0x1c5/0x4b0
>    kernfs_fop_write_iter+0x1d7/0x280
>    vfs_write+0x580/0x630
>    ksys_write+0xec/0x190
>    do_syscall_64+0x156/0x490
>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
>   Freed by task 0:
>    kasan_save_track+0x3e/0x80
>    kasan_save_free_info+0x46/0x50
>    __kasan_slab_free+0x3a/0x60
>    kfree+0x14e/0x4f0
>    rcu_core+0x6f3/0xcd0
>    handle_softirqs+0x1a0/0x550
>    __irq_exit_rcu+0x8c/0x150
>    irq_exit_rcu+0xe/0x20
>    sysvec_apic_timer_interrupt+0x6e/0x80
>    asm_sysvec_apic_timer_interrupt+0x1a/0x20
> 
>   Last potentially related work creation:
>    kasan_save_stack+0x3e/0x60
>    kasan_record_aux_stack+0x99/0xb0
>    call_rcu+0x55/0x5c0
>    blkg_free_workfn+0x130/0x220
>    process_scheduled_works+0x655/0xb60
>    worker_thread+0x446/0x600
>    kthread+0x1f4/0x230
>    ret_from_fork+0x259/0x420
>    ret_from_fork_asm+0x1a/0x30
> 
> Signed-off-by: Yu Kuai <yukuai@fygo.io>

Makes sense. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/bfq-cgroup.c | 43 ++++++++++++++++++++++---------------------
>  1 file changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index ac83b0668764..37ab70930c8d 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -300,6 +300,25 @@ static struct bfq_group *bfqg_parent(struct bfq_group *bfqg)
>  	return pblkg ? blkg_to_bfqg(pblkg) : NULL;
>  }
>  
> +static void bfqg_stats_exit(struct bfqg_stats *stats)
> +{
> +	blkg_rwstat_exit(&stats->bytes);
> +	blkg_rwstat_exit(&stats->ios);
> +#ifdef CONFIG_BFQ_CGROUP_DEBUG
> +	blkg_rwstat_exit(&stats->merged);
> +	blkg_rwstat_exit(&stats->service_time);
> +	blkg_rwstat_exit(&stats->wait_time);
> +	blkg_rwstat_exit(&stats->queued);
> +	bfq_stat_exit(&stats->time);
> +	bfq_stat_exit(&stats->avg_queue_size_sum);
> +	bfq_stat_exit(&stats->avg_queue_size_samples);
> +	bfq_stat_exit(&stats->dequeue);
> +	bfq_stat_exit(&stats->group_wait_time);
> +	bfq_stat_exit(&stats->idle_time);
> +	bfq_stat_exit(&stats->empty_time);
> +#endif
> +}
> +
>  struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
>  {
>  	struct bfq_entity *group_entity = bfqq->entity.parent;
> @@ -321,8 +340,10 @@ static void bfqg_get(struct bfq_group *bfqg)
>  
>  static void bfqg_put(struct bfq_group *bfqg)
>  {
> -	if (refcount_dec_and_test(&bfqg->ref))
> +	if (refcount_dec_and_test(&bfqg->ref)) {
> +		bfqg_stats_exit(&bfqg->stats);
>  		kfree(bfqg);
> +	}
>  }
>  
>  static void bfqg_and_blkg_get(struct bfq_group *bfqg)
> @@ -433,25 +454,6 @@ void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg)
>  	entity->sched_data = &bfqg->sched_data;
>  }
>  
> -static void bfqg_stats_exit(struct bfqg_stats *stats)
> -{
> -	blkg_rwstat_exit(&stats->bytes);
> -	blkg_rwstat_exit(&stats->ios);
> -#ifdef CONFIG_BFQ_CGROUP_DEBUG
> -	blkg_rwstat_exit(&stats->merged);
> -	blkg_rwstat_exit(&stats->service_time);
> -	blkg_rwstat_exit(&stats->wait_time);
> -	blkg_rwstat_exit(&stats->queued);
> -	bfq_stat_exit(&stats->time);
> -	bfq_stat_exit(&stats->avg_queue_size_sum);
> -	bfq_stat_exit(&stats->avg_queue_size_samples);
> -	bfq_stat_exit(&stats->dequeue);
> -	bfq_stat_exit(&stats->group_wait_time);
> -	bfq_stat_exit(&stats->idle_time);
> -	bfq_stat_exit(&stats->empty_time);
> -#endif
> -}
> -
>  static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp)
>  {
>  	if (blkg_rwstat_init(&stats->bytes, gfp) ||
> @@ -552,7 +554,6 @@ static void bfq_pd_free(struct blkg_policy_data *pd)
>  {
>  	struct bfq_group *bfqg = pd_to_bfqg(pd);
>  
> -	bfqg_stats_exit(&bfqg->stats);
>  	bfqg_put(bfqg);
>  }
>  
> -- 
> 2.51.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox