Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-25 16:07 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: Keith Busch, Ming Lei, Christoph Hellwig

From: Keith Busch <kbusch@kernel.org>

A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.

Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..72c8ac805882c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3246,7 +3246,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (!rq)
 		blk_queue_exit(q);
 	else
-		blk_mq_free_request(rq);
+		rq_list_push(&plug->cached_rqs, rq);
 }

 #ifdef CONFIG_BLK_MQ_STACKING
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Ming Lei @ 2026-05-25 15:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	linux-block, LKML, Andrew Morton
In-Reply-To: <fda8abc8-6aa2-463b-bf72-865f6b838034@I-love.SAKURA.ne.jp>

On Mon, May 25, 2026 at 12:40:19PM +0900, Tetsuo Handa wrote:
> Some commit which was merged in the merge window for 7.1 broke the loop
> driver; a race window where lo_release() clears the backing file via
> __loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
> 
> The exact commit which changed the behavior is not known due to lack of
> reproducer and timing dependent behavior, but it seems that we need to
> solve this problem in the loop driver despite there was no change for the
> loop driver during this merge window.
> 
> To close this race, try to flush pending I/O requests. However, calling
> drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
> lockdep warnings [3][4]. We need to flush pending I/O requests without
> disk->open_mutex held.

No, please don't workaround before root cause.

No proof shows that the issue is in block layer or loop driver, the IO isn't
expected, you need to figure out why btrfs still issues IO after this loop
disk is closed by everyone and writeback is done.

https://syzkaller.appspot.com/x/log.txt?x=101e4702580000


Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v6] block: propagate in_flight to whole disk on partition I/O
From: Yizhou Tang @ 2026-05-25 14:31 UTC (permalink / raw)
  To: Tang Yizhou
  Cc: axboe, hch, kbusch, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260525021917.273190-1-yizhou.tang@shopee.com>

Sorry, I just let Claude Code review this patch with Opus 4.7 high
model, and it pointed out that blk_account_io_merge_request() was
missed. I think that's right. I’ll resend an updated patch tomorrow
and add the corresponding Assisted-by tag.

Best regards,
Yi

^ permalink raw reply

* Re: [PATCH v2] cgroup/rstat: validate cpu before css_rstat_cpu() access
From: patchwork-bot+netdevbpf @ 2026-05-25 13:53 UTC (permalink / raw)
  To: Qing Ming
  Cc: tj, josef, axboe, hannes, mkoutny, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, akpm, ast, haoluo, yosry, cgroups,
	linux-block, linux-kernel, linux-mm, bpf
In-Reply-To: <20260516070849.106141-1-a0yami@mailbox.org>

Hello:

This patch was applied to bpf/bpf.git (master)
by Tejun Heo <tj@kernel.org>:

On Sat, 16 May 2026 15:08:49 +0800 you wrote:
> css_rstat_updated() is exposed as a BPF kfunc and accepts a
> caller-provided cpu argument. The function uses cpu for per-cpu rstat
> lookups without checking whether it refers to a valid possible CPU.
> 
> A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an
> invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu ==
> 0x7fffffff triggers:
> 
> [...]

Here is the summary with links:
  - [v2] cgroup/rstat: validate cpu before css_rstat_cpu() access
    https://git.kernel.org/bpf/bpf/c/8817005efbdf

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Matthew Wilcox @ 2026-05-25 13:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <ahPm4h2gKgyEEuvV@infradead.org>

On Sun, May 24, 2026 at 11:06:26PM -0700, Christoph Hellwig wrote:
> > +/**
> > + * bvec_folio - Return the first folio referenced by this bvec
> > + * @bv: bvec to access
> > + *
> > + * bvecs can span multiple folios.  Unless you know that this
> > + * bvec does not, you may be better off using something like
> > + * bio_for_each_folio_all() which iterates over all folios.
> > + */
> > +static inline struct folio *bvec_folio(const struct bio_vec *bv)
> > +{
> > +	return page_folio(bv->bv_page);
> > +}
> 
> The comment here is confusing.  bio_for_each_folio_all is a helper that
> only works in the submitter side, and not for anything using the
> bvec_iter required for drivers or anything else sitting below a
> potential bio clone/split or using bvecs from an upper layer (like
> ITER_BVEC direct I/O).  Additionally bv_page can be a different
> page than the fist page due to large bv_offset on split bios.
> 
> So I'm not against the function per se, but the documentation must
> explain the minefields it is stepping into a bit better.

Lower level drivers shouldn't be concerning themselves with folios.
For a start, we can put non-folios (eg slab memory) into bvecs.
I'm happy to clarify this comment further, but I don't understand
who's going to look at this function and need to have more explanation.

^ permalink raw reply

* Re: blktests failures with v7.1-rc1 kernel
From: Nilay Shroff @ 2026-05-25 12:44 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org, nbd,
	linux-rdma
In-Reply-To: <afB5syZbUrppgsDQ@shinmob>

hi Shinichiro,

On 4/28/26 2:43 PM, Shin'ichiro Kawasaki wrote:
> Hi all,
> 
> I ran the latest blktests (git hash: ea5472c1adc8) with the v7.1-rc1 kernel. I
> observed 8 failures listed below. Comparing with the previous report for the
> v7.0 kernel [1], 2 failures are new (nvme/045, scsi/002). Your actions for fix
> will be welcomed as always.
> 
> [1]https://lore.kernel.org/linux-block/aeCDXI5hY_ivSWm4@shinmob/
> 
> 
> List of failures
> ================
> #1: nvme/005,063 (tcp transport)
> #2: nvme/045 (new)(kmemleak)
> #3: nvme/058 (fc transport)(hang)(kmemleak)
> #4: nvme/060
> #5: nvme/061 (rdma transport, siw driver)(kmemleak)
> #6: nvme/061 (fc transport)
> #7: nbd/002
> #8: scsi/002 (new)
> 
> 
> Failure description
> ===================
> 
> #1: nvme/005,063 (tcp transport)
> 
>      The test cases nvme/005 and 063 fail for tcp transport due to the lockdep
>      WARN related to the three locks q->q_usage_counter, q->elevator_lock and
>      set->srcu. The failure was reported first time for nvme/063 and v6.16-rc1
>      kernel [2].
> 
>      Chaitanya provided a fix patch (thanks!), and it is queued for v7.1-rcX tags
>      [3]. However, nvme/005 and 063 still fail even when I apply the fix patch to
>      v7.1-rc1 kernel. The call traces of the lockdep WARN are different between
>      "v7.1-rc1" kernel [4] and "v7.1-rc1+the fix patch" kernel [5]. I guess that
>      there exist two lockdep problems with similar symptoms and patch [3] fixed
>      one of them. I guess that still one problem is left.
> 
>      [2]https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/
>      [3]https://lore.kernel.org/all/20260413171628.6204-1-kch@nvidia.com/


I looked into this lockdep warning, and it seems that Chaitanya's patch indeed fixes the
original issue reported in [4]. However, the new warning reported in [5] appears to be a
separate lockdep splat and, from what I can tell, likely a false positive. There are two
reasons why I think so:

1. The lockdep report suggests that thread #1 is sending data over a TCP socket while
    another thread #2 is still in the process of establishing that same socket connection.
    In practice, this should not be possible because request dispatch over the socket can
    only happen after the connection setup has completed successfully.

2. The warning also suggests that while thread #0 is deleting the gendisk and unregistering
    the corresponding request queue, another thread #5 is concurrently attempting to change
    the queue elevator. However, once gendisk deletion starts, elevator switching is already
    inhibited for that queue (see disable_elv_switch()), so the reported locking scenario
    should not be reachable in practice.

Based on the above, I suspect this is a lockdep false positive caused by dependency tracking
across different queue/socket lifecycle phases. We may need to suppress lock dependency tracking
in some of these paths to avoid the false warning.

Thanks,
--Nilay




^ permalink raw reply

* Re: [PATCH 4/4] dm crypt: batch all sectors of a bio per crypto request
From: Mikulas Patocka @ 2026-05-25 12:02 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Herbert Xu, David S . Miller, Mike Snitzer, Alasdair Kergon,
	Ard Biesheuvel, Eric Biggers, Jens Axboe, Horia Geanta,
	Gilad Ben-Yossef, linux-crypto, dm-devel, linux-block
In-Reply-To: <20260519120002.27267-5-lravich@amazon.com>

[-- Attachment #1: Type: text/plain, Size: 15091 bytes --]

Hi


On Tue, 19 May 2026, Leonid Ravich wrote:

> When the underlying skcipher driver advertises support for multiple
> data units in a single request (CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT),
> configure the cipher with cc->sector_size as data_unit_size and
> submit one request per bio instead of one request per sector.  This
> removes per-sector overhead in the crypto API hot path: request
> allocation, callback dispatch, completion handling, and SG setup.
> 
> The optimisation is enabled automatically at table load when all
> of the following hold:
> 
>  - the cipher is non-aead (i.e. skcipher);
>  - tfms_count is 1 (interleaved per-sector keys would break batching);
>  - the IV mode is plain or plain64 (the only modes whose generator
>    produces a sequential 64-bit little-endian counter that the cipher
>    can extend by adding the data-unit index, matching the convention
>    documented in crypto_skcipher_set_data_unit_size());
>  - the iv_gen_ops->post() hook is unset (lmk and tcw use it; both are
>    already excluded by the IV-mode test, but the explicit check makes
>    the assumption durable against future IV modes);
>  - dm-integrity is not stacked (no integrity tag or integrity IV);
>  - the cipher driver advertises multi-data-unit support.
> 
> A new CRYPT_MULTI_DATA_UNIT cipher_flag, set once at construction
> time, gates the multi-data-unit path.  The existing per-sector path
> in crypt_convert_block_skcipher() is unchanged; the new
> crypt_convert_block_skcipher_multi() is reached from a small dispatch
> in crypt_convert() and shares the same backlog/-EBUSY/-EINPROGRESS
> flow control with the per-sector path.
> 
> Heap-allocated scatterlists are stashed in dm_crypt_request and freed
> in crypt_free_req_skcipher() to avoid races between the synchronous-
> success free path and async-completion reuse from the request pool.
> 
> On -ENOMEM during scatterlist allocation, the bio is requeued via
> BLK_STS_DEV_RESOURCE rather than failed, matching the behaviour of
> the existing -ENOMEM path for crypto request allocation.

You should make sure that you do not attempt to use the multi-data-unit 
mode when you retry the bio, otherwise it could loop indefinitely. Note 
that there are people who swap to dm-crypt - and so, it must work even if 
the memory is totally exhausted.

You should also use GFP_NOIO | __GFP_NORETRY instead of GFP_NOIO, so that 
the code doesn't loop in the allocator forever.


Perhaps __bio_for_each_bvec would be better than __bio_for_each_segment, 
so that it works faster with folios.

Mikulas

> Verified end-to-end with a byte-equivalence test: encrypted output of
> plain64 dm-crypt with the multi-data-unit path matches output of the
> single-data-unit path bit-for-bit over a 256 MB device.
> 
> Signed-off-by: Leonid Ravich <lravich@amazon.com>
> ---
>  drivers/md/dm-crypt.c | 248 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 241 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 5ef43231fe77..b35831d43f0e 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -98,6 +98,14 @@ struct dm_crypt_request {
>  	struct scatterlist sg_in[4];
>  	struct scatterlist sg_out[4];
>  	u64 iv_sector;
> +	/*
> +	 * Heap-allocated scatterlists used by the multi-data-unit path
> +	 * when one bio is processed in a single skcipher request.  NULL
> +	 * when the inline sg_in[]/sg_out[] arrays above are sufficient
> +	 * (single-data-unit path).  Freed in crypt_free_req_skcipher().
> +	 */
> +	struct scatterlist *sg_in_ext;
> +	struct scatterlist *sg_out_ext;
>  };
>  
>  struct crypt_config;
> @@ -149,6 +157,7 @@ enum cipher_flags {
>  	CRYPT_IV_LARGE_SECTORS,		/* Calculate IV from sector_size, not 512B sectors */
>  	CRYPT_ENCRYPT_PREPROCESS,	/* Must preprocess data for encryption (elephant) */
>  	CRYPT_KEY_MAC_SIZE_SET,		/* The integrity_key_size option was used */
> +	CRYPT_MULTI_DATA_UNIT,		/* Batch all sectors of a bio per crypto request */
>  };
>  
>  /*
> @@ -1501,12 +1510,139 @@ static int crypt_convert_block_skcipher(struct crypt_config *cc,
>  	return r;
>  }
>  
> +/*
> + * Multi-data-unit variant of crypt_convert_block_skcipher.  Submits all
> + * remaining sectors of the current bio in one skcipher request whose
> + * data_unit_size is cc->sector_size.  The cipher walks the IV between
> + * data units (see crypto_skcipher_set_data_unit_size()).
> + *
> + * Returns the same set of values as crypt_convert_block_skcipher:
> + *   0 on synchronous success (full chunk processed),
> + *   -EINPROGRESS / -EBUSY on asynchronous dispatch,
> + *   -ENOMEM if scatterlist allocation fails (caller maps to
> + *           BLK_STS_DEV_RESOURCE so the bio is requeued, not failed),
> + *   negative errno otherwise.
> + *
> + * On success the bio iterators have been advanced by the chunk size.
> + */
> +static int crypt_convert_block_skcipher_multi(struct crypt_config *cc,
> +					      struct convert_context *ctx,
> +					      struct skcipher_request *req,
> +					      unsigned int *out_processed)
> +{
> +	const unsigned int sector_size = cc->sector_size;
> +	unsigned int total_in = ctx->iter_in.bi_size;
> +	unsigned int total_out = ctx->iter_out.bi_size;
> +	unsigned int total = min(total_in, total_out);
> +	unsigned int n_sectors;
> +	unsigned int n_sg_in = 0, n_sg_out = 0;
> +	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
> +	struct scatterlist *sg_in = NULL, *sg_out = NULL;
> +	struct bvec_iter iter_in, iter_out;
> +	struct bio_vec bv;
> +	u8 *iv, *org_iv;
> +	int r;
> +
> +	if (unlikely(total < sector_size))
> +		return -EIO;
> +	n_sectors = total / sector_size;
> +	total = n_sectors * sector_size;
> +
> +	/*
> +	 * Walk the bio_vec iterators to count how many SG entries we need
> +	 * for exactly @total bytes.  bi_size of the iterators is at least
> +	 * @total by construction above.
> +	 */
> +	iter_in = ctx->iter_in;
> +	iter_in.bi_size = total;
> +	__bio_for_each_segment(bv, ctx->bio_in, iter_in, iter_in)
> +		n_sg_in++;
> +
> +	iter_out = ctx->iter_out;
> +	iter_out.bi_size = total;
> +	__bio_for_each_segment(bv, ctx->bio_out, iter_out, iter_out)
> +		n_sg_out++;
> +
> +	sg_in = kmalloc_array(n_sg_in, sizeof(*sg_in), GFP_NOIO);
> +	sg_out = (ctx->bio_in == ctx->bio_out) ? sg_in :
> +		 kmalloc_array(n_sg_out, sizeof(*sg_out), GFP_NOIO);
> +	if (!sg_in || !sg_out) {
> +		kfree(sg_in);
> +		if (sg_out != sg_in)
> +			kfree(sg_out);
> +		return -ENOMEM;
> +	}
> +
> +	sg_init_table(sg_in, n_sg_in);
> +	{
> +		unsigned int i = 0;
> +
> +		iter_in = ctx->iter_in;
> +		iter_in.bi_size = total;
> +		__bio_for_each_segment(bv, ctx->bio_in, iter_in, iter_in)
> +			sg_set_page(&sg_in[i++], bv.bv_page, bv.bv_len,
> +				    bv.bv_offset);
> +	}
> +
> +	if (sg_out != sg_in) {
> +		unsigned int i = 0;
> +
> +		sg_init_table(sg_out, n_sg_out);
> +		iter_out = ctx->iter_out;
> +		iter_out.bi_size = total;
> +		__bio_for_each_segment(bv, ctx->bio_out, iter_out, iter_out)
> +			sg_set_page(&sg_out[i++], bv.bv_page, bv.bv_len,
> +				    bv.bv_offset);
> +	}
> +
> +	/*
> +	 * Compute the IV for the first data unit.  The cipher will derive
> +	 * IVs for subsequent data units by treating this one as a 128-bit
> +	 * little-endian counter and adding the data-unit index, which
> +	 * matches the layout produced by plain and plain64.
> +	 */
> +	dmreq->iv_sector = ctx->cc_sector;
> +	if (test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags))
> +		dmreq->iv_sector >>= cc->sector_shift;
> +	dmreq->ctx = ctx;
> +
> +	iv = iv_of_dmreq(cc, dmreq);
> +	org_iv = org_iv_of_dmreq(cc, dmreq);
> +	r = cc->iv_gen_ops->generator(cc, org_iv, dmreq);
> +	if (r < 0)
> +		goto out_free_sg;
> +	memcpy(iv, org_iv, cc->iv_size);
> +
> +	/* Stash the SG arrays for cleanup on completion / free. */
> +	dmreq->sg_in_ext = sg_in;
> +	dmreq->sg_out_ext = (sg_out == sg_in) ? NULL : sg_out;
> +
> +	skcipher_request_set_crypt(req, sg_in, sg_out, total, iv);
> +
> +	if (bio_data_dir(ctx->bio_in) == WRITE)
> +		r = crypto_skcipher_encrypt(req);
> +	else
> +		r = crypto_skcipher_decrypt(req);
> +
> +	*out_processed = total;
> +	return r;
> +
> +out_free_sg:
> +	kfree(sg_in);
> +	if (sg_out != sg_in)
> +		kfree(sg_out);
> +	dmreq->sg_in_ext = NULL;
> +	dmreq->sg_out_ext = NULL;
> +	return r;
> +}
> +
>  static void kcryptd_async_done(void *async_req, int error);
>  
>  static int crypt_alloc_req_skcipher(struct crypt_config *cc,
>  				     struct convert_context *ctx)
>  {
>  	unsigned int key_index = ctx->cc_sector & (cc->tfms_count - 1);
> +	struct dm_crypt_request *dmreq;
>  
>  	if (!ctx->r.req) {
>  		ctx->r.req = mempool_alloc(&cc->req_pool, in_interrupt() ? GFP_ATOMIC : GFP_NOIO);
> @@ -1516,6 +1652,18 @@ static int crypt_alloc_req_skcipher(struct crypt_config *cc,
>  
>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>  
> +	/*
> +	 * Initialise the heap-allocated scatterlist pointers so that
> +	 * crypt_free_req_skcipher() does not read uninitialised memory
> +	 * for paths that don't take the multi-data-unit branch.  The
> +	 * dmreq trailer lives in the per-bio data area which is not
> +	 * zeroed by the dm core, and the request is reused from the
> +	 * mempool across many bios.
> +	 */
> +	dmreq = dmreq_of_req(cc, ctx->r.req);
> +	dmreq->sg_in_ext = NULL;
> +	dmreq->sg_out_ext = NULL;
> +
>  	/*
>  	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>  	 * requests if driver request queue is full.
> @@ -1562,6 +1710,12 @@ static void crypt_free_req_skcipher(struct crypt_config *cc,
>  				    struct skcipher_request *req, struct bio *base_bio)
>  {
>  	struct dm_crypt_io *io = dm_per_bio_data(base_bio, cc->per_bio_data_size);
> +	struct dm_crypt_request *dmreq = dmreq_of_req(cc, req);
> +
> +	kfree(dmreq->sg_in_ext);
> +	dmreq->sg_in_ext = NULL;
> +	kfree(dmreq->sg_out_ext);
> +	dmreq->sg_out_ext = NULL;
>  
>  	if ((struct skcipher_request *)(io + 1) != req)
>  		mempool_free(req, &cc->req_pool);
> @@ -1590,7 +1744,9 @@ static void crypt_free_req(struct crypt_config *cc, void *req, struct bio *base_
>  static blk_status_t crypt_convert(struct crypt_config *cc,
>  			 struct convert_context *ctx, bool atomic, bool reset_pending)
>  {
> -	unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
> +	const unsigned int sector_step = cc->sector_size >> SECTOR_SHIFT;
> +	const bool multi_du = test_bit(CRYPT_MULTI_DATA_UNIT, &cc->cipher_flags);
> +	unsigned int processed;
>  	int r;
>  
>  	/*
> @@ -1611,8 +1767,13 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
>  
>  		atomic_inc(&ctx->cc_pending);
>  
> +		processed = cc->sector_size;
>  		if (crypt_integrity_aead(cc))
>  			r = crypt_convert_block_aead(cc, ctx, ctx->r.req_aead, ctx->tag_offset);
> +		else if (multi_du)
> +			r = crypt_convert_block_skcipher_multi(cc, ctx,
> +							       ctx->r.req,
> +							       &processed);
>  		else
>  			r = crypt_convert_block_skcipher(cc, ctx, ctx->r.req, ctx->tag_offset);
>  
> @@ -1634,8 +1795,19 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
>  					 * exit and continue processing in a workqueue
>  					 */
>  					ctx->r.req = NULL;
> -					ctx->tag_offset++;
> -					ctx->cc_sector += sector_step;
> +					if (!multi_du) {
> +						ctx->tag_offset++;
> +						ctx->cc_sector += sector_step;
> +					} else {
> +						bio_advance_iter(ctx->bio_in,
> +								 &ctx->iter_in,
> +								 processed);
> +						bio_advance_iter(ctx->bio_out,
> +								 &ctx->iter_out,
> +								 processed);
> +						ctx->cc_sector +=
> +							processed >> SECTOR_SHIFT;
> +					}
>  					return BLK_STS_DEV_RESOURCE;
>  				}
>  			} else {
> @@ -1649,19 +1821,42 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
>  		 */
>  		case -EINPROGRESS:
>  			ctx->r.req = NULL;
> -			ctx->tag_offset++;
> -			ctx->cc_sector += sector_step;
> +			if (!multi_du) {
> +				ctx->tag_offset++;
> +				ctx->cc_sector += sector_step;
> +			} else {
> +				bio_advance_iter(ctx->bio_in, &ctx->iter_in,
> +						 processed);
> +				bio_advance_iter(ctx->bio_out, &ctx->iter_out,
> +						 processed);
> +				ctx->cc_sector += processed >> SECTOR_SHIFT;
> +			}
>  			continue;
>  		/*
>  		 * The request was already processed (synchronously).
>  		 */
>  		case 0:
>  			atomic_dec(&ctx->cc_pending);
> -			ctx->cc_sector += sector_step;
> -			ctx->tag_offset++;
> +			if (!multi_du) {
> +				ctx->cc_sector += sector_step;
> +				ctx->tag_offset++;
> +			} else {
> +				bio_advance_iter(ctx->bio_in, &ctx->iter_in,
> +						 processed);
> +				bio_advance_iter(ctx->bio_out, &ctx->iter_out,
> +						 processed);
> +				ctx->cc_sector += processed >> SECTOR_SHIFT;
> +			}
>  			if (!atomic)
>  				cond_resched();
>  			continue;
> +		/*
> +		 * Out of memory for the multi-DU SG arrays — bounce back
> +		 * to the caller for requeue rather than failing the bio.
> +		 */
> +		case -ENOMEM:
> +			atomic_dec(&ctx->cc_pending);
> +			return BLK_STS_DEV_RESOURCE;
>  		/*
>  		 * There was a data integrity error.
>  		 */
> @@ -3142,6 +3337,45 @@ static int crypt_ctr_cipher(struct dm_target *ti, char *cipher_in, char *key)
>  		}
>  	}
>  
> +	/*
> +	 * Enable multi-data-unit batching when the cipher supports it and
> +	 * the IV layout is one we can derive per-DU from a single starting
> +	 * IV: plain or plain64 produce a sequential 64-bit little-endian
> +	 * counter, which matches the convention of
> +	 * crypto_skcipher_set_data_unit_size().  Restrict to the simple
> +	 * case (single tfm, no integrity, no per-sector post() callback)
> +	 * to keep the consumer path small; modes like essiv, lmk, tcw,
> +	 * eboiv, plain64be, random, null, benbi, and elephant are
> +	 * deliberately excluded because their generators or post-IV hooks
> +	 * cannot be re-derived by the cipher between data units.
> +	 */
> +	if (!crypt_integrity_aead(cc) && cc->tfms_count == 1 &&
> +	    cc->iv_gen_ops &&
> +	    (cc->iv_gen_ops == &crypt_iv_plain_ops ||
> +	     cc->iv_gen_ops == &crypt_iv_plain64_ops) &&
> +	    !cc->iv_gen_ops->post &&
> +	    !cc->integrity_tag_size && !cc->integrity_iv_size &&
> +	    crypto_skcipher_supports_multi_data_unit(cc->cipher_tfm.tfms[0])) {
> +		ret = crypto_skcipher_set_data_unit_size(cc->cipher_tfm.tfms[0],
> +							 cc->sector_size);
> +		if (!ret) {
> +			set_bit(CRYPT_MULTI_DATA_UNIT, &cc->cipher_flags);
> +			DMINFO("Using multi-data-unit crypto offload (du=%u)",
> +			       cc->sector_size);
> +		} else {
> +			/*
> +			 * The driver advertised the capability via cra_flags
> +			 * but rejected the requested data unit size.  This is
> +			 * a driver bug worth seeing in dmesg; fall back to
> +			 * the per-sector path so the device still activates.
> +			 */
> +			DMWARN_LIMIT("multi-DU offload disabled: %s rejected du=%u (%d)",
> +				     crypto_skcipher_driver_name(cc->cipher_tfm.tfms[0]),
> +				     cc->sector_size, ret);
> +			ret = 0;
> +		}
> +	}
> +
>  	/* wipe the kernel key payload copy */
>  	if (cc->key_string)
>  		memset(cc->key, 0, cc->key_size * sizeof(u8));
> -- 
> 2.47.3
> 

^ permalink raw reply

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Mike Rapoport @ 2026-05-25  9:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-kernel, linux-mm
In-Reply-To: <ahP3R2nH8-bHMVf0@infradead.org>

On Mon, May 25, 2026 at 12:16:23AM -0700, Christoph Hellwig wrote:
> On Mon, May 25, 2026 at 09:52:09AM +0300, Mike Rapoport wrote:
> > On Sun, May 24, 2026 at 11:08:31PM -0700, Christoph Hellwig wrote:
> > > On Wed, May 20, 2026 at 11:15:52AM +0300, Mike Rapoport (Microsoft) wrote:
> > > > check_partition() allocates a buffer to use as backing buffer for
> > > > seq_buf.
> > > > 
> > > > This buffer can be allocated with kmalloc() as there's nothing special
> > > > about it to go directly to the page allocator.
> > > > 
> > > > Replace use of __get_free_page() with kmalloc() and free_page() with
> > > > kfree().
> > > 
> > > So I heard various vague references that we should replace
> > > __get_free_page with kmalloc, but nothing definitive.  Can you please
> > > point to a good resource for that?
> > 
> > There was quite recent discussion when I posted patches that change
> > __get_free_page to return void *:
> > 
> > https://lore.kernel.org/all/20251018093002.3660549-1-rppt@kernel.org/
> 
> This doesn't tell much more.
> 
> > And an old thread when Al posted similar patches:
> > 
> > https://lore.kernel.org/all/CA+55aFwp4iy4rtX2gE2WjBGFL=NxMVnoFeHqYa2j1dYOMMGqxg@mail.gmail.com/T/#u<S-Del>
> 
> This does, but it still fails to explain why kmalloc performs just as
> well as __get_free_page(s) these days.

I don't think that in this case - a single allocation on the cold path -
the performance difference is even measurable.

Nevertheless allocations from slab caches are way faster than
__get_free_page() (i.e.  alloc_pages()) as it's essentially lockless
cmpxchg. Allocations that need to refill the cache do alloc_pages() with a
little of slab bookkeeping overhead.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH] block: fix dio leak on integrity metadata mapping failure
From: Christoph Hellwig @ 2026-05-25  7:45 UTC (permalink / raw)
  To: Aaron Esau
  Cc: linux-block, Jens Axboe, Anuj Gupta, Kanchan Joshi,
	Christoph Hellwig, Keith Busch, linux-kernel, stable
In-Reply-To: <20260518074258.1600307-1-aaron1esau@gmail.com>

> +			if (unlikely(ret)) {
> +				bio->bi_status = BLK_STS_IOERR;
> +				bio_endio(bio);
> +				break;
> +			}

AFAICS the same issue also exists for the other goto fail case,
so we should convert the code at that label to a bio_endio().

I think both this and the existing -EAGAIN case and even the
-EIOCBQUEUED case leak the reference on the original bio.  Or am
I missing something?

It might makes sense to stop playing games with that bio refcount
and just have a status field in struct blkdev_dio.

^ permalink raw reply

* Re: [PATCH] block, nvme: export and use passthrough stats
From: Christoph Hellwig @ 2026-05-25  7:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, axboe, hch, nilay, Keith Busch
In-Reply-To: <20260525073219.GA5305@lst.de>

On Mon, May 25, 2026 at 09:32:19AM +0200, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 08:15:37AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > So stacking drivers can also report passthrough workloads through
> > iostat.
> 
> Except that this patch also wires it up in nvme.  So please split
> it and write proper commit messages for both parts.

.. and while we're at it - it actually is called from the lower nvme
drivers, just conditional on being a mpath request.  So this is
in many ways a bit weird, and the comments / commit logs should
probably explain it a bit.


^ permalink raw reply

* Re: [PATCH] block, nvme: export and use passthrough stats
From: Christoph Hellwig @ 2026-05-25  7:32 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, axboe, hch, nilay, Keith Busch
In-Reply-To: <20260522151537.1509784-1-kbusch@meta.com>

On Fri, May 22, 2026 at 08:15:37AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> So stacking drivers can also report passthrough workloads through
> iostat.

Except that this patch also wires it up in nvme.  So please split
it and write proper commit messages for both parts.

> +static inline bool blk_rq_passthrough_stats(struct request *req)
> +{

And maybe add a kerneldoc comment now that this is public.


^ permalink raw reply

* Re: [PATCH v3 04/10] block: introduce dma map backed bio type
From: Pavel Begunkov @ 2026-05-25  7:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
	Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
	William Power, Phil Cayton, Jason Gunthorpe
In-Reply-To: <20260520083043.GA18893@lst.de>

On 5/20/26 09:30, Christoph Hellwig wrote:
> On Mon, May 18, 2026 at 11:29:54AM +0100, Pavel Begunkov wrote:
>>>>    	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
>>>>    	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
>>>> +	BIO_DMABUF_MAP, /* Using premmaped dma buffers */
>>>
>>> Shouldn't this be a REQ_ flag as we should never mix and match bios with
>>> and without this flag in a single request?
>>
>> Do you mean adding both and propagating it from bio to req? submit_bio()
>> takes a bio, so we still need to set it there before it reaches blk-mq.
>> And there might be bio-based drivers using it in the future.
> 
> I think I forgot to reply to this, so let's do this now.
> 
> REQ_ is actually used by both bios and requests, so if you set it in
> bio->bi_opf it will automatically get propagated to the request, but
> it can also always be tested on the bio, including by bio-based
> drivers.

Ah yes, good point, thanks

-- 
Pavel Begunkov


^ permalink raw reply

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Christoph Hellwig @ 2026-05-25  7:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	linux-mm
In-Reply-To: <ahPxmSqT4Y4jcAbQ@kernel.org>

On Mon, May 25, 2026 at 09:52:09AM +0300, Mike Rapoport wrote:
> On Sun, May 24, 2026 at 11:08:31PM -0700, Christoph Hellwig wrote:
> > On Wed, May 20, 2026 at 11:15:52AM +0300, Mike Rapoport (Microsoft) wrote:
> > > check_partition() allocates a buffer to use as backing buffer for
> > > seq_buf.
> > > 
> > > This buffer can be allocated with kmalloc() as there's nothing special
> > > about it to go directly to the page allocator.
> > > 
> > > Replace use of __get_free_page() with kmalloc() and free_page() with
> > > kfree().
> > 
> > So I heard various vague references that we should replace
> > __get_free_page with kmalloc, but nothing definitive.  Can you please
> > point to a good resource for that?
> 
> There was quite recent discussion when I posted patches that change
> __get_free_page to return void *:
> 
> https://lore.kernel.org/all/20251018093002.3660549-1-rppt@kernel.org/

This doesn't tell much more.

> And an old thread when Al posted similar patches:
> 
> https://lore.kernel.org/all/CA+55aFwp4iy4rtX2gE2WjBGFL=NxMVnoFeHqYa2j1dYOMMGqxg@mail.gmail.com/T/#u<S-Del>

This does, but it still fails to explain why kmalloc performs just as
well as __get_free_page(s) these days.

And such an explanation or link to it should go into every patch doing
this switch.


^ permalink raw reply

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Mike Rapoport @ 2026-05-25  6:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-kernel, linux-mm
In-Reply-To: <ahPnX-CwVgK-9chp@infradead.org>

On Sun, May 24, 2026 at 11:08:31PM -0700, Christoph Hellwig wrote:
> On Wed, May 20, 2026 at 11:15:52AM +0300, Mike Rapoport (Microsoft) wrote:
> > check_partition() allocates a buffer to use as backing buffer for
> > seq_buf.
> > 
> > This buffer can be allocated with kmalloc() as there's nothing special
> > about it to go directly to the page allocator.
> > 
> > Replace use of __get_free_page() with kmalloc() and free_page() with
> > kfree().
> 
> So I heard various vague references that we should replace
> __get_free_page with kmalloc, but nothing definitive.  Can you please
> point to a good resource for that?

There was quite recent discussion when I posted patches that change
__get_free_page to return void *:

https://lore.kernel.org/all/20251018093002.3660549-1-rppt@kernel.org/

And an old thread when Al posted similar patches:

https://lore.kernel.org/all/CA+55aFwp4iy4rtX2gE2WjBGFL=NxMVnoFeHqYa2j1dYOMMGqxg@mail.gmail.com/T/#u<S-Del>

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 06/21] iov_iter: Make iov_iter_get_pages*() wrap iov_iter_extract_pages()
From: Christoph Hellwig @ 2026-05-25  6:13 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, Matthew Wilcox, Christoph Hellwig,
	Paulo Alcantara, Jens Axboe, Leon Romanovsky, Steve French,
	ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, Trond Myklebust, netfs,
	linux-afs, linux-cifs, linux-nfs, ceph-devel, v9fs, linux-erofs,
	linux-fsdevel, linux-kernel, linux-block
In-Reply-To: <20260518222959.488126-7-dhowells@redhat.com>

On Mon, May 18, 2026 at 11:29:38PM +0100, David Howells wrote:
> Make iov_iter_get_pages*() wrap iov_iter_extract_pages() for kernel
> iterator types (e.g. ITER_BVEC, ITER_FOLIOQ, ITER_XARRAY).  The pages
> obtained have their refcounts incremented afterwards if they're not slab
> pages.  ITER_KVEC is left returning -EFAULT.

Just kill off iov_iter_get_pages*, please.  Or at least stop using it
where these types matter.


^ permalink raw reply

* Re: [PATCH v2 05/21] Add a function to kmap one page of a multipage bio_vec
From: Christoph Hellwig @ 2026-05-25  6:13 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, Matthew Wilcox, Christoph Hellwig,
	Paulo Alcantara, Jens Axboe, Leon Romanovsky, Steve French,
	ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, Trond Myklebust, netfs,
	linux-afs, linux-cifs, linux-nfs, ceph-devel, v9fs, linux-erofs,
	linux-fsdevel, linux-kernel, linux-block
In-Reply-To: <20260518222959.488126-6-dhowells@redhat.com>

On Mon, May 18, 2026 at 11:29:37PM +0100, David Howells wrote:
> Add a function to kmap one page of a multipage bio_vec by offset (which is
> added to the offset in the bio_vec internally).  The caller is responsible
> for calculating how much of the page is then available.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Christoph Hellwig <hch@infradead.org>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: linux-block@vger.kernel.org
> cc: netfs@lists.linux.dev
> cc: linux-fsdevel@vger.kernel.org
> ---
>  include/linux/bvec.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index d36dd476feda..9df4a56fef61 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -299,4 +299,21 @@ static inline phys_addr_t bvec_phys(const struct bio_vec *bvec)
>  	return page_to_phys(bvec->bv_page) + bvec->bv_offset;
>  }
>  
> +/**
> + * kmap_local_bvec - Map part of a bvec into the kernel virtual address space
> + * @bvec: bvec to map
> + * @offset: Offset into bvec
> + *
> + * Map the page containing the byte at @offset into the kernel virtual address
> + * space.  The caller is responsible for making sure this doesn't overrun.
> + *
> + * Call kunmap_local on the returned address to unmap.
> + */
> +static inline void *kmap_local_bvec(struct bio_vec *bvec, size_t offset)

The name is rather confusing for something that does not map the entire
bvec, and is an anagram of the existing bvec_kmap_local.  So please
rename it to bvec_kmap_partial or something.

> +{
> +	offset += bvec->bv_offset;
> +
> +	return kmap_local_page(bvec->bv_page + offset / PAGE_SIZE) + offset % PAGE_SIZE;

Overly long line.  Also this can use shits and byte masking to be a tad
more efficient and matching the rest of the bvec code.

Users of this would be interesting and why you're not simply using a
bvec_iter at page granularity, which is what other block kmap code
does.


^ permalink raw reply

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Christoph Hellwig @ 2026-05-25  6:08 UTC (permalink / raw)
  To: Mike Rapoport (Microsoft); +Cc: Jens Axboe, linux-block, linux-kernel, linux-mm
In-Reply-To: <20260520-block-v1-1-6463dc2cf042@kernel.org>

On Wed, May 20, 2026 at 11:15:52AM +0300, Mike Rapoport (Microsoft) wrote:
> check_partition() allocates a buffer to use as backing buffer for
> seq_buf.
> 
> This buffer can be allocated with kmalloc() as there's nothing special
> about it to go directly to the page allocator.
> 
> Replace use of __get_free_page() with kmalloc() and free_page() with
> kfree().

So I heard various vague references that we should replace
__get_free_page with kmalloc, but nothing definitive.  Can you please
point to a good resource for that?


^ permalink raw reply

* Re: [PATCH] block: Avoid mounting the bdev pseudo-filesystem in userspace
From: Christoph Hellwig @ 2026-05-25  6:07 UTC (permalink / raw)
  To: Denis Arefev; +Cc: Jens Axboe, linux-block, linux-kernel, lvc-project, stable
In-Reply-To: <20260521072857.5078-1-arefev@swemel.ru>

On Thu, May 21, 2026 at 10:28:56AM +0300, Denis Arefev wrote:
> The bdev pseudo-filesystem is an internal kernel filesystem with which
> userspace should not interfere. Unregister it so that userspace cannot
> even attempt to mount it.
> 
> This fixes a bug [1] that occurs when attempting to access files,
> because the system call move_mount() uses pointers declared in the
> inode_operations structure, which for the bdev pseudo-filesystem
> are always equal to 0. `inode->i_op = &empty_iops;`

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Christoph Hellwig @ 2026-05-25  6:06 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <20260522182122.2489391-1-willy@infradead.org>

> +/**
> + * bvec_folio - Return the first folio referenced by this bvec
> + * @bv: bvec to access
> + *
> + * bvecs can span multiple folios.  Unless you know that this
> + * bvec does not, you may be better off using something like
> + * bio_for_each_folio_all() which iterates over all folios.
> + */
> +static inline struct folio *bvec_folio(const struct bio_vec *bv)
> +{
> +	return page_folio(bv->bv_page);
> +}

The comment here is confusing.  bio_for_each_folio_all is a helper that
only works in the submitter side, and not for anything using the
bvec_iter required for drivers or anything else sitting below a
potential bio clone/split or using bvecs from an upper layer (like
ITER_BVEC direct I/O).  Additionally bv_page can be a different
page than the fist page due to large bv_offset on split bios.

So I'm not against the function per se, but the documentation must
explain the minefields it is stepping into a bit better.

^ permalink raw reply

* Re: [PATCH v1] block: switch numa_node to int in blk_mq_hw_ctx and init_request
From: Christoph Hellwig @ 2026-05-25  6:03 UTC (permalink / raw)
  To: Mateusz Nowicki
  Cc: Jens Axboe, Caleb Sander Mateos, Sung-woo Kim, Josef Bacik,
	Alasdair Kergon, Mike Snitzer, Mikulas Patocka,
	Benjamin Marzinski, Ulf Hansson, Richard Weinberger, Zhihao Cheng,
	Miquel Raynal, Vignesh Raghavendra, Sven Peter, Janne Grunau,
	Neal Gompa, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	James E.J. Bottomley, Martin K. Petersen, Thomas Fourier, Al Viro,
	Luke Wang, Kees Cook, linux-block, linux-kernel, nbd, dm-devel,
	linux-mmc, linux-mtd, asahi, linux-arm-kernel, linux-nvme,
	linux-scsi
In-Reply-To: <20260523125210.272274-1-mateusz.nowicki@posteo.net>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
From: Christoph Hellwig @ 2026-05-25  5:58 UTC (permalink / raw)
  To: Chao Shi
  Cc: Jens Axboe, Christoph Hellwig, Christian Brauner, Josef Bacik,
	linux-block, linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu
In-Reply-To: <20260522220025.1770388-1-coshi036@gmail.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
From: Christoph Hellwig @ 2026-05-25  5:30 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <f8d509c7-c0c0-4196-9a82-a6f8ae24ffc2@columbia.edu>

On Fri, May 22, 2026 at 07:17:15PM -0400, Tal Zussman wrote:
> A: So this actually seems legit... doesn't look like anything actually calls 
> blkdev_write_begin() or blkdev_write_end(), unless I'm missing something.
> block_write_begin_iocb() usage seems necessary for bh-based filesystems, but
> block devices seem to use iomap for writes unconditionally.

Yes.  Maybe send a separate patch to remove these now unused methods?
Or I could do that since I forgot to remove them when I should have.


^ permalink raw reply

* Re: Observing higher CPU utilization during random IO fio testing
From: Yu Kuai @ 2026-05-25  5:28 UTC (permalink / raw)
  To: Jens Axboe, Wen Xiong, linux-block
  Cc: tom.leiming, jmoyer, Gjoyce, wenxiong, yukuai
In-Reply-To: <1cb74987-34e2-422f-93cf-9174fe913538@kernel.dk>

Hi,

在 2026/5/22 5:52, Jens Axboe 写道:
> On 5/21/26 1:44 PM, Wen Xiong wrote:
>> Hi All,
>>
>> Our performance team observed the higher CPU utilization in RHEL10 compared to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc4) as well when running FIO random IO tests.
>>
>> System configuration:
>> 47 dedicate cores
>> 120 GB memory
>> PCIe4 2-Port 64Gb FC Adapter
>> FlashSystem: FS9500, 12 LUNs/FC port, 100G each LUN.
>>
>> Random IO tests are more CPU intensive than sequential IO tests due to several factors: more context switching, Interrupt Handling,  cache Inefficiency etc. We found out the following patch which caused the higher CPU utilization in rhel10 and newer linux kernel:
>>
>> commit 060406c61c7cb4bbd82a02d179decca9c9bb3443 (HEAD)
>> Author: Yu Kuai <yukuai3@huawei.com>
>> Date:   Thu May 9 20:38:25 2024 +0800
>>
>> block: add plug while submitting IO
>>
>> So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
>> and __blkdev_direct_IO_async(), block layer can still benefit from caching
>> nsec time in the plug.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> We reverted above patch in rhel10 kernel and upstream 7.1-rc4, saw lower CPU utilization when doing the same FIO test.
>>
>> The patch adds plugging in __submit_bio() in block layer, maybe cause performance degradation:
>> - Random IO tests have less merging, flush overhead.
>> - More IO scheduler interaction, forces requests through scheduler instead of direct dispatch(direct dispatch to hardware queue)

I don't understand this point. Can you explain more? I think plug should not matter if request go through scheduler or not.

>> - Poor cache locality during plug operation
>>
>> Below are some performance data that our performance team collected:
>>
>> RHEL9.8 comparison RHEL10.0
>> Iotype     qd        nj         rmix      mpstat busy delta    lparstat delta
>> Randrw     1         20         100       135%                 109%
>> Randrw     1         40         100       72%                  81%
>> Randrw     1         20         70        278%                 174%
>> Randrw     1         40         70        272%                 191%
>> Randrw     1         20         0         93%                  30%
>> Randrw     1         40         0         104%                 36%
>>
>> RHEL 9.8 comparison RHEL10 with reverting above plugging patch in block layer.h
>> Iotype     qd        nj         rmix       mpstat busy delta    lparstat deltab
>> Randrw     1         20         100        -12%                 20%
>> Randrw     1         40         100        -42%                 -4%
>> Randrw     1         20         70         70%                  71%
>> Randrw     1         40         70         %51                  60%
>> Randrw     1         20         0          -14%                 -43%
>> Randrw     1         40         0          -33%                 -51%
>>
>> Can a block layer expert help us resolve this high CPU utilization performance issue?

And I assume you're testing raw disk, because filesystems should always enable plug.

>> Let us know if you need more performance data or other perf data.

Yes, perf data will be helpful. And please show your test in details and I'll
check if I can reproduce it.

> Let's CC Yu Kuai who wrote that commit, that might help.
>
-- 
Thansk,
Kuai

^ permalink raw reply

* Re: [PATCH v6 3/4] buffer: add dropbehind writeback support
From: Christoph Hellwig @ 2026-05-25  5:25 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <d98a35c1-882b-4752-a2c7-378872fc97a3@columbia.edu>

On Fri, May 22, 2026 at 07:14:46PM -0400, Tal Zussman wrote:
> >  	if (IS_ERR(folio))
> >  		return PTR_ERR(folio);
> > @@ -2160,6 +2165,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> >  	*foliop = folio;
> >  	return status;
> >  }
> 
> Q: "Is an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() needed for
> block_write_begin_iocb()?
> Since the legacy wrapper block_write_begin() remains exported, it appears
> filesystems compiled as loadable modules (like ext4 or fat) will encounter
> linker errors if they attempt to transition to the new API."
> 
> A: Can be added once filesystems want to use it.

Yeah, Sashiko is really stupid and counterproductive here.

> > @@ -2715,6 +2727,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
> >  
> >  	bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
> >  
> > +	if (folio_test_dropbehind(bh->b_folio))
> > +		bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> > +
> 
> Q: "Does this add unnecessary workqueue overhead for read operations?
> While task-context deferral is required for asynchronous writeback completions
> to allow folio_end_writeback() to sleep and invalidate the folio, read
> dropbehind invalidation is handled synchronously in the reader's task context
> via filemap_end_dropbehind_read().
> Could this be gated by an op == REQ_OP_WRITE or op_is_write(opf) check so
> that read bio completions aren't forced into a workqueue?"
> 
> A: Seems worth adding a op_is_write(opf) check to avoid deferring read
> dropbehind completions.

Yes.


^ permalink raw reply

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Christoph Hellwig @ 2026-05-25  5:24 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	linux-rt-devel
In-Reply-To: <ea6fc01f-5cb7-4a04-9f92-bbd2791fea51@columbia.edu>

[adding the PREEMPT-RT maintainers and list for one and a half questions
for them a bit below]

On Fri, May 22, 2026 at 07:09:59PM -0400, Tal Zussman wrote:
> > +		while ((bio = bio_list_pop(&list)))
> > +			bio->bi_end_io(bio);
> 
> Q: "Does this code handle bios that do not have a completion callback?
> If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
> bi_end_io pointer (like some flush or discard bios), executing
> bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
> The synchronous completion path in bio_endio() checks if bio->bi_end_io is
> non-NULL before calling it. Should a similar check be added here?"
> 
> A: Every bio that reaches this point should have bi_end_io set. There's no
> point in deferring a bio without a completion handler.

Yes.  The only bios without bi_end_io that can end up in bio_endio
are those attached to passthrough requests, and deferring them of
course doesn't make sense.

> Q: "Could this loop also monopolize the CPU and cause latency spikes?
> Because the work uses a 1-jiffie delay, tens of thousands of bios might
> accumulate under heavy I/O load. Processing the entire batch sequentially
> without yielding or calling cond_resched() inside the inner loop could
> lead to scheduler latency or watchdog warnings on non-preemptible kernels."
> 
> A: With PREEMPT_LAZY as the default, the worker is preemptible, as per 
> Hillf and Christoph, so this is unnecessary.

Yes.  Sashiko tends to spew this a lot, which is odd as there were clear
recommendations to only do this absolutely if needed even before
PREEMPT_LAZY.

> > +
> > +	local_lock_irqsave(&bio_complete_batch.lock, flags);
> 
> Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
> context?
> On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
> which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
> called when bio_in_atomic() is true (which includes hardware interrupts or
> execution under a raw_spinlock_t), attempting to acquire a sleepable lock
> here would trigger an "Invalid wait context" lockdep warning.
> Would a lockless list (llist) be more appropriate here to avoid sleeping
> in atomic contexts?"
> 
> A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
> to switch to raw_spinlock_t, as it seems like that would add unnecessary
> overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
> (as is done for the per-CPU bio allocation cache) should work.

Adding the PREEMPT_RT maintainers for this as it is above my pay grade.

> Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
> protection in bio_complete_work_fn()?
> When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
> can execute per-CPU work items during worker pool congestion. This rescuer
> thread executes unbound, meaning it could run on CPU B while processing
> CPU A's work item.
> Since local_lock operates strictly on the currently executing CPU, the
> rescuer thread on CPU B would acquire CPU B's lock, while popping elements
> from CPU A's list (derived via container_of()).
> If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
> it will acquire CPU A's lock and modify the same list without mutual
> exclusion, potentially causing list corruption."
> 
> A: The rescuer should run on the same CPU, not unbound, so this is not an
> issue.

This is another area where the PREEMPT_RT/scheduler folks might be able
to help.

> static inline bool bio_complete_in_task(struct bio *bio)
> {
> 	if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> 		return false;
> 	if (!bio_in_atomic())
> 		return false;
> 	bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> 	__bio_complete_in_task(bio);
> 	return true;
> }
> 
> We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
> been deferred to the workqueue as is safe to run.

Would be nice to avoid this, but yes.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox