Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-05-22 23:09 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig
  Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
	linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <20260514-blk-dontcache-v6-1-782e2fa7477b@columbia.edu>

On 5/14/26 5:51 PM, Tal Zussman wrote:

Responding to Sashiko review inline:

Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu

> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> +	struct delayed_work *dw = to_delayed_work(w);
> +	struct bio_complete_batch *batch =
> +		container_of(dw, struct bio_complete_batch, work);
> +
> +	while (1) {
> +		struct bio_list list;
> +		struct bio *bio;
> +
> +		local_lock_irq(&bio_complete_batch.lock);
> +		list = batch->list;
> +		bio_list_init(&batch->list);
> +		local_unlock_irq(&bio_complete_batch.lock);
> +
> +		if (bio_list_empty(&list))
> +			break;
> +
> +		while ((bio = bio_list_pop(&list)))
> +			bio->bi_end_io(bio);

Q: "Does this code handle bios that do not have a completion callback?
If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
bi_end_io pointer (like some flush or discard bios), executing
bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
The synchronous completion path in bio_endio() checks if bio->bi_end_io is
non-NULL before calling it. Should a similar check be added here?"

A: Every bio that reaches this point should have bi_end_io set. There's no
point in deferring a bio without a completion handler.

Q: "Could this loop also monopolize the CPU and cause latency spikes?
Because the work uses a 1-jiffie delay, tens of thousands of bios might
accumulate under heavy I/O load. Processing the entire batch sequentially
without yielding or calling cond_resched() inside the inner loop could
lead to scheduler latency or watchdog warnings on non-preemptible kernels."

A: With PREEMPT_LAZY as the default, the worker is preemptible, as per 
Hillf and Christoph, so this is unnecessary.

> +		if (need_resched()) {
> +			bool is_empty;
> +
> +			local_lock_irq(&bio_complete_batch.lock);
> +			is_empty = bio_list_empty(&batch->list);
> +			local_unlock_irq(&bio_complete_batch.lock);
> +			if (!is_empty)
> +				mod_delayed_work_on(batch->cpu,
> +						    bio_complete_wq,
> +						    &batch->work, 0);
> +			break;
> +		}
> +	}
> +}
> +
> +void __bio_complete_in_task(struct bio *bio)
> +{
> +	struct bio_complete_batch *batch;
> +	unsigned long flags;
> +	bool was_empty;
> +
> +	local_lock_irqsave(&bio_complete_batch.lock, flags);

Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
context?
On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
called when bio_in_atomic() is true (which includes hardware interrupts or
execution under a raw_spinlock_t), attempting to acquire a sleepable lock
here would trigger an "Invalid wait context" lockdep warning.
Would a lockless list (llist) be more appropriate here to avoid sleeping
in atomic contexts?"

A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
to switch to raw_spinlock_t, as it seems like that would add unnecessary
overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
(as is done for the per-CPU bio allocation cache) should work.

> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	was_empty = bio_list_empty(&batch->list);
> +	bio_list_add(&batch->list, bio);
> +	local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> +	if (was_empty)
> +		mod_delayed_work_on(batch->cpu, bio_complete_wq,
> +				    &batch->work, 1);
> +}
> +EXPORT_SYMBOL_GPL(__bio_complete_in_task);
> +
>  static inline bool bio_remaining_done(struct bio *bio)
>  {
>  	/*
> @@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
>  	}
>  #endif
>  
> -	if (bio->bi_end_io)
> +	if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
> +		__bio_complete_in_task(bio);
> +	else if (bio->bi_end_io)
>  		bio->bi_end_io(bio);
>  }
>  EXPORT_SYMBOL(bio_endio);
> @@ -1977,6 +2053,51 @@ int bioset_init(struct bio_set *bs,
>  }
>  EXPORT_SYMBOL(bioset_init);
>  
> +static int bio_complete_batch_cpu_online(unsigned int cpu)
> +{
> +	enable_delayed_work(&per_cpu(bio_complete_batch, cpu).work);
> +	return 0;
> +}
> +

Q: "Could a CPU hotplug rollback strand deferred bios indefinitely?
During the early teardown phase (CPUHP_AP_ONLINE_DYN),
bio_complete_batch_cpu_down_prep() disables the delayed work. However, the
CPU remains active and can continue completing bios. When these bios
complete, __bio_complete_in_task() adds them to the list with
was_empty = true, but mod_delayed_work_on() fails to queue the work
because it is disabled.
If the CPU offline operation is aborted, bio_complete_batch_cpu_online()
re-enables the delayed work but does not explicitly queue it. Because the
list is no longer empty, any future bio completions will see
was_empty == false and skip calling mod_delayed_work_on(), permanently
stranding the bios."

A: This is legit. The solution is to requeue the work if the list isn't
empty on CPU online.

> +/*
> + * Disable this CPU's delayed work so that it cannot run on an unbound worker
> + * after the CPU is offlined.
> + */
> +static int bio_complete_batch_cpu_down_prep(unsigned int cpu)
> +{
> +	disable_delayed_work_sync(&per_cpu(bio_complete_batch, cpu).work);
> +	return 0;
> +}
> +
> +/*
> + * Drain a dead CPU's deferred bio completions. The CPU is dead and the worker
> + * is canceled so no locking is needed.
> + */
> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> +	struct bio_complete_batch *batch =
> +		per_cpu_ptr(&bio_complete_batch, cpu);
> +	struct bio *bio;
> +
> +	while ((bio = bio_list_pop(&batch->list)))
> +		bio->bi_end_io(bio);
> +

Q: "Is it possible to hit the same NULL pointer dereference issue here if the
bio lacks a bi_end_io callback?"

A: Same as above.

> +	return 0;
> +}
> +
> +static void __init bio_complete_batch_init(int cpu)
> +{
> +	struct bio_complete_batch *batch =
> +		per_cpu_ptr(&bio_complete_batch, cpu);
> +
> +	bio_list_init(&batch->list);
> +	INIT_DELAYED_WORK(&batch->work, bio_complete_work_fn);
> +	batch->cpu = cpu;
> +
> +	if (!cpu_online(cpu))
> +		disable_delayed_work_sync(&batch->work);
> +}
> +
>  static int __init init_bio(void)
>  {
>  	int i;
> @@ -1991,6 +2112,30 @@ static int __init init_bio(void)
>  				SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>  	}
>  
> +	for_each_possible_cpu(i)
> +		bio_complete_batch_init(i);
> +
> +	bio_complete_wq = alloc_workqueue("bio_complete",
> +					   WQ_MEM_RECLAIM | WQ_PERCPU, 0);

Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
protection in bio_complete_work_fn()?
When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
can execute per-CPU work items during worker pool congestion. This rescuer
thread executes unbound, meaning it could run on CPU B while processing
CPU A's work item.
Since local_lock operates strictly on the currently executing CPU, the
rescuer thread on CPU B would acquire CPU B's lock, while popping elements
from CPU A's list (derived via container_of()).
If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
it will acquire CPU A's lock and modify the same list without mutual
exclusion, potentially causing list corruption."

A: The rescuer should run on the same CPU, not unbound, so this is not an
issue.

> +	if (!bio_complete_wq)
> +		panic("bio: can't allocate bio_complete workqueue\n");
> +
> +	/*
> +	 * bio task-context completion draining on hot-unplugged CPUs:
> +	 *
> +	 *   1. Stop the per-CPU delayed work while the CPU is still online, so
> +	 *      that it cannot run on an unbound worker later.
> +	 *   2. Drain leftover bios added between worker disabling and CPU
> +	 *      offlining.
> +	 */
> +	cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> +				  "block/bio:complete:online",
> +				  bio_complete_batch_cpu_online,
> +				  bio_complete_batch_cpu_down_prep);
> +	cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
> +				  "block/bio:complete:dead",
> +				  NULL, bio_complete_batch_cpu_dead);
> +
>  	cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
>  					bio_cpu_dead);
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 97d747320b35..c0214d6c28d6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -369,6 +369,38 @@ static inline struct bio *bio_alloc(struct block_device *bdev,
>  
>  void submit_bio(struct bio *bio);
>  
> +/**
> + * bio_in_atomic - check if the current context is unsafe for bio completion
> + *
> + * Return: %true in atomic contexts (e.g. hard/soft IRQ, preempt-disabled);
> + * %false when a bio can be safely completed in the current context.
> + */
> +static inline bool bio_in_atomic(void)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> +		return true;
> +	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> +		return true;

Q: "Will this cause an infinite loop of bio offloading on kernels with
CONFIG_PREEMPT_COUNT disabled?
Because bio_in_atomic() unconditionally returns true without preempt count
support, a dynamic call to bio_complete_in_task() from within a bi_end_io()
callback will always offload the bio to the workqueue.
When the workqueue executes bio->bi_end_io(bio), the callback will evaluate
bio_complete_in_task() again, which will return true again, creating a
permanent offloading loop."

A: Legit issue. This can be solved by changing bio_complete_in_task() to:

static inline bool bio_complete_in_task(struct bio *bio)
{
	if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
		return false;
	if (!bio_in_atomic())
		return false;
	bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
	__bio_complete_in_task(bio);
	return true;
}

We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
been deferred to the workqueue as is safe to run.

> +	return !preemptible();
> +}
> +
> +void __bio_complete_in_task(struct bio *bio);
> +
> +/**
> + * bio_complete_in_task - ensure a bio is completed in preemptible task context
> + * @bio: bio to complete
> + *
> + * If called from non-task context, offload the bio completion to a worker
> + * thread and return %true. Else return %false and do nothing.
> + */
> +static inline bool bio_complete_in_task(struct bio *bio)
> +{
> +	if (!bio_in_atomic())
> +		return false;
> +	__bio_complete_in_task(bio);
> +	return true;
> +}
> +
>  extern void bio_endio(struct bio *);
>  
>  static inline void bio_io_error(struct bio *bio)
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8808ee76e73c..d49d97a050d0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -322,6 +322,7 @@ enum {
>  	BIO_REMAPPED,
>  	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
>  	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> +	BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
>  	BIO_FLAG_LAST
>  };
>  
> 


^ permalink raw reply

* Re: [PATCHv3] blk-mq: pop cached request if it is usable
From: Keith Busch @ 2026-05-22 22:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, axboe, hch, linux-block
In-Reply-To: <ahDT-JZZGyyTXyii@kbusch-mbp>

On Fri, May 22, 2026 at 04:08:56PM -0600, Keith Busch wrote:
> On Fri, May 22, 2026 at 12:12:18PM +0800, Ming Lei wrote:
> > On Thu, May 21, 2026 at 07:44:50PM -0600, Keith Busch wrote:
> > > On Fri, May 22, 2026 at 07:33:39AM +0800, Ming Lei wrote:
> > > > 
> > > > BTW, as mentioned in v2, the request may be added back in case of merge,
> > > > but seems not a big deal given blk_mq_free_plug_rqs() doesn't free requests
> > > > in batch.
> > > 
> > > We could introduce a special goto label for the merge case to push it
> > 
> > It can be done simply by replacing the added `blk_mq_free_request` with moving
> > it back to plug list.
> 
> What I'm worried about is hitting a blocking allocation, then the
> cached_rqs list is freed, leaving the current request from it the only
> one still holding a queue reference. I think we ought to re-enter the
> queue in that case.

Hmm, I may be mistaken here. The block allocation doesn't call
blk_finish_plug(), so the current plug is left intact; the other
queue_exit goto's are either from non-blocking contexts or a successful
merge that holds queue references in other ways. I guess there is no
queue_exit goto where unconditionally pushing back might be a problem.
So yeah, sorry, maybe restoring it to the cached_rqs is a worthy
optimization to make.

^ permalink raw reply

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-05-22 22:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
	linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <agq2KRd8RkP1TAf5@infradead.org>

On 5/18/26 2:48 AM, Christoph Hellwig wrote:
> On Thu, May 14, 2026 at 05:51:14PM -0400, Tal Zussman wrote:
>> Some bio completion handlers need to run from preemptible task context,
>> but bio_endio() may be called from IRQ context (e.g., buffer_head
>> writeback). Callers need a way to ensure their callback eventually runs
>> from a sleepable context. Add infrastructure for that, in two forms:
>> 
>>   1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
>>      in advance that its callback needs task context (e.g., dropbehind
>>      writeback). bio_endio() sees the flag and offloads completion to a
>>      worker automatically.
>> 
>>   2. bio_complete_in_task(), a helper that completion callbacks can
>>      invoke from within bi_end_io() when the deferral decision is
>>      dynamic (e.g., fserror reporting).
> 
> Note that method 2 is unused as of this series.  I do plan to add users
> ASAP, and at one or two could even land through the block layer in this
> merge window.
> 
>> Both share a per-CPU batch list drained by a delayed work item on a
>> WQ_PERCPU workqueue. Producers push the bio onto the local CPU's batch
>> and schedule the work item, which then dispatches each bio's bi_end_io()
>> from task context. The delayed work item uses a 1-jiffie delay to allow
>> batches of completions to accumulate before processing.
> 
> But this 1-jiffie delay also means we unconditionally increase
> completion latency, which feels like a bad idea.  Do you have any
> measurements that show where it does benefit?  Note that queing work
> already often has very measurable latency on it's own.  This also
> directly contradics the erofs experience that even went to a RT
> thread to reduce the latency.

I added this per Dave's feedback on v4, where he noted that XFS inodegc
uses a delayed work item to avoid context switch storms. There's only a
delay for the first bio in a batch to complete, as we only delay when the
list is empty. I'll run some experiments and measure context switches,
completion latency, etc. to see if this is necessary.

>> Both methods are gated on bio_in_atomic(), which returns true in any
>> context where a sleeping bi_end_io() is unsafe, including
>> non-preemptible task context. This logic is copied from commit
>> c99fab6e80b7 ("erofs: fix atomic context detection when
>> !CONFIG_DEBUG_LOCK_ALLOC").
> 
> Let's not copy it, but have a prep patch that moves the erofs logic
> into the block layer under the new bio_in_atomic name.

Will do.

>> +		while ((bio = bio_list_pop(&list)))
>> +			bio->bi_end_io(bio);
>> +
>> +		if (need_resched()) {
>> +			bool is_empty;
>> +
>> +			local_lock_irq(&bio_complete_batch.lock);
>> +			is_empty = bio_list_empty(&batch->list);
>> +			local_unlock_irq(&bio_complete_batch.lock);
>> +			if (!is_empty)
>> +				mod_delayed_work_on(batch->cpu,
>> +						    bio_complete_wq,
>> +						    &batch->work, 0);
>> +			break;
>> +		}
>> +	}
> 
> Ån all mainstream architetures we now default to lazy preempt, which
> should remove the need for need_resched() calls.  Do you have evidence
> that we actually need this handling on recent kernels?

No evidence - I added this per feedback on v3, but agreed that it can be
simplified.

> Otherwise this looks good to me.
> 

Thanks - AI review found a couple more small things, which I'll respond to
in a separate message.


^ permalink raw reply

* Re: [PATCHv3] blk-mq: pop cached request if it is usable
From: Keith Busch @ 2026-05-22 22:08 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, axboe, hch, linux-block
In-Reply-To: <ag_XoloTHEgt3Y8s@fedora>

On Fri, May 22, 2026 at 12:12:18PM +0800, Ming Lei wrote:
> On Thu, May 21, 2026 at 07:44:50PM -0600, Keith Busch wrote:
> > On Fri, May 22, 2026 at 07:33:39AM +0800, Ming Lei wrote:
> > > 
> > > BTW, as mentioned in v2, the request may be added back in case of merge,
> > > but seems not a big deal given blk_mq_free_plug_rqs() doesn't free requests
> > > in batch.
> > 
> > We could introduce a special goto label for the merge case to push it
> 
> It can be done simply by replacing the added `blk_mq_free_request` with moving
> it back to plug list.

What I'm worried about is hitting a blocking allocation, then the
cached_rqs list is freed, leaving the current request from it the only
one still holding a queue reference. I think we ought to re-enter the
queue in that case.

I suggested a special goto for the successful merge because a plug merge
couldn't happen if a previous allocation did schedule since that flushes
the plug. I guess we'd have to distinguish a plug merge vs a sched
merge, though.

^ permalink raw reply

* [PATCH] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
From: Chao Shi @ 2026-05-22 22:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Christian Brauner, Josef Bacik, linux-block,
	linux-kernel, Chao Shi, Sungwoo Kim, Dave Tian, Weidong Zhu

bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.

syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.

Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.

Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead")
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
 block/bdev.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index b8fbb9576110..7fc3f5ba22a3 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1259,7 +1259,13 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
 		bdev->bd_holder_ops->mark_dead(bdev, surprise);
 	else {
 		mutex_unlock(&bdev->bd_holder_lock);
-		sync_blockdev(bdev);
+		/*
+		 * On surprise removal the device is already gone; syncing is
+		 * futile and can hang forever waiting on I/O that will never
+		 * complete.  Match fs_bdev_mark_dead(), which also skips it.
+		 */
+		if (!surprise)
+			sync_blockdev(bdev);
 	}
 
 	invalidate_bdev(bdev);
-- 
2.43.0


^ permalink raw reply related

* Re: [GIT PULL] Block fixes for 7.1-rc5
From: pr-tracker-bot @ 2026-05-22 19:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, linux-block@vger.kernel.org
In-Reply-To: <a050fd86-b9dc-4a39-a274-57bbe1931d42@kernel.dk>

The pull request you sent on Fri, 22 May 2026 09:53:29 -0600:

> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/block-7.1-20260522

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/3997e3bb1d30a426c0599918ebaac51698fcc959

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [PATCH] block: Add bvec_folio()
From: Matthew Wilcox (Oracle) @ 2026-05-22 18:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox (Oracle), linux-block, linux-kernel, io-uring,
	linux-mm, Leon Romanovsky

This is a simple helper which replaces page_folio(bvec->bv_page).
Minor improvement in readability, but the real motivation is to reduce
the number of references to bvec->bv_page so that it can be changed
with less work.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Leon Romanovsky <leon@kernel.org>
---

Hi Jens,

I have a pile of other patches which depend on this one, but they're
spread all over the kernel and don't really have anything in common
with each other.  Getting this in the next merge window will let me send
those patches next cycle.

 block/bio.c          |  6 +++---
 include/linux/bio.h  |  2 +-
 include/linux/bvec.h | 13 +++++++++++++
 io_uring/rsrc.c      |  2 +-
 mm/page_io.c         |  4 ++--
 5 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f10900b3f42..85aab3140909 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1300,7 +1300,7 @@ static void bio_free_folios(struct bio *bio)
 	int i;
 
 	bio_for_each_bvec_all(bv, bio, i) {
-		struct folio *folio = page_folio(bv->bv_page);
+		struct folio *folio = bvec_folio(bv);
 
 		if (!is_zero_folio(folio))
 			folio_put(folio);
@@ -1409,7 +1409,7 @@ int bio_iov_iter_bounce(struct bio *bio, struct iov_iter *iter, size_t maxlen,
 
 static void bvec_unpin(struct bio_vec *bv, bool mark_dirty)
 {
-	struct folio *folio = page_folio(bv->bv_page);
+	struct folio *folio = bvec_folio(bv);
 	size_t nr_pages = (bv->bv_offset + bv->bv_len - 1) / PAGE_SIZE -
 			bv->bv_offset / PAGE_SIZE + 1;
 
@@ -1443,7 +1443,7 @@ static void bio_iov_iter_unbounce_read(struct bio *bio, bool is_error,
 			bvec_unpin(&bio->bi_io_vec[1 + i], mark_dirty);
 	}
 
-	folio_put(page_folio(bio->bi_io_vec[0].bv_page));
+	folio_put(bvec_folio(&bio->bi_io_vec[0]));
 }
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index dc17780d6c1e..6613ab4519bd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -283,7 +283,7 @@ static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
 		return;
 	}
 
-	fi->folio = page_folio(bvec->bv_page);
+	fi->folio = bvec_folio(bvec);
 	fi->offset = bvec->bv_offset +
 			PAGE_SIZE * folio_page_idx(fi->folio, bvec->bv_page);
 	fi->_seg_count = bvec->bv_len;
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d36dd476feda..32846079b853 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -74,6 +74,19 @@ static inline void bvec_set_virt(struct bio_vec *bv, void *vaddr,
 	bvec_set_page(bv, virt_to_page(vaddr), len, offset_in_page(vaddr));
 }
 
+/**
+ * bvec_folio - Return the first folio referenced by this bvec
+ * @bv: bvec to access
+ *
+ * bvecs can span multiple folios.  Unless you know that this
+ * bvec does not, you may be better off using something like
+ * bio_for_each_folio_all() which iterates over all folios.
+ */
+static inline struct folio *bvec_folio(const struct bio_vec *bv)
+{
+	return page_folio(bv->bv_page);
+}
+
 struct bvec_iter {
 	/*
 	 * Current device address in 512 byte sectors. Only updated by the bio
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 650303626be6..5d792f70ec1e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -102,7 +102,7 @@ static void io_release_ubuf(void *priv)
 	unsigned int i;
 
 	for (i = 0; i < imu->nr_bvecs; i++) {
-		struct folio *folio = page_folio(imu->bvec[i].bv_page);
+		struct folio *folio = bvec_folio(&imu->bvec[i]);
 
 		unpin_user_folio(folio, 1);
 	}
diff --git a/mm/page_io.c b/mm/page_io.c
index 70cea9e24d2f..a59b73f8bdd9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -490,7 +490,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
 
 	if (ret == sio->len) {
 		for (p = 0; p < sio->pages; p++) {
-			struct folio *folio = page_folio(sio->bvec[p].bv_page);
+			struct folio *folio = bvec_folio(&sio->bvec[p]);
 
 			count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
 			count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio));
@@ -500,7 +500,7 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
 		count_vm_events(PSWPIN, sio->len >> PAGE_SHIFT);
 	} else {
 		for (p = 0; p < sio->pages; p++) {
-			struct folio *folio = page_folio(sio->bvec[p].bv_page);
+			struct folio *folio = bvec_folio(&sio->bvec[p]);
 
 			folio_unlock(folio);
 		}
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH 00/12] Block storage copy offloading
From: Bart Van Assche @ 2026-05-22 16:22 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki
  Cc: Jens Axboe, linux-block, linux-scsi, linux-nvme,
	Christoph Hellwig, Nitesh Shetty
In-Reply-To: <ahBD9fRrPDuoB2cj@shinmob>

On 5/22/26 5:00 AM, Shin'ichiro Kawasaki wrote:
> FYI, blktests CI trial run detected that this patch series triggers nvme/018
> failure. I manually applied this series on top of the v7.1-rc4 kernel and
> observed the failure is recreated in stable manner.
> 
> nvme/018 (tr=loop) (unit test NVMe-oF out of range access on a file backend) [failed]
>      runtime  1.208s  ...  1.189s
>      --- tests/nvme/018.out      2025-04-22 13:13:27.738873155 +0900
>      +++ /home/shin/Blktests/blktests/results/nodev_tr_loop/nvme/018.out.bad     2026-05-22 20:57:31.060000000 +0900
>      @@ -1,3 +1,4 @@
>       Running nvme/018
>      +ERROR: nvme read for out of range LBA was not rejected
>       disconnected 1 controller(s)
>       Test complete

Thanks Shin'ichiro for having reported this. I plan to include a fix
when I publish v2 of this patch series.

Bart.

^ permalink raw reply

* [GIT PULL] Block fixes for 7.1-rc5
From: Jens Axboe @ 2026-05-22 15:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block@vger.kernel.org

Hi Linus,

A few fixes for block that should go into the 7.1 kernel release. This
pull request contains:

- NVMe pull request via Keith
	- Fix memory leak for peer-to-peer addresses
	- Fix dma map leaks on resource errors"

- Another bio integrity fix, fixing a recent regression.

- Fix for an issue with the request pre-allocation and caching when IO
  is queued, where if a bio split occurred and ended up blocking, the
  list could be corrupted.

Please pull!


The following changes since commit 4141f46daa4cf1f8caa14129f8b6db86f17452f5:

  Merge tag 'nvme-7.1-2026-05-14' of git://git.infradead.org/nvme into block-7.1 (2026-05-14 19:14:33 -0600)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git tags/block-7.1-20260522

for you to fetch changes up to f6982769910ecddabdb5b8b9afdab0bb8b6668ac:

  block: avoid use-after-free in disk_free_zone_resources() (2026-05-22 08:01:52 -0600)

----------------------------------------------------------------
block-7.1-20260522

----------------------------------------------------------------
Caleb Sander Mateos (1):
      bio-integrity-fs: pass data iter to bio_integrity_verify()

Damien Le Moal (1):
      block: avoid use-after-free in disk_free_zone_resources()

Jens Axboe (1):
      Merge tag 'nvme-7.1-2026-05-21' of git://git.infradead.org/nvme into block-7.1

Keith Busch (3):
      nvme-pci: fix dma_vecs leak on p2p memory
      nvme-pci: fix dma mapping leak on data setup error
      blk-mq: pop cached request if it is usable

 block/bio-integrity-fs.c |  6 +++++-
 block/blk-mq.c           | 34 +++++++++-------------------------
 block/blk-zoned.c        |  7 +++----
 drivers/nvme/host/pci.c  | 34 ++++++++++++++++++++++++++++++----
 4 files changed, 47 insertions(+), 34 deletions(-)

-- 
Jens Axboe


^ permalink raw reply

* [PATCH] block, nvme: export and use passthrough stats
From: Keith Busch @ 2026-05-22 15:15 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch

From: Keith Busch <kbusch@kernel.org>

So stacking drivers can also report passthrough workloads through
iostat.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-mq.c                | 30 ------------------------------
 drivers/nvme/host/multipath.c |  4 +++-
 include/linux/blk-mq.h        | 29 +++++++++++++++++++++++++++++
 3 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..c794b70fefe26 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1088,36 +1088,6 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-static inline bool blk_rq_passthrough_stats(struct request *req)
-{
-	struct bio *bio = req->bio;
-
-	if (!blk_queue_passthrough_stat(req->q))
-		return false;
-
-	/* Requests without a bio do not transfer data. */
-	if (!bio)
-		return false;
-
-	/*
-	 * Stats are accumulated in the bdev, so must have one attached to a
-	 * bio to track stats. Most drivers do not set the bdev for passthrough
-	 * requests, but nvme is one that will set it.
-	 */
-	if (!bio->bi_bdev)
-		return false;
-
-	/*
-	 * We don't know what a passthrough command does, but we know the
-	 * payload size and data direction. Ensuring the size is aligned to the
-	 * block size filters out most commands with payloads that don't
-	 * represent sector access.
-	 */
-	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
-		return false;
-	return true;
-}
-
 static inline void blk_account_io_start(struct request *req)
 {
 	trace_block_io_start(req);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac06..435fab0be6401 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -175,9 +175,11 @@ void nvme_mpath_start_request(struct request *rq)
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
 	}
 
-	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
+	if (!blk_queue_io_stat(disk->queue) ||
 	    (nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
+	if (blk_rq_is_passthrough(rq) && !blk_rq_passthrough_stats(rq))
+		return;
 
 	nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
 	nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581d..8301830ece8b7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1243,4 +1243,33 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline bool blk_rq_passthrough_stats(struct request *req)
+{
+	struct bio *bio = req->bio;
+
+	if (!blk_queue_passthrough_stat(req->q))
+		return false;
+
+	/* Requests without a bio do not transfer data. */
+	if (!bio)
+		return false;
+
+	/*
+	 * Stats are accumulated in the bdev, so must have one attached to a
+	 * bio to track stats. Most drivers do not set the bdev for passthrough
+	 * requests, but nvme is one that will set it.
+	 */
+	if (!bio->bi_bdev)
+		return false;
+
+	/*
+	 * We don't know what a passthrough command does, but we know the
+	 * payload size and data direction. Ensuring the size is aligned to the
+	 * block size filters out most commands with payloads that don't
+	 * represent sector access.
+	 */
+	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
+		return false;
+	return true;
+}
 #endif /* BLK_MQ_H */
-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH v5] block: propagate in_flight to whole disk on partition I/O
From: Jens Axboe @ 2026-05-22 14:27 UTC (permalink / raw)
  To: Keith Busch, Tang Yizhou
  Cc: hch, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <ahBnKR-IunwxVDzg@kbusch-mbp>

On 5/22/26 8:24 AM, Keith Busch wrote:
> On Fri, May 22, 2026 at 10:16:38PM +0800, Tang Yizhou wrote:
>> @@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
>>  	part_stat_inc(bdev, ios[sgrp]);
>>  	part_stat_add(bdev, sectors[sgrp], sectors);
>>  	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
>> -	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
>> +	bdev_inc_in_flight(bdev, op);
> 
> This one should be bdev_dec_in_flight().

Yes, and let's chill the repostings. Send 1 per day, we've up to v5 in a
very short amount of time. Take your time and get it right, and give
people a chance to comment and review before randomly throwing version
N+1 over the wall.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH v5] block: propagate in_flight to whole disk on partition I/O
From: Keith Busch @ 2026-05-22 14:24 UTC (permalink / raw)
  To: Tang Yizhou; +Cc: axboe, hch, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522141638.298530-1-yizhou.tang@shopee.com>

On Fri, May 22, 2026 at 10:16:38PM +0800, Tang Yizhou wrote:
> @@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
>  	part_stat_inc(bdev, ios[sgrp]);
>  	part_stat_add(bdev, sectors[sgrp], sectors);
>  	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
> -	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
> +	bdev_inc_in_flight(bdev, op);

This one should be bdev_dec_in_flight().

^ permalink raw reply

* [PATCH v5] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-22 14:16 UTC (permalink / raw)
  To: axboe, hch; +Cc: yukuai, linux-block, linux-kernel, Tang Yizhou, Leon Hwang

From: Tang Yizhou <yizhou.tang@shopee.com>

Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:

lsblk
  NAME     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
  mydev    252:1    0   20G  0 disk
  └─mydev1 259:0    0   10G  0 part

iostat -xp 1
  Device       r/s        rkB/s      ... aqu-sz   %util
  mydev    128153.00  512612.00      ...  13.22   72.20
  mydev1   128154.00  512616.00      ...  13.22  100.00

%util is different between mydev and mydev1, which is unexpected.

This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.

In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.

Fix it by restoring the whole-disk in_flight accounting.

Fixes: e016b78201a2 ("block: return just one value from part_in_flight")
Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v2: Update commit message.
v3: Take Christoph's advice and factor the common code into two helpers.
v4: Remove my redundant new line in blk.h. Add Christoph's Reviewed-by
tag.
v5: Remove the changelog from the commit message.
 block/blk-core.c |  4 ++--
 block/blk-mq.c   |  5 ++---
 block/blk.h      | 21 +++++++++++++++++++++
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..81b322b8a385 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1042,7 +1042,7 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
 {
 	part_stat_lock();
 	update_io_ticks(bdev, start_time, false);
-	part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 
 	return start_time;
@@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
 	part_stat_inc(bdev, ios[sgrp]);
 	part_stat_add(bdev, sectors[sgrp], sectors);
 	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
-	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 }
 EXPORT_SYMBOL(bdev_end_io_acct);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..6bdfe642bd93 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1082,8 +1082,7 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 		update_io_ticks(req->part, jiffies, true);
 		part_stat_inc(req->part, ios[sgrp]);
 		part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
-		part_stat_local_dec(req->part,
-				    in_flight[op_is_write(req_op(req))]);
+		bdev_dec_in_flight(req->part, req_op(req));
 		part_stat_unlock();
 	}
 }
@@ -1143,7 +1142,7 @@ static inline void blk_account_io_start(struct request *req)
 
 	part_stat_lock();
 	update_io_ticks(req->part, jiffies, false);
-	part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
+	bdev_inc_in_flight(req->part, req_op(req));
 	part_stat_unlock();
 }
 
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..11245a494c43 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -4,6 +4,7 @@
 
 #include <linux/bio-integrity.h>
 #include <linux/blk-crypto.h>
+#include <linux/part_stat.h>
 #include <linux/lockdep.h>
 #include <linux/memblock.h>	/* for max_pfn/max_low_pfn */
 #include <linux/sched/sysctl.h>
@@ -485,6 +486,26 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 		q->last_merge = NULL;
 }
 
+static inline void bdev_inc_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_inc(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_inc(bdev_whole(bdev), in_flight[rw]);
+}
+
+static inline void bdev_dec_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_dec(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_dec(bdev_whole(bdev), in_flight[rw]);
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 7.2 0/2] ublk: enable UBLK_F_SHMEM_ZC on zone appends
From: Jens Axboe @ 2026-05-22 14:06 UTC (permalink / raw)
  To: Ming Lei, Caleb Sander Mateos; +Cc: linux-block, linux-kernel
In-Reply-To: <20260520203654.1413640-1-csander@purestorage.com>


On Wed, 20 May 2026 14:36:52 -0600, Caleb Sander Mateos wrote:
> Commit 4d4a512a1f87 ("ublk: add PFN-based buffer matching in I/O path")
> added support to ublk_setup_iod() for matching request buffers against
> registered UBLK_F_SHMEM_ZC buffers, but missed adding it to
> ublk_setup_iod_zoned() for zoned requests. ublk_setup_iod_zoned()
> duplicates the code for initializing struct ublksrv_io_desc, making it
> easy to forget to keep them in sync. Move the common code to a helper
> function ublk_init_iod(). This allows zone appends to leverage the
> shared memory zero copy optimization.
> 
> [...]

Applied, thanks!

[1/2] ublk: move ublk_req_build_flags() earlier
      commit: eee9224affae6c1bfd664e5b769e40e3ff099879
[2/2] ublk: factor out ublk_init_iod() helper
      commit: 23130b3ffcdb1568a9ef178ab3cba866e5486082

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH 7.2 0/2] ublk: optimize ublk_rq_has_data()
From: Jens Axboe @ 2026-05-22 14:06 UTC (permalink / raw)
  To: Ming Lei, Caleb Sander Mateos; +Cc: linux-block, linux-kernel
In-Reply-To: <20260513211846.1956810-1-csander@purestorage.com>


On Wed, 13 May 2026 15:18:44 -0600, Caleb Sander Mateos wrote:
> ublk_rq_has_data() currently uses bio_has_data(), which involves 2
> indirections and several branches. Introduce a blk_rq_has_data()
> analogue for struct request and use it instead to save an indirection
> and NULL check.
> 
> Caleb Sander Mateos (2):
>   blk-mq: introduce blk_rq_has_data()
>   ublk: optimize ublk_rq_has_data()
> 
> [...]

Applied, thanks!

[1/2] blk-mq: introduce blk_rq_has_data()
      commit: 999722b34441b4ab65b7ca7fb16dd4b62fc3c354
[2/2] ublk: optimize ublk_rq_has_data()
      commit: 5995e751d2612cd8254cdf9c1155a96bbbb2d509

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: avoid use-after-free in disk_free_zone_resources()
From: Jens Axboe @ 2026-05-22 14:06 UTC (permalink / raw)
  To: linux-block, Damien Le Moal; +Cc: Christoph Hellwig
In-Reply-To: <20260522115622.588535-1-dlemoal@kernel.org>


On Fri, 22 May 2026 20:56:22 +0900, Damien Le Moal wrote:
> The function disk_update_zone_resources() may call
> disk_free_zone_resources() in case of error, and following this,
> blk_revalidate_disk_zones() will again calls disk_free_zone_resources() if
> disk_update_zone_resources() failed. If a zone worker thread is being used
> (which is the default for a rotational media zoned device),
> disk_free_zone_resources() will try to stop the zone worker thread twice
> because disk->zone_wplugs_worker is not reset to NULL when the worker
> thread is stopped the first time.
> 
> [...]

Applied, thanks!

[1/1] block: avoid use-after-free in disk_free_zone_resources()
      commit: f6982769910ecddabdb5b8b9afdab0bb8b6668ac

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v3] block: propagate in_flight to whole disk on partition I/O
From: Jens Axboe @ 2026-05-22 14:00 UTC (permalink / raw)
  To: Tang Yizhou, hch; +Cc: yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522131409.261259-1-yizhou.tang@shopee.com>

On 5/22/26 7:14 AM, Tang Yizhou wrote:
> Fixes: e016b78201a2 ("block: return just one value from part_in_flight")
> 
> v2: Update commit message.
> v3: Take Christoph's advice and factor the common code into two helpers.
> 
> Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
> Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
> ---

Changelog goes _below_ this line, we don't want it in the git commit
message. And Fixes goes with the other tags.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v3] block: propagate in_flight to whole disk on partition I/O
From: Leon Hwang @ 2026-05-22 13:53 UTC (permalink / raw)
  To: Tang Yizhou, axboe, hch; +Cc: yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522131409.261259-1-yizhou.tang@shopee.com>

On 2026/5/22 21:14, Tang Yizhou wrote:
> From: Tang Yizhou <yizhou.tang@shopee.com>
> 
[...]
> diff --git a/block/blk.h b/block/blk.h
> index b998a7761faf..05099aab6863 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/bio-integrity.h>
>  #include <linux/blk-crypto.h>
> +#include <linux/part_stat.h>
>  #include <linux/lockdep.h>
>  #include <linux/memblock.h>	/* for max_pfn/max_low_pfn */
>  #include <linux/sched/sysctl.h>
> @@ -11,6 +12,7 @@
>  #include <xen/xen.h>
>  #include "blk-crypto-internal.h"
>  
> +

NIT: I think this new line is added unintentionally. Should be dropped.

Thanks,
Leon


>  struct elv_change_ctx;
>  
>  /*
> @@ -485,6 +487,26 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
>  		q->last_merge = NULL;
>  }
>  
> +static inline void bdev_inc_in_flight(struct block_device *bdev,
> +				      enum req_op op)
> +{
> +	bool rw = op_is_write(op);
> +
> +	part_stat_local_inc(bdev, in_flight[rw]);
> +	if (bdev_is_partition(bdev))
> +		part_stat_local_inc(bdev_whole(bdev), in_flight[rw]);
> +}
> +
> +static inline void bdev_dec_in_flight(struct block_device *bdev,
> +				      enum req_op op)
> +{
> +	bool rw = op_is_write(op);
> +
> +	part_stat_local_dec(bdev, in_flight[rw]);
> +	if (bdev_is_partition(bdev))
> +		part_stat_local_dec(bdev_whole(bdev), in_flight[rw]);
> +}
> +
>  /*
>   * Internal io_context interface
>   */


^ permalink raw reply

* [PATCH v4] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-22 13:30 UTC (permalink / raw)
  To: axboe, hch; +Cc: yukuai, linux-block, linux-kernel, Tang Yizhou, Leon Hwang

From: Tang Yizhou <yizhou.tang@shopee.com>

Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:

lsblk
  NAME     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
  mydev    252:1    0   20G  0 disk
  └─mydev1 259:0    0   10G  0 part

iostat -xp 1
  Device       r/s        rkB/s      ... aqu-sz   %util
  mydev    128153.00  512612.00      ...  13.22   72.20
  mydev1   128154.00  512616.00      ...  13.22  100.00

%util is different between mydev and mydev1, which is unexpected.

This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.

In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.

Fix it by restoring the whole-disk in_flight accounting.

v2: Update commit message.
v3: Take Christoph's advice and factor the common code into two helpers.
v4: Remove my redundant new line in blk.h. Add Christoph's Reviewed-by
tag.

Fixes: e016b78201a2 ("block: return just one value from part_in_flight")

Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c |  4 ++--
 block/blk-mq.c   |  5 ++---
 block/blk.h      | 21 +++++++++++++++++++++
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..81b322b8a385 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1042,7 +1042,7 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
 {
 	part_stat_lock();
 	update_io_ticks(bdev, start_time, false);
-	part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 
 	return start_time;
@@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
 	part_stat_inc(bdev, ios[sgrp]);
 	part_stat_add(bdev, sectors[sgrp], sectors);
 	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
-	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 }
 EXPORT_SYMBOL(bdev_end_io_acct);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..6bdfe642bd93 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1082,8 +1082,7 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 		update_io_ticks(req->part, jiffies, true);
 		part_stat_inc(req->part, ios[sgrp]);
 		part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
-		part_stat_local_dec(req->part,
-				    in_flight[op_is_write(req_op(req))]);
+		bdev_dec_in_flight(req->part, req_op(req));
 		part_stat_unlock();
 	}
 }
@@ -1143,7 +1142,7 @@ static inline void blk_account_io_start(struct request *req)
 
 	part_stat_lock();
 	update_io_ticks(req->part, jiffies, false);
-	part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
+	bdev_inc_in_flight(req->part, req_op(req));
 	part_stat_unlock();
 }
 
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..11245a494c43 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -4,6 +4,7 @@
 
 #include <linux/bio-integrity.h>
 #include <linux/blk-crypto.h>
+#include <linux/part_stat.h>
 #include <linux/lockdep.h>
 #include <linux/memblock.h>	/* for max_pfn/max_low_pfn */
 #include <linux/sched/sysctl.h>
@@ -485,6 +486,26 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 		q->last_merge = NULL;
 }
 
+static inline void bdev_inc_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_inc(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_inc(bdev_whole(bdev), in_flight[rw]);
+}
+
+static inline void bdev_dec_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_dec(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_dec(bdev_whole(bdev), in_flight[rw]);
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v3] block: propagate in_flight to whole disk on partition I/O
From: Christoph Hellwig @ 2026-05-22 13:27 UTC (permalink / raw)
  To: Tang Yizhou; +Cc: axboe, hch, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522131409.261259-1-yizhou.tang@shopee.com>

Nice!

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH v2] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-22 13:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522130157.GA25237@lst.de>

Sorry I sent this before reading your email. Please refer to patch v3.

Best regards,
Yi

^ permalink raw reply

* [PATCH v3] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-22 13:14 UTC (permalink / raw)
  To: axboe, hch; +Cc: yukuai, linux-block, linux-kernel, Tang Yizhou, Leon Hwang

From: Tang Yizhou <yizhou.tang@shopee.com>

Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:

lsblk
  NAME     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
  mydev    252:1    0   20G  0 disk
  └─mydev1 259:0    0   10G  0 part

iostat -xp 1
  Device       r/s        rkB/s      ... aqu-sz   %util
  mydev    128153.00  512612.00      ...  13.22   72.20
  mydev1   128154.00  512616.00      ...  13.22  100.00

%util is different between mydev and mydev1, which is unexpected.

This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.

In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.

Fix it by restoring the whole-disk in_flight accounting.

Fixes: e016b78201a2 ("block: return just one value from part_in_flight")

v2: Update commit message.
v3: Take Christoph's advice and factor the common code into two helpers.

Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
---
 block/blk-core.c |  4 ++--
 block/blk-mq.c   |  5 ++---
 block/blk.h      | 22 ++++++++++++++++++++++
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..81b322b8a385 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1042,7 +1042,7 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
 {
 	part_stat_lock();
 	update_io_ticks(bdev, start_time, false);
-	part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 
 	return start_time;
@@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
 	part_stat_inc(bdev, ios[sgrp]);
 	part_stat_add(bdev, sectors[sgrp], sectors);
 	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
-	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 }
 EXPORT_SYMBOL(bdev_end_io_acct);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..6bdfe642bd93 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1082,8 +1082,7 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 		update_io_ticks(req->part, jiffies, true);
 		part_stat_inc(req->part, ios[sgrp]);
 		part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
-		part_stat_local_dec(req->part,
-				    in_flight[op_is_write(req_op(req))]);
+		bdev_dec_in_flight(req->part, req_op(req));
 		part_stat_unlock();
 	}
 }
@@ -1143,7 +1142,7 @@ static inline void blk_account_io_start(struct request *req)
 
 	part_stat_lock();
 	update_io_ticks(req->part, jiffies, false);
-	part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
+	bdev_inc_in_flight(req->part, req_op(req));
 	part_stat_unlock();
 }
 
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..05099aab6863 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -4,6 +4,7 @@
 
 #include <linux/bio-integrity.h>
 #include <linux/blk-crypto.h>
+#include <linux/part_stat.h>
 #include <linux/lockdep.h>
 #include <linux/memblock.h>	/* for max_pfn/max_low_pfn */
 #include <linux/sched/sysctl.h>
@@ -11,6 +12,7 @@
 #include <xen/xen.h>
 #include "blk-crypto-internal.h"
 
+
 struct elv_change_ctx;
 
 /*
@@ -485,6 +487,26 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 		q->last_merge = NULL;
 }
 
+static inline void bdev_inc_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_inc(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_inc(bdev_whole(bdev), in_flight[rw]);
+}
+
+static inline void bdev_dec_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_dec(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_dec(bdev_whole(bdev), in_flight[rw]);
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] block: propagate in_flight to whole disk on partition I/O
From: Christoph Hellwig @ 2026-05-22 13:01 UTC (permalink / raw)
  To: Tang Yizhou; +Cc: axboe, hch, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522123437.214058-1-yizhou.tang@shopee.com>

This looks unchanged from v1, did you just resend that?


^ permalink raw reply

* Re: [PATCH] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-22 12:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, yukuai, linux-block, linux-kernel, Leon Hwang
In-Reply-To: <20260522121219.GB21338@lst.de>

On Fri, May 22, 2026 at 8:12 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, May 22, 2026 at 07:37:51PM +0800, Tang Yizhou wrote:
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1043,6 +1043,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
> >       part_stat_lock();
> >       update_io_ticks(bdev, start_time, false);
> >       part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
> > +     if (bdev_is_partition(bdev))
> > +             part_stat_local_inc(bdev_whole(bdev), in_flight[op_is_write(op)]);
>
> overly lone line.

OK. I will update in the next patch.

>
> > +     if (bdev_is_partition(bdev))
> > +             part_stat_local_dec(bdev_whole(bdev), in_flight[op_is_write(op)]);
>
> Same.
>
> >  }
> > @@ -1144,6 +1147,9 @@ static inline void blk_account_io_start(struct request *req)
> >       part_stat_lock();
> >       update_io_ticks(req->part, jiffies, false);
> >       part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
> > +     if (bdev_is_partition(req->part))
> > +             part_stat_local_inc(bdev_whole(req->part),
> > +                                 in_flight[op_is_write(req_op(req))]);
>
> and tis duplicates the above logic.  Mabye factor the common code
> into two little helpers?

Sure.

>

^ permalink raw reply

* [PATCH v2] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-22 12:34 UTC (permalink / raw)
  To: axboe, hch; +Cc: yukuai, linux-block, linux-kernel, Tang Yizhou, Leon Hwang

From: Tang Yizhou <yizhou.tang@shopee.com>

Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:

lsblk
  NAME     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
  mydev    252:1    0   20G  0 disk
  └─mydev1 259:0    0   10G  0 part

iostat -xp 1
  Device       r/s        rkB/s      ... aqu-sz   %util
  mydev    128153.00  512612.00      ...  13.22   72.20
  mydev1   128154.00  512616.00      ...  13.22  100.00

%util is different between mydev and mydev1, which is unexpected.

This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.

In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.

Fix it by restoring the whole-disk in_flight accounting.

Fixes: e016b78201a2 ("block: return just one value from part_in_flight")

Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
---
v2: Update commit message.
 block/blk-core.c | 4 ++++
 block/blk-mq.c   | 6 ++++++
 2 files changed, 10 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..03f4b7015e69 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1043,6 +1043,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
 	part_stat_lock();
 	update_io_ticks(bdev, start_time, false);
 	part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_inc(bdev_whole(bdev), in_flight[op_is_write(op)]);
 	part_stat_unlock();
 
 	return start_time;
@@ -1074,6 +1076,8 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
 	part_stat_add(bdev, sectors[sgrp], sectors);
 	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
 	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_dec(bdev_whole(bdev), in_flight[op_is_write(op)]);
 	part_stat_unlock();
 }
 EXPORT_SYMBOL(bdev_end_io_acct);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..60ead16f1496 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1084,6 +1084,9 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 		part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
 		part_stat_local_dec(req->part,
 				    in_flight[op_is_write(req_op(req))]);
+		if (bdev_is_partition(req->part))
+			part_stat_local_dec(bdev_whole(req->part),
+					    in_flight[op_is_write(req_op(req))]);
 		part_stat_unlock();
 	}
 }
@@ -1144,6 +1147,9 @@ static inline void blk_account_io_start(struct request *req)
 	part_stat_lock();
 	update_io_ticks(req->part, jiffies, false);
 	part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
+	if (bdev_is_partition(req->part))
+		part_stat_local_inc(bdev_whole(req->part),
+				    in_flight[op_is_write(req_op(req))]);
 	part_stat_unlock();
 }
 
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox