Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] Revert "nbd: freeze the queue while we're adding connections"
From: yangerkun @ 2026-05-27  3:52 UTC (permalink / raw)
  To: josef, axboe; +Cc: linux-block, nbd, yangerkun
In-Reply-To: <20260526115253.746625-1-yangerkun@huawei.com>



在 2026/5/26 19:52, Yang Erkun 写道:
> This reverts commit b98e762e3d71e893b221f871825dc64694cfb258.
> 
> Commit b98e762e3d71 ("nbd: freeze the queue while we're adding
> connections") added blk_mq_freeze_queue/blk_mq_unfreeze_queue in
> nbd_add_socket() to protect krealloc(config->socks) from concurrent I/O
> that could cause a Use-After-Free.
> 
> However, analysis shows that in all current code paths, concurrent I/O
> cannot actually reach nbd_add_socket():
> 
> 1. nbd_genl_connect() path:
>     nbd_add_socket() is called first, and nbd_start_device() -- which
>     starts the queue and enables I/O -- is called only after all sockets
>     have been added. So the freeze/unfreeze runs against an idle queue,
>     marking then waiting on a percpu_ref that is already zero, and then
>     resurrecting it -- a pure no-op that burns an RCU grace period per
>     socket on multi-core systems.
> 
> 2. nbd_ioctl(NBD_SET_SOCK) path:
>     The task_setup check enforces that only the thread which performed
>     the first NBD_SET_SOCK can call NBD_SET_SOCK again. That thread is
>     blocked in NBD_DO_IT's wait_event_interruptible, so it cannot issue
>     another NBD_SET_SOCK concurrently with I/O. Other threads are
>     rejected by the task_setup != current check.

Apologies, but the analysis provided here is inadequate. A 
use-after-free (UAF) can still occur in the following scenario:

task A: ioctl NBD_SET_SOCK => task_setup = A
task B: ioctl NBD_DO_IT    => nbd_start_device_ioctl, nbd can receive IO
task A: ioctl NBD_SET_SOCK => task_setup == A, so racer can happend with
concurrent IO!

This patch is misleading, please disregard it. Sorry once again.

> 
> 3. nbd_genl_reconfigure() does not call nbd_add_socket() at all; it
>     uses nbd_reconnect_socket() which replaces a dead socket in-place
>     without reallocating config->socks.
> 
> Therefore the freeze/unfreeze provides no actual protection in any
> reachable code path, while imposing the cost of blk_mq_freeze_queue
> (percpu_ref_kill + RCU grace period wait + percpu_ref_resurrect) on
> every socket addition during device setup[1].
> 
> Revert the change to eliminate the unnecessary overhead.
> 
> Link: https://lore.kernel.org/all/20260327091223.4147956-1-leo.lilong@huaweicloud.com/ [1]
> Signed-off-by: Yang Erkun <yangerkun@huawei.com>
> ---
>   drivers/block/nbd.c | 11 +----------
>   1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index fe63f3c55d0d..9033d996c9a9 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1245,22 +1245,16 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
>   	struct socket *sock;
>   	struct nbd_sock **socks;
>   	struct nbd_sock *nsock;
> -	unsigned int memflags;
>   	int err;
>   
>   	/* Arg will be cast to int, check it to avoid overflow */
>   	if (arg > INT_MAX)
>   		return -EINVAL;
> +
>   	sock = nbd_get_socket(nbd, arg, &err);
>   	if (!sock)
>   		return err;
>   
> -	/*
> -	 * We need to make sure we don't get any errant requests while we're
> -	 * reallocating the ->socks array.
> -	 */
> -	memflags = blk_mq_freeze_queue(nbd->disk->queue);
> -
>   	if (!netlink && !nbd->task_setup &&
>   	    !test_bit(NBD_RT_BOUND, &config->runtime_flags))
>   		nbd->task_setup = current;
> @@ -1300,12 +1294,9 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
>   	INIT_WORK(&nsock->work, nbd_pending_cmd_work);
>   	socks[config->num_connections++] = nsock;
>   	atomic_inc(&config->live_connections);
> -	blk_mq_unfreeze_queue(nbd->disk->queue, memflags);
> -
>   	return 0;
>   
>   put_socket:
> -	blk_mq_unfreeze_queue(nbd->disk->queue, memflags);
>   	sockfd_put(sock);
>   	return err;
>   }


^ permalink raw reply

* Re: [PATCH] zram: fix use-after-free in zram_bvec_write_partial()
From: Sergey Senozhatsky @ 2026-05-27  3:45 UTC (permalink / raw)
  To: Cunlong Li
  Cc: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton,
	linux-kernel, linux-block, Christoph Hellwig, stable
In-Reply-To: <20260527-zram-v1-1-ce1acb2bfaf9@gmail.com>

On (26/05/27 11:26), Cunlong Li wrote:
> zram_read_page() picks the sync or async backing device read path
> based on whether the parent bio is NULL.  zram_bvec_write_partial()
> passes its parent bio down, so for ZRAM_WB slots the read is
> dispatched asynchronously and zram_read_page() returns 0 while the
> bio is still in flight.  The caller then runs memcpy_from_bvec(),
> zram_write_page() and __free_page() on the buffer, leaving the
> async read to write into a freed page.
> 
> zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
> ("zram: fix synchronous reads") for the same reason; the
> write_partial counterpart was missed.
> 
> Fixes: 4e3c87b9421d ("zram: fix synchronous reads")
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
> ---
>  drivers/block/zram/zram_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index aebc710f0d6a..b23a8bbb687c 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -2333,7 +2333,7 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
>  	if (!page)
>  		return -ENOMEM;
>  
> -	ret = zram_read_page(zram, page, index, bio);
> +	ret = zram_read_page(zram, page, index, NULL);

Sounds like zram_bvec_write_partial() doesn't need bio parameter then?

^ permalink raw reply

* [PATCH] zram: fix use-after-free in zram_bvec_write_partial()
From: Cunlong Li @ 2026-05-27  3:26 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky, Jens Axboe, Andrew Morton
  Cc: linux-kernel, linux-block, Christoph Hellwig, stable, Cunlong Li

zram_read_page() picks the sync or async backing device read path
based on whether the parent bio is NULL.  zram_bvec_write_partial()
passes its parent bio down, so for ZRAM_WB slots the read is
dispatched asynchronously and zram_read_page() returns 0 while the
bio is still in flight.  The caller then runs memcpy_from_bvec(),
zram_write_page() and __free_page() on the buffer, leaving the
async read to write into a freed page.

zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
("zram: fix synchronous reads") for the same reason; the
write_partial counterpart was missed.

Fixes: 4e3c87b9421d ("zram: fix synchronous reads")
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
---
 drivers/block/zram/zram_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aebc710f0d6a..b23a8bbb687c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2333,7 +2333,7 @@ static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
 	if (!page)
 		return -ENOMEM;

-	ret = zram_read_page(zram, page, index, bio);
+	ret = zram_read_page(zram, page, index, NULL);
 	if (!ret) {
 		memcpy_from_bvec(page_address(page) + offset, bvec);
 		ret = zram_write_page(zram, page, index);

---
base-commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7
change-id: 20260526-zram-b01425b7e6c6

Best regards,
-- 
Cunlong Li <shenxiaogll@gmail.com>

^ permalink raw reply related

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Ming Lei @ 2026-05-27  3:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	linux-block, LKML, Andrew Morton, Linus Torvalds, linux-btrfs,
	David Sterba, linux-fsdevel, Christian Brauner
In-Reply-To: <d1b5a737-f0e3-4927-b762-430b37fbb2f9@I-love.SAKURA.ne.jp>

On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
> On 2026/05/27 10:20, Ming Lei wrote:
> >> Of course we should try to figure out the root cause first, but how can we do?
> > 
> > Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
> > which may cause data loss, so CC btrfs list and maintainer.
> 
> Why do you assume that the culprit is btrfs?
> 
> https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
> this similar race is also happening with jfs.

I just didn't see the above report on jfs.

It doesn't change anything, the same question still stands: unexpected write IO is issued
or crosses umount & last closing of loop disk.



Thanks,
Ming

^ permalink raw reply

* Should the "loop" driver reject using pseudo files as backing file?
From: Tetsuo Handa @ 2026-05-27  1:52 UTC (permalink / raw)
  To: Jens Axboe, linux-block

I noticed that /dev/loopX accepts pseudo files, for currently
loop_validate_file() does

	if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
		return -EINVAL;

and pseudo files are treated as S_ISREG().

Reading pseudo files via /dev/loopX causes bogus results (tries to
repeatedly read the entire content up to the size visible to "ls"
command). I think that allowing such usage will confuse userspace
programs.

[root@localhost ~]# ls -l /sys/power/pm_test
-rw-r--r-- 1 root root 4096 May 26 22:14 /sys/power/pm_test
[root@localhost ~]# cat /sys/power/pm_test | wc
      1       6      48
[root@localhost ~]# cat $(losetup -f --show /sys/power/pm_test) | wc
     85     513    4096

Writing to pseudo files via /dev/loopX seems to work (at least works for
/sys/power/pm_test ), but can/should we forbid binding to pseudo files?

[root@localhost ~]# echo none > /sys/power/pm_test
[root@localhost ~]# echo none > $(losetup -f --show /sys/power/pm_test)

F.Y.I. An analysis by Google AI mode (expires in 7 days) is at https://share.google/aimode/prhETTZMQEzsw5HV9 .

^ permalink raw reply

* Re: [PATCH] block: rename need_dispatch to cautious_dispatch in blk-mq sched
From: Guixin Liu @ 2026-05-27  1:47 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: linux-block, xlpang, oliver.yang
In-Reply-To: <458c5c9c-8740-4079-8800-f30e074fafef@kernel.dk>



在 2026/5/26 23:54, Jens Axboe 写道:
> On 5/26/26 7:11 AM, Guixin Liu wrote:
>> The local boolean in __blk_mq_sched_dispatch_requests() decides whether
>> to fall back to the per-ctx round-robin path (blk_mq_do_dispatch_ctx())
>> instead of the batch flush path (blk_mq_flush_busy_ctxs()).  The whole
>> function is about dispatching anyway, so the name "need_dispatch" is
>> not particularly informative and can mislead readers into thinking that
>> a false value means "skip dispatching".
>>
>> Rename it to "cautious_dispatch" to match the comment right above the
>> check ("dequeue request one by one from sw queue if queue is busy")
>> and to convey the actual intent: take the cautious, fair, one-at-a-time
>> path either when we just drained hctx->dispatch (so the device has
>> recently pushed back) or when the dispatch_busy EWMA still indicates
>> congestion.  The fast batch path is only taken when neither signal
>> suggests recent backpressure.
> If we're going to do churn like that, it should at least be an
> improvement. 'cautious_dispatch' tells the reader nothing about
> what kind of behavior this modifies. 'piecemeal_dispatch' would
> be much better, as it actually accurately describes what it
> does.
Sure, 'piecemeal_dispatch' indeed better, changed in v2. Best Regards, 
Guixin Liu


^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Tetsuo Handa @ 2026-05-27  1:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	linux-block, LKML, Andrew Morton, Linus Torvalds, linux-btrfs,
	David Sterba
In-Reply-To: <ahZGxoI6oHQ_vSrx@fedora>

On 2026/05/27 10:20, Ming Lei wrote:
>> Of course we should try to figure out the root cause first, but how can we do?
> 
> Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
> which may cause data loss, so CC btrfs list and maintainer.

Why do you assume that the culprit is btrfs?

https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 indicated that
this similar race is also happening with jfs.

[  678.816570][ T1038] read_mapping_page failed!
[  678.816584][ T1038] ERROR: (device loop3): txCommit: 
[  678.816584][ T1038] 
[  678.816633][ T1038] jfs_write_inode: jfs_commit_inode failed!
[  678.895688][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  678.956225][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  678.970652][   T12] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.102838][ T4281] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.104701][ T4281] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.121329][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.122119][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.199283][ T2183] lo_rw_aio(loop3) starting read with raw_refcnt=0x0, refcnt=1
[  679.200014][ T2183] lo_rw_aio(loop3) starting write with raw_refcnt=0x0, refcnt=1
[  679.275613][ T5615] __loop_clr_fd(loop3) clearing lo_backing_file with raw_refcnt=0x0, refcnt=1
[  679.397358][   T13] bridge_slave_1: left allmulticast mode
[  679.397399][   T13] bridge_slave_1: left promiscuous mode
[  679.410004][   T13] bridge0: port 2(bridge_slave_1) entered disabled state
[  679.433576][ T2183] ------------[ cut here ]------------
[  679.433592][ T2183] d_inode(dentry) != file_inode(file)
[  679.433617][ T2183] WARNING: ./include/linux/fs.h:1368 at file_remove_privs_flags+0x58c/0x640, CPU#0: kworker/u8:12/2183
[  679.433676][ T2183] Modules linked in:
[  679.433695][ T2183] CPU: 0 UID: 0 PID: 2183 Comm: kworker/u8:12 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
[  679.433720][ T2183] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
[  679.433739][ T2183] Workqueue: loop3 loop_workfn
[  679.433805][ T2183] RIP: 0010:file_remove_privs_flags+0x58c/0x640
[  679.433848][ T2183] Code: 00 75 4d 44 89 e8 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 5f d4 80 ff e9 90 fe ff ff e8 55 d4 80 ff 90 <0f> 0b 90 e9 85 fb ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c b7
[  679.433867][ T2183] RSP: 0018:ffffc90007e374e0 EFLAGS: 00010293
[  679.433885][ T2183] RAX: ffffffff8243f7cb RBX: ffff888036fa8ca0 RCX: ffff88802c0abd80
[  679.433902][ T2183] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  679.433933][ T2183] RBP: ffffc90007e37638 R08: 0000000000000000 R09: 0000000000000000
[  679.433946][ T2183] R10: dffffc0000000000 R11: fffffbfff1f1597f R12: ffff888063726220
[  679.433962][ T2183] R13: 1ffff11006df5194 R14: 0000000000000000 R15: 1ffff1100c6e4c44
[  679.433978][ T2183] FS:  0000000000000000(0000) GS:ffff888125f1f000(0000) knlGS:0000000000000000
[  679.433998][ T2183] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  679.434016][ T2183] CR2: 00007f22e1be7dac CR3: 000000003e332000 CR4: 00000000003526f0
[  679.434038][ T2183] Call Trace:
[  679.434049][ T2183]  <TASK>
[  679.434072][ T2183]  ? __pfx_file_remove_privs_flags+0x10/0x10
[  679.434118][ T2183]  ? rt_mutex_post_schedule+0xd1/0x1c0
[  679.434172][ T2183]  ? generic_write_checks_count+0x449/0x550
[  679.434212][ T2183]  ? generic_write_checks+0xc8/0x110
[  679.434249][ T2183]  shmem_file_write_iter+0xaa/0x120
[  679.434286][ T2183]  lo_rw_aio+0xef0/0x1170
[  679.434349][ T2183]  ? __pfx_lo_rw_aio+0x10/0x10
[  679.434401][ T2183]  ? kthread_associate_blkcg+0x490/0x600
[  679.434432][ T2183]  ? rt_spin_unlock+0x160/0x200
[  679.434476][ T2183]  loop_process_work+0x637/0x11b0
[  679.434539][ T2183]  ? __pfx_loop_process_work+0x10/0x10
[  679.434582][ T2183]  ? look_up_lock_class+0x57/0x110
[  679.434626][ T2183]  ? register_lock_class+0x31/0x2e0
[  679.434661][ T2183]  ? __lock_acquire+0x6b5/0x2d10
[  679.434741][ T2183]  ? do_raw_spin_lock+0x12b/0x2f0
[  679.434785][ T2183]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  679.434830][ T2183]  ? process_one_work+0x8be/0x1630
[  679.434870][ T2183]  ? process_one_work+0x8be/0x1630
[  679.434922][ T2183]  ? process_one_work+0x8be/0x1630
[  679.434959][ T2183]  process_one_work+0x98b/0x1630
[  679.435026][ T2183]  ? __pfx_process_one_work+0x10/0x10
[  679.435060][ T2183]  ? do_raw_spin_lock+0x12b/0x2f0
[  679.435128][ T2183]  worker_thread+0xb49/0x1140
[  679.435202][ T2183]  kthread+0x388/0x470
[  679.435233][ T2183]  ? __pfx_worker_thread+0x10/0x10
[  679.435276][ T2183]  ? __pfx_kthread+0x10/0x10
[  679.435309][ T2183]  ret_from_fork+0x514/0xb70
[  679.435348][ T2183]  ? __pfx_ret_from_fork+0x10/0x10
[  679.435382][ T2183]  ? __switch_to+0xc79/0x1410
[  679.435415][ T2183]  ? __pfx_kthread+0x10/0x10
[  679.435447][ T2183]  ret_from_fork_asm+0x1a/0x30
[  679.435517][ T2183]  </TASK>


^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Ming Lei @ 2026-05-27  1:20 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	linux-block, LKML, Andrew Morton, Linus Torvalds, linux-btrfs,
	David Sterba
In-Reply-To: <1a9f53d4-6f48-4df8-a3d8-2b0e442a163a@I-love.SAKURA.ne.jp>

On Tue, May 26, 2026 at 09:25:30AM +0900, Tetsuo Handa wrote:
> On 2026/05/26 0:19, Ming Lei wrote:
> > On Mon, May 25, 2026 at 12:40:19PM +0900, Tetsuo Handa wrote:
> >> Some commit which was merged in the merge window for 7.1 broke the loop
> >> driver; a race window where lo_release() clears the backing file via
> >> __loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
> >>
> >> The exact commit which changed the behavior is not known due to lack of
> >> reproducer and timing dependent behavior, but it seems that we need to
> >> solve this problem in the loop driver despite there was no change for the
> >> loop driver during this merge window.
> >>
> >> To close this race, try to flush pending I/O requests. However, calling
> >> drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
> >> lockdep warnings [3][4]. We need to flush pending I/O requests without
> >> disk->open_mutex held.
> > 
> > No, please don't workaround before root cause.
> > 
> > No proof shows that the issue is in block layer or loop driver, the IO isn't
> > expected, you need to figure out why btrfs still issues IO after this loop
> > disk is closed by everyone and writeback is done.
> > 
> > https://syzkaller.appspot.com/x/log.txt?x=101e4702580000
> > 
> 
> Of course we should try to figure out the root cause first, but how can we do?

Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
which may cause data loss, so CC btrfs list and maintainer.

...
 
> Possible approaches for finding the exact commit that is causing this problem:
> 
>   (a) Revert all changes in the block layer from linux.git and monitor for one week for whether this
>       problem is still happening (because linux.git is more frequently hitting this problem than
>       linux-next.git ).
> 
>   (b) Revert all changes in the block layer from linux-next.git and monitor for two weeks for
>       whether this problem is still happening (less reliable than linux.git but a candidate).
> 
>   (c) Let sashiko review all changes between v7.0 and v7.1 that may cause this problem.
>       (Human developers have no time to review. But is investigation with moving baseline commit
>       possible for sashiko ?)
> 
>   (d) Any ideas?
> 
> P.S. Since the loop driver is a critical infrastructure for testing filesystems by syzbot,
> I want this problem be addressed before 7.1 is released.

syzbot is for finding real problem, here the real trouble is unexpected write IO from btrfs.

So please do not try to paper over real bug by 'fixing' loop.


Thanks,
Ming

^ permalink raw reply

* Re: [PATCHv2] blk-mq: reinsert cached request to the list
From: Ming Lei @ 2026-05-27  0:53 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, axboe, Keith Busch, Christoph Hellwig
In-Reply-To: <20260526153531.2365935-1-kbusch@meta.com>

On Tue, May 26, 2026 at 10:35 AM Keith Busch <kbusch@meta.com> wrote:
>
> From: Keith Busch <kbusch@kernel.org>
>
> A previous commit removed an optimization out of caution for a scenario
> that turns out not to be real: all the "queue_exit" goto's are safe to
> reinsert the request into the cached_rq's plug list as they are either
> from a non-blocking path, or a successful merge that already holds the
> queue reference. This optimization is most needed for small sequential
> workloads that successfully merge into larger requests.
>
> Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
> Suggested-by: Ming Lei <tom.leiming@gmail.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
>
>   Actually use the correct rq_list function to return the rq to the
>   list.
>
>  block/blk-mq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 28c2d931e75ea..a24175441380e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3246,7 +3246,7 @@ void blk_mq_submit_bio(struct bio *bio)
>         if (!rq)
>                 blk_queue_exit(q);
>         else
> -               blk_mq_free_request(rq);
> +               rq_list_add_head(&plug->cached_rqs, rq);
>  }

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH] block: blk-zoned: fix zwplug refcount leak on write error path
From: Damien Le Moal @ 2026-05-26 23:15 UTC (permalink / raw)
  To: Haris Iqbal, Wentao Liang, Jens Axboe; +Cc: linux-block, linux-kernel, stable
In-Reply-To: <d8be2a57-c950-46c2-b9d8-120b6e53da91@linux.dev>

On 5/27/26 3:54 AM, Haris Iqbal wrote:
> 
> 
> On 5/26/26 16:18, Wentao Liang wrote:
>> blk_zone_wplug_handle_write() increments zwplug->ref via kref_get()
>> when preparing to handle a zone write. On the error path where
>> blk_zone_wplug_handle_write_noalloc() fails, the function returns
>> without calling kref_put() on zwplug->ref, leaking the reference.
>>
>> Add kref_put(&zwplug->ref, ...) on the error path to properly release
>> the reference.
>>
>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
>> ---
>>   block/blk-zoned.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 42ef830054dc..24b899663a48 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -1503,6 +1503,7 @@ static bool blk_zone_wplug_handle_write(struct bio
>> *bio, unsigned int nr_segs)
>>         if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
>>           spin_unlock_irqrestore(&zwplug->lock, flags);
>> +        disk_put_zone_wplug(zwplug);
> 
> I am not sure if this is needed. The code above adds the
> BIO_ZONE_WRITE_PLUGGING flag to the bio, which means the
> blk_zone_write_plug_bio_endio would be called which should then call
> disk_put_zone_wplug.

Correct. This patch is not correct at all. The write plug reference is dropped
in the BIO completion path.

Wentao,

You clearly did not test this at all because if you had, you would have seen
all the warning splats that your patch triggers.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH v15 0/8] blk: honor isolcpus configuration
From: Aaron Tomlin @ 2026-05-26 22:02 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, ming.lei, tom.leiming, steve,
	sean, chjohnst, neelx, mproche, nick.lange, marco.crivellari,
	rishil1999, linux-block, linux-kernel
In-Reply-To: <a276d7fa-8ab1-4cc2-a095-e7e4c060a4ad@flourine.local>

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]

On Tue, May 26, 2026 at 06:05:54PM +0200, Daniel Wagner wrote:
> > Please let me know your thoughts.
> > 
> > 
> > Changes since v14:
> 
> You’re moving fast with these updates! It’s great energy, but it’s
> actually moving a bit faster than the review process can keep up with.
> I’ve heard from some folks in the CC that they waiting for a 'final'
> version.
> 
> Is this latest version ready for a full, deep-dive review, or are there
> still a few 'knacks' you’re looking to iron out first?

Hi Daniel,

Thank you for making me aware of this. I entirely understand—I certainly
wish to avoid causing review fatigue or unnecessary churn for those copied
on this thread.

To address your query: there remain a few minor concerns. However,
considering the rather rapid pace of recent updates, I am more than happy
to pause the process.

Let us allow the current iteration to settle. I shall hold off on
submitting the next version to ensure everyone has ample time to run their
tests, conduct a thorough review, and offer further feedback on the present
state of the patches.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH V4 0/3] md/nvme: Enable PCI P2PDMA support for RAID0 and NVMe Multipath
From: Jens Axboe @ 2026-05-26 21:52 UTC (permalink / raw)
  To: song, yukuai, linan122, kbusch, hch, sagi, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, kmodukuri
In-Reply-To: <20260513185153.95552-1-kch@nvidia.com>


On Wed, 13 May 2026 11:51:50 -0700, Chaitanya Kulkarni wrote:
> This patch series extends PCI peer-to-peer DMA (P2PDMA) support to enable
> direct data transfers between PCIe devices through RAID and NVMe multipath
> block layers.
> 
> Current Linux kernel P2PDMA infrastructure supports direct peer-to-peer
> transfers, but this support is not propagated through certain storage
> stacks like MD RAID and NVMe multipath. This adds two patches for
> MD RAID 0/1/10 and NVMe to propogate P2PDMA support through the
> storage stack.
> 
> [...]

Applied, thanks!

[1/3] block: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices
      commit: 7882834048f110931275357db60dccff906dc96a
[2/3] md: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device
      commit: 02666132403aec8fc5de315002894f713ef17dbc
[3/3] nvme-multipath: enable PCI P2PDMA for multipath devices
      commit: fb0eeeed91f3236133383445fee5cc8f20330e6e

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH V4 0/3] md/nvme: Enable PCI P2PDMA support for RAID0 and NVMe Multipath
From: Jens Axboe @ 2026-05-26 21:51 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: song@kernel.org, yukuai@fnnas.com, Christoph Hellwig,
	linan122@huawei.com, kbusch@kernel.org, sagi@grimberg.me,
	linux-block@vger.kernel.org, linux-raid@vger.kernel.org,
	linux-nvme@lists.infradead.org, Kiran Modukuri
In-Reply-To: <053b99b2-c994-42ff-af63-6e63ab468557@nvidia.com>

On 5/26/26 11:09 AM, Chaitanya Kulkarni wrote:
> Jens,
> 
> On 5/19/26 17:11, Chaitanya Kulkarni wrote:
>> Jens,
>>
>>
>> On 5/14/26 9:35 PM, Christoph Hellwig wrote:
>>> Still looks good to me as per the reviews.
>>>
>> If there no objection, can we merge this ?
>>
>> -Chaitanya
>>
>>
> There is outstanding work I want to send out based on this one.

Out standing, outstanding, or both? :-)

> May I please request you to merge this patch series ?

Was waiting on the md parts to get reviewed, by I missed that Xiao Ni
already did. I'll queue it up.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCHv2] blk-mq: reinsert cached request to the list
From: Jens Axboe @ 2026-05-26 21:06 UTC (permalink / raw)
  To: linux-block, Keith Busch; +Cc: Keith Busch, Ming Lei, Christoph Hellwig
In-Reply-To: <20260526153531.2365935-1-kbusch@meta.com>


On Tue, 26 May 2026 08:35:31 -0700, Keith Busch wrote:
> A previous commit removed an optimization out of caution for a scenario
> that turns out not to be real: all the "queue_exit" goto's are safe to
> reinsert the request into the cached_rq's plug list as they are either
> from a non-blocking path, or a successful merge that already holds the
> queue reference. This optimization is most needed for small sequential
> workloads that successfully merge into larger requests.
> 
> [...]

Applied, thanks!

[1/1] blk-mq: reinsert cached request to the list
      commit: b051bb6bf0a231117036aa607cadf55be8e63910

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Vlastimil Babka @ 2026-05-26 20:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Rapoport, Christoph Hellwig, Jens Axboe, linux-block,
	linux-kernel, linux-mm
In-Reply-To: <ahWwD0eqj3cUgaKn@casper.infradead.org>

On 5/26/26 16:37, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 02:07:36PM +0200, Vlastimil Babka wrote:
>> The main reasons for switching AFAIU would be related with the
>> folio/memdesc conversions? If one needs just a kernel memory buffer,
>> kmalloc() it is, even if it happens to be page size. Page allocator
>> should be only used if you need e.g. the refcounting or anything else
>> that struct page provides. But then in some cases the memdesc conversion
>> would need adjustments at some point. With kmalloc() we can forget about
>> this user.
> 
> No, I think this is unrelated to memdescs.
> 
> I've seen a few people say slightly wrong things about
> folios/pages/memdescs recently, so let me try to clarify the end state.
> 
> I do not intend to get rid of the ability to allocate a bare page of
> memory with something like alloc_pages() or get_free_page().  It's
> just that the struct page associated with it will contain far less
> information (because it's smaller).

Alright, but isn't it still the case that if you don't need any of what
struct page provides today or will do in the future, it's better if you just
use kmalloc()? I thought you said so yourself?

https://lore.kernel.org/all/aPQxN7-FeFB6vTuv@casper.infradead.org/

So what exactly would your rationale for "Most of them shouldn't be using
get_free_pages() at all, they should be using kmalloc()." be?

> https://kernelnewbies.org/MatthewWilcox/Memdescs has a bit more
> information, but to distill it:
> 
> You get a u64 worth of data (technically one per page, but if you
> allocate multiple pages, they're all going to be the same).
> Bits 0-3 will be type 0 (to indicate that it has no memdesc).  
> Bits 4-10 will be subtype 2 (to indicate no information about owner).
> Bit 11 will be clear to indicate that this page should not be mappable
> to userspace.
> Bits 12-17 will store the allocation order.
> The top few bits will encode zone/node/section like page->flags
> do today.
> 
> That doesn't leave many free bits for the user, but that's OK because
> most allocations don't actually need any bits in struct page.  If you do
> want something like a refcount or list_head, see the "Managed memory"
> section on that page.  If you actually want a full-fat folio, well,
> allocate a folio, not a page.


^ permalink raw reply

* Re: [PATCH] bvec: make the bvec_iter helpers inline functions
From: Keith Busch @ 2026-05-26 19:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block
In-Reply-To: <20260526070042.1817997-1-hch@lst.de>

On Tue, May 26, 2026 at 09:00:27AM +0200, Christoph Hellwig wrote:
> -#define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
> +static __always_inline const struct bio_vec *
> +__bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +	return bvecs + iter.bi_idx;
> +}

There's a couple drivers, nvme-tcp and loop, that call this without the
const qualifier, so this will produce new warnings. The nvme-tcp one is
simpler to fix by just adding the 'const', where loop looks like it
needs a little more consideration to get there, but still doable.

^ permalink raw reply

* Re: [PATCH 1/2] block: Use struct_size() helper in kmalloc()
From: Bart Van Assche @ 2026-05-26 19:44 UTC (permalink / raw)
  To: luoqing, Jens Axboe; +Cc: linux-block, linux-kernel
In-Reply-To: <20260526085648.1784798-1-l1138897701@163.com>

On 5/26/26 1:56 AM, luoqing wrote:
> From: luoqing <luoqing@kylinos.cn>
> 
> Make use of the struct_size() helper instead of an open-coded version,
> in order to avoid any potential type mistakes or integer overflows that,
> in the worst scenario, could lead to heap overflows.
> 
> Signed-off-by: luoqing <luoqing@kylinos.cn>
> ---
>   block/bio.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index d80d5d26804e..397fc3bc0ede 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -657,8 +657,7 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
>   
>   	if (nr_vecs > BIO_MAX_INLINE_VECS)
>   		return NULL;
> -	return kmalloc(sizeof(*bio) + nr_vecs * sizeof(struct bio_vec),
> -			gfp_mask);
> +	return kmalloc(struct_size(bio, bio_vec, nr_vecs), gfp_mask);
>   }
>   EXPORT_SYMBOL(bio_kmalloc);

Has this patch been tested? If I apply it and try to build the kernel,
the following appears:

block/bio.c:633:34: error: no member named 'bio_vec' in 'struct bio'
   633 |         return kmalloc(struct_size(bio, bio_vec, nr_vecs), 
gfp_mask);

Bart.



^ permalink raw reply

* Re: [PATCH] bvec: make the bvec_iter helpers inline functions
From: Bart Van Assche @ 2026-05-26 19:32 UTC (permalink / raw)
  To: Christoph Hellwig, axboe; +Cc: linux-block
In-Reply-To: <20260526070042.1817997-1-hch@lst.de>

On 5/26/26 12:00 AM, Christoph Hellwig wrote:
> The macros are impossible to follow due to the lack of visual type
> information and all the braces.  Replace them with inline helpers to
> improve on that.  Because the calling conventions are a bit problematic
> with a lot of passing structures by value, all the helpers are marked
> as __always_inline so that they are force inlined.

Thanks Christoph!

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply

* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-05-26 19:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
	linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <ahPbaSEoNA755Nt3@infradead.org>

On 5/25/26 1:17 AM, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 06:47:43PM -0400, Tal Zussman wrote:
>> > But this 1-jiffie delay also means we unconditionally increase
>> > completion latency, which feels like a bad idea.  Do you have any
>> > measurements that show where it does benefit?  Note that queing work
>> > already often has very measurable latency on it's own.  This also
>> > directly contradics the erofs experience that even went to a RT
>> > thread to reduce the latency.
>> 
>> I added this per Dave's feedback on v4, where he noted that XFS inodegc
>> uses a delayed work item to avoid context switch storms. There's only a
>> delay for the first bio in a batch to complete, as we only delay when the
>> list is empty. I'll run some experiments and measure context switches,
>> completion latency, etc. to see if this is necessary.
> 
> The difference is that XFS inodegc is not latency bound.  Most of the
> time no one cares if it is delayed a bit, in the cases where someone
> cares we explicitly flush the queues.  I/O completion on the other hand
> is something where users very much care about latency.
> 

I ran some experiments with fio on both XFS and a raw block device. Five
iterations each for 60s. Results below.

TLDR: Removing the delay doesn't significantly decrease user-visible
latency or otherwise improve performance, but does significantly reduce
throughput and increase context switches in some workloads (e.g. C).
I think it makes sense to leave the delay as-is. Thoughts?

Results:

Workloads (all `uncached=1`):
  A: rw=write     bs=128k iodepth=1   ioengine=pvsync2     # XFS
  B: rw=write     bs=128k iodepth=128 ioengine=io_uring    # XFS
  C: rw=randwrite bs=4k   iodepth=32  ioengine=io_uring    # XFS
  D: rw=rw 50/50  bs=64k  iodepth=32  ioengine=io_uring    # XFS
  E: rw=write     bs=128k iodepth=128 ioengine=io_uring    # raw /dev/nvmeXn1
  F: rw=write     bs=128k iodepth=128 numjobs=4
     + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS

Mean ± stddev across 5 iterations:

    metric                     delay=1           delay=0     delta
    --------------------------------------------------------------

  A seq 128k qd1
    BW (MB/s)                4333 ± 27         4374 ± 34     +0.9%
    p99   (us)              36.2 ± 0.8        35.8 ± 0.4     -1.1%
    p999  (us)               3260 ± 75         3228 ± 29     -1.0%
    ctx-switches          184 k ± 59 k     3.68 M ± 65 k    +1903%
    cs / io                0.09 ± 0.03       1.86 ± 0.03    +1888%
    avg bios/run            80.4 ± 0.6         1.1 ± 0.0    -98.7%

  B seq 128k qd128
    BW (MB/s)               4393 ± 3.3        4311 ± 5.3     -1.9%
    p99   (us)               8461 ± 73        8638 ± 105     +2.1%
    p999  (us)             12465 ± 213       12386 ± 299     -0.6%
    ctx-switches        6.90 M ± 186 k    9.72 M ± 184 k    +40.7%
    cs / io                3.43 ± 0.10       4.92 ± 0.10    +43.4%
    avg bios/run            51.9 ± 2.2         1.3 ± 0.0    -97.4%

  C rand 4k qd32
    BW (MB/s)               66.2 ± 0.8        44.6 ± 7.4    -32.7%
    p99   (us)              8002 ± 174      17990 ± 6800   +124.8%
    p999  (us)             11390 ± 554     31890 ± 11076   +180.0%
    ctx-switches         3.67 M ± 45 k    3.59 M ± 106 k     -2.2%
    cs / io                3.78 ± 0.04       5.62 ± 0.83    +48.7%
    avg bios/run            32.3 ± 1.0         3.1 ± 0.3    -90.5%

  D mixed 50/50 r/w 64k qd32
    write BW (MB/s)       892.4 ± 20.9      925.3 ± 18.3     +3.7%
    write p99 (us)          3562 ± 107         3601 ± 82     +1.1%
    write p999 (us)         4673 ± 217        4647 ± 107     -0.6%
    read BW (MB/s)        893.6 ± 20.8      926.6 ± 18.4     +3.7%
    read p99 (us)            1003 ± 48         1035 ± 39     +3.2%
    read p999 (us)           1545 ± 63         1476 ± 50     -4.5%
    ctx-switches         5.15 M ± 75 k    5.79 M ± 230 k    +12.6%
    cs / io                6.32 ± 0.15       6.85 ± 0.20     +8.5%
    avg bios/run            23.9 ± 0.3         2.5 ± 0.0    -89.4%

  E raw 128k qd128
    BW (MB/s)               1043 ± 1.0        1045 ± 0.5     +0.1%
    p99   (us)             26922 ± 105       27027 ± 128     +0.4%
    p999  (us)            37906 ± 4527      37408 ± 2464     -1.3%
    ctx-switches          3.20 M ± 6 k     3.33 M ± 10 k     +3.8%
    cs / io                6.71 ± 0.01       6.95 ± 0.02     +3.7%
    avg bios/run            38.0 ± 0.1        32.0 ± 0.0    -15.6%

  F mem-pressure (dirty_bytes=64MB, 4 writers)
    BW (MB/s)                4361 ± 24         4444 ± 40     +1.9%
    p99   (us)             29439 ± 419       30173 ± 788     +2.5%
    p999  (us)            35704 ± 1773       36648 ± 535     +2.6%
    ctx-switches        20.8 M ± 1.6 M    27.1 M ± 1.4 M    +30.1%
    cs / io                6.94 ± 0.49       8.87 ± 0.46    +27.8%
    avg bios/run            23.6 ± 0.3         1.2 ± 0.0    -94.9%

^ permalink raw reply

* Re: [PATCH] block: blk-zoned: fix zwplug refcount leak on write error path
From: Haris Iqbal @ 2026-05-26 18:54 UTC (permalink / raw)
  To: Wentao Liang, Jens Axboe, Damien Le Moal
  Cc: linux-block, linux-kernel, stable
In-Reply-To: <20260526141824.2293025-1-vulab@iscas.ac.cn>



On 5/26/26 16:18, Wentao Liang wrote:
> blk_zone_wplug_handle_write() increments zwplug->ref via kref_get()
> when preparing to handle a zone write. On the error path where
> blk_zone_wplug_handle_write_noalloc() fails, the function returns
> without calling kref_put() on zwplug->ref, leaking the reference.
> 
> Add kref_put(&zwplug->ref, ...) on the error path to properly release
> the reference.
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Cc: stable@vger.kernel.org
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
>   block/blk-zoned.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 42ef830054dc..24b899663a48 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1503,6 +1503,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>   
>   	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
>   		spin_unlock_irqrestore(&zwplug->lock, flags);
> +		disk_put_zone_wplug(zwplug);

I am not sure if this is needed. The code above adds the 
BIO_ZONE_WRITE_PLUGGING flag to the bio, which means the 
blk_zone_write_plug_bio_endio would be called which should then call 
disk_put_zone_wplug.

I do wonder if there are special cases when blk_zone_bio_endio is not 
called.

>   		bio_io_error(bio);
>   		return true;
>   	}
> @@ -1511,6 +1512,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>   	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
>   
>   	spin_unlock_irqrestore(&zwplug->lock, flags);
> +	disk_put_zone_wplug(zwplug);
>   
>   	return false;
>   


^ permalink raw reply

* Re: [PATCH] bvec: make the bvec_iter helpers inline functions
From: Caleb Sander Mateos @ 2026-05-26 18:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block
In-Reply-To: <20260526070042.1817997-1-hch@lst.de>

On Tue, May 26, 2026 at 12:00 AM Christoph Hellwig <hch@lst.de> wrote:
>
> The macros are impossible to follow due to the lack of visual type
> information and all the braces.  Replace them with inline helpers to
> improve on that.  Because the calling conventions are a bit problematic
> with a lot of passing structures by value, all the helpers are marked
> as __always_inline so that they are force inlined.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> ---
>  include/linux/bvec.h | 101 +++++++++++++++++++++++++++----------------
>  1 file changed, 64 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index d36dd476feda..f4c7ec282ac9 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -104,51 +104,78 @@ struct bvec_iter_all {
>         unsigned        done;
>  };
>
> -/*
> - * various member access, note that bio_data should of course not be used
> - * on highmem page vectors
> - */
> -#define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
> +static __always_inline const struct bio_vec *
> +__bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return bvecs + iter.bi_idx;
> +}
>
>  /* multi-page (mp_bvec) helpers */
> -#define mp_bvec_iter_page(bvec, iter)                          \
> -       (__bvec_iter_bvec((bvec), (iter))->bv_page)
> +static __always_inline struct page *
> +mp_bvec_iter_page(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return __bvec_iter_bvec(bvecs, iter)->bv_page;
> +}
>
> -#define mp_bvec_iter_len(bvec, iter)                           \
> -       min((iter).bi_size,                                     \
> -           __bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
> +static __always_inline unsigned int
> +mp_bvec_iter_len(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return min(__bvec_iter_bvec(bvecs, iter)->bv_len - iter.bi_bvec_done,
> +                       iter.bi_size);
> +}
>
> -#define mp_bvec_iter_offset(bvec, iter)                                \
> -       (__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
> +static __always_inline unsigned int
> +mp_bvec_iter_offset(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return __bvec_iter_bvec(bvecs, iter)->bv_offset + iter.bi_bvec_done;
> +}
>
> -#define mp_bvec_iter_page_idx(bvec, iter)                      \
> -       (mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
> +static __always_inline unsigned int
> +mp_bvec_iter_page_idx(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return mp_bvec_iter_offset(bvecs, iter) / PAGE_SIZE;
> +}
>
> -#define mp_bvec_iter_bvec(bvec, iter)                          \
> -((struct bio_vec) {                                            \
> -       .bv_page        = mp_bvec_iter_page((bvec), (iter)),    \
> -       .bv_len         = mp_bvec_iter_len((bvec), (iter)),     \
> -       .bv_offset      = mp_bvec_iter_offset((bvec), (iter)),  \
> -})
> +static __always_inline struct bio_vec
> +mp_bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return (struct bio_vec) {
> +               .bv_page        = mp_bvec_iter_page(bvecs, iter),
> +               .bv_len         = mp_bvec_iter_len(bvecs, iter),
> +               .bv_offset      = mp_bvec_iter_offset(bvecs, iter),
> +       };
> +}
>
>  /* For building single-page bvec in flight */
> - #define bvec_iter_offset(bvec, iter)                          \
> -       (mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
> -
> -#define bvec_iter_len(bvec, iter)                              \
> -       min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),               \
> -             PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
> -
> -#define bvec_iter_page(bvec, iter)                             \
> -       (mp_bvec_iter_page((bvec), (iter)) +                    \
> -        mp_bvec_iter_page_idx((bvec), (iter)))
> -
> -#define bvec_iter_bvec(bvec, iter)                             \
> -((struct bio_vec) {                                            \
> -       .bv_page        = bvec_iter_page((bvec), (iter)),       \
> -       .bv_len         = bvec_iter_len((bvec), (iter)),        \
> -       .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
> -})
> +static __always_inline unsigned int
> +bvec_iter_offset(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return mp_bvec_iter_offset(bvecs, iter) % PAGE_SIZE;
> +}
> +
> +static __always_inline unsigned int
> +bvec_iter_len(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return min(mp_bvec_iter_len(bvecs, iter),
> +                       PAGE_SIZE - bvec_iter_offset(bvecs, iter));
> +}
> +
> +static __always_inline struct page *
> +bvec_iter_page(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return mp_bvec_iter_page(bvecs, iter) +
> +               mp_bvec_iter_page_idx(bvecs, iter);
> +}
> +
> +static __always_inline struct bio_vec
> +bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
> +{
> +       return (struct bio_vec) {
> +               .bv_page        = bvec_iter_page(bvecs, iter),
> +               .bv_len         = bvec_iter_len(bvecs, iter),
> +               .bv_offset      = bvec_iter_offset(bvecs, iter),
> +       };
> +}
>
>  static inline bool bvec_iter_advance(const struct bio_vec *bv,
>                 struct bvec_iter *iter, unsigned bytes)
> --
> 2.53.0
>
>

^ permalink raw reply

* Re: [PATCH] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
From: Chao S @ 2026-05-26 18:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Christian Brauner, Josef Bacik, linux-block,
	linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu
In-Reply-To: <177981343047.464267.10378401249582627288.b4-ty@b4>

Thanks Jens! I appreciate your work!.

Best,
Chao

On Tue, May 26, 2026 at 12:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>
>
> On Fri, 22 May 2026 18:00:25 -0400, Chao Shi wrote:
> > bdev_mark_dead()'s @surprise == true means the device is already gone.
> > The filesystem callback fs_bdev_mark_dead() honours this and skips
> > sync_filesystem(), but the bare block device path (no ->mark_dead op)
> > lost its !surprise guard when the holder ->mark_dead callback was wired
> > up (see Fixes), and now calls sync_blockdev() unconditionally, which can
> > hang forever waiting on writeback that can no longer complete.
> >
> > [...]
>
> Applied, thanks!
>
> [1/1] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
>       commit: 304f384f34af98a205086ce67331cad4fea6504d
>
> Best regards,
> --
> Jens Axboe
>
>
>

^ permalink raw reply

* Re: [PATCH] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
From: Chao S @ 2026-05-26 18:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Christian Brauner, Josef Bacik, linux-block,
	linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu
In-Reply-To: <20260525055810.GC3293@lst.de>

Thanks for your kindly review!.

Best,
Chao

On Mon, May 25, 2026 at 1:58 AM Christoph Hellwig <hch@lst.de> wrote:
>
> Looks good:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>

^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Matthew Wilcox @ 2026-05-26 17:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-kernel, io-uring, linux-mm,
	Leon Romanovsky
In-Reply-To: <ahVBCtsodsM2FHis@infradead.org>

On Mon, May 25, 2026 at 11:43:22PM -0700, Christoph Hellwig wrote:
> On Mon, May 25, 2026 at 02:29:27PM +0100, Matthew Wilcox wrote:
> > > So I'm not against the function per se, but the documentation must
> > > explain the minefields it is stepping into a bit better.
> > 
> > Lower level drivers shouldn't be concerning themselves with folios.
> > For a start, we can put non-folios (eg slab memory) into bvecs.
> 
> Well, that is a very good thing to put into the comment.  We can also
> put them into high-level bvecs, so framing this as 'only use if you
> know the memory is folios, which you can't unless you are the entity
> who filled the bio' might be a good choice.

How about:

/**
 * bvec_folio - Return the first folio referenced by this bvec
 * @bv: bvec to access
 *
 * bvecs can contain non-folio memory, so this should only be called by
 * the creator of the bvec; drivers have no business looking at the owner
 * of the memory.  It may not even be the right interface for the caller
 * to use as bvecs can span multiple folios.  You may be better off using
 * something like bio_for_each_folio_all() which iterates over all folios.
 */

^ permalink raw reply

* Re: [PATCH V4 0/3] md/nvme: Enable PCI P2PDMA support for RAID0 and NVMe Multipath
From: Chaitanya Kulkarni @ 2026-05-26 17:09 UTC (permalink / raw)
  To: axboe@kernel.dk
  Cc: song@kernel.org, yukuai@fnnas.com, Christoph Hellwig,
	linan122@huawei.com, kbusch@kernel.org, sagi@grimberg.me,
	linux-block@vger.kernel.org, linux-raid@vger.kernel.org,
	linux-nvme@lists.infradead.org, Kiran Modukuri
In-Reply-To: <4ed83782-04cf-45b5-93a0-05a08e61b82e@nvidia.com>

Jens,

On 5/19/26 17:11, Chaitanya Kulkarni wrote:
> Jens,
>
>
> On 5/14/26 9:35 PM, Christoph Hellwig wrote:
>> Still looks good to me as per the reviews.
>>
> If there no objection, can we merge this ?
>
> -Chaitanya
>
>
There is outstanding work I want to send out based on this one.

May I please request you to merge this patch series ?

-ck



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox