Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-25 16:07 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: Keith Busch, Ming Lei, Christoph Hellwig

From: Keith Busch <kbusch@kernel.org>

A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.

Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..72c8ac805882c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3246,7 +3246,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (!rq)
 		blk_queue_exit(q);
 	else
-		blk_mq_free_request(rq);
+		rq_list_push(&plug->cached_rqs, rq);
 }

 #ifdef CONFIG_BLK_MQ_STACKING
-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v1] mtip32xx: fix use-after-free on service thread failure
From: Yuho Choi @ 2026-05-25 16:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Thomas Fourier, Martin K . Petersen, Andy Shevchenko, Al Viro,
	linux-block, linux-kernel, Yuho Choi

If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.

The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.

Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.

Fixes: e8b58ef09e84 ("mtip32xx: fix device removal")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
---
 drivers/block/mtip32xx/mtip32xx.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 567192e371a8..ccf5c164cf46 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3405,6 +3405,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 		.max_segment_size	= 0x400000,
 	};
 	int rv = 0, wait_for_rebuild = 0;
+	bool disk_added = false;
 	sector_t capacity;
 	unsigned int index = 0;
 
@@ -3438,6 +3439,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 		dev_err(&dd->pdev->dev,
 			"Unable to allocate request queue\n");
 		rv = -ENOMEM;
+		dd->disk = NULL;
 		goto block_queue_alloc_init_error;
 	}
 	dd->queue		= dd->disk->queue;
@@ -3496,6 +3498,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 	rv = device_add_disk(&dd->pdev->dev, dd->disk, mtip_disk_attr_groups);
 	if (rv)
 		goto read_capacity_error;
+	disk_added = true;
 
 	if (dd->mtip_svc_handler) {
 		set_bit(MTIP_DDF_INIT_DONE_BIT, &dd->dd_flag);
@@ -3511,7 +3514,9 @@ static int mtip_block_initialize(struct driver_data *dd)
 		dev_err(&dd->pdev->dev, "service thread failed to start\n");
 		dd->mtip_svc_handler = NULL;
 		rv = -EFAULT;
-		goto kthread_run_error;
+		if (disk_added)
+			goto kthread_run_error;
+		goto read_capacity_error;
 	}
 	wake_up_process(dd->mtip_svc_handler);
 	if (wait_for_rebuild == MTIP_FTL_REBUILD_MAGIC)
@@ -3522,6 +3527,10 @@ static int mtip_block_initialize(struct driver_data *dd)
 kthread_run_error:
 	/* Delete our gendisk. This also removes the device from /dev */
 	del_gendisk(dd->disk);
+	mtip_hw_debugfs_exit(dd);
+	blk_mq_free_tag_set(&dd->tags);
+	mtip_hw_exit(dd);
+	return rv;
 read_capacity_error:
 init_hw_cmds_error:
 	mtip_hw_debugfs_exit(dd);
@@ -3529,6 +3538,7 @@ static int mtip_block_initialize(struct driver_data *dd)
 	ida_free(&rssd_index_ida, index);
 ida_get_error:
 	put_disk(dd->disk);
+	dd->disk = NULL;
 block_queue_alloc_init_error:
 	blk_mq_free_tag_set(&dd->tags);
 block_queue_alloc_tag_error:
@@ -3839,7 +3849,10 @@ static int mtip_pci_probe(struct pci_dev *pdev,
 	}
 
 iomap_err:
-	kfree(dd);
+	if (dd->disk)
+		put_disk(dd->disk);
+	else
+		kfree(dd);
 	pci_set_drvdata(pdev, NULL);
 	return rv;
 done:
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 1/1] rust: block: fix GenDiskBuilder failure cleanup
From: Andreas Hindborg @ 2026-05-25 17:42 UTC (permalink / raw)
  To: Ren Wei, linux-block, rust-for-linux
  Cc: ojeda, boqun, gary, bjorn3_gh, lossin, aliceryhl, tmgross, dakr,
	daniel.almeida, axboe, sunke, tamird, yuantan098, bird,
	royenheart, n05ec
In-Reply-To: <b6411cc055080c984a67bfad72fd683aa84b8e13.1779596478.git.royenheart@gmail.com>

"Ren Wei" <n05ec@lzu.edu.cn> writes:

> From: Haoze Xie <royenheart@gmail.com>
>
> If GenDiskBuilder::build() fails after __blk_mq_alloc_disk(), the
> allocated gendisk is left behind until the caller drops the last
> tagset reference.
>
> Handle the failure path by releasing the temporary gendisk first,
> then converting the foreign queue data back, so probe failures clean
> up both resources before returning an error.
>
> Fixes: 3253aba3408aa ("rust: block: introduce `kernel::block::mq` module")
> Cc: stable@kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Signed-off-by: Haoze Xie <royenheart@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>

Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>

Thanks for reporting and fixing!

I see we also lack a `put_disk` in `GenDisk::drop` after `del_gendisk`.
Do you want to patch that as well?

Best regards,
Andreas Hindborg



^ permalink raw reply

* Re: [PATCH v3 7/7] rust: doctest: use vertical import style
From: Miguel Ojeda @ 2026-05-25 17:49 UTC (permalink / raw)
  To: Alvin Sun
  Cc: Arnd Bergmann, Greg Kroah-Hartman, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, Jens Axboe,
	Brendan Higgins, David Gow, Rae Moar, rust-for-linux, linux-block,
	linux-kselftest, kunit-dev
In-Reply-To: <20260521-miscdev-use-format-v3-7-56240ca70d0c@linux.dev>

On Thu, May 21, 2026 at 8:57 AM Alvin Sun <alvin.sun@linux.dev> wrote:
>
> Convert `use` imports to vertical layout for better readability and
> maintainability.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Please check the titles used by previous commits for files, e.g.
"scripts: rust: " would probably be the prefix here. (No need to
resend just for this, maintainers may fix it on the fly).

The patch looks OK otherwise of course, thanks!

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH v3 0/7] rust: use vertical import style and remove redundant imports
From: Miguel Ojeda @ 2026-05-25 17:51 UTC (permalink / raw)
  To: Alvin Sun
  Cc: Arnd Bergmann, Greg Kroah-Hartman, Miguel Ojeda, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, Andreas Hindborg,
	Alice Ryhl, Trevor Gross, Danilo Krummrich, Jens Axboe,
	Brendan Higgins, David Gow, Rae Moar, rust-for-linux, linux-block,
	linux-kselftest, kunit-dev, Onur Özkan
In-Reply-To: <20260521-miscdev-use-format-v3-0-56240ca70d0c@linux.dev>

On Thu, May 21, 2026 at 8:57 AM Alvin Sun <alvin.sun@linux.dev> wrote:
>
> Adopt the vertical import style and drop redundant imports already
> re-exported via `kernel::prelude`.

I can take this if block and misc Ack. The changes are straightforward anyway.

Thanks!

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
From: Tal Zussman @ 2026-05-25 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
	linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <ahPejprWrEjsh7aC@infradead.org>

On 5/25/26 1:30 AM, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 07:17:15PM -0400, Tal Zussman wrote:
>> A: So this actually seems legit... doesn't look like anything actually calls 
>> blkdev_write_begin() or blkdev_write_end(), unless I'm missing something.
>> block_write_begin_iocb() usage seems necessary for bh-based filesystems, but
>> block devices seem to use iomap for writes unconditionally.
> 
> Yes.  Maybe send a separate patch to remove these now unused methods?
> Or I could do that since I forgot to remove them when I should have.
> 

I'll send a patch. I'll also drop the block_write_begin_iocb() change from this
series, as it becomes unused.

^ permalink raw reply

* [PATCH] block: remove blkdev_write_begin() and blkdev_write_end()
From: Tal Zussman @ 2026-05-25 18:25 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, linux-kernel, Tal Zussman

Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
 block/fops.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index bb6642b45937..ffe7b2042f4e 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -499,36 +499,12 @@ static void blkdev_readahead(struct readahead_control *rac)
 	mpage_readahead(rac, blkdev_get_block);
 }
 
-static int blkdev_write_begin(const struct kiocb *iocb,
-			      struct address_space *mapping, loff_t pos,
-			      unsigned len, struct folio **foliop,
-			      void **fsdata)
-{
-	return block_write_begin(mapping, pos, len, foliop, blkdev_get_block);
-}
-
-static int blkdev_write_end(const struct kiocb *iocb,
-			    struct address_space *mapping,
-			    loff_t pos, unsigned len, unsigned copied,
-			    struct folio *folio, void *fsdata)
-{
-	int ret;
-	ret = block_write_end(pos, len, copied, folio);
-
-	folio_unlock(folio);
-	folio_put(folio);
-
-	return ret;
-}
-
 const struct address_space_operations def_blk_aops = {
 	.dirty_folio	= block_dirty_folio,
 	.invalidate_folio = block_invalidate_folio,
 	.read_folio	= blkdev_read_folio,
 	.readahead	= blkdev_readahead,
 	.writepages	= blkdev_writepages,
-	.write_begin	= blkdev_write_begin,
-	.write_end	= blkdev_write_end,
 	.migrate_folio	= buffer_migrate_folio_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };

---
base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d
change-id: 20260525-blk-write-cleanup-afedb5d1ab84

Best regards,
-- 
Tal Zussman <tz2294@columbia.edu>


^ permalink raw reply related

* [syzbot] [block?] possible deadlock in blk_request_module (2)
From: syzbot @ 2026-05-25 20:24 UTC (permalink / raw)
  To: axboe, linux-block, linux-kernel, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    c1ecb239fa34 Add linux-next specific files for 20260522
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=136a20ee580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=fb0ff9bfe34ad282ebd4
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+fb0ff9bfe34ad282ebd4@syzkaller.appspotmail.com

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G             L     
------------------------------------------------------
syz.2.2988/16485 is trying to acquire lock:
ffffffff8e94e518 (major_names_lock){+.+.}-{4:4}, at: blk_probe_dev block/genhd.c:881 [inline]
ffffffff8e94e518 (major_names_lock){+.+.}-{4:4}, at: blk_request_module+0x35/0x2a0 block/genhd.c:897

but task is already holding lock:
ffffffff8e0756d8 (system_transition_mutex){+.+.}-{4:4}, at: software_resume+0x47/0x4c0 kernel/power/hibernate.c:1022

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #7 (system_transition_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       lock_system_sleep+0x49/0x70 kernel/power/main.c:71
       disk_store+0xa7/0x500 kernel/power/hibernate.c:1217
       kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
       iter_file_splice_write+0x9a6/0x10f0 fs/splice.c:736
       do_splice_from fs/splice.c:936 [inline]
       direct_splice_actor+0x104/0x160 fs/splice.c:1159
       splice_direct_to_actor+0x545/0xc80 fs/splice.c:1103
       do_splice_direct_actor fs/splice.c:1202 [inline]
       do_splice_direct+0x19b/0x2a0 fs/splice.c:1228
       do_sendfile+0x547/0x7e0 fs/read_write.c:1372
       __do_sys_sendfile64 fs/read_write.c:1433 [inline]
       __se_sys_sendfile64+0x144/0x1a0 fs/read_write.c:1419
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #6 (&of->mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       kernfs_seq_start+0x5c/0x420 fs/kernfs/file.c:172
       seq_read_iter+0x3f8/0xe20 fs/seq_file.c:226
       new_sync_read fs/read_write.c:493 [inline]
       vfs_read+0x58b/0xa80 fs/read_write.c:574
       ksys_read+0x156/0x270 fs/read_write.c:717
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #5 (&p->lock){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       seq_read_iter+0xb8/0xe20 fs/seq_file.c:183
       lo_rw_aio+0xc80/0xf00 include/linux/percpu-rwsem.h:-1
       do_req_filebacked drivers/block/loop.c:435 [inline]
       loop_handle_cmd drivers/block/loop.c:1941 [inline]
       loop_process_work+0x92a/0x11b0 drivers/block/loop.c:1976
       process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #4 ((work_completion)(&worker->work)){+.+.}-{0:0}:
       process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
       process_scheduled_works kernel/workqueue.c:3401 [inline]
       worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
       kthread+0x388/0x470 kernel/kthread.c:436
       ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #3 ((wq_completion)loop8){+.+.}-{0:0}:
       touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
       __flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
       drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
       __loop_clr_fd drivers/block/loop.c:1130 [inline]
       lo_release+0x287/0x8f0 drivers/block/loop.c:1767
       bdev_release+0x541/0x660 block/bdev.c:-1
       blkdev_release+0x15/0x20 block/fops.c:705
       __fput+0x461/0xa70 fs/file_table.c:510
       task_work_run+0x1d9/0x270 kernel/task_work.c:233
       exit_task_work include/linux/task_work.h:40 [inline]
       do_exit+0x70f/0x22c0 kernel/exit.c:1004
       do_group_exit+0x21b/0x2d0 kernel/exit.c:1147
       get_signal+0x1284/0x1330 kernel/signal.c:3038
       arch_do_signal_or_restart+0xbc/0x840 arch/x86/kernel/signal.c:337
       __exit_to_user_mode_loop kernel/entry/common.c:64 [inline]
       exit_to_user_mode_loop+0x8c/0x4d0 kernel/entry/common.c:98
       __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
       syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:230 [inline]
       syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
       do_syscall_64+0x33e/0x560 arch/x86/entry/syscall_64.c:100
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #2 (&disk->open_mutex){+.+.}-{4:4}:
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       __del_gendisk+0x127/0x980 block/genhd.c:710
       del_gendisk+0xe7/0x160 block/genhd.c:823
       loop_remove+0x42/0xc0 drivers/block/loop.c:2136
       loop_control_remove drivers/block/loop.c:2195 [inline]
       loop_control_ioctl+0x4ba/0x5b0 drivers/block/loop.c:2237
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:597 [inline]
       __se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #1 (&set->update_nr_hwq_lock){++++}-{4:4}:
       down_read+0x97/0x200 kernel/locking/rwsem.c:1568
       add_disk_fwnode+0xe7/0x480 block/genhd.c:596
       add_disk include/linux/blkdev.h:794 [inline]
       loop_add+0x86e/0xb50 drivers/block/loop.c:2108
       blk_probe_dev block/genhd.c:884 [inline]
       blk_request_module+0x27d/0x2a0 block/genhd.c:-1
       blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
       blkdev_open+0x1f5/0x620 block/fops.c:688
       do_dentry_open+0x83d/0x13e0 fs/open.c:947
       vfs_open+0x3b/0x350 fs/open.c:1052
       do_open fs/namei.c:4688 [inline]
       path_openat+0x2eea/0x3960 fs/namei.c:4847
       do_file_open+0x23e/0x4a0 fs/namei.c:4876
       do_sys_openat2+0x115/0x200 fs/open.c:1368
       do_sys_open fs/open.c:1374 [inline]
       __do_sys_openat fs/open.c:1390 [inline]
       __se_sys_openat fs/open.c:1385 [inline]
       __x64_sys_openat+0x138/0x170 fs/open.c:1385
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (major_names_lock){+.+.}-{4:4}:
       check_prev_add kernel/locking/lockdep.c:3167 [inline]
       check_prevs_add kernel/locking/lockdep.c:3286 [inline]
       validate_chain kernel/locking/lockdep.c:3910 [inline]
       __lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
       lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
       __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
       mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
       blk_probe_dev block/genhd.c:881 [inline]
       blk_request_module+0x35/0x2a0 block/genhd.c:897
       blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
       bdev_file_open_by_dev+0xa0/0x240 block/bdev.c:1054
       swsusp_check+0x56/0x490 kernel/power/swap.c:1571
       software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
       resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
       kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
       new_sync_write fs/read_write.c:595 [inline]
       vfs_write+0x629/0xba0 fs/read_write.c:688
       ksys_write+0x156/0x270 fs/read_write.c:740
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

Chain exists of:
  major_names_lock --> &of->mutex --> system_transition_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(system_transition_mutex);
                               lock(&of->mutex);
                               lock(system_transition_mutex);
  lock(major_names_lock);

 *** DEADLOCK ***

5 locks held by syz.2.2988/16485:
 #0: ffff88803c6b3528 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1260
 #1: ffff8880354b8480 (sb_writers#7){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2733 [inline]
 #1: ffff8880354b8480 (sb_writers#7){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:684
 #2: ffff88803cf6d878 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:343
 #3: ffff88801e299698 (kn->active#62){.+.+}-{0:0}, at: kernfs_get_active_of fs/kernfs/file.c:80 [inline]
 #3: ffff88801e299698 (kn->active#62){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x232/0x540 fs/kernfs/file.c:344
 #4: ffffffff8e0756d8 (system_transition_mutex){+.+.}-{4:4}, at: software_resume+0x47/0x4c0 kernel/power/hibernate.c:1022

stack backtrace:
CPU: 0 UID: 0 PID: 16485 Comm: syz.2.2988 Tainted: G             L      syzkaller #0 PREEMPT_{RT,(full)} 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2045
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2177
 check_prev_add kernel/locking/lockdep.c:3167 [inline]
 check_prevs_add kernel/locking/lockdep.c:3286 [inline]
 validate_chain kernel/locking/lockdep.c:3910 [inline]
 __lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
 lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
 __mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
 mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
 blk_probe_dev block/genhd.c:881 [inline]
 blk_request_module+0x35/0x2a0 block/genhd.c:897
 blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
 bdev_file_open_by_dev+0xa0/0x240 block/bdev.c:1054
 swsusp_check+0x56/0x490 kernel/power/swap.c:1571
 software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
 resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
 kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x629/0xba0 fs/read_write.c:688
 ksys_write+0x156/0x270 fs/read_write.c:740
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f29a11dce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f299f436028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f29a1455fa0 RCX: 00007f29a11dce59
RDX: 0000000000000012 RSI: 0000200000000040 RDI: 0000000000000004
RBP: 00007f29a1272d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f29a1456038 R14: 00007f29a1455fa0 R15: 00007fff13825808
 </TASK>
block device autoloading is deprecated and will be removed.
PM: Image not found (code -22)


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Tetsuo Handa @ 2026-05-26  0:25 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Bart Van Assche, Christoph Hellwig, Damien Le Moal, linux-block,
	LKML, Andrew Morton, Linus Torvalds
In-Reply-To: <ahRocb0Vs_m6RF_O@fedora>

On 2026/05/26 0:19, Ming Lei wrote:
> On Mon, May 25, 2026 at 12:40:19PM +0900, Tetsuo Handa wrote:
>> Some commit which was merged in the merge window for 7.1 broke the loop
>> driver; a race window where lo_release() clears the backing file via
>> __loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
>>
>> The exact commit which changed the behavior is not known due to lack of
>> reproducer and timing dependent behavior, but it seems that we need to
>> solve this problem in the loop driver despite there was no change for the
>> loop driver during this merge window.
>>
>> To close this race, try to flush pending I/O requests. However, calling
>> drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
>> lockdep warnings [3][4]. We need to flush pending I/O requests without
>> disk->open_mutex held.
> 
> No, please don't workaround before root cause.
> 
> No proof shows that the issue is in block layer or loop driver, the IO isn't
> expected, you need to figure out why btrfs still issues IO after this loop
> disk is closed by everyone and writeback is done.
> 
> https://syzkaller.appspot.com/x/log.txt?x=101e4702580000
> 

Of course we should try to figure out the root cause first, but how can we do?

  Absolute fact:

    This problem started happening no later than next-20260413 in the linux-next.git tree.
    ( syzbot was unable to test next-202604{03,06,07,08,09,10} due to a different bug. )

    This problem is still happening as of v7.1-rc5 in the linux.git tree.

    No one has succeeded establishing steps to reproduce this problem.

    No one has identified the exact commit that is causing this problem.

  Likely fact:

    Since this problem did not happen using next-20260402 in the linux-next.git tree until 2026/04/13 16:31,
    this problem did not exist until next-20260402 in the linux-next.git tree.

    Since this problem did not happen until v7.0, this problem did not exist until v7.0.
    (Although last minute changes for v7.0-rc{6,7} or v7.0 could become the culprit,
     the merge window which accepts big changes for v7.1 is more likely.)

  My guess:

    The culprit commit is in between commit a028739a4330 ("Merge tag 'block-7.0-20260305' of
    git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux") and commit 7fe6ac157b7e ("Merge tag
    'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux"), for
    changes related to bio handling are merged in this period.

    "git log --oneline block/ drivers/block/" between next-20260402 and next-20260413 shows the following diff:

--------------------
-da93b347876b Merge branch 'master' of https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
-9b75c6e054b7 Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
-265720725a47 Merge branch 'fs-next' of linux-next
-ac9e99118030 Merge branch into tip/master: 'x86/cleanups'
-8ea5c0750d36 zram: do not forget to endio for partial discard requests
-0476d2e93477 zram: change scan_slots to return void
-1207420afea8 zram: propagate read_from_bdev_async() errors
-aafa569edb41 zram: optimize LZ4 dictionary compression performance
-24c76a259819 Merge branch 'for-7.1/block' into for-next
-eca714c0aac1 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+7d8d908556ca Merge branch 'master' of https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
+4391dc7df11d Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
+18c6a4c24187 Merge branch 'fs-next' of linux-next
+7f828a86cfef Merge branch into tip/master: 'x86/cleanups'
+716aa108c5bb zram: reject unrecognized type= values in recompress_store()
+3470a1d34f40 zram: do not forget to endio for partial discard requests
+88a57e158619 Merge branch 'for-7.1/block' into for-next
+36446de0c30c ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
+f2bab85781e8 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+9357dc97533a Merge branch 'vfs-7.1.integrity' into vfs.all
+e0b15707598c Merge branch 'for-7.1/block' into for-next
+539fb773a3f7 block: refactor blkdev_zone_mgmt_ioctl
+ddc1dfffcbea Merge branch 'for-7.1/block' into for-next
+365ea7cc6244 ublk: allow buffer registration before device is started
+5e864438e285 ublk: replace xarray with IDA for shmem buffer index allocation
+8ea8566a9aee ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
+211ff1602b67 ublk: verify all pages in multi-page bvec fall within registered range
+23b3b6f0b584 ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
+cb793ff1353d Merge branch 'for-7.1/block' into for-next
+92c3737a2473 block: add a bio_submit_or_kill helper
+6fa747550e35 block: factor out a bio_await helper
+65565ca5f99b block: unify the synchronous bi_end_io callbacks
+cc91702dedc5 Merge branch 'for-7.1/block' into for-next
+8a34e88769f6 ublk: eliminate permanent pages[] array from struct ublk_buf
+08677040a911 ublk: enable UBLK_F_SHMEM_ZC feature flag
+4d4a512a1f87 ublk: add PFN-based buffer matching in I/O path
+2fb0ded237bb ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
+dec615fa43c3 Merge branch 'for-7.1/block' into for-next
+fa0cac9a5158 drbd: use get_random_u64() where appropriate
+0b581d2fb4cf Merge branch 'for-7.1/block' into for-next
+a9c4b1d37622 drbd: remove DRBD_GENLA_F_MANDATORY flag handling
+d436cfb3a259 Merge branch 'for-7.1/block' into for-next
+e9b004ff8306 blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
+09ebc43b5edc Merge branch 'for-7.1/block' into for-next
+0842186d2c4e ublk: reset per-IO canceled flag on each fetch
+cba82993308d zram: change scan_slots to return void
+bf989ade270d zram: propagate read_from_bdev_async() errors
+f0f6f7871430 zram: optimize LZ4 dictionary compression performance
+301f39220096 zram: unify and harden algo/priority params handling
+cedfa028b54e zram: remove chained recompression
+5004a27edba5 zram: drop ->num_active_comps
+ed19b9d5504f zram: do not autocorrect bad recompression parameters
+241f9005b1c8 zram: do not permit params change after init
+c09fb53d293a zram: use statically allocated compression algorithm names
+6030f93e5c71 Merge branch 'for-7.1/io_uring-fuse' into for-next
+29ebfdd7db89 io_uring/rsrc: rename io_buffer_register_bvec()/io_buffer_unregister_bvec()
+6568edbea553 Merge branch 'for-7.1/block' into for-next
+a175ee827331 block: use sysfs_emit in sysfs show functions
+c691e4b0d80b bio: fix kmemleak false positives from percpu bio alloc cache
 f91ffe89b201 blk-iocost: fix busy_level reset when no IOs complete
 23308af722fe blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
 b2a78fec344e zloop: add max_open_zones option
 2a2f520fda82 block: fix zones_cond memory leak on zone revalidation error paths
 267ec4d7223a loop: fix partition scan race between udev and loop_reread_partitions()
 499d2d2f4cf9 sed-opal: Add STACK_RESET command
-c61825bb46bc Merge branch 'vfs-7.1.integrity' into vfs.all
-fc2093641448 zram: unify and harden algo/priority params handling
-4fd453f16446 zram: remove chained recompression
-e2b717936d1a zram: drop ->num_active_comps
-3578bb37f7d1 zram: do not autocorrect bad recompression parameters
-5331373bfebd zram: do not permit params change after init
 2b31e86387e6 drbd: Balance RCU calls in drbd_adm_dump_devices()
 f9480ecf939d bdev: Drop pointless invalidate_inode_buffers() call
-b00ff1b25f85 zram: use statically allocated compression algorithm names
 630bbba45cfd drbd: use genl pre_doit/post_doit
 829def1e35ca zloop: forget write cache on force removal
 eff8d1656e83 zloop: refactor zloop_rw
--------------------

    "git log --oneline block/" between next-20260402 and next-20260413 shows the following diff:

--------------------
-9b75c6e054b7 Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
-eca714c0aac1 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+4391dc7df11d Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
+f2bab85781e8 Merge branch 'vfs-7.1.bh.metadata' into vfs.all
+539fb773a3f7 block: refactor blkdev_zone_mgmt_ioctl
+92c3737a2473 block: add a bio_submit_or_kill helper
+6fa747550e35 block: factor out a bio_await helper
+65565ca5f99b block: unify the synchronous bi_end_io callbacks
+e9b004ff8306 blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
+a175ee827331 block: use sysfs_emit in sysfs show functions
+c691e4b0d80b bio: fix kmemleak false positives from percpu bio alloc cache
 f91ffe89b201 blk-iocost: fix busy_level reset when no IOs complete
 23308af722fe blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()
 2a2f520fda82 block: fix zones_cond memory leak on zone revalidation error paths
--------------------

Possible approaches for finding the exact commit that is causing this problem:

  (a) Revert all changes in the block layer from linux.git and monitor for one week for whether this
      problem is still happening (because linux.git is more frequently hitting this problem than
      linux-next.git ).

  (b) Revert all changes in the block layer from linux-next.git and monitor for two weeks for
      whether this problem is still happening (less reliable than linux.git but a candidate).

  (c) Let sashiko review all changes between v7.0 and v7.1 that may cause this problem.
      (Human developers have no time to review. But is investigation with moving baseline commit
      possible for sashiko ?)

  (d) Any ideas?

P.S. Since the loop driver is a critical infrastructure for testing filesystems by syzbot,
I want this problem be addressed before 7.1 is released.

^ permalink raw reply

* Re: [PATCH 1/1] rust: block: fix GenDiskBuilder failure cleanup
From: Haoze Xie @ 2026-05-26  0:43 UTC (permalink / raw)
  To: Andreas Hindborg, Ren Wei, linux-block, rust-for-linux
  Cc: ojeda, boqun, gary, bjorn3_gh, lossin, aliceryhl, tmgross, dakr,
	daniel.almeida, axboe, sunke, tamird, yuantan098, bird
In-Reply-To: <87qzmz75d9.fsf@t14s.mail-host-address-is-not-set>

On 5/26/2026 1:42 AM, Andreas Hindborg wrote:
> "Ren Wei" <n05ec@lzu.edu.cn> writes:
> 
>> From: Haoze Xie <royenheart@gmail.com>
>>
>> If GenDiskBuilder::build() fails after __blk_mq_alloc_disk(), the
>> allocated gendisk is left behind until the caller drops the last
>> tagset reference.
>>
>> Handle the failure path by releasing the temporary gendisk first,
>> then converting the foreign queue data back, so probe failures clean
>> up both resources before returning an error.
>>
>> Fixes: 3253aba3408aa ("rust: block: introduce `kernel::block::mq` module")
>> Cc: stable@kernel.org
>> Reported-by: Yuan Tan <yuantan098@gmail.com>
>> Reported-by: Xin Liu <bird@lzu.edu.cn>
>> Signed-off-by: Haoze Xie <royenheart@gmail.com>
>> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> 
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
> 
> Thanks for reporting and fixing!
> 
> I see we also lack a `put_disk` in `GenDisk::drop` after `del_gendisk`.
> Do you want to patch that as well?
> 
> Best regards,
> Andreas Hindborg
> 
> 

Hi Andreas,

Thanks for the review and for pointing this out.

We will run some additional experiments around the `GenDisk::drop()`
teardown path first. If it does turn out to be a real issue, we will
send a follow-up patch.

Best regards,
Haoze

^ permalink raw reply

* Re: [PATCH] blk-mq: reinsert cached request to the list
From: Ming Lei @ 2026-05-26  1:44 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, axboe, Keith Busch, Christoph Hellwig
In-Reply-To: <20260525160744.896047-1-kbusch@meta.com>

On Mon, May 25, 2026 at 09:07:44AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> A previous commit removed an optimization out of caution for a scenario
> that turns out not to be real: all the "queue_exit" goto's are safe to
> reinsert the request into the cached_rq's plug list as they are either
> from a non-blocking path, or a successful merge that already holds the
> queue reference. This optimization is most needed for small sequential
> workloads that successfully merge into larger requests.
> 
> Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
> Suggested-by: Ming Lei <tom.leiming@gmail.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/blk-mq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 28c2d931e75ea..72c8ac805882c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3246,7 +3246,7 @@ void blk_mq_submit_bio(struct bio *bio)
>  	if (!rq)
>  		blk_queue_exit(q);
>  	else
> -		blk_mq_free_request(rq);
> +		rq_list_push(&plug->cached_rqs, rq);

rq_list_add_head()?


Thanks,
Ming

^ permalink raw reply

* [PATCH v7] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-26  2:15 UTC (permalink / raw)
  To: axboe, hch, kbusch
  Cc: yukuai, linux-block, linux-kernel, Tang Yizhou, Leon Hwang

From: Tang Yizhou <yizhou.tang@shopee.com>

Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:

lsblk
  NAME     MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
  mydev    252:1    0   20G  0 disk
  └─mydev1 259:0    0   10G  0 part

iostat -xp 1
  Device       r/s        rkB/s      ... aqu-sz   %util
  mydev    128153.00  512612.00      ...  13.22   72.20
  mydev1   128154.00  512616.00      ...  13.22  100.00

%util is different between mydev and mydev1, which is unexpected.

This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.

In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.

Fix it by restoring the whole-disk in_flight accounting.

Fixes: e016b78201a2 ("block: return just one value from part_in_flight")
Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v2: Update commit message.
v3: Take Christoph's advice and factor the common code into two helpers.
v4: Remove my redundant new line in blk.h. Add Christoph's Reviewed-by
tag.
v5: Remove the changelog from the commit message.
v6: Accept Keith's suggestion and fix the bug in bdev_end_io_acct().
v7: Address the review feedback from Claude Opus 4.7 and update
blk_account_io_merge_request().
 block/blk-core.c  |  4 ++--
 block/blk-merge.c |  3 +--
 block/blk-mq.c    |  5 ++---
 block/blk.h       | 21 +++++++++++++++++++++
 4 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..cee4e4a37503 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1042,7 +1042,7 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
 {
 	part_stat_lock();
 	update_io_ticks(bdev, start_time, false);
-	part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
+	bdev_inc_in_flight(bdev, op);
 	part_stat_unlock();
 
 	return start_time;
@@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
 	part_stat_inc(bdev, ios[sgrp]);
 	part_stat_add(bdev, sectors[sgrp], sectors);
 	part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
-	part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
+	bdev_dec_in_flight(bdev, op);
 	part_stat_unlock();
 }
 EXPORT_SYMBOL(bdev_end_io_acct);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fcf09325b22e..62d68a72f569 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -723,8 +723,7 @@ static void blk_account_io_merge_request(struct request *req)
 	if (req->rq_flags & RQF_IO_STAT) {
 		part_stat_lock();
 		part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
-		part_stat_local_dec(req->part,
-				    in_flight[op_is_write(req_op(req))]);
+		bdev_dec_in_flight(req->part, req_op(req));
 		part_stat_unlock();
 	}
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..6bdfe642bd93 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1082,8 +1082,7 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 		update_io_ticks(req->part, jiffies, true);
 		part_stat_inc(req->part, ios[sgrp]);
 		part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
-		part_stat_local_dec(req->part,
-				    in_flight[op_is_write(req_op(req))]);
+		bdev_dec_in_flight(req->part, req_op(req));
 		part_stat_unlock();
 	}
 }
@@ -1143,7 +1142,7 @@ static inline void blk_account_io_start(struct request *req)
 
 	part_stat_lock();
 	update_io_ticks(req->part, jiffies, false);
-	part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
+	bdev_inc_in_flight(req->part, req_op(req));
 	part_stat_unlock();
 }
 
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..11245a494c43 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -4,6 +4,7 @@
 
 #include <linux/bio-integrity.h>
 #include <linux/blk-crypto.h>
+#include <linux/part_stat.h>
 #include <linux/lockdep.h>
 #include <linux/memblock.h>	/* for max_pfn/max_low_pfn */
 #include <linux/sched/sysctl.h>
@@ -485,6 +486,26 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 		q->last_merge = NULL;
 }
 
+static inline void bdev_inc_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_inc(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_inc(bdev_whole(bdev), in_flight[rw]);
+}
+
+static inline void bdev_dec_in_flight(struct block_device *bdev,
+				      enum req_op op)
+{
+	bool rw = op_is_write(op);
+
+	part_stat_local_dec(bdev, in_flight[rw]);
+	if (bdev_is_partition(bdev))
+		part_stat_local_dec(bdev_whole(bdev), in_flight[rw]);
+}
+
 /*
  * Internal io_context interface
  */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Christoph Hellwig @ 2026-05-26  6:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	linux-mm
In-Reply-To: <ahQX_JCgS9JWIhY-@kernel.org>

On Mon, May 25, 2026 at 12:35:56PM +0300, Mike Rapoport wrote:
> > This does, but it still fails to explain why kmalloc performs just as
> > well as __get_free_page(s) these days.
> 
> I don't think that in this case - a single allocation on the cold path -
> the performance difference is even measurable.

Well, please state that.

> Nevertheless allocations from slab caches are way faster than
> __get_free_page() (i.e.  alloc_pages()) as it's essentially lockless
> cmpxchg. Allocations that need to refill the cache do alloc_pages() with a
> little of slab bookkeeping overhead.

Please state that too.


^ permalink raw reply

* Re: [PATCH] block: remove blkdev_write_begin() and blkdev_write_end()
From: Christoph Hellwig @ 2026-05-26  6:29 UTC (permalink / raw)
  To: Tal Zussman; +Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-kernel
In-Reply-To: <20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply

* Re: [PATCH] block: Add bvec_folio()
From: Christoph Hellwig @ 2026-05-26  6:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-kernel,
	io-uring, linux-mm, Leon Romanovsky
In-Reply-To: <ahROtyLcr567wM8l@casper.infradead.org>

On Mon, May 25, 2026 at 02:29:27PM +0100, Matthew Wilcox wrote:
> > So I'm not against the function per se, but the documentation must
> > explain the minefields it is stepping into a bit better.
> 
> Lower level drivers shouldn't be concerning themselves with folios.
> For a start, we can put non-folios (eg slab memory) into bvecs.

Well, that is a very good thing to put into the comment.  We can also
put them into high-level bvecs, so framing this as 'only use if you
know the memory is folios, which you can't unless you are the entity
who filled the bio' might be a good choice.


^ permalink raw reply

* [PATCH] bvec: make the bvec_iter helpers inline functions
From: Christoph Hellwig @ 2026-05-26  7:00 UTC (permalink / raw)
  To: axboe; +Cc: linux-block

The macros are impossible to follow due to the lack of visual type
information and all the braces.  Replace them with inline helpers to
improve on that.  Because the calling conventions are a bit problematic
with a lot of passing structures by value, all the helpers are marked
as __always_inline so that they are force inlined.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/bvec.h | 101 +++++++++++++++++++++++++++----------------
 1 file changed, 64 insertions(+), 37 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d36dd476feda..f4c7ec282ac9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -104,51 +104,78 @@ struct bvec_iter_all {
 	unsigned	done;
 };
 
-/*
- * various member access, note that bio_data should of course not be used
- * on highmem page vectors
- */
-#define __bvec_iter_bvec(bvec, iter)	(&(bvec)[(iter).bi_idx])
+static __always_inline const struct bio_vec *
+__bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return bvecs + iter.bi_idx;
+}
 
 /* multi-page (mp_bvec) helpers */
-#define mp_bvec_iter_page(bvec, iter)				\
-	(__bvec_iter_bvec((bvec), (iter))->bv_page)
+static __always_inline struct page *
+mp_bvec_iter_page(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return __bvec_iter_bvec(bvecs, iter)->bv_page;
+}
 
-#define mp_bvec_iter_len(bvec, iter)				\
-	min((iter).bi_size,					\
-	    __bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
+static __always_inline unsigned int
+mp_bvec_iter_len(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return min(__bvec_iter_bvec(bvecs, iter)->bv_len - iter.bi_bvec_done,
+			iter.bi_size);
+}
 
-#define mp_bvec_iter_offset(bvec, iter)				\
-	(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
+static __always_inline unsigned int
+mp_bvec_iter_offset(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return __bvec_iter_bvec(bvecs, iter)->bv_offset + iter.bi_bvec_done;
+}
 
-#define mp_bvec_iter_page_idx(bvec, iter)			\
-	(mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+static __always_inline unsigned int
+mp_bvec_iter_page_idx(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return mp_bvec_iter_offset(bvecs, iter) / PAGE_SIZE;
+}
 
-#define mp_bvec_iter_bvec(bvec, iter)				\
-((struct bio_vec) {						\
-	.bv_page	= mp_bvec_iter_page((bvec), (iter)),	\
-	.bv_len		= mp_bvec_iter_len((bvec), (iter)),	\
-	.bv_offset	= mp_bvec_iter_offset((bvec), (iter)),	\
-})
+static __always_inline struct bio_vec
+mp_bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return (struct bio_vec) {
+		.bv_page	= mp_bvec_iter_page(bvecs, iter),
+		.bv_len		= mp_bvec_iter_len(bvecs, iter),
+		.bv_offset	= mp_bvec_iter_offset(bvecs, iter),
+	};
+}
 
 /* For building single-page bvec in flight */
- #define bvec_iter_offset(bvec, iter)				\
-	(mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
-
-#define bvec_iter_len(bvec, iter)				\
-	min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),		\
-	      PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
-
-#define bvec_iter_page(bvec, iter)				\
-	(mp_bvec_iter_page((bvec), (iter)) +			\
-	 mp_bvec_iter_page_idx((bvec), (iter)))
-
-#define bvec_iter_bvec(bvec, iter)				\
-((struct bio_vec) {						\
-	.bv_page	= bvec_iter_page((bvec), (iter)),	\
-	.bv_len		= bvec_iter_len((bvec), (iter)),	\
-	.bv_offset	= bvec_iter_offset((bvec), (iter)),	\
-})
+static __always_inline unsigned int
+bvec_iter_offset(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return mp_bvec_iter_offset(bvecs, iter) % PAGE_SIZE;
+}
+
+static __always_inline unsigned int
+bvec_iter_len(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return min(mp_bvec_iter_len(bvecs, iter),
+			PAGE_SIZE - bvec_iter_offset(bvecs, iter));
+}
+
+static __always_inline struct page *
+bvec_iter_page(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return mp_bvec_iter_page(bvecs, iter) +
+		mp_bvec_iter_page_idx(bvecs, iter);
+}
+
+static __always_inline struct bio_vec
+bvec_iter_bvec(const struct bio_vec *bvecs, const struct bvec_iter iter)
+{
+	return (struct bio_vec) {
+		.bv_page	= bvec_iter_page(bvecs, iter),
+		.bv_len		= bvec_iter_len(bvecs, iter),
+		.bv_offset	= bvec_iter_offset(bvecs, iter),
+	};
+}
 
 static inline bool bvec_iter_advance(const struct bio_vec *bv,
 		struct bvec_iter *iter, unsigned bytes)
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v2] blk-throttle: schedule parent dispatch in tg_flush_bios()
From: Shin'ichiro Kawasaki @ 2026-05-26  8:32 UTC (permalink / raw)
  To: Tao Cui; +Cc: tj, josef, axboe, cgroups, linux-block
In-Reply-To: <20260522091530.1901437-1-cuitao@kylinos.cn>

On May 22, 2026 / 17:15, Tao Cui wrote:
> tg_flush_bios() schedules pending_timer on the child tg's own
> service_queue, which causes throtl_pending_timer_fn() to dispatch from
> the child's pending_tree.  For leaf cgroups this tree is empty, so the
> timer fires and exits without dispatching the throttled bio.
> 
> The throttled bio sits in the parent's pending_tree with disptime set
> to jiffies (THROTL_TG_CANCELING zeroes all dispatch times), but the
> parent's timer is never explicitly rescheduled.  The bio only gets
> dispatched when the parent timer eventually fires at its previously
> scheduled expiry.
> 
> Fix by calling throtl_schedule_next_dispatch(sq->parent_sq, true)
> instead, matching what tg_set_limit() already does.  This forces the
> parent's dispatch cycle to run immediately and flush all canceling
> bios without waiting for a stale timer.
> 
> For the device deletion path (blk_throtl_cancel_bios), directly
> complete throttled bios with EIO via bio_io_error() instead of
> dispatching them through the timer -> work -> submission chain.
> This avoids a race with the SCSI state machine where bios can reach
> the SCSI layer while the device is in SDEV_CANCEL state, causing
> ENODEV instead of the expected EIO.
> 
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

I reported that v1 patch fails with the blktests test case throtl/004,
but I did not report the problem that this patch addresses. Then I don't
think this Reported-by tag is valid. Please drop it.

I confirmed that the recent blktess CI test run with this v2 patch did not
fail at throtl/004. Thanks to your action for the failure.


^ permalink raw reply

* [PATCH 1/2] block: Use struct_size() helper in kmalloc()
From: luoqing @ 2026-05-26  8:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, l1138897701

From: luoqing <luoqing@kylinos.cn>

Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worst scenario, could lead to heap overflows.

Signed-off-by: luoqing <luoqing@kylinos.cn>
---
 block/bio.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index d80d5d26804e..397fc3bc0ede 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -657,8 +657,7 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
 
 	if (nr_vecs > BIO_MAX_INLINE_VECS)
 		return NULL;
-	return kmalloc(sizeof(*bio) + nr_vecs * sizeof(struct bio_vec),
-			gfp_mask);
+	return kmalloc(struct_size(bio, bio_vec, nr_vecs), gfp_mask);
 }
 EXPORT_SYMBOL(bio_kmalloc);
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH] block: partitions: fix of_node refcount leak in of_partition()
From: Wentao Liang @ 2026-05-26 10:21 UTC (permalink / raw)
  To: Jens Axboe, stable
  Cc: Josh Law, Kees Cook, linux-block, linux-kernel, Wentao Liang

of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:

1. The compatibility check at the top of the function returns 0
   without releasing partitions_np when the node exists but is not
   "fixed-partitions" compatible.

2. The function returns 1 at the end after successfully processing
   all partitions without releasing partitions_np.

Fix both leaks by adding of_node_put(partitions_np) on each path.

Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 block/partitions/of.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/partitions/of.c b/block/partitions/of.c
index c22b60661098..53664ea06b65 100644
--- a/block/partitions/of.c
+++ b/block/partitions/of.c
@@ -74,8 +74,10 @@ int of_partition(struct parsed_partitions *state)
 	struct device_node *partitions_np = of_node_get(ddev->of_node);
 
 	if (!partitions_np ||
-	    !of_device_is_compatible(partitions_np, "fixed-partitions"))
+	    !of_device_is_compatible(partitions_np, "fixed-partitions")) {
+		of_node_put(partitions_np);
 		return 0;
+	}
 
 	slot = 1;
 	/* Validate parition offset and size */
@@ -104,5 +106,6 @@ int of_partition(struct parsed_partitions *state)
 
 	seq_buf_puts(&state->pp_buf, "\n");
 
+	of_node_put(partitions_np);
 	return 1;
 }
-- 
2.34.1


^ permalink raw reply related

* [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Wentao Liang @ 2026-05-26 10:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, Wentao Liang, stable

blk_mq_mark_tag_wait() calls sbitmap_queue_get() which increments
sbq->ws_active. On the error path where the waitqueue_active() check
fails and the function returns early, sbq->ws_active is not decremented,
leaking the reference.

Fix this by calling sbitmap_queue_clear() to properly release the
ws_active reference before returning on the error path.

Fixes: c27d53fb445f ("blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..e1c2ac416693 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1952,6 +1952,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	spin_lock_irq(&wq->lock);
 	spin_lock(&hctx->dispatch_wait_lock);
 	if (!list_empty(&wait->entry)) {
+		list_del_init(&wait->entry);
+		atomic_dec(&sbq->ws_active);
 		spin_unlock(&hctx->dispatch_wait_lock);
 		spin_unlock_irq(&wq->lock);
 		return false;
-- 
2.34.1


^ permalink raw reply related

* [PATCH] Revert "nbd: freeze the queue while we're adding connections"
From: Yang Erkun @ 2026-05-26 11:52 UTC (permalink / raw)
  To: josef, axboe; +Cc: linux-block, nbd

This reverts commit b98e762e3d71e893b221f871825dc64694cfb258.

Commit b98e762e3d71 ("nbd: freeze the queue while we're adding
connections") added blk_mq_freeze_queue/blk_mq_unfreeze_queue in
nbd_add_socket() to protect krealloc(config->socks) from concurrent I/O
that could cause a Use-After-Free.

However, analysis shows that in all current code paths, concurrent I/O
cannot actually reach nbd_add_socket():

1. nbd_genl_connect() path:
   nbd_add_socket() is called first, and nbd_start_device() -- which
   starts the queue and enables I/O -- is called only after all sockets
   have been added. So the freeze/unfreeze runs against an idle queue,
   marking then waiting on a percpu_ref that is already zero, and then
   resurrecting it -- a pure no-op that burns an RCU grace period per
   socket on multi-core systems.

2. nbd_ioctl(NBD_SET_SOCK) path:
   The task_setup check enforces that only the thread which performed
   the first NBD_SET_SOCK can call NBD_SET_SOCK again. That thread is
   blocked in NBD_DO_IT's wait_event_interruptible, so it cannot issue
   another NBD_SET_SOCK concurrently with I/O. Other threads are
   rejected by the task_setup != current check.

3. nbd_genl_reconfigure() does not call nbd_add_socket() at all; it
   uses nbd_reconnect_socket() which replaces a dead socket in-place
   without reallocating config->socks.

Therefore the freeze/unfreeze provides no actual protection in any
reachable code path, while imposing the cost of blk_mq_freeze_queue
(percpu_ref_kill + RCU grace period wait + percpu_ref_resurrect) on
every socket addition during device setup[1].

Revert the change to eliminate the unnecessary overhead.

Link: https://lore.kernel.org/all/20260327091223.4147956-1-leo.lilong@huaweicloud.com/ [1]
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
---
 drivers/block/nbd.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index fe63f3c55d0d..9033d996c9a9 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1245,22 +1245,16 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
 	struct socket *sock;
 	struct nbd_sock **socks;
 	struct nbd_sock *nsock;
-	unsigned int memflags;
 	int err;

 	/* Arg will be cast to int, check it to avoid overflow */
 	if (arg > INT_MAX)
 		return -EINVAL;
+
 	sock = nbd_get_socket(nbd, arg, &err);
 	if (!sock)
 		return err;

-	/*
-	 * We need to make sure we don't get any errant requests while we're
-	 * reallocating the ->socks array.
-	 */
-	memflags = blk_mq_freeze_queue(nbd->disk->queue);
-
 	if (!netlink && !nbd->task_setup &&
 	    !test_bit(NBD_RT_BOUND, &config->runtime_flags))
 		nbd->task_setup = current;
@@ -1300,12 +1294,9 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
 	INIT_WORK(&nsock->work, nbd_pending_cmd_work);
 	socks[config->num_connections++] = nsock;
 	atomic_inc(&config->live_connections);
-	blk_mq_unfreeze_queue(nbd->disk->queue, memflags);
-
 	return 0;

 put_socket:
-	blk_mq_unfreeze_queue(nbd->disk->queue, memflags);
 	sockfd_put(sock);
 	return err;
 }
-- 
2.52.0

^ permalink raw reply related

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Vlastimil Babka @ 2026-05-26 12:07 UTC (permalink / raw)
  To: Mike Rapoport, Christoph Hellwig, Matthew Wilcox
  Cc: Jens Axboe, linux-block, linux-kernel, linux-mm
In-Reply-To: <ahQX_JCgS9JWIhY-@kernel.org>

On 5/25/26 11:35 AM, Mike Rapoport wrote:
> On Mon, May 25, 2026 at 12:16:23AM -0700, Christoph Hellwig wrote:
>>
>> This does, but it still fails to explain why kmalloc performs just as
>> well as __get_free_page(s) these days.
> 
> I don't think that in this case - a single allocation on the cold path -
> the performance difference is even measurable.
> 
> Nevertheless allocations from slab caches are way faster than
> __get_free_page() (i.e.  alloc_pages()) as it's essentially lockless
> cmpxchg. Allocations that need to refill the cache do alloc_pages() with a

Probably not "way faster" but the fast path is quite similar - percpu
pcplist protected by spin_trylock (pages) vs sheaves with local_trylock
(slab), should slightly favour slab because spinlocks are typically not
inlined and local_trylock is.

The main reasons for switching AFAIU would be related with the
folio/memdesc conversions? If one needs just a kernel memory buffer,
kmalloc() it is, even if it happens to be page size. Page allocator
should be only used if you need e.g. the refcounting or anything else
that struct page provides. But then in some cases the memdesc conversion
would need adjustments at some point. With kmalloc() we can forget about
this user.

Matthew can probably state it better or even link to something
authoritative?

> little of slab bookkeeping overhead.
> 

^ permalink raw reply

* [PATCH] block: rename need_dispatch to cautious_dispatch in blk-mq sched
From: Guixin Liu @ 2026-05-26 13:11 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: linux-block, xlpang, oliver.yang

The local boolean in __blk_mq_sched_dispatch_requests() decides whether
to fall back to the per-ctx round-robin path (blk_mq_do_dispatch_ctx())
instead of the batch flush path (blk_mq_flush_busy_ctxs()).  The whole
function is about dispatching anyway, so the name "need_dispatch" is
not particularly informative and can mislead readers into thinking that
a false value means "skip dispatching".

Rename it to "cautious_dispatch" to match the comment right above the
check ("dequeue request one by one from sw queue if queue is busy")
and to convey the actual intent: take the cautious, fair, one-at-a-time
path either when we just drained hctx->dispatch (so the device has
recently pushed back) or when the dispatch_busy EWMA still indicates
congestion.  The fast batch path is only taken when neither signal
suggests recent backpressure.

No functional change.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 block/blk-mq-sched.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0a00f5a76f5a..ef28c3dd95a3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -267,7 +267,7 @@ static int blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)

 static int __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
-	bool need_dispatch = false;
+	bool cautious_dispatch = false;
 	LIST_HEAD(rq_list);

 	/*
@@ -298,16 +298,16 @@ static int __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		blk_mq_sched_mark_restart_hctx(hctx);
 		if (!blk_mq_dispatch_rq_list(hctx, &rq_list, true))
 			return 0;
-		need_dispatch = true;
+		cautious_dispatch = true;
 	} else {
-		need_dispatch = hctx->dispatch_busy;
+		cautious_dispatch = hctx->dispatch_busy;
 	}

 	if (hctx->queue->elevator)
 		return blk_mq_do_dispatch_sched(hctx);

 	/* dequeue request one by one from sw queue if queue is busy */
-	if (need_dispatch)
+	if (cautious_dispatch)
 		return blk_mq_do_dispatch_ctx(hctx);
 	blk_mq_flush_busy_ctxs(hctx, &rq_list);
 	blk_mq_dispatch_rq_list(hctx, &rq_list, true);
-- 
2.43.7

^ permalink raw reply related

* Re: [PATCH] blk-mq: reinsert cached request to the list
From: kernel test robot @ 2026-05-26 13:50 UTC (permalink / raw)
  To: Keith Busch, linux-block, axboe
  Cc: oe-kbuild-all, Keith Busch, Ming Lei, Christoph Hellwig
In-Reply-To: <20260525160744.896047-1-kbusch@meta.com>

Hi Keith,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe/for-next]
[also build test ERROR on next-20260525]
[cannot apply to linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/blk-mq-reinsert-cached-request-to-the-list/20260526-000916
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link:    https://lore.kernel.org/r/20260525160744.896047-1-kbusch%40meta.com
patch subject: [PATCH] blk-mq: reinsert cached request to the list
config: i386-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260526/202605261526.40AHANmH-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260526/202605261526.40AHANmH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605261526.40AHANmH-lkp@intel.com/

All errors (new ones prefixed by >>):

   block/blk-mq.c: In function 'blk_mq_submit_bio':
>> block/blk-mq.c:3249:17: error: implicit declaration of function 'rq_list_push'; did you mean 'rq_list_peek'? [-Wimplicit-function-declaration]
    3249 |                 rq_list_push(&plug->cached_rqs, rq);
         |                 ^~~~~~~~~~~~
         |                 rq_list_peek


vim +3249 block/blk-mq.c

  3110	
  3111	/**
  3112	 * blk_mq_submit_bio - Create and send a request to block device.
  3113	 * @bio: Bio pointer.
  3114	 *
  3115	 * Builds up a request structure from @q and @bio and send to the device. The
  3116	 * request may not be queued directly to hardware if:
  3117	 * * This request can be merged with another one
  3118	 * * We want to place request at plug queue for possible future merging
  3119	 * * There is an IO scheduler active at this queue
  3120	 *
  3121	 * It will not queue the request if there is an error with the bio, or at the
  3122	 * request creation.
  3123	 */
  3124	void blk_mq_submit_bio(struct bio *bio)
  3125	{
  3126		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
  3127		struct blk_plug *plug = current->plug;
  3128		const int is_sync = op_is_sync(bio->bi_opf);
  3129		unsigned int integrity_action;
  3130		struct blk_mq_hw_ctx *hctx;
  3131		unsigned int nr_segs;
  3132		struct request *rq;
  3133		blk_status_t ret;
  3134	
  3135		/*
  3136		 * If the plug has a cached request for this queue, try to use it.
  3137		 */
  3138		rq = blk_mq_get_cached_request(plug, q, bio->bi_opf);
  3139	
  3140		/*
  3141		 * A BIO that was released from a zone write plug has already been
  3142		 * through the preparation in this function, already holds a reference
  3143		 * on the queue usage counter, and is the only write BIO in-flight for
  3144		 * the target zone. Go straight to preparing a request for it.
  3145		 */
  3146		if (bio_zone_write_plugging(bio)) {
  3147			nr_segs = bio->__bi_nr_segments;
  3148			if (rq)
  3149				blk_queue_exit(q);
  3150			goto new_request;
  3151		}
  3152	
  3153		/*
  3154		 * The cached request already holds a q_usage_counter reference and we
  3155		 * don't have to acquire a new one if we use it.
  3156		 */
  3157		if (!rq) {
  3158			if (unlikely(bio_queue_enter(bio)))
  3159				return;
  3160		}
  3161	
  3162		/*
  3163		 * Device reconfiguration may change logical block size or reduce the
  3164		 * number of poll queues, so the checks for alignment and poll support
  3165		 * have to be done with queue usage counter held.
  3166		 */
  3167		if (unlikely(bio_unaligned(bio, q))) {
  3168			bio_io_error(bio);
  3169			goto queue_exit;
  3170		}
  3171	
  3172		if ((bio->bi_opf & REQ_POLLED) && !blk_mq_can_poll(q)) {
  3173			bio->bi_status = BLK_STS_NOTSUPP;
  3174			bio_endio(bio);
  3175			goto queue_exit;
  3176		}
  3177	
  3178		bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
  3179		if (!bio)
  3180			goto queue_exit;
  3181	
  3182		integrity_action = bio_integrity_action(bio);
  3183		if (integrity_action)
  3184			bio_integrity_prep(bio, integrity_action);
  3185	
  3186		blk_mq_bio_issue_init(q, bio);
  3187		if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
  3188			goto queue_exit;
  3189	
  3190		if (bio_needs_zone_write_plugging(bio)) {
  3191			if (blk_zone_plug_bio(bio, nr_segs))
  3192				goto queue_exit;
  3193		}
  3194	
  3195	new_request:
  3196		if (rq) {
  3197			rq_qos_throttle(rq->q, bio);
  3198			blk_mq_rq_time_init(rq, blk_time_get_ns());
  3199			rq->cmd_flags = bio->bi_opf;
  3200			INIT_LIST_HEAD(&rq->queuelist);
  3201		} else {
  3202			rq = blk_mq_get_new_requests(q, plug, bio);
  3203			if (unlikely(!rq)) {
  3204				if (bio->bi_opf & REQ_NOWAIT)
  3205					bio_wouldblock_error(bio);
  3206				goto queue_exit;
  3207			}
  3208		}
  3209	
  3210		trace_block_getrq(bio);
  3211	
  3212		rq_qos_track(q, rq, bio);
  3213	
  3214		blk_mq_bio_to_request(rq, bio, nr_segs);
  3215	
  3216		ret = blk_crypto_rq_get_keyslot(rq);
  3217		if (ret != BLK_STS_OK) {
  3218			bio->bi_status = ret;
  3219			bio_endio(bio);
  3220			blk_mq_free_request(rq);
  3221			return;
  3222		}
  3223	
  3224		if (bio_zone_write_plugging(bio))
  3225			blk_zone_write_plug_init_request(rq);
  3226	
  3227		if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
  3228			return;
  3229	
  3230		if (plug) {
  3231			blk_add_rq_to_plug(plug, rq);
  3232			return;
  3233		}
  3234	
  3235		hctx = rq->mq_hctx;
  3236		if ((rq->rq_flags & RQF_USE_SCHED) ||
  3237		    (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
  3238			blk_mq_insert_request(rq, 0);
  3239			blk_mq_run_hw_queue(hctx, true);
  3240		} else {
  3241			blk_mq_run_dispatch_ops(q, blk_mq_try_issue_directly(hctx, rq));
  3242		}
  3243		return;
  3244	
  3245	queue_exit:
  3246		if (!rq)
  3247			blk_queue_exit(q);
  3248		else
> 3249			rq_list_push(&plug->cached_rqs, rq);
  3250	}
  3251	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-26 14:02 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, linux-block, axboe, Christoph Hellwig
In-Reply-To: <ahT656Dazfz5oc8r@fedora>

On Mon, May 25, 2026 at 08:44:07PM -0500, Ming Lei wrote:
> On Mon, May 25, 2026 at 09:07:44AM -0700, Keith Busch wrote:
> > +		rq_list_push(&plug->cached_rqs, rq);
> 
> rq_list_add_head()?

Yes indeed. Serves me right for trying to squeeze this in over a
holiday. Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox