Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] block: invalidate cached plug timestamp after task switch
From: Usama Arif @ 2026-06-11 23:14 UTC (permalink / raw)
  To: axboe, linux-block, bsegall, dietmar.eggemann, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt,
	vincent.guittot, vschneid
  Cc: shakeel.butt, hannes, riel, kernel-team, Usama Arif, stable

blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.

The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().

As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.

Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 kernel/sched/core.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..bf024ca115ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5368,6 +5368,13 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 */
 	kmap_local_sched_in();

+	/*
+	 * Any cached block-layer timestamp (plug->cur_ktime) is stale now,
+	 * invalidate it.
+	 */
+	if (unlikely(current->flags & PF_BLOCK_TS))
+		blk_plug_invalidate_ts(current);
+
 	fire_sched_in_preempt_notifiers(current);
 	/*
 	 * When switching through a kernel thread, the loop in
@@ -7290,12 +7297,10 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_BLOCK_TS)) {
-		if (tsk->flags & PF_BLOCK_TS)
-			blk_plug_invalidate_ts(tsk);
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
-		else if (tsk->flags & PF_IO_WORKER)
+		else
 			io_wq_worker_running(tsk);
 	}
 }
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11 18:31 UTC (permalink / raw)
  To: Christian König
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <96f9390b-a547-442f-b0a9-99a5ba52c0e1@amd.com>

On Thu, Jun 11, 2026 at 10:39:14AM +0200, Christian König wrote:
> On 6/11/26 10:29, Andy Shevchenko wrote:
> > On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
> >> On 6/10/26 17:02, Andy Shevchenko wrote:
> >>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
> >>>> On 6/10/26 10:18, Kaitao Cheng wrote:
> >>>>> 在 2026/6/10 16:07, Christian König 写道:

...

> >>>>> Should we revert to v1, or keep list_for_each_entry() and
> >>>>> list_for_each_entry_safe() as they are, close this thread, and make no
> >>>>> changes?
> >>>>>
> >>>>> Link to v1:
> >>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
> >>>>>
> >>>>> Or do you have any better suggestions?
> >>>>
> >>>> v1 looks perfectly reasonable to me.
> >>>
> >>> But why not just hiding that once for all (in case they don't use the temporary
> >>> iterator)? Easy to automate, robust — everyone is happy?
> >>
> >> As far as I can see that is an extremely bad idea.
> >>
> >> The distinction between the use cases of 'iterating the list' and 'iterating
> >> the list while you modify it' is completely intentional.
> > 
> > What I meant is to keep the name, just drop the parameter (make it hidden and
> > being defined inside list_for_each_*_safe() cases).
> 
> Ah, sorry I was still thinking the suggestion is to merge
> list_for_each_entry() and list_for_each_entry_safe().
> 
> If the modification is done all at once or in steps doesn't really matter for
> me as long as the patch can be re-created reproducible.
> 
> But I'm wondering if we couldn't improve the name at the same time. The
> _safe() postfix has caused tons of confusion where especially beginners
> thought that it is a thread-safe variant, which it clearly isn't.
> 
> The _mutable() postfix sounds like a much better description to what happens here.

I see, no objections from my side, but with the new name we don't need to have
treewide change, the downside that one should undertake this to finish the job,
otherwise we will have _safe() and _mutable() for a long time.

> >> See the bool type can be implemented by int as well, but it is just a
> >> different use case.
> > 
> >>>> You should just include some patches in the same patch set to actually use
> >>>> the new macros.
> >>>>
> >>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
> >>>> the new macro I'm happy to review that.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH] blk-mq: bound blk_hctx_poll() to one jiffy
From: Anuj Gupta @ 2026-06-11 16:27 UTC (permalink / raw)
  To: axboe, hch, kbusch, lidiangang, changfengnan, tom.leiming,
	nj.shetty, joshi.k, anuj1072538
  Cc: linux-block, Anuj Gupta, Alok Rathore
In-Reply-To: <CGME20260611163403epcas5p2bba1c085eaab24852acfb0751ef148ab@epcas5p2.samsung.com>

blk_hctx_poll() can busy-poll until a completion is found or
need_resched() becomes true. On preemptible kernels, the scheduler can
set TIF_NEED_RESCHED on the timer tick and preempt the task at IRQ
return before the loop condition re-evaluates it. After the context
switch, the flag is cleared, so the poller can continue spinning instead
of returning to its caller.

This can happen with io_uring IOPOLL reads inside iocb_bio_iopoll(),
which holds the rcu_read_lock() while calling bio_poll(). If another
poller on the same polled queue drains the available completions, this
poller may repeatedly find no completions and remain inside the RCU
read-side critical section long enough to trigger RCU stall reports:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-9): P3961
rcu:     (detected by 3, t=60002 jiffies, g=18533, q=4943 ncpus=20)
task:fio state:R  running task     stack:0     pid:3961
Call Trace:
<TASK>
? nvme_poll+0x36/0xa0 [nvme]
? blk_hctx_poll+0x39/0x90
? blk_mq_poll+0x30/0x60
? bio_poll+0x87/0x170
? iocb_bio_iopoll+0x32/0x50
? io_uring_classic_poll+0x25/0x50
? io_do_iopoll+0x216/0x420
? __do_sys_io_uring_enter+0x2c7/0x7c0

Reproducible with:

fio -filename=/dev/nvme0n1 -direct=1 -size=4g -rw=randread \
--numjobs=32 -bs=4K -ioengine=io_uring -hipri=1 -iodepth=1 \
--registerfiles=1 --group_reporting --thread

Record the starting jiffy and exit the loop once jiffies has advanced.
This bounds each blk_hctx_poll() invocation while also covering the
case where the reschedule flag was cleared by the context switch
before the loop condition could observe it.

Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Alok Rathore <alok.rathore@samsung.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..d85fa4a51e79 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5248,6 +5248,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			 struct io_comp_batch *iob, unsigned int flags)
 {
 	int ret;
+	unsigned long start = jiffies;

 	do {
 		ret = q->mq_ops->poll(hctx, iob);
@@ -5258,7 +5259,7 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
 			break;
 		cpu_relax();
-	} while (!need_resched());
+	} while (!need_resched() && time_before_eq(jiffies, start));

 	return 0;
 }
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, brauner, linux-block, linux-fsdevel, linux-ext4,
	linux-xfs, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611133833.GA14645@lst.de>

On Thu, Jun 11, 2026 at 03:38:33PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> > It's entirely possible a device supports byte aligned addresses. The
> > block layer just doesn't let a driver report that. So either it really
> > was successful because you found a bug that skips the alignment checks,
> > or your device silently corrupted your payload.

I tried this on different hardware, I find it hard to say all those
devices were corrupting the payload.

> > 
> > Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> > though, in not taking the optimization when it was possible. So here's
> > an alternative suggestion that should get things working as expected:
> 
> The fix below looks like it is addressing a real bug.  I'm not sure if
> Carlos is hitting it, but we were missing the alignment checks for
> single-bvec fast path bios so far indeed.

You left context out so I'm assuming by the fix you meant Keith's patch.
I can give it a spin and see if it fixes the behavior I'm talking
about. Give me some time as I have a bunch of stuff to do tonight so
likely I will only manage to try this tomorrow.

^ permalink raw reply

* [PATCH 4/4] block: add configurable error injection
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

Add a new block error injection interface that allows to inject specific
status code for specific ranges.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 Documentation/block/error-injection.rst |  59 +++++
 Documentation/block/index.rst           |   1 +
 block/Kconfig                           |   8 +
 block/Makefile                          |   1 +
 block/blk-core.c                        |   4 +
 block/blk-sysfs.c                       |   5 +
 block/error-injection.c                 | 315 ++++++++++++++++++++++++
 block/error-injection.h                 |  21 ++
 block/genhd.c                           |   4 +
 include/linux/blkdev.h                  |   6 +
 10 files changed, 424 insertions(+)
 create mode 100644 Documentation/block/error-injection.rst
 create mode 100644 block/error-injection.c
 create mode 100644 block/error-injection.h

diff --git a/Documentation/block/error-injection.rst b/Documentation/block/error-injection.rst
new file mode 100644
index 000000000000..81f31af82e65
--- /dev/null
+++ b/Documentation/block/error-injection.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Configurable Error Injection
+============================
+
+Overview
+--------
+
+Configurable error injection allows injecting specific block layer status codes
+for sector ranges of a block device.  Errors can be injected unconditionally, or
+with a given probability.
+
+To use configurable error injection, CONFIG_BLK_ERROR_INJECTION must be enabled.
+
+The only interface is the error_injection debugfs file, which is created for
+each registered gendisk.  Writes to this file are used to create or delete rules
+and reads return a list of the current error injection sites.
+
+Options
+-------
+
+The following options specify the operations:
+
+===================	=======================================================
+add			add a new rule
+removeall		remove all existing rules
+===================	=======================================================
+
+The following options specify the details of the rule for the add operation:
+
+===================	=======================================================
+op=<string>		block layer operation this rule applies to.  This uses
+			the XYZ for each REQ_OP_XYZ operation, e.g. READ, WRITE
+			or DISCARD. Mandatory.
+status=<string>		Status to return.  This uses XYZ for each BLK_STS_XYZ
+			code, e.g. IOERR or MEDIUM. Mandatory.
+start=<number>		First block layer sector the rule applies to.
+			Optional, defaults to 0.
+nr_sectors=<number>	Number of sectors this rule applies.
+			Optional, defaults to the remainder of the device.
+chance=<number>		Only return a failure with a likelihood of 1/chance.
+			Optional, defaults to 1 (always).
+===================	=======================================================
+
+Example
+-------
+
+Return BLK_STS_IOERR for one in 10 reads of sector 0 of /dev/nvme0n1:
+
+	$ echo 'add,op=READ,start=0,status=IOERR,chance=10' > /sys/kernel/debug/block/nvme0n1/error_injection
+
+Return BLK_STS_MEDIUM for every write to /dev/nvme0n1:
+
+	$ echo 'add,op=WRITE,start=0,status=MEDIUM' > /sys/kernel/debug/block/nvme0n1/error_injection
+
+Remove all rules for /dev/nvme0n1:
+
+	$ echo 'removeall' > /sys/kernel/debug/block/nvme0n1/error_injection
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index 9fea696f9daa..bfa1bbd31ddf 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -22,3 +22,4 @@ Block
    switching-sched
    writeback_cache_control
    ublk
+   error-injection
diff --git a/block/Kconfig b/block/Kconfig
index 15027963472d..70e4a66d941f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -221,6 +221,14 @@ config BLOCK_HOLDER_DEPRECATED
 config BLK_MQ_STACKING
 	bool
 
+config BLK_ERROR_INJECTION
+	bool "Enable block layer error injection"
+	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
+	help
+	  Enable inserting arbitrary block errors through a debugfs interface.
+
+	  See Documentation/block/error-injection.rst for details.
+
 source "block/Kconfig.iosched"
 
 endif # BLOCK
diff --git a/block/Makefile b/block/Makefile
index 54130faacc21..e7bd320e3d69 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,7 @@ obj-y		:= bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
 			genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
 			disk-events.o blk-ia-ranges.o early-lookup.o
 
+obj-$(CONFIG_BLK_ERROR_INJECTION) += error-injection.o
 obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
 obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
diff --git a/block/blk-core.c b/block/blk-core.c
index beaab7a71fba..73a41df98c9a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -50,6 +50,7 @@
 #include "blk-cgroup.h"
 #include "blk-throttle.h"
 #include "blk-ioprio.h"
+#include "error-injection.h"
 
 struct dentry *blk_debugfs_root;
 
@@ -767,6 +768,9 @@ static void __submit_bio_noacct_mq(struct bio *bio)
 
 void submit_bio_noacct_nocheck(struct bio *bio, bool split)
 {
+	if (unlikely(blk_error_inject(bio)))
+		return;
+
 	blk_cgroup_bio_start(bio);
 
 	if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f22c1f253eb3..520972676ab4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -19,6 +19,7 @@
 #include "blk-wbt.h"
 #include "blk-cgroup.h"
 #include "blk-throttle.h"
+#include "error-injection.h"
 
 struct queue_sysfs_entry {
 	struct attribute attr;
@@ -933,6 +934,8 @@ static void blk_debugfs_remove(struct gendisk *disk)
 
 	blk_debugfs_lock_nomemsave(q);
 	blk_trace_shutdown(q);
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		blk_error_injection_exit(disk);
 	debugfs_remove_recursive(q->debugfs_dir);
 	q->debugfs_dir = NULL;
 	q->sched_debugfs_dir = NULL;
@@ -963,6 +966,8 @@ int blk_register_queue(struct gendisk *disk)
 
 	memflags = blk_debugfs_lock(q);
 	q->debugfs_dir = debugfs_create_dir(disk->disk_name, blk_debugfs_root);
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
+		blk_error_injection_init(disk);
 	if (queue_is_mq(q))
 		blk_mq_debugfs_register(q);
 	blk_debugfs_unlock(q, memflags);
diff --git a/block/error-injection.c b/block/error-injection.c
new file mode 100644
index 000000000000..d24c90e9a25f
--- /dev/null
+++ b/block/error-injection.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Christoph Hellwig.
+ */
+#include <linux/debugfs.h>
+#include <linux/blkdev.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include "blk.h"
+#include "error-injection.h"
+
+struct blk_error_inject {
+	struct list_head		entry;
+	sector_t			start;
+	sector_t			end;
+	enum req_op			op;
+	blk_status_t			status;
+
+	/* only inject every 1 / chance times */
+	unsigned int			chance;
+};
+
+DEFINE_STATIC_KEY_FALSE(blk_error_injection_enabled);
+
+bool __blk_error_inject(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_error_inject *inj;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(inj, &disk->error_injection_list, entry) {
+		if (bio_op(bio) != inj->op)
+			continue;
+		/*
+		 * This never matches 0-sized bios like empty WRITEs with
+		 * REQ_PREFLUSH or ZONE_RESET_ALL.  While adding a special case
+		 * for them would be trivial, that means any WRITE rule would
+		 * trigger for flushes.  So before we can make this work
+		 * properly, we'll need to start using REQ_OP_FLUSH for pure
+		 * flushes at the bio level like we already do in blk-mq.
+		 */
+		if (bio->bi_iter.bi_sector > inj->end ||
+		    bio_end_sector(bio) <= inj->start)
+			continue;
+		if (inj->chance > 1 && (get_random_u32() % inj->chance) != 0)
+			continue;
+
+		pr_info_ratelimited("%pg: injecting %s error for %s at sector %llu:%u\n",
+				disk->part0, blk_status_to_str(inj->status),
+				blk_op_str(inj->op), bio->bi_iter.bi_sector,
+				bio_sectors(bio));
+		bio->bi_status = inj->status;
+		rcu_read_unlock();
+		bio_endio(bio);
+		return true;
+	}
+	rcu_read_unlock();
+	return false;
+}
+
+static int error_inject_add(struct gendisk *disk, enum req_op op,
+		sector_t start, u64 nr_sectors, blk_status_t status,
+		unsigned int chance)
+{
+	struct blk_error_inject *inj;
+	int error = -EINVAL;
+
+	if (op == REQ_OP_LAST)
+		return -EINVAL;
+	if (status == BLK_STS_OK)
+		return -EINVAL;
+
+	inj = kzalloc_obj(*inj);
+	if (!inj)
+		return -ENOMEM;
+
+	if (nr_sectors) {
+		if (U64_MAX - nr_sectors < start)
+			goto out_free_inj;
+		inj->end = start + nr_sectors - 1;
+	} else {
+		inj->end = U64_MAX;
+	}
+
+	inj->op = op;
+	inj->start = start;
+	inj->status = status;
+	inj->chance = chance;
+
+	pr_debug_ratelimited("%pg: adding %s injection for %s at sector %llu:%llu\n",
+			disk->part0, blk_status_to_str(status),
+			blk_op_str(op),
+			start, nr_sectors);
+
+	/*
+	 * Add to the front of the list so that newer entries can partially
+	 * override other entries.  This also intentionally allows duplicate
+	 * entries as there is no real reason to reject them.
+	 */
+	mutex_lock(&disk->error_injection_lock);
+	if (!disk_live(disk)) {
+		mutex_unlock(&disk->error_injection_lock);
+		error = -ENODEV;
+		goto out_free_inj;
+	}
+	if (list_empty(&disk->error_injection_list))
+		static_branch_inc(&blk_error_injection_enabled);
+	list_add_rcu(&inj->entry, &disk->error_injection_list);
+	set_bit(GD_ERROR_INJECT, &disk->state);
+	mutex_unlock(&disk->error_injection_lock);
+	return 0;
+
+out_free_inj:
+	kfree(inj);
+	return error;
+}
+
+static void error_inject_removeall(struct gendisk *disk)
+{
+	struct blk_error_inject *inj;
+
+	mutex_lock(&disk->error_injection_lock);
+	clear_bit(GD_ERROR_INJECT, &disk->state);
+	while ((inj = list_first_entry_or_null(&disk->error_injection_list,
+			struct blk_error_inject, entry))) {
+		list_del_rcu(&inj->entry);
+		kfree_rcu_mightsleep(inj);
+	}
+	static_branch_dec(&blk_error_injection_enabled);
+	mutex_unlock(&disk->error_injection_lock);
+}
+
+enum options {
+	Opt_add			= (1u << 0),
+	Opt_removeall		= (1u << 1),
+
+	Opt_op			= (1u << 16),
+	Opt_start		= (1u << 17),
+	Opt_nr_sectors		= (1u << 18),
+	Opt_status		= (1u << 19),
+	Opt_chance		= (1u << 20),
+
+	Opt_invalid,
+};
+
+static const match_table_t opt_tokens = {
+	{ Opt_add,			"add",			},
+	{ Opt_removeall,		"removeall",		},
+	{ Opt_op,			"op=%s",		},
+	{ Opt_start,			"start=%u"		},
+	{ Opt_nr_sectors,		"nr_sectors=%u"		},
+	{ Opt_status,			"status=%s"		},
+	{ Opt_chance,			"chance=%u"		},
+	{ Opt_invalid,			NULL,			},
+};
+
+static int match_op(substring_t *args, enum req_op *op)
+{
+	const char *tag;
+
+	tag = match_strdup(args);
+	if (!tag)
+		return -ENOMEM;
+	*op = str_to_blk_op(tag);
+	if (*op == REQ_OP_LAST)
+		pr_warn("invalid op '%s'\n", tag);
+	kfree(tag);
+	return 0;
+}
+
+static int match_status(substring_t *args, blk_status_t *status)
+{
+	const char *tag;
+
+	tag = match_strdup(args);
+	if (!tag)
+		return -ENOMEM;
+	*status = tag_to_blk_status(tag);
+	if (!*status)
+		pr_warn("invalid status '%s'\n", tag);
+	kfree(tag);
+	return 0;
+}
+
+static ssize_t blk_error_injection_parse_options(struct gendisk *disk,
+		char *options)
+{
+	enum { Unset, Add, Removeall } action = Unset;
+	unsigned int option_mask = 0, chance = 1;
+	enum req_op op = REQ_OP_LAST;
+	u64 start = 0, nr_sectors = 0;
+	blk_status_t status = BLK_STS_OK;
+	substring_t args[MAX_OPT_ARGS];
+	char *p;
+
+	while ((p = strsep(&options, ",\n")) != NULL) {
+		int error = 0;
+		ssize_t token;
+
+		if (!*p)
+			continue;
+		token = match_token(p, opt_tokens, args);
+		option_mask |= token;
+		switch (token) {
+		case Opt_add:
+			if (action != Unset)
+				return -EINVAL;
+			action = Add;
+			break;
+		case Opt_removeall:
+			if (action != Unset)
+				return -EINVAL;
+			action = Removeall;
+			break;
+		case Opt_op:
+			error = match_op(args, &op);
+			break;
+		case Opt_start:
+			error = match_u64(args, &start);
+			break;
+		case Opt_nr_sectors:
+			error = match_u64(args, &nr_sectors);
+			break;
+		case Opt_status:
+			error = match_status(args, &status);
+			break;
+		case Opt_chance:
+			error = match_uint(args, &chance);
+			if (!error && chance == 0)
+				error = -EINVAL;
+			break;
+		default:
+			pr_warn("unknown parameter or missing value '%s'\n", p);
+			error = -EINVAL;
+		}
+		if (error)
+			return error;
+	}
+
+	switch (action) {
+	case Add:
+		return error_inject_add(disk, op, start, nr_sectors, status,
+				chance);
+	case Removeall:
+		if (option_mask & ~Opt_removeall)
+			return -EINVAL;
+		error_inject_removeall(disk);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static ssize_t blk_error_injection_write(struct file *file,
+		const char __user *ubuf, size_t count, loff_t *pos)
+{
+	struct gendisk *disk = file_inode(file)->i_private;
+	char *options;
+	int error;
+
+	options = memdup_user_nul(ubuf, count);
+	if (IS_ERR(options))
+		return PTR_ERR(options);
+	error = blk_error_injection_parse_options(disk, options);
+	kfree(options);
+
+	if (error)
+		return error;
+	return count;
+}
+
+static int blk_error_injection_show(struct seq_file *s, void *private)
+{
+	struct gendisk *disk = s->private;
+	struct blk_error_inject *inj;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(inj, &disk->error_injection_list, entry) {
+		seq_printf(s, "%llu:%llu status=%s,chance=%u",
+			inj->start, inj->end,
+			blk_status_to_tag(inj->status), inj->chance);
+		seq_putc(s, '\n');
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static int blk_error_injection_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, blk_error_injection_show, inode->i_private);
+}
+
+static int blk_error_injection_release(struct inode *inode, struct file *file)
+{
+	return single_release(inode, file);
+}
+
+static const struct file_operations blk_error_injection_fops = {
+	.owner		= THIS_MODULE,
+	.write		= blk_error_injection_write,
+	.read		= seq_read,
+	.open		= blk_error_injection_open,
+	.release	= blk_error_injection_release,
+};
+
+void blk_error_injection_init(struct gendisk *disk)
+{
+	debugfs_create_file("error_injection", 0600, disk->queue->debugfs_dir,
+			disk, &blk_error_injection_fops);
+}
+
+void blk_error_injection_exit(struct gendisk *disk)
+{
+	error_inject_removeall(disk);
+}
diff --git a/block/error-injection.h b/block/error-injection.h
new file mode 100644
index 000000000000..9821d773abab
--- /dev/null
+++ b/block/error-injection.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BLK_ERROR_INJECTION_H
+#define _BLK_ERROR_INJECTION_H 1
+
+#include <linux/jump_label.h>
+
+DECLARE_STATIC_KEY_FALSE(blk_error_injection_enabled);
+
+void blk_error_injection_init(struct gendisk *disk);
+void blk_error_injection_exit(struct gendisk *disk);
+bool __blk_error_inject(struct bio *bio);
+static inline bool blk_error_inject(struct bio *bio)
+{
+	if (IS_ENABLED(CONFIG_BLK_ERROR_INJECTION) &&
+	    static_branch_unlikely(&blk_error_injection_enabled) &&
+	    test_bit(GD_ERROR_INJECT, &bio->bi_bdev->bd_disk->state))
+		return __blk_error_inject(bio);
+	return false;
+}
+
+#endif /* _BLK_ERROR_INJECTION_H */
diff --git a/block/genhd.c b/block/genhd.c
index 7d6854fd28e9..f84b6a355b57 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1485,6 +1485,10 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	lockdep_init_map(&disk->lockdep_map, "(bio completion)", lkclass, 0);
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
+#endif
+#ifdef CONFIG_BLK_ERROR_INJECTION
+	mutex_init(&disk->error_injection_lock);
+	INIT_LIST_HEAD(&disk->error_injection_list);
 #endif
 	mutex_init(&disk->rqos_state_mutex);
 	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 57e84d59a642..5070851cf924 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -176,6 +176,7 @@ struct gendisk {
 #define GD_SUPPRESS_PART_SCAN		5
 #define GD_OWNS_QUEUE			6
 #define GD_ZONE_APPEND_USED		7
+#define GD_ERROR_INJECT			8
 
 	struct mutex open_mutex;	/* open/close mutex */
 	unsigned open_partitions;	/* number of open partitions */
@@ -227,6 +228,11 @@ struct gendisk {
 	 */
 	struct blk_independent_access_ranges *ia_ranges;
 
+#ifdef CONFIG_BLK_ERROR_INJECTION
+	struct mutex		error_injection_lock;
+	struct list_head	error_injection_list;
+#endif
+
 	struct mutex rqos_state_mutex;	/* rqos state change mutex */
 };
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 3/4] block: add a str_to_blk_op helper
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

Add a helper to find the REQ_OP_XYZ constant from the "XYZ" string.
This will be used for the error injection debugfs interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 10 ++++++++++
 block/blk.h      |  1 +
 2 files changed, 11 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 842b5c6f2fb4..beaab7a71fba 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -132,6 +132,16 @@ inline const char *blk_op_str(enum req_op op)
 }
 EXPORT_SYMBOL_GPL(blk_op_str);
 
+enum req_op str_to_blk_op(const char *op)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_op_name); i++)
+		if (blk_op_name[i] && !strcmp(blk_op_name[i], op))
+			return (enum req_op)i;
+	return REQ_OP_LAST;
+}
+
 #define ENT(_tag, _errno, _desc)	\
 [BLK_STS_##_tag] = {				\
 	.errno		= _errno,		\
diff --git a/block/blk.h b/block/blk.h
index 3ab2cdd6ed12..507ab34a6e90 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -53,6 +53,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q);
 const char *blk_status_to_str(blk_status_t status);
 const char *blk_status_to_tag(blk_status_t status);
 blk_status_t tag_to_blk_status(const char *tag);
+enum req_op str_to_blk_op(const char *op);
 
 bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 bool blk_queue_start_drain(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/4] block: add a "tag" for block status codes
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

The full name of the status codes is not good for user interfaces as it
can contain white spaces.  Add the name of the status code without the
BLK_STS_ prefix as a tag so that it can be used for user interfaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 28 ++++++++++++++++++++++++++++
 block/blk.h      |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43121a9f99f0..842b5c6f2fb4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,10 +135,12 @@ EXPORT_SYMBOL_GPL(blk_op_str);
 #define ENT(_tag, _errno, _desc)	\
 [BLK_STS_##_tag] = {				\
 	.errno		= _errno,		\
+	.tag		= __stringify(_tag),	\
 	.name		= _desc,		\
 }
 static const struct {
 	int		errno;
+	const char	*tag;
 	const char	*name;
 } blk_errors[] = {
 	ENT(OK,			0,		""),
@@ -203,6 +205,32 @@ const char *blk_status_to_str(blk_status_t status)
 	return blk_errors[idx].name;
 }
 
+const char *blk_status_to_tag(blk_status_t status)
+{
+	int idx = (__force int)status;
+
+	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors) || !blk_errors[idx].tag))
+		return "<null>";
+	return blk_errors[idx].tag;
+}
+
+blk_status_t tag_to_blk_status(const char *tag)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(blk_errors); i++) {
+		if (blk_errors[i].tag &&
+		    !strcmp(blk_errors[i].tag, tag))
+			return (__force blk_status_t)i;
+	}
+
+	/*
+	 * Return BLK_STS_OK for mismatches as this function is intended to
+	 * parse error status values.
+	 */
+	return BLK_STS_OK;
+}
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
diff --git a/block/blk.h b/block/blk.h
index 7fdfb9012ce1..3ab2cdd6ed12 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,8 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 const char *blk_status_to_str(blk_status_t status);
+const char *blk_status_to_tag(blk_status_t status);
+blk_status_t tag_to_blk_status(const char *tag);
 
 bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
 bool blk_queue_start_drain(struct request_queue *q);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/4] block: add a macro to initialize the status table
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Bart Van Assche, Hannes Reinecke
In-Reply-To: <20260611140703.2401204-1-hch@lst.de>

Prepare for adding a new value to the error table by adding a macro
to fill it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
 block/blk-core.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1c637db79e59..43121a9f99f0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -132,39 +132,44 @@ inline const char *blk_op_str(enum req_op op)
 }
 EXPORT_SYMBOL_GPL(blk_op_str);
 
+#define ENT(_tag, _errno, _desc)	\
+[BLK_STS_##_tag] = {				\
+	.errno		= _errno,		\
+	.name		= _desc,		\
+}
 static const struct {
 	int		errno;
 	const char	*name;
 } blk_errors[] = {
-	[BLK_STS_OK]		= { 0,		"" },
-	[BLK_STS_NOTSUPP]	= { -EOPNOTSUPP, "operation not supported" },
-	[BLK_STS_TIMEOUT]	= { -ETIMEDOUT,	"timeout" },
-	[BLK_STS_NOSPC]		= { -ENOSPC,	"critical space allocation" },
-	[BLK_STS_TRANSPORT]	= { -ENOLINK,	"recoverable transport" },
-	[BLK_STS_TARGET]	= { -EREMOTEIO,	"critical target" },
-	[BLK_STS_RESV_CONFLICT]	= { -EBADE,	"reservation conflict" },
-	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
-	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
-	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
-	[BLK_STS_DEV_RESOURCE]	= { -EBUSY,	"device resource" },
-	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
-	[BLK_STS_OFFLINE]	= { -ENODEV,	"device offline" },
+	ENT(OK,			0,		""),
+	ENT(NOTSUPP,		-EOPNOTSUPP,	"operation not supported"),
+	ENT(TIMEOUT,		-ETIMEDOUT,	"timeout"),
+	ENT(NOSPC,		-ENOSPC,	"critical space allocation"),
+	ENT(TRANSPORT,		-ENOLINK,	"recoverable transport"),
+	ENT(TARGET,		-EREMOTEIO,	"critical target"),
+	ENT(RESV_CONFLICT,	-EBADE,		"reservation conflict"),
+	ENT(MEDIUM,		-ENODATA,	"critical medium"),
+	ENT(PROTECTION,		-EILSEQ,	"protection"),
+	ENT(RESOURCE,		-ENOMEM,	"kernel resource"),
+	ENT(DEV_RESOURCE,	-EBUSY,		"device resource"),
+	ENT(AGAIN,		-EAGAIN,	"nonblocking retry"),
+	ENT(OFFLINE,		-ENODEV,	"device offline"),
 
 	/* device mapper special case, should not leak out: */
-	[BLK_STS_DM_REQUEUE]	= { -EREMCHG, "dm internal retry" },
+	ENT(DM_REQUEUE,		-EREMCHG,	"dm internal retry"),
 
 	/* zone device specific errors */
-	[BLK_STS_ZONE_OPEN_RESOURCE]	= { -ETOOMANYREFS, "open zones exceeded" },
-	[BLK_STS_ZONE_ACTIVE_RESOURCE]	= { -EOVERFLOW, "active zones exceeded" },
+	ENT(ZONE_OPEN_RESOURCE, -ETOOMANYREFS,	"open zones exceeded"),
+	ENT(ZONE_ACTIVE_RESOURCE, -EOVERFLOW,	"active zones exceeded"),
 
 	/* Command duration limit device-side timeout */
-	[BLK_STS_DURATION_LIMIT]	= { -ETIME, "duration limit exceeded" },
-
-	[BLK_STS_INVAL]		= { -EINVAL,	"invalid" },
+	ENT(DURATION_LIMIT,	-ETIME,		"duration limit exceeded"),
+	ENT(INVAL,		-EINVAL,	"invalid"),
 
 	/* everything else not covered above: */
-	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
+	ENT(IOERR,		-EIO,		"I/O"),
 };
+#undef ENT
 
 blk_status_t errno_to_blk_status(int errno)
 {
-- 
2.53.0


^ permalink raw reply related

* configurable block error injection v5
From: Christoph Hellwig @ 2026-06-11 14:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc

Hi all,

this series adds a new configurable block error injection facility.
We already have a few to inject block errors, but unfortunately most
of them are either not very useful or hard to use, or both:

 - The fail_make_request failure injection point can't distinguish
   different commands, different ranges in the file and can only injection
   plain I/O errors.
 - the should_fail_bio 'dynamic' failure injection has all the same issues
   as fail_make_request
 - dm-error can only fail all command in the table using BLK_STS_IOERR
   and requires setting up a new block device
 - dm-flakey and dm-dust allow all kinds of configurability, but still
   don't have good error selection, no good support for non-read/write
   commands and are limited to the dm table alignment requirements,
   which for zoned devices enforces setting them up for an entire zone.
   They also once again require setting up a stacked block device,
   which is really annoying in harnesses like xfstests

This series adds a new debugfs-based block layer error injection
that allows to configure what operations and ranges the injection
applied to, and what status to return.  It also allows to configure a
failure ratio similar to the xfs errortag injection.

Changes since v4:
 - don't unlock in removeall to avoid a race between removeall and setup
 - document why we can't match 0-sized bios

Changes since v3:
 - use a static branch to guard the new condition
 - split out a new header so that jump_label.h doesn't get pulled into
   blk.h
 - more checking for impossible conditions in blk_status_to_tag
 - more spelling fixes

Changes since v2:
 - improve the documentation a bit
 - fix a spelling mistake in a comment

Changes since v1:
 - drop the should_fail_bio removal and cleanup depending on it, as it's
   used by eBPF programs and thus a hidden UABI.
 - as a result split the code out to it's own Kconfig symbol
 - various error handling fixed pointed out by Keith
 - documentation spelling fixes pointed out by Randy

Diffstat:
 Documentation/block/error-injection.rst |   59 +++++
 Documentation/block/index.rst           |    1 
 block/Kconfig                           |    8 
 block/Makefile                          |    1 
 block/blk-core.c                        |   87 ++++++--
 block/blk-sysfs.c                       |    5 
 block/blk.h                             |    3 
 block/error-injection.c                 |  315 ++++++++++++++++++++++++++++++++
 block/error-injection.h                 |   21 ++
 block/genhd.c                           |    4 
 include/linux/blkdev.h                  |    6 
 11 files changed, 490 insertions(+), 20 deletions(-)

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Christoph Hellwig @ 2026-06-11 13:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Carlos Maiolino, Christoph Hellwig, brauner, linux-block,
	linux-fsdevel, linux-ext4, linux-xfs, Hannes Reinecke,
	Martin K. Petersen, Jens Axboe
In-Reply-To: <aiqwy5DfHI79KXuZ@kbusch-mbp>

On Thu, Jun 11, 2026 at 06:57:47AM -0600, Keith Busch wrote:
> It's entirely possible a device supports byte aligned addresses. The
> block layer just doesn't let a driver report that. So either it really
> was successful because you found a bug that skips the alignment checks,
> or your device silently corrupted your payload.
> 
> Anyway, my earlier suggestion should work. Ming thinks it may go to far,
> though, in not taking the optimization when it was possible. So here's
> an alternative suggestion that should get things working as expected:

The fix below looks like it is addressing a real bug.  I'm not sure if
Carlos is hitting it, but we were missing the alignment checks for
single-bvec fast path bios so far indeed.


^ permalink raw reply

* Re: [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: Stefan Hajnoczi @ 2026-06-10 15:37 UTC (permalink / raw)
  To: Sumit Saxena, Michael S. Tsirkin
  Cc: Martin K . Petersen, Jens Axboe, James E . J . Bottomley,
	linux-scsi, linux-block, Adam Radford, Khalid Aziz,
	Adaptec OEM Raid Solutions, Matthew Wilcox, Hannes Reinecke,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Eugenio Perez,
	virtualization, Vishal Bhakta, bcm-kernel-feedback-list,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	xen-devel, John Garry
In-Reply-To: <20260609121806.2121755-3-sumit.saxena@broadcom.com>

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Tue, Jun 09, 2026 at 05:48:01PM +0530, Sumit Saxena wrote:
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 5fdaa71f0652..88375574cb18 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -929,7 +929,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
>  
>  	shost = scsi_host_alloc(&virtscsi_host_template,
> -				struct_size(vscsi, req_vqs, num_queues));
> +				struct_size(vscsi, req_vqs, num_queues), NULL);

A virtio_device has a parent (this is the virtio transport, like
virtio_pci) and that may have NUMA node.

drivers/virtio/virtio.c:register_virtio_device() could call
set_dev_node(dev, dev_to_node(dev->parent)) to propagate the NUMA node
to the virtio_device if it is not already automatically propagated.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Keith Busch @ 2026-06-11 12:57 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, brauner, linux-block, linux-fsdevel,
	linux-ext4, linux-xfs, Hannes Reinecke, Martin K. Petersen,
	Jens Axboe
In-Reply-To: <aiqBvF93P4NjfaDR@nidhogg.toxiclabs.cc>

On Thu, Jun 11, 2026 at 12:05:22PM +0200, Carlos Maiolino wrote:
> The passed in address 0x1003af80001 is one byte misaligned and shouldn't
> (at least in theory) ever be accepted no? Or am I missing something
> else?

It's entirely possible a device supports byte aligned addresses. The
block layer just doesn't let a driver report that. So either it really
was successful because you found a bug that skips the alignment checks,
or your device silently corrupted your payload.

Anyway, my earlier suggestion should work. Ming thinks it may go to far,
though, in not taking the optimization when it was possible. So here's
an alternative suggestion that should get things working as expected:

---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..4c31762d6fb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,6 +404,9 @@ static inline bool bio_may_need_split(struct bio *bio,
        bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
        if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
                return true;
+
+       if ((bv->bv_offset | bv->bv_len) & lim->dma_alignment)
+               return true;
        return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
 }
 
-- 

^ permalink raw reply related

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11 12:27 UTC (permalink / raw)
  To: Andy Shevchenko, Christian König
  Cc: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Davidlohr Bueso, Paul E . McKenney, Josh Triplett, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Liam Girdwood, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Eddie James, Mark Brown, Maxime Coquelin, Alexandre Torgue,
	Laxman Dewangan, Neil Armstrong, Robert Foss, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Laurent Pinchart, Jonas Karlman, Jernej Skrabec, Matthew Auld,
	Matthew Brost, Waiman Long, drbd-dev, linux-block,
	linux1394-devel, dri-devel, intel-gfx, linux-spi, linux-stm32,
	linux-arm-kernel, linux-tegra, linux-sound, linux-kernel,
	Andrew Morton, Randy Dunlap, Christian Brauner, David Howells,
	Luca Ceresoli, Kaito Cheng, Muchun Song, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Takashi Sakamoto, Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipx1goKIsk40vrF@ashevche-desk.local>



在 2026/6/11 16:29, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
>> On 6/10/26 17:02, Andy Shevchenko wrote:
>>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>>>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>>>> Should we revert to v1, or keep list_for_each_entry() and
>>>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>>>> changes?
>>>>>
>>>>> Link to v1:
>>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>>>
>>>>> Or do you have any better suggestions?
>>>>
>>>> v1 looks perfectly reasonable to me.
>>>
>>> But why not just hiding that once for all (in case they don't use the temporary
>>> iterator)? Easy to automate, robust — everyone is happy?
>>
>> As far as I can see that is an extremely bad idea.
>>
>> The distinction between the use cases of 'iterating the list' and 'iterating
>> the list while you modify it' is completely intentional.

I agree with this point. It is very reasonable for list_for_each_entry()
to be used only for 'iterating the list'. In practice, however, we do not
have an effective way to enforce that rule for users, whereas the distinction
between bool and int can be enforced by the compiler. The 13 patches in the
current series are all real examples where users modify the list while using
list_for_each_entry(). Is a rule that cannot actually be enforced reasonable?
This is just my humble opinion, and I am raising it here only for discussion.

> What I meant is to keep the name, just drop the parameter (make it hidden and
> being defined inside list_for_each_*_safe() cases).

I agree with this approach, but the specific details still need to be settled,
including the issue described in the link below.

https://lore.kernel.org/all/0a333eb8-fc29-4b85-993e-6b726f4c7cf0@linux.dev/

Of course, there is also the suffix-renaming issue raised by Christian.

>> See the bool type can be implemented by int as well, but it is just a
>> different use case.
> 
>>>> You should just include some patches in the same patch set to actually use
>>>> the new macros.
>>>>
>>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>>>> the new macro I'm happy to review that.
>>>
>>
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11 12:04 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aippVAj83dCzscTN@ashevche-desk.local>



在 2026/6/11 15:52, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 03:36:01PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/11 14:54, Andy Shevchenko 写道:
>>> On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
>>>> 在 2026/6/10 22:43, Andy Shevchenko 写道:
>>>>> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
>>>>>> 在 2026/6/9 18:33, Christian König 写道:
>>>>>>> On 6/9/26 08:13, Kaitao Cheng wrote:
>>>
>>>>>>>> This series prepares for, and then updates, the list_for_each_entry()
>>>>>>>> family so the common entry iterators cache their next or previous cursor
>>>>>>>> before the loop body runs.
>>>>>>>
>>>>>>> Why in the world would we want to do that?
>>>>>>>
>>>>>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
>>>>>>>
>>>>>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
>>>>>>>
>>>>>>>> The first 13 patches open-code loops that intentionally depend on the
>>>>>>>> old "derive the next entry from the current cursor at the end of the
>>>>>>>> iteration" behaviour.  These loops append work to the list being walked,
>>>>>>>> restart traversal after dropping a lock, skip an entry consumed by the
>>>>>>>> current iteration, or otherwise adjust the cursor in the loop body.
>>>>>>>
>>>>>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
>>>>>>
>>>>>> Hi Christian and Andy Shevchenko,
>>>>>>
>>>>>> Thanks for taking a look. I would like to clarify the point you raised.
>>>>>>
>>>>>> The reason I started looking at this is the original motivation behind
>>>>>> the _safe() variants.  They exist because some users need to remove, move
>>>>>> or otherwise consume the current entry while walking the list.  In that
>>>>>> case the next cursor has to be preserved before the loop body can modify
>>>>>> the current entry.
>>>>>>
>>>>>> The unfortunate part is that this could not be expressed with the
>>>>>> existing list_for_each_entry() interface without changing its calling
>>>>>> convention.  The _safe() variants had to grow an extra argument for the
>>>>>> temporary cursor, and that is why we ended up with a separate family of
>>>>>> macros.
>>>>>>
>>>>>> But conceptually, the distinction does not have to be exposed as two
>>>>>> different iterator families forever.  The difference is an implementation
>>>>>> detail: whether the iterator keeps the next/previous cursor before the
>>>>>> body runs.  This series makes the common list_for_each_entry() iterators
>>>>>> do that internally, so the safe and non-safe forms can effectively be
>>>>>> folded together, or at least the need for a separate public _safe()
>>>>>> interface becomes much weaker.
>>>>>>
>>>>>> There is also a usability issue with the current _safe() interface.  The
>>>>>> caller is forced to define a temporary cursor outside the macro and pass
>>>>>> it in, even though almost all users never use that cursor directly.  It is
>>>>>> just boilerplate required by the macro implementation.  I find that
>>>>>> redundant and awkward: the temporary cursor is an internal detail of the
>>>>>> iteration, but every caller has to spell it out.
>>>>>
>>>>> Ah, I think the distinct macro families is that what we want.
>>>>> But the hiding of the parameter can be done inside list_for_each_*_safe().
>>>>> You can do a treewide change with coccinelle.
>>>>>
>>>>> Sorry if I didn't get the whole idea from your previous contributions.
>>>>>
>>>>> Note, even cases that would need a temporary cursor may be switched to
>>>>> new list_for_each_*_safe(), see how PCI macros for iterating over resources
>>>>> are implemented (include/linux/pci.h).
>>>>
>>>> Thanks for your suggestions. I've written a demo based on your feedback.
>>>> Could you please review it and share your thoughts on this approach?
>>>
>>> Have you checked how many users actually need the temporary storage?
>>
>> In Muchun's reply, he mentioned the following:
>>
>> There are 9,925 list_for_each_entry() call sites in total. Among them,
>> 9,919 do not require any adaptation, and only 6 need to be refactored:
>>
>> As for list_for_each_entry_safe(), there are 4,572 callers. 4,550 of them
>> can be directly replaced by the new list_for_each_entry(), while 22 cannot
>> be replaced
>>
>> https://lore.kernel.org/all/2B3BFA1E-08B8-42AB-87D6-A28BF15E5C58@linux.dev/
>>
>> I only used Coccinelle to scan for list_for_each_entry() call sites, and
>> found the 13 call sites shown in the current patch series, which cover
>> the 6 cases mentioned in Muchun's email. I have not yet run the Coccinelle
>> scan for list_for_each_entry_safe().
>>
>> If we need to handle all 9,925 list_for_each_entry() call sites or all 4,572
>> list_for_each_entry_safe() call sites in one go, would such a change be too
>> large? I expect it would affect almost every kernel subsystem.
> 
> If it's done by Linus himself during the day when he prepares -rc1, it's fine.
> You would need to provide a good justification for the change, though.
> 
> But in the above statistics the 4572 vs 4550, so the first step is to investigate
> why temporary cursor is used in those 22 cases and what we can do to avoid that.

Here is one example: in shmem_unuse() in mm/shmem.c, list_for_each_entry_safe()
is used. In this case, the caller releases shmem_swaplist_lock inside the loop.
During that window, the list may be modified, and the previously saved next may
become stale. Therefore, next needs to be recomputed so that subsequent iteration
is based on the latest list state.

This leads to two possible approaches:

1. Change list_for_each_entry_safe(pos, n, head, member) directly to
list_for_each_entry_safe(pos, head, member). If we do this, the case
above would need to be converted to an open-coded form.

2. Support both forms, list_for_each_entry_safe(pos, n, head, member)
and list_for_each_entry_safe(pos, head, member), as described in the
link below.
https://lore.kernel.org/all/9b98e860-11df-44bf-9a95-3046d2c274a6@linux.dev/

Do you have any other thoughts on this?

>> I wonder whether it would be better to first provide the necessary
>> compatibility APIs, and then let each subsystem owner update their code as
>> appropriate. That would make the impact more controlled, similar to how
>> the current folio replacement of page is being handled.
-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [LSF/MM/BPF RFC PATCH 00/13]
From: Leon Romanovsky @ 2026-06-11 11:59 UTC (permalink / raw)
  To: Haris Iqbal
  Cc: linux-block, linux-rdma, linux-kernel, axboe, bvanassche, hch,
	jgg, jinpu.wang
In-Reply-To: <CAJpMwyg-6Qxskq2ktuhvf46yD5848J9BYLMPPfBLjg2Uzs=xnw@mail.gmail.com>

On Wed, May 27, 2026 at 02:44:08PM +0200, Haris Iqbal wrote:
> On Tue, May 12, 2026 at 12:34 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, May 05, 2026 at 09:46:12AM +0200, Md Haris Iqbal wrote:
> > > Following a conversation with Bart yesterday, I am sending the RMR+BRMR
> > > code through patch for easier review.
> > >
> > > The patches apply over the for-next branch of the block tree over commit
> > > 07dfa981ca3
> > >
> > > For context,
> > > RMR (Reliable Multicast over RTRS) is a kernel module that provides
> > > active-active block-level replication over RDMA. It guarantees delivery
> > > of IO to a group of storage nodes and handles resynchronization of data
> > > directly between storage nodes without involving the compute client.
> > >
> > > BRMR (Block device over RMR) sits on top of RMR and exposes a standard
> > > Linux block device (/dev/brmrX) backed by an RMR pool. Together, RMR and
> > > BRMR provide a single-hop replication and resynchronization solution for
> > > RDMA-connected storage clusters.
> > >
> > > My session is on Wednesday, at 12 in the storage room (Istanbul).
> >
> > To summarize the discussion:
> >
> > 1. Move as much logic as possible into the block layer; RDMA should serve
> >    strictly as a transport.
> > 2. Identify another in‑kernel user of this functionality, and add support for
> >    it if required. At least accommodate potential users elsewhere in the
> >    kernel.
> 
> Thanks for the summary Leon.
> 
> The main logic which handles multicast/replication legs, missed I/O
> tracking, re-synchronization, etc are the core parts of RMR.
> If we move those to a separate module, there won't be much left in
> RMR. RMR already uses RTRS from the RDMA subsystem as transport.
> 
> Having said that, I am not against moving RMR out of the RDMA layer.
> It can serve as a reliable replication service/library for any other
> user in the kernel to use.
> Which subsystem (block or something else) would be a better fit then,
> can be discussed.
> 
> PS: Would this be a good candidate for a session/discussion in the upcoming LPC?

Probably yes.

Thanks

> 
> >
> > Thanks

^ permalink raw reply

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Ming Lei @ 2026-06-11 11:22 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: linux-block, Jens Axboe, Nilay Shroff
In-Reply-To: <20260611074200.474676-1-shinichiro.kawasaki@wdc.com>

Hi Shin'ichiro,

On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> I observed that the blktests test case block/005 hangs on a specific
> server hardware using a specific HDD as a block device. During the test
> case run, the kernel reported a KASAN null-ptr-deref (and other memory
> corruption symptoms) [2]. This failure looked sporadic and hardware-
> dependent.
> 
> From the kernel message, I noticed that udev-worker wrote to the
> queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> The test case block/005 also wrote to the same sysfs attribute, which

sysfs write is supposed to be serialized...

> indicated that a concurrent elevator change caused the failure. I
> created a new blktests test case that simply does the concurrent
> elevator change with a null_blk device [1]. It recreates the failure in
> a stable manner on various server hardware.
> 
> Using the new test case, I bisected and found that the failure first
> appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
> depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
> commit does not appear to explain the failure by itself: it changed the
> queue freeze behavior and only unveiled a race, probably. Looking back
> at the changes to elevator_change(), I think the actual cause is the
> commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
> elevator_lock") in the kernel tag v6.16-rc1. This commit moved
> elevator_change_done() out of the guard of ->elevator_lock and the queue
> freeze. As a result, when two threads write to the same queue/scheduler
> attribute concurrently, elevator_change_done() runs in parallel causing
> the memory corruption and the hang.
> 
> As the fix attempt, I created the patch in this series. It adds a new
> mutex that serializes the whole elevator switch sequence, including the
> elevator_change_done() call. I ran the reproducer with lockdep enabled
> and confirmed that the patch avoids the failure and new WARN was not
> observed.
> 
> However, the fix patch adds a new lock, and I'm not sure if it is the best
> solution. Comments on the patch, or suggestions for a better solution,
> would be appreciated.
> 
> [1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826
> 
> [2]
> 
> [30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
> [30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
> [30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> [30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
> [30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
> [30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
> [30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
> 8 8b 6d 00 48
> [30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
> [30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
> [30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
> [30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
> [30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
> [30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
> [30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
> [30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
> [30105.119796] [ T186111] Call Trace:
> [30105.124154] [ T186111]  <TASK>
> [30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
> [30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610

blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
fix could be check & avoid the null-ptr-deref.

Adding new lock should be the last straw usually, especially this one is
depended by queue freeze.

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH] rust: block: require `Sync` for `Operations::QueueData`
From: Andreas Hindborg @ 2026-06-10 18:02 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Boqun Feng, Miguel Ojeda, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Jens Axboe, Daniel Almeida, linux-block, rust-for-linux,
	linux-kernel
In-Reply-To: <CANiq72mGmRTfauU-HUOyBVSG0GM7q39=6aP8p1djN2cHr5C_Kw@mail.gmail.com>

Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> writes:

> On Mon, Jun 8, 2026 at 10:25 AM Andreas Hindborg <a.hindborg@kernel.org> wrote:
>>
>> Fixes: 90d952fac8ac ("rust: block: add `GenDisk` private data support")
>
> Should this be Cc: stable given the hash is old? i.e. 6.18.y+

Yes it should have cc stable, sorry.


Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-11 10:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Martin K . Petersen, Jens Axboe,
	James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <aimSb9I0Vl-68hy9@kbusch-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 1232 bytes --]

On Wed, Jun 10, 2026 at 10:06 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Jun 10, 2026 at 09:16:11PM +0530, Sumit Saxena wrote:
> > The motivation for this change stems from performance issue we
> > encountered due to false sharing of the 'nr_active_requests_shared_tags'
> > counter
> > on certain CPU architectures. I initially submitted a patch to move that
> > counter to
> > its own cache line to avoid conflicts with 'nr_requests' and other hot
> > fields
> > (see:
> >
https://patchwork.kernel.org/project/linux-scsi/patch/20260402074637.92417-3-sumit.saxena@broadcom.com/
> > ).
> >
> > During the review, Bart shared his work, which eliminates the
> > counter entirely by removing the fairness throttling. My testing
confirmed
> > that
> > this approach resolved the performance issues and improved IOPS.
> > This patch is part of a larger set, and I have reported the cumulative
> > performance
> > improvements in the cover letter.
>
> So the problem is just the atomic operation accounting overhead? I
> previously thought the device just really needed to consume all the tags
> to hit performance.
That's correct, it's the atomic operation accounting overhead.

Thanks,
Sumit

[-- Attachment #1.2: Type: text/html, Size: 1660 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 10:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: brauner, linux-block, linux-fsdevel, linux-ext4, linux-xfs,
	Keith Busch, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611055744.GA18538@lst.de>

On Thu, Jun 11, 2026 at 07:57:44AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 10, 2026 at 04:52:11PM +0200, cem@kernel.org wrote:
> > From: Carlos Maiolino <cem@kernel.org>
> > 
> > The DIO alignment check has been lifted from iomap layer to rely on the
> > block layer to enforce proper alignment when issuing direct IO
> > operations. This though, depending on the IO size and buffer address
> > passed to the IO operation may lead to user-visible behavior change.
> > 
> > This has been caught initially by LTP test diotest4 running on
> > PPC architecture, where the test fails because a read() operation
> > with a supposedly misaligned buffer succeeds instead of an expected
> > -EINVAL.
> > This has no direct relationship with PPC, but seems to do with the
> > IO size crossing page borders or not.
> 
> I don't understand the problem here.  Why do we want to insist on a
> failure when we can support it?  I think the test is just broken.

The problem I see here from my POV is this changed the behavior expected
from the syscalls when the passed in buffer is misaligned as the read()
(in the test) succeeds when the passed in buffer does not match the
alignment requirements (see below).

I am pretty happy in declaring this a test bug, but I thought it would be
worth starting a discussion about the sudden/unexpected behavior change.
Not to mention now different filesystems will have different alignment
requirements which seems at least "weird" to me. I mean, now suddenly
iomap-based filesystems have a more relaxed alignment constraint than
for example btrfs.

> 
> > The problematic behavior is reproducible on x86 by reducing the IO size
> > to something < PAGE_SIZE, so the misaligned read()s will also be accepted
> > by the block layer.
> 
> What do you mean with misaligned here?  For a long time the kernel
> supports basically arbitrary low memory alignment for diret I/O,
> just bounded by the device capabilities (typical 4 byte alignment).

The test sends to read() a buffer misplaced by 1 byte (see below) which
doesn't match the system's alignment constraints at least from the user
passed buffer perspective.
I've been assuming it should match device's dma_alignment constraints.
The typical 4 byte alignment indeed is the requirement from my PPC
machine, but not for my x86:

> 
> The supported memory alignment is reported in the statx
> dio_mem_align.  What does that say compared to the alignment
> expectations in this test?

From my x86:
dio_mem_align: 512
dio_offset_align: 512

From PPC:
dio_mem_align: 4
dio_offset_align: 512

But this does not explain how the following call would succeed in either
case (below one taken from PPC):

openat(dirfd=AT_FDCWD, pathname="testdata-4.135256", flags=O_RDWR|O_DIRECT) = 3
_llseek(fd=3, offset=4096, result=[4096], whence=SEEK_SET) = 0
read(arg1=0x3, arg2=0x1003af80001, arg3=0x1000) = 0x1000

The passed in address 0x1003af80001 is one byte misaligned and shouldn't
(at least in theory) ever be accepted no? Or am I missing something
else?

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Christian König @ 2026-06-11  8:39 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipx1goKIsk40vrF@ashevche-desk.local>

On 6/11/26 10:29, Andy Shevchenko wrote:
> On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
>> On 6/10/26 17:02, Andy Shevchenko wrote:
>>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>>>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>>>> Should we revert to v1, or keep list_for_each_entry() and
>>>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>>>> changes?
>>>>>
>>>>> Link to v1:
>>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>>>
>>>>> Or do you have any better suggestions?
>>>>
>>>> v1 looks perfectly reasonable to me.
>>>
>>> But why not just hiding that once for all (in case they don't use the temporary
>>> iterator)? Easy to automate, robust — everyone is happy?
>>
>> As far as I can see that is an extremely bad idea.
>>
>> The distinction between the use cases of 'iterating the list' and 'iterating
>> the list while you modify it' is completely intentional.
> 
> What I meant is to keep the name, just drop the parameter (make it hidden and
> being defined inside list_for_each_*_safe() cases).

Ah, sorry I was still thinking the suggestion is to merge list_for_each_entry() and list_for_each_entry_safe().

If the modification is done all at once or in steps doesn't really matter for me as long as the patch can be re-created reproducible.

But I'm wondering if we couldn't improve the name at the same time. The _safe() postfix has caused tons of confusion where especially beginners thought that it is a thread-safe variant, which it clearly isn't.

The _mutable() postfix sounds like a much better description to what happens here.

Regards,
Christian.

> 
>> See the bool type can be implemented by int as well, but it is just a
>> different use case.
> 
>>>> You should just include some patches in the same patch set to actually use
>>>> the new macros.
>>>>
>>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>>>> the new macro I'm happy to review that.
>>>
>>
> 


^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11  8:29 UTC (permalink / raw)
  To: Christian König
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <92683537-8404-47fe-a4ba-160e54870f0b@amd.com>

On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
> On 6/10/26 17:02, Andy Shevchenko wrote:
> > On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
> >> On 6/10/26 10:18, Kaitao Cheng wrote:
> >>> 在 2026/6/10 16:07, Christian König 写道:

...

> >>> Should we revert to v1, or keep list_for_each_entry() and
> >>> list_for_each_entry_safe() as they are, close this thread, and make no
> >>> changes?
> >>>
> >>> Link to v1:
> >>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
> >>>
> >>> Or do you have any better suggestions?
> >>
> >> v1 looks perfectly reasonable to me.
> > 
> > But why not just hiding that once for all (in case they don't use the temporary
> > iterator)? Easy to automate, robust — everyone is happy?
> 
> As far as I can see that is an extremely bad idea.
> 
> The distinction between the use cases of 'iterating the list' and 'iterating
> the list while you modify it' is completely intentional.

What I meant is to keep the name, just drop the parameter (make it hidden and
being defined inside list_for_each_*_safe() cases).

> See the bool type can be implemented by int as well, but it is just a
> different use case.

> >> You should just include some patches in the same patch set to actually use
> >> the new macros.
> >>
> >> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
> >> the new macro I'm happy to review that.
> > 
> 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Christian König @ 2026-06-11  8:01 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <ail8iNvPrJnE7p58@ashevche-desk.local>

On 6/10/26 17:02, Andy Shevchenko wrote:
> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>> Should we revert to v1, or keep list_for_each_entry() and
>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>> changes?
>>>
>>> Link to v1:
>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>
>>> Or do you have any better suggestions?
>>
>> v1 looks perfectly reasonable to me.
> 
> But why not just hiding that once for all (in case they don't use the temporary
> iterator)? Easy to automate, robust — everyone is happy?

As far as I can see that is an extremely bad idea.

The distinction between the use cases of 'iterating the list' and 'iterating the list while you modify it' is completely intentional.

See the bool type can be implemented by int as well, but it is just a different use case.

Regards,
Christian.

> 
>> You should just include some patches in the same patch set to actually use
>> the new macros.
>>
>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>> the new macro I'm happy to review that.
> 


^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11  7:52 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <83ba73d8-27d3-4ee9-a143-7dfe4cb827be@linux.dev>

On Thu, Jun 11, 2026 at 03:36:01PM +0800, Kaitao Cheng wrote:
> 在 2026/6/11 14:54, Andy Shevchenko 写道:
> > On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
> >> 在 2026/6/10 22:43, Andy Shevchenko 写道:
> >>> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
> >>>> 在 2026/6/9 18:33, Christian König 写道:
> >>>>> On 6/9/26 08:13, Kaitao Cheng wrote:
> > 
> >>>>>> This series prepares for, and then updates, the list_for_each_entry()
> >>>>>> family so the common entry iterators cache their next or previous cursor
> >>>>>> before the loop body runs.
> >>>>>
> >>>>> Why in the world would we want to do that?
> >>>>>
> >>>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
> >>>>>
> >>>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
> >>>>>
> >>>>>> The first 13 patches open-code loops that intentionally depend on the
> >>>>>> old "derive the next entry from the current cursor at the end of the
> >>>>>> iteration" behaviour.  These loops append work to the list being walked,
> >>>>>> restart traversal after dropping a lock, skip an entry consumed by the
> >>>>>> current iteration, or otherwise adjust the cursor in the loop body.
> >>>>>
> >>>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
> >>>>
> >>>> Hi Christian and Andy Shevchenko,
> >>>>
> >>>> Thanks for taking a look. I would like to clarify the point you raised.
> >>>>
> >>>> The reason I started looking at this is the original motivation behind
> >>>> the _safe() variants.  They exist because some users need to remove, move
> >>>> or otherwise consume the current entry while walking the list.  In that
> >>>> case the next cursor has to be preserved before the loop body can modify
> >>>> the current entry.
> >>>>
> >>>> The unfortunate part is that this could not be expressed with the
> >>>> existing list_for_each_entry() interface without changing its calling
> >>>> convention.  The _safe() variants had to grow an extra argument for the
> >>>> temporary cursor, and that is why we ended up with a separate family of
> >>>> macros.
> >>>>
> >>>> But conceptually, the distinction does not have to be exposed as two
> >>>> different iterator families forever.  The difference is an implementation
> >>>> detail: whether the iterator keeps the next/previous cursor before the
> >>>> body runs.  This series makes the common list_for_each_entry() iterators
> >>>> do that internally, so the safe and non-safe forms can effectively be
> >>>> folded together, or at least the need for a separate public _safe()
> >>>> interface becomes much weaker.
> >>>>
> >>>> There is also a usability issue with the current _safe() interface.  The
> >>>> caller is forced to define a temporary cursor outside the macro and pass
> >>>> it in, even though almost all users never use that cursor directly.  It is
> >>>> just boilerplate required by the macro implementation.  I find that
> >>>> redundant and awkward: the temporary cursor is an internal detail of the
> >>>> iteration, but every caller has to spell it out.
> >>>
> >>> Ah, I think the distinct macro families is that what we want.
> >>> But the hiding of the parameter can be done inside list_for_each_*_safe().
> >>> You can do a treewide change with coccinelle.
> >>>
> >>> Sorry if I didn't get the whole idea from your previous contributions.
> >>>
> >>> Note, even cases that would need a temporary cursor may be switched to
> >>> new list_for_each_*_safe(), see how PCI macros for iterating over resources
> >>> are implemented (include/linux/pci.h).
> >>
> >> Thanks for your suggestions. I've written a demo based on your feedback.
> >> Could you please review it and share your thoughts on this approach?
> > 
> > Have you checked how many users actually need the temporary storage?
> 
> In Muchun's reply, he mentioned the following:
> 
> There are 9,925 list_for_each_entry() call sites in total. Among them,
> 9,919 do not require any adaptation, and only 6 need to be refactored:
> 
> As for list_for_each_entry_safe(), there are 4,572 callers. 4,550 of them
> can be directly replaced by the new list_for_each_entry(), while 22 cannot
> be replaced
> 
> https://lore.kernel.org/all/2B3BFA1E-08B8-42AB-87D6-A28BF15E5C58@linux.dev/
> 
> I only used Coccinelle to scan for list_for_each_entry() call sites, and
> found the 13 call sites shown in the current patch series, which cover
> the 6 cases mentioned in Muchun's email. I have not yet run the Coccinelle
> scan for list_for_each_entry_safe().
> 
> If we need to handle all 9,925 list_for_each_entry() call sites or all 4,572
> list_for_each_entry_safe() call sites in one go, would such a change be too
> large? I expect it would affect almost every kernel subsystem.

If it's done by Linus himself during the day when he prepares -rc1, it's fine.
You would need to provide a good justification for the change, though.

But in the above statistics the 4572 vs 4550, so the first step is to investigate
why temporary cursor is used in those 22 cases and what we can do to avoid that.

> I wonder whether it would be better to first provide the necessary
> compatibility APIs, and then let each subsystem owner update their code as
> appropriate. That would make the impact more controlled, similar to how
> the current folio replacement of page is being handled.
> 
> >>>> With the updated list_for_each_entry() implementation, that extra cursor
> >>>> can be kept inside the iterator itself.  Callers that only want to walk
> >>>> the list, including callers that delete or consume the current entry, no
> >>>> longer need to carry an otherwise-unused temporary variable just to make
> >>>> the macro work.
> >>>>
> >>>>>> The final patch changes include/linux/list.h to keep a private cursor in
> >>>>>> the common entry iterators while preserving the public macro interface.
> >>>>>> The safe variants remain available when callers need the temporary
> >>>>>> cursor explicitly or have stronger mutation requirements.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH RFC 1/1] block: serialize whole elevator change steps for the same queue
From: Shin'ichiro Kawasaki @ 2026-06-11  7:42 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki
In-Reply-To: <20260611074200.474676-1-shinichiro.kawasaki@wdc.com>

When elevator_change() is called concurrently for the same queue, the
elevator_change_done() function runs concurrently as well. This function
adds or deletes kobjects for the debugfs entry of the queue. Then the
concurrent calls cause memory corruption of the kobjects and result in a
process hang. The core part of the elevator switch is protected by queue
freeze and q->elevator_lock. However, since the commit 559dc11143eb
("block: move elv_register[unregister]_queue out of elevator_lock"), the
elevator_change_done() is not serialized. Hence the memory corruption
and the hang.

The failures are observed when udev-worker writes to a sysfs
queue/scheduler attribute file while the blktests test case block/005
writes to the same attribute file. The failure also can be recreated by
running two processes that write to the same queue/scheduler file
concurrently. The failure is observed since another commit 370ac285f23a
("block: avoid cpu_hotplug_lock depedency on freeze_lock"). This commit
changed the behavior of queue freeze and it unveiled the failure.

Fix the failure by adding a new per-queue lock 'elevator_queue_lock',
which serializes the whole elevator switch steps for the same queue
including the elevator_change_done() call.

Fixes: 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..c6418889897a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -430,6 +430,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 	refcount_set(&q->refs, 1);
 	mutex_init(&q->debugfs_mutex);
 	mutex_init(&q->elevator_lock);
+	mutex_init(&q->elevator_queue_lock);
 	mutex_init(&q->sysfs_lock);
 	mutex_init(&q->limits_lock);
 	mutex_init(&q->rq_qos_mutex);
diff --git a/block/elevator.c b/block/elevator.c
index 3bcd37c2aa34..65bdea27aa8a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -665,6 +665,13 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 			return ret;
 	}

+	/*
+	 * Acquire elevator_queue_lock to serialize the debugfs (un)register
+	 * steps for the same queue. The elevator switch core part is protected
+	 * by queue freezing and ->elevator_lock.
+	 */
+	mutex_lock(&q->elevator_queue_lock);
+
 	memflags = blk_mq_freeze_queue(q);
 	/*
 	 * May be called before adding disk, when there isn't any FS I/O,
@@ -690,6 +697,8 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	if (!ctx->new)
 		blk_mq_free_sched_res(&ctx->res, ctx->type, set);

+	mutex_unlock(&q->elevator_queue_lock);
+
 	return ret;
 }

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..cfeddd3ded95 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -606,6 +606,13 @@ struct request_queue {
 	 */
 	struct mutex		elevator_lock;

+	/*
+	 * Serializes the whole elevator change operation for the same queue,
+	 * including the debugfs (un)register steps. Must be acquired before
+	 * freezing the queue and acquiring elevator_lock.
+	 */
+	struct mutex		elevator_queue_lock;
+
 	struct mutex		sysfs_lock;
 	/*
 	 * Protects queue limits and also sysfs attribute read_ahead_kb.
-- 
2.54.0

^ permalink raw reply related

* [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Shin'ichiro Kawasaki @ 2026-06-11  7:41 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki

I observed that the blktests test case block/005 hangs on a specific
server hardware using a specific HDD as a block device. During the test
case run, the kernel reported a KASAN null-ptr-deref (and other memory
corruption symptoms) [2]. This failure looked sporadic and hardware-
dependent.

From the kernel message, I noticed that udev-worker wrote to the
queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
The test case block/005 also wrote to the same sysfs attribute, which
indicated that a concurrent elevator change caused the failure. I
created a new blktests test case that simply does the concurrent
elevator change with a null_blk device [1]. It recreates the failure in
a stable manner on various server hardware.

Using the new test case, I bisected and found that the failure first
appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
commit does not appear to explain the failure by itself: it changed the
queue freeze behavior and only unveiled a race, probably. Looking back
at the changes to elevator_change(), I think the actual cause is the
commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
elevator_lock") in the kernel tag v6.16-rc1. This commit moved
elevator_change_done() out of the guard of ->elevator_lock and the queue
freeze. As a result, when two threads write to the same queue/scheduler
attribute concurrently, elevator_change_done() runs in parallel causing
the memory corruption and the hang.

As the fix attempt, I created the patch in this series. It adds a new
mutex that serializes the whole elevator switch sequence, including the
elevator_change_done() call. I ran the reproducer with lockdep enabled
and confirmed that the patch avoids the failure and new WARN was not
observed.

However, the fix patch adds a new lock, and I'm not sure if it is the best
solution. Comments on the patch, or suggestions for a better solution,
would be appreciated.

[1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826

[2]

[30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
[30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
[30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
[30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
[30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
8 8b 6d 00 48
[30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
[30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
[30105.119796] [ T186111] Call Trace:
[30105.124154] [ T186111]  <TASK>
[30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
[30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610
[30105.140037] [ T186111]  ? __pfx_elevator_change_done+0x10/0x10
[30105.146409] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.152246] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.158189] [ T186111]  elevator_change+0x283/0x4f0
[30105.163342] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.168932] [ T186111]  elv_iosched_store+0x30c/0x3a0
[30105.174265] [ T186111]  ? __pfx_elv_iosched_store+0x10/0x10
[30105.180797] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.187066] [ T186111]  ? kernfs_fop_write_iter+0x25b/0x5e0
[30105.193594] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.199931] [ T186111]  ? lock_acquire+0x126/0x140
[30105.205683] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.211924] [ T186111]  queue_attr_store+0x23f/0x360
[30105.217796] [ T186111]  ? __pfx_queue_attr_store+0x10/0x10
[30105.224180] [ T186111]  ? __lock_acquire+0x55d/0xbd0
[30105.230049] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.236247] [ T186111]  ? sysfs_file_kobj+0x1d/0x1b0
[30105.242093] [ T186111]  ? find_held_lock+0x2b/0x80
[30105.247763] [ T186111]  ? __lock_release.isra.0+0x59/0x170
[30105.254122] [ T186111]  ? lock_release.part.0+0x1c/0x50
[30105.260226] [ T186111]  ? sysfs_file_kobj+0xb9/0x1b0
[30105.266048] [ T186111]  ? sysfs_kf_write+0x65/0x170
[30105.271778] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.277934] [ T186111]  kernfs_fop_write_iter+0x3da/0x5e0
[30105.284173] [ T186111]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[30105.290926] [ T186111]  vfs_write+0x524/0x1010
[30105.296215] [ T186111]  ? __pfx_vfs_write+0x10/0x10
[30105.301905] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.308092] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.314246] [ T186111]  ksys_write+0xff/0x200
[30105.319331] [ T186111]  ? __pfx_ksys_write+0x10/0x10
[30105.325007] [ T186111]  do_syscall_64+0xf4/0x1550
[30105.330407] [ T186111]  ? __pfx___x64_sys_openat+0x10/0x10
[30105.336566] [ T186111]  ? seccomp_run_filters+0xeb/0x560
[30105.342517] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.348096] [ T186111]  ? __seccomp_filter+0xa2/0x920
[30105.353749] [ T186111]  ? __pfx___seccomp_filter+0x10/0x10
[30105.359830] [ T186111]  ? trace_hardirqs_on_prepare+0x150/0x1a0
[30105.366344] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.371892] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.377422] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.382922] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.388401] [ T186111]  ? do_syscall_64+0x34/0x1550
[30105.393777] [ T186111]  ? do_syscall_64+0xab/0x1550
[30105.399129] [ T186111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[30105.405624] [ T186111] RIP: 0033:0x7fc1c7c4fbbe
[30105.410647] [ T186111] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[30105.431611] [ T186111] RSP: 002b:00007ffefd3bdd90 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[30105.440716] [ T186111] RAX: ffffffffffffffda RBX: 000055fa3f0f4b80 RCX: 00007fc1c7c4fbbe
[30105.449404] [ T186111] RDX: 000000000000000b RSI: 000055fa3ed9d550 RDI: 0000000000000015
[30105.458090] [ T186111] RBP: 00007ffefd3bdda0 R08: 0000000000000000 R09: 0000000000000000
[30105.466787] [ T186111] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000b
[30105.475479] [ T186111] R13: 000000000000000b R14: 000055fa3ed9d550 R15: 000055fa3ed9d550
[30105.484182] [ T186111]  </TASK>
[30105.487920] [ T186111] Modules linked in: iscsi_target_mod tcm_loop target_core_pscsi target_core_file target_core_iblock xfs nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc target_core_user target_core_mod rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nf_tables ip6table_filter ip6_tables iptable_filter ip_tables qrtr intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt intel_pmc_bxt kvm_intel kvm irqbypass rapl sunrpc intel_cstate intel_uncore pcspkr i2c_i801 i2c_smbus mei_me igb lpc_ich mei ioatdma dca wmi binfmt_misc joydev acpi_power_meter acpi_pad btrfs raid6_pq xor ses enclosure loop dm_multipath nfnetlink zram lz4hc_compress lz4_compress
[30105.488278] [ T186111]  zstd_compress ast drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper mpt3sas drm mpi3mr raid_class scsi_transport_sas scsi_dh_rdac scsi_dh_emc scsi_dh_alua i2c_dev fuse [last unloaded: zonefs]
[30105.609649] [ T186111] ---[ end trace 0000000000000000 ]---
[30105.648290] [ T186111] pstore: backend (erst) writing error (-28)
[30105.654739] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.662519] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 48 8b 6d 00 48
[30105.683653] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.691248] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.700121] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.708841] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.717829] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.726550] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.735306] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c54ce000(0000) knlGS:0000000000000000
[30105.745003] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.752368] [ T186111] CR2: 00007f251f9bc0e8 CR3: 0000000108350002 CR4: 00000000001726f0


Shin'ichiro Kawasaki (1):
  block: serialize whole elevator change steps for the same queue

 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

-- 
2.54.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox