* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Christoph Hellwig @ 2026-05-25 5:24 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang,
Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
linux-rt-devel
In-Reply-To: <ea6fc01f-5cb7-4a04-9f92-bbd2791fea51@columbia.edu>
[adding the PREEMPT-RT maintainers and list for one and a half questions
for them a bit below]
On Fri, May 22, 2026 at 07:09:59PM -0400, Tal Zussman wrote:
> > + while ((bio = bio_list_pop(&list)))
> > + bio->bi_end_io(bio);
>
> Q: "Does this code handle bios that do not have a completion callback?
> If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
> bi_end_io pointer (like some flush or discard bios), executing
> bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
> The synchronous completion path in bio_endio() checks if bio->bi_end_io is
> non-NULL before calling it. Should a similar check be added here?"
>
> A: Every bio that reaches this point should have bi_end_io set. There's no
> point in deferring a bio without a completion handler.
Yes. The only bios without bi_end_io that can end up in bio_endio
are those attached to passthrough requests, and deferring them of
course doesn't make sense.
> Q: "Could this loop also monopolize the CPU and cause latency spikes?
> Because the work uses a 1-jiffie delay, tens of thousands of bios might
> accumulate under heavy I/O load. Processing the entire batch sequentially
> without yielding or calling cond_resched() inside the inner loop could
> lead to scheduler latency or watchdog warnings on non-preemptible kernels."
>
> A: With PREEMPT_LAZY as the default, the worker is preemptible, as per
> Hillf and Christoph, so this is unnecessary.
Yes. Sashiko tends to spew this a lot, which is odd as there were clear
recommendations to only do this absolutely if needed even before
PREEMPT_LAZY.
> > +
> > + local_lock_irqsave(&bio_complete_batch.lock, flags);
>
> Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
> context?
> On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
> which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
> called when bio_in_atomic() is true (which includes hardware interrupts or
> execution under a raw_spinlock_t), attempting to acquire a sleepable lock
> here would trigger an "Invalid wait context" lockdep warning.
> Would a lockless list (llist) be more appropriate here to avoid sleeping
> in atomic contexts?"
>
> A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
> to switch to raw_spinlock_t, as it seems like that would add unnecessary
> overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
> (as is done for the per-CPU bio allocation cache) should work.
Adding the PREEMPT_RT maintainers for this as it is above my pay grade.
> Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
> protection in bio_complete_work_fn()?
> When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
> can execute per-CPU work items during worker pool congestion. This rescuer
> thread executes unbound, meaning it could run on CPU B while processing
> CPU A's work item.
> Since local_lock operates strictly on the currently executing CPU, the
> rescuer thread on CPU B would acquire CPU B's lock, while popping elements
> from CPU A's list (derived via container_of()).
> If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
> it will acquire CPU A's lock and modify the same list without mutual
> exclusion, potentially causing list corruption."
>
> A: The rescuer should run on the same CPU, not unbound, so this is not an
> issue.
This is another area where the PREEMPT_RT/scheduler folks might be able
to help.
> static inline bool bio_complete_in_task(struct bio *bio)
> {
> if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> return false;
> if (!bio_in_atomic())
> return false;
> bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> __bio_complete_in_task(bio);
> return true;
> }
>
> We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
> been deferred to the workqueue as is safe to run.
Would be nice to avoid this, but yes.
^ permalink raw reply
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Christoph Hellwig @ 2026-05-25 5:17 UTC (permalink / raw)
To: Tal Zussman
Cc: Christoph Hellwig, Jens Axboe, Matthew Wilcox (Oracle),
Christian Brauner, Darrick J. Wong, Carlos Maiolino,
Alexander Viro, Jan Kara, Dave Chinner, Bart Van Assche,
linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
Gao Xiang
In-Reply-To: <ea1fa305-3ba2-4cfd-b7cb-86875032a300@columbia.edu>
On Fri, May 22, 2026 at 06:47:43PM -0400, Tal Zussman wrote:
> > But this 1-jiffie delay also means we unconditionally increase
> > completion latency, which feels like a bad idea. Do you have any
> > measurements that show where it does benefit? Note that queing work
> > already often has very measurable latency on it's own. This also
> > directly contradics the erofs experience that even went to a RT
> > thread to reduce the latency.
>
> I added this per Dave's feedback on v4, where he noted that XFS inodegc
> uses a delayed work item to avoid context switch storms. There's only a
> delay for the first bio in a batch to complete, as we only delay when the
> list is empty. I'll run some experiments and measure context switches,
> completion latency, etc. to see if this is necessary.
The difference is that XFS inodegc is not latency bound. Most of the
time no one cares if it is delayed a bit, in the cases where someone
cares we explicitly flush the queues. I/O completion on the other hand
is something where users very much care about latency.
^ permalink raw reply
* [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()
From: Tetsuo Handa @ 2026-05-25 3:40 UTC (permalink / raw)
To: Ming Lei, Jens Axboe
Cc: Bart Van Assche, Christoph Hellwig, Damien Le Moal, linux-block,
LKML, Andrew Morton
In-Reply-To: <ag1223nAa0wZ8ALC@fedora>
Some commit which was merged in the merge window for 7.1 broke the loop
driver; a race window where lo_release() clears the backing file via
__loop_clr_fd() despite some I/O requests are pending was introduced [1][2].
The exact commit which changed the behavior is not known due to lack of
reproducer and timing dependent behavior, but it seems that we need to
solve this problem in the loop driver despite there was no change for the
loop driver during this merge window.
To close this race, try to flush pending I/O requests. However, calling
drain_workqueue() from __loop_clr_fd() with disk->open_mutex held causes
lockdep warnings [3][4]. We need to flush pending I/O requests without
disk->open_mutex held.
In the past, commit 322c4293ecc5 ("loop: make autoclear operation
asynchronous") has tried to defer __loop_clr_fd() to WQ context. But it was
reverted by commit bf23747ee053 ("loop: revert "make autoclear operation
asynchronous"") because userspace might be expecting that fput() on the
backing file is processed before lo_release() from close() returns to user
mode.
Therefore, this patch tries to defer __loop_clr_fd() to task work context.
__loop_clr_fd() is split into three steps:
Step 1: Flush pending I/O requests without holding disk->open_mutex.
Step 2: Do what __loop_clr_fd() from lo_release() was doing with
disk->open_mutex held.
Step 3: Drop refcounts without holding disk->open_mutex.
A potential side effect of this approach is that a userspace program who
issued open() request before __loop_clr_fd() completes might be confused
by observing -ENXIO because lo_open() can be called before __loop_clr_fd()
completes.
Except for the side effect above, I expect this patch to work by the
following reasons.
- The existing Lo_rundown state safely guarantees that any subsequent
lo_open() attempts will immediately fail with -ENXIO, preventing races
even after disk->open_mutex is temporarily released.
- Since returning from lo_release() normally allows the block layer to
immediately drop module and device references, this patch explicitly
increments the refcounts (__module_get() and get_device()) before
deferring the work, and safely releases them at the end of Step 3
inside __loop_clr_fd().
- It prefers task_work so that userspace processes expecting immediate
completion (such as fput() side-effects) receive a deterministic
behavior before returning from close(). It falls back to schedule_work()
if the current context is a kernel thread (PF_KTHREAD) or if
task_work_add() fails.
Link: https://syzkaller.appspot.com/bug?extid=cd8a9a308e879a4e2c28 [1]
Link: https://syzkaller.appspot.com/bug?extid=bc273027d5643e48e5b3 [2]
Link: https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e [3]
Link: https://syzkaller.appspot.com/bug?extid=c4e9d077bcc86bee08dc [4]
Analyzed-by: AI Mode in Google Search (no mail address)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
drivers/block/loop.c | 86 ++++++++++++++++++++++++++++++++++++--------
kernel/task_work.c | 1 +
2 files changed, 73 insertions(+), 14 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0000913f7efc..d97aa2c209e3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -36,6 +36,7 @@
#include <linux/blk-mq.h>
#include <linux/spinlock.h>
#include <uapi/linux/loop.h>
+#include <linux/task_work.h>
/* Possible states of device */
enum {
@@ -74,6 +75,10 @@ struct loop_device {
struct gendisk *lo_disk;
struct mutex lo_mutex;
bool idr_visible;
+ union {
+ struct callback_head lo_clr_task_work;
+ struct work_struct lo_clr_work;
+ };
};
struct loop_cmd {
@@ -1112,12 +1117,34 @@ static int loop_configure(struct loop_device *lo, blk_mode_t mode,
return error;
}
-static void __loop_clr_fd(struct loop_device *lo)
+static void __loop_clr_fd(struct callback_head *callback)
{
+ struct loop_device *lo = container_of(callback, struct loop_device, lo_clr_task_work);
struct queue_limits lim;
struct file *filp;
gfp_t gfp = lo->old_gfp_mask;
+ /* Step 1: Flush all outstanding I/O, without open_mutex held. */
+
+ /*
+ * Now that loop_queue_rq() sees lo->lo_state != Lo_bound,
+ * wait for already started loop_queue_rq() to complete.
+ */
+ synchronize_rcu();
+ /*
+ * Now that no more works are scheduled by loop_queue_rq(),
+ * wait for already scheduled works to complete.
+ */
+ drain_workqueue(lo->workqueue);
+ /*
+ * Now that no more AIO requests are scheduled by lo_rw_aio(),
+ * wait for already started AIO to complete.
+ */
+ blk_mq_unfreeze_queue(lo->lo_queue, blk_mq_freeze_queue(lo->lo_queue));
+
+ /* Step 2: Perform remaining cleanup, with open_mutex held. */
+ mutex_lock(&lo->lo_disk->open_mutex);
+
spin_lock_irq(&lo->lo_lock);
filp = lo->lo_backing_file;
lo->lo_backing_file = NULL;
@@ -1128,12 +1155,7 @@ static void __loop_clr_fd(struct loop_device *lo)
lo->lo_sizelimit = 0;
memset(lo->lo_file_name, 0, LO_NAME_SIZE);
- /*
- * Reset the block size to the default.
- *
- * No queue freezing needed because this is called from the final
- * ->release call only, so there can't be any outstanding I/O.
- */
+ /* Reset the block size to the default. */
lim = queue_limits_start_update(lo->lo_queue);
lim.logical_block_size = SECTOR_SIZE;
lim.physical_block_size = SECTOR_SIZE;
@@ -1145,8 +1167,6 @@ static void __loop_clr_fd(struct loop_device *lo)
/* let user-space know about this change */
kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
mapping_set_gfp_mask(filp->f_mapping, gfp);
- /* This is safe: open() is still holding a reference. */
- module_put(THIS_MODULE);
disk_force_media_change(lo->lo_disk);
@@ -1154,9 +1174,6 @@ static void __loop_clr_fd(struct loop_device *lo)
int err;
/*
- * open_mutex has been held already in release path, so don't
- * acquire it if this function is called in such case.
- *
* If the reread partition isn't from release path, lo_refcnt
* must be at least one and it can only become zero when the
* current holder is released.
@@ -1181,12 +1198,31 @@ static void __loop_clr_fd(struct loop_device *lo)
WRITE_ONCE(lo->lo_state, Lo_unbound);
mutex_unlock(&lo->lo_mutex);
+ /* Step 3: Drop refcounts, without open_mutex held. */
+ mutex_unlock(&lo->lo_disk->open_mutex);
+
/*
* Need not hold lo_mutex to fput backing file. Calling fput holding
* lo_mutex triggers a circular lock dependency possibility warning as
* fput can take open_mutex which is usually taken before lo_mutex.
*/
fput(filp);
+
+ /*
+ * Drop all references that would have been dropped as soon as
+ * returning from lo_release() and releasing disk->open_mutex.
+ */
+ module_put(lo->lo_disk->fops->owner);
+ put_device(disk_to_dev(lo->lo_disk));
+
+ module_put(THIS_MODULE);
+}
+
+static void loop_clr_work(struct work_struct *work)
+{
+ struct loop_device *lo = container_of(work, struct loop_device, lo_clr_work);
+
+ __loop_clr_fd(&lo->lo_clr_task_work);
}
static int loop_clr_fd(struct loop_device *lo)
@@ -1747,8 +1783,30 @@ static void lo_release(struct gendisk *disk)
need_clear = (lo->lo_state == Lo_rundown);
mutex_unlock(&lo->lo_mutex);
- if (need_clear)
- __loop_clr_fd(lo);
+ /*
+ * In order to flush pending I/O requests before clearing the backing device,
+ * defer __loop_clr_fd() to task work context or normal workqueue context.
+ * The Lo_rundown state guarantees that lo_open() will fail with -ENXIO.
+ */
+ if (need_clear) {
+ /*
+ * Grab all references that will be dropped as soon as returning from
+ * lo_release() and releasing disk->open_mutex.
+ */
+ get_device(disk_to_dev(disk));
+ __module_get(disk->fops->owner);
+ /*
+ * Prefer task work, for userspace might be expecting that fput()
+ * on the backing file is processed before lo_release() from close()
+ * returns to user mode.
+ */
+ init_task_work(&lo->lo_clr_task_work, __loop_clr_fd);
+ if ((current->flags & PF_KTHREAD) ||
+ task_work_add(current, &lo->lo_clr_task_work, TWA_RESUME)) {
+ INIT_WORK(&lo->lo_clr_work, loop_clr_work);
+ schedule_work(&lo->lo_clr_work);
+ }
+ }
}
static void lo_free_disk(struct gendisk *disk)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0f7519f8e7c9..45fd146b85df 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -102,6 +102,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
return 0;
}
+EXPORT_SYMBOL_GPL(task_work_add);
/**
* task_work_cancel_match - cancel a pending work added by task_work_add()
--
2.54.0
^ permalink raw reply related
* [PATCH v6] block: propagate in_flight to whole disk on partition I/O
From: Tang Yizhou @ 2026-05-25 2:19 UTC (permalink / raw)
To: axboe, hch, kbusch
Cc: yukuai, linux-block, linux-kernel, Tang Yizhou, Leon Hwang
From: Tang Yizhou <yizhou.tang@shopee.com>
Now when I/O is submitted to a partition, the per-CPU in_flight[]
counter is incremented only on the partition's block_device, not on the
underlying whole disk. This leads to a problem which can be shown by a
fio test:
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
mydev 252:1 0 20G 0 disk
└─mydev1 259:0 0 10G 0 part
iostat -xp 1
Device r/s rkB/s ... aqu-sz %util
mydev 128153.00 512612.00 ... 13.22 72.20
mydev1 128154.00 512616.00 ... 13.22 100.00
%util is different between mydev and mydev1, which is unexpected.
This is the cumulative effect of a series of patches. The root cause is
commit e016b78201a2 ("block: return just one value from part_in_flight"),
which deleted the branch in part_in_flight() that aggregated the whole-disk
in_flight count on top of the partition's. Then the second commit is
commit 10ec5e86f9b8 ("block: merge part_{inc,dev}_in_flight into their
only callers"), which folded the whole-disk in_flight accounting into
generic_start_io_acct() and generic_end_io_acct(). Those two helpers
were then removed by commit e722fff238bb ("block: remove
generic_{start,end}_io_acct"), and from that point on the whole disk's
in_flight is no longer accounted at all.
In update_io_ticks(), if calling bdev_count_inflight() finds that the
inflight value of the whole device is 0, the accumulation of io_ticks will
be skipped, causing the reported util% value to be underestimated.
Fix it by restoring the whole-disk in_flight accounting.
Fixes: e016b78201a2 ("block: return just one value from part_in_flight")
Suggested-by: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v2: Update commit message.
v3: Take Christoph's advice and factor the common code into two helpers.
v4: Remove my redundant new line in blk.h. Add Christoph's Reviewed-by
tag.
v5: Remove the changelog from the commit message.
v6: Accept Keith's suggestion and fix the bug in bdev_end_io_acct().
block/blk-core.c | 4 ++--
block/blk-mq.c | 5 ++---
block/blk.h | 21 +++++++++++++++++++++
3 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..cee4e4a37503 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1042,7 +1042,7 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
{
part_stat_lock();
update_io_ticks(bdev, start_time, false);
- part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
+ bdev_inc_in_flight(bdev, op);
part_stat_unlock();
return start_time;
@@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
part_stat_inc(bdev, ios[sgrp]);
part_stat_add(bdev, sectors[sgrp], sectors);
part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
- part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
+ bdev_dec_in_flight(bdev, op);
part_stat_unlock();
}
EXPORT_SYMBOL(bdev_end_io_acct);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d0c37daf568f..6bdfe642bd93 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1082,8 +1082,7 @@ static inline void blk_account_io_done(struct request *req, u64 now)
update_io_ticks(req->part, jiffies, true);
part_stat_inc(req->part, ios[sgrp]);
part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns);
- part_stat_local_dec(req->part,
- in_flight[op_is_write(req_op(req))]);
+ bdev_dec_in_flight(req->part, req_op(req));
part_stat_unlock();
}
}
@@ -1143,7 +1142,7 @@ static inline void blk_account_io_start(struct request *req)
part_stat_lock();
update_io_ticks(req->part, jiffies, false);
- part_stat_local_inc(req->part, in_flight[op_is_write(req_op(req))]);
+ bdev_inc_in_flight(req->part, req_op(req));
part_stat_unlock();
}
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..11245a494c43 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -4,6 +4,7 @@
#include <linux/bio-integrity.h>
#include <linux/blk-crypto.h>
+#include <linux/part_stat.h>
#include <linux/lockdep.h>
#include <linux/memblock.h> /* for max_pfn/max_low_pfn */
#include <linux/sched/sysctl.h>
@@ -485,6 +486,26 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
q->last_merge = NULL;
}
+static inline void bdev_inc_in_flight(struct block_device *bdev,
+ enum req_op op)
+{
+ bool rw = op_is_write(op);
+
+ part_stat_local_inc(bdev, in_flight[rw]);
+ if (bdev_is_partition(bdev))
+ part_stat_local_inc(bdev_whole(bdev), in_flight[rw]);
+}
+
+static inline void bdev_dec_in_flight(struct block_device *bdev,
+ enum req_op op)
+{
+ bool rw = op_is_write(op);
+
+ part_stat_local_dec(bdev, in_flight[rw]);
+ if (bdev_is_partition(bdev))
+ part_stat_local_dec(bdev_whole(bdev), in_flight[rw]);
+}
+
/*
* Internal io_context interface
*/
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v5] block: propagate in_flight to whole disk on partition I/O
From: Yizhou Tang @ 2026-05-25 2:07 UTC (permalink / raw)
To: Keith Busch
Cc: Tang Yizhou, axboe, hch, yukuai, linux-block, linux-kernel,
Leon Hwang
In-Reply-To: <ahBnKR-IunwxVDzg@kbusch-mbp>
On Fri, May 22, 2026 at 10:53 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 10:16:38PM +0800, Tang Yizhou wrote:
> > @@ -1073,7 +1073,7 @@ void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
> > part_stat_inc(bdev, ios[sgrp]);
> > part_stat_add(bdev, sectors[sgrp], sectors);
> > part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration));
> > - part_stat_local_dec(bdev, in_flight[op_is_write(op)]);
> > + bdev_inc_in_flight(bdev, op);
>
> This one should be bdev_dec_in_flight().
Thanks for pointing that out.
Best regards,
Yi
>
^ permalink raw reply
* [PATCH v9] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-25 0:51 UTC (permalink / raw)
To: axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.
Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.
This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.
This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:
# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
Attaching 1 probe...
^C
@tag_waits[4]: 12
@tag_waits[12]: 87
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Hi Johannes, Damien, Chaitanya, Laurence,
I have dropped the earlier "Reviewed-by:" and "Tested-by:" tags because
of functional logic changes in the tracepoint assignment block. A fresh
review would be highly appreciated. Thank you.
Changes since v8 [1]:
- Fixed the standard pool depth calculation in TP_fast_assign to
accurately report the unreserved capacity by mathematically
subtracting nr_reserved_tags from nr_tags
- Removed "Reviewed-by:" and "Tested-by:" tags due to the functional
logic updates in the tracepoint assignment block
Changes since v7 [2]:
- Added an is_reserved boolean to the trace record to explicitly expose
reserved pool starvation to userspace
- Fixed TP_fast_assign to report the correct nr_reserved_tags depth
when I/O schedulers utilise the reserved pool
Changes since v6 [3]:
- Dropped Patch 2. Observability is now driven entirely by the tracepoint,
with the commit message updated to demonstrate how userspace (e.g.,
bpftrace) can safely replicate counting out-of-band (Jens Axboe)
- Moved tracepoint call above sbitmap_prepare_to_wait(). This prevents
inadvertently resetting the task state under PREEMPT_RT locks
- Updated the tracepoint signature and TP_fast_assign block to evaluate
the allocation flags. If the submitting context is starved of a reserved
tag (BLK_MQ_REQ_RESERVED), the tracepoint now accurately reports the
severely constrained nr_reserved_tags depth instead of the total nr_tags
depth.
Changes since v5 [4]:
- Replaced this_cpu_inc() with raw_cpu_inc() within
blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
preemptible context immediately prior to io_schedule(). This adjustment
deliberately prioritises the reduction of execution overhead over
absolute statistical precision for this diagnostic interface.
Changes since v4 [5]:
- Prevented a NULL pointer dereference in the tracepoint fast-assign for
disk-less request queues by safely checking q->disk before resolving the
dev_t
- Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
the per-CPU counter allocation from the volatile debugfs lifecycle and
tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
and blk_mq_exit_hctx())
- Fixed a potential compiler double-fetch bug by wrapping the per-CPU
pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()
- Passed the appropriate gfp_t flags down to the allocation routines to
maintain the strict GFP_NOIO context
- Updated kernel-doc descriptions to clarify that the NULL pointer
checks guard against memory allocation failures under pressure, rather
than initialisation race conditions
Changes since v3 [6]:
- Transitioned tracking architecture from shared atomic_t variables to
dynamically allocated per-CPU counters to resolve cache line bouncing
(Bart Van Assche)
Changes since v2 [7]:
- Added "Reviewed-by:" and "Tested-by:" tags for patch 1
- Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
- Introduced atomic counters via debugfs
Changes since v1 [8]:
- Improved the description of the trace point (Damien Le Moal)
- Removed the redundant "active requests" (Laurence Oberman)
- Introduced pool-specific starvation tracking
[1]: https://lore.kernel.org/lkml/20260524014204.622699-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260523200942.587199-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260517213614.350367-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[6]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[7]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[8]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
---
block/blk-mq-tag.c | 6 ++++
include/trace/events/block.h | 59 ++++++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..35deee5bbc73 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
#include <linux/kmemleak.h>
#include <linux/delay.h>
+#include <trace/events/block.h>
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-sched.h"
@@ -181,6 +182,11 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
if (tag != BLK_MQ_NO_TAG)
break;
+ /* Log the starvation event before altering task state */
+ trace_block_rq_tag_wait(data->q, data->hctx,
+ data->rq_flags & RQF_SCHED_TAGS,
+ data->flags);
+
sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
tag = __blk_mq_get_tag(data, bt);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..9c97a16850b9 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,65 @@ DECLARE_EVENT_CLASS(block_rq,
IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
);
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ * @alloc_flags: allocation flags dictating the specific tag pool
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver
+ * tags, software scheduler tags, or reserved tags). This trace point
+ * indicates that the context will be placed into an uninterruptible state
+ * via sbitmap_prepare_to_wait(). If a tag is not acquired in the final
+ * lockless retry, the context will yield the CPU via io_schedule() until
+ * an active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+ TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ bool is_sched_tag, unsigned int alloc_flags),
+
+ TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( u32, hctx_id )
+ __field( u32, nr_tags )
+ __field( bool, is_sched_tag )
+ __field( bool, is_reserved )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = q->disk ? disk_devt(q->disk) : 0;
+ __entry->hctx_id = hctx->queue_num;
+ __entry->is_sched_tag = is_sched_tag;
+ __entry->is_reserved = alloc_flags & BLK_MQ_REQ_RESERVED;
+
+ if (__entry->is_reserved) {
+ __entry->nr_tags = is_sched_tag ?
+ hctx->sched_tags->nr_reserved_tags :
+ hctx->tags->nr_reserved_tags;
+ } else {
+ if (is_sched_tag)
+ __entry->nr_tags = hctx->sched_tags->nr_tags -
+ hctx->sched_tags->nr_reserved_tags;
+ else
+ __entry->nr_tags = hctx->tags->nr_tags -
+ hctx->tags->nr_reserved_tags;
+ }
+
+ ),
+
+ TP_printk("%d,%d hctx=%u starved on %s%s tags (depth=%u)",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->hctx_id,
+ __entry->is_sched_tag ? "scheduler" : "hardware",
+ __entry->is_reserved ? " reserved" : "",
+ __entry->nr_tags)
+);
+
/**
* block_rq_insert - insert block operation request into queue
* @rq: block IO operation request
base-commit: 6779b50faa562e6cca1aa6a4649a4d764c6c7e28
--
2.51.0
^ permalink raw reply related
* [syzbot] [block?] possible deadlock in add_disk_fwnode
From: syzbot @ 2026-05-24 23:51 UTC (permalink / raw)
To: axboe, linux-block, linux-kernel, syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: c1ecb239fa34 Add linux-next specific files for 20260522
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15526d96580000
kernel config: https://syzkaller.appspot.com/x/.config?x=e0299bf0261ddd5
dashboard link: https://syzkaller.appspot.com/bug?extid=4feabfc9641267769c97
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/b8845b668755/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/517b638d908a/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/dd5e3b587fce/bzImage-c1ecb239.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+4feabfc9641267769c97@syzkaller.appspotmail.com
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
------------------------------------------------------
syz.0.2446/19404 is trying to acquire lock:
ffff88806b47a410 (&set->update_nr_hwq_lock){++++}-{4:4}, at: add_disk_fwnode+0xe7/0x480 block/genhd.c:596
but task is already holding lock:
ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_probe_dev block/genhd.c:881 [inline]
ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_request_module+0x35/0x2a0 block/genhd.c:897
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #7 (major_names_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
blk_probe_dev block/genhd.c:881 [inline]
blk_request_module+0x35/0x2a0 block/genhd.c:897
blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
bdev_file_open_by_dev+0xa0/0x240 block/bdev.c:1054
swsusp_check+0x56/0x490 kernel/power/swap.c:1571
software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
iter_file_splice_write+0x9a6/0x10f0 fs/splice.c:736
do_splice_from fs/splice.c:936 [inline]
direct_splice_actor+0x104/0x160 fs/splice.c:1159
splice_direct_to_actor+0x545/0xc80 fs/splice.c:1103
do_splice_direct_actor fs/splice.c:1202 [inline]
do_splice_direct+0x19b/0x2a0 fs/splice.c:1228
do_sendfile+0x547/0x7e0 fs/read_write.c:1372
__do_sys_sendfile64 fs/read_write.c:1433 [inline]
__se_sys_sendfile64+0x144/0x1a0 fs/read_write.c:1419
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #6 (system_transition_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
lock_system_sleep+0x49/0x70 kernel/power/main.c:71
resume_store+0x2ff/0x4f0 kernel/power/hibernate.c:1300
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
iter_file_splice_write+0x9a6/0x10f0 fs/splice.c:736
do_splice_from fs/splice.c:936 [inline]
direct_splice_actor+0x104/0x160 fs/splice.c:1159
splice_direct_to_actor+0x545/0xc80 fs/splice.c:1103
do_splice_direct_actor fs/splice.c:1202 [inline]
do_splice_direct+0x19b/0x2a0 fs/splice.c:1228
do_sendfile+0x547/0x7e0 fs/read_write.c:1372
__do_sys_sendfile64 fs/read_write.c:1433 [inline]
__se_sys_sendfile64+0x144/0x1a0 fs/read_write.c:1419
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&of->mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
kernfs_seq_start+0x5c/0x420 fs/kernfs/file.c:172
seq_read_iter+0x3f8/0xe20 fs/seq_file.c:226
new_sync_read fs/read_write.c:493 [inline]
vfs_read+0x58b/0xa80 fs/read_write.c:574
ksys_read+0x156/0x270 fs/read_write.c:717
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&p->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
seq_read_iter+0xb8/0xe20 fs/seq_file.c:183
lo_rw_aio+0xc80/0xf00 include/linux/percpu-rwsem.h:-1
do_req_filebacked drivers/block/loop.c:435 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x92a/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #3 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 ((wq_completion)loop1){+.+.}-{0:0}:
touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
__flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
__loop_clr_fd drivers/block/loop.c:1130 [inline]
lo_release+0x287/0x8f0 drivers/block/loop.c:1767
bdev_release+0x541/0x660 block/bdev.c:-1
blkdev_release+0x15/0x20 block/fops.c:705
__fput+0x461/0xa70 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1511 [inline]
__se_sys_close fs/open.c:1496 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1496
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&disk->open_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
__del_gendisk+0x127/0x980 block/genhd.c:710
del_gendisk+0xe7/0x160 block/genhd.c:823
loop_remove+0x42/0xc0 drivers/block/loop.c:2136
loop_control_remove drivers/block/loop.c:2195 [inline]
loop_control_ioctl+0x4ba/0x5b0 drivers/block/loop.c:2237
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xff/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (&set->update_nr_hwq_lock){++++}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
down_read+0x97/0x200 kernel/locking/rwsem.c:1568
add_disk_fwnode+0xe7/0x480 block/genhd.c:596
add_disk include/linux/blkdev.h:794 [inline]
loop_add+0x86e/0xb50 drivers/block/loop.c:2108
blk_probe_dev block/genhd.c:884 [inline]
blk_request_module+0x27d/0x2a0 block/genhd.c:-1
blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
blkdev_open+0x1f5/0x620 block/fops.c:688
do_dentry_open+0x83d/0x13e0 fs/open.c:947
vfs_open+0x3b/0x350 fs/open.c:1052
do_open fs/namei.c:4688 [inline]
path_openat+0x2eea/0x3960 fs/namei.c:4847
do_file_open+0x23e/0x4a0 fs/namei.c:4876
do_sys_openat2+0x115/0x200 fs/open.c:1368
do_sys_open fs/open.c:1374 [inline]
__do_sys_creat fs/open.c:1452 [inline]
__se_sys_creat fs/open.c:1446 [inline]
__x64_sys_creat+0x8f/0xc0 fs/open.c:1446
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
&set->update_nr_hwq_lock --> system_transition_mutex --> major_names_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(major_names_lock);
lock(system_transition_mutex);
lock(major_names_lock);
rlock(&set->update_nr_hwq_lock);
*** DEADLOCK ***
1 lock held by syz.0.2446/19404:
#0: ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_probe_dev block/genhd.c:881 [inline]
#0: ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_request_module+0x35/0x2a0 block/genhd.c:897
stack backtrace:
CPU: 0 UID: 0 PID: 19404 Comm: syz.0.2446 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)}
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2045
check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2177
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
down_read+0x97/0x200 kernel/locking/rwsem.c:1568
add_disk_fwnode+0xe7/0x480 block/genhd.c:596
add_disk include/linux/blkdev.h:794 [inline]
loop_add+0x86e/0xb50 drivers/block/loop.c:2108
blk_probe_dev block/genhd.c:884 [inline]
blk_request_module+0x27d/0x2a0 block/genhd.c:-1
blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
blkdev_open+0x1f5/0x620 block/fops.c:688
do_dentry_open+0x83d/0x13e0 fs/open.c:947
vfs_open+0x3b/0x350 fs/open.c:1052
do_open fs/namei.c:4688 [inline]
path_openat+0x2eea/0x3960 fs/namei.c:4847
do_file_open+0x23e/0x4a0 fs/namei.c:4876
do_sys_openat2+0x115/0x200 fs/open.c:1368
do_sys_open fs/open.c:1374 [inline]
__do_sys_creat fs/open.c:1452 [inline]
__se_sys_creat fs/open.c:1446 [inline]
__x64_sys_creat+0x8f/0xc0 fs/open.c:1446
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1094cfce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f1092f56028 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 00007f1094f75fa0 RCX: 00007f1094cfce59
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00002000000000c0
RBP: 00007f1094d92d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1094f76038 R14: 00007f1094f75fa0 R15: 00007fff22949298
</TASK>
block device autoloading is deprecated and will be removed.
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* [syzbot] [block?] possible deadlock in bdev_open
From: syzbot @ 2026-05-24 23:29 UTC (permalink / raw)
To: axboe, linux-block, linux-kernel, syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: c1ecb239fa34 Add linux-next specific files for 20260522
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14e5db06580000
kernel config: https://syzkaller.appspot.com/x/.config?x=e0299bf0261ddd5
dashboard link: https://syzkaller.appspot.com/bug?extid=0f427123ae84b3ba6dc7
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/b8845b668755/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/517b638d908a/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/dd5e3b587fce/bzImage-c1ecb239.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+0f427123ae84b3ba6dc7@syzkaller.appspotmail.com
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
------------------------------------------------------
syz.0.336/7301 is trying to acquire lock:
ffff88801b3974c8 (&disk->open_mutex){+.+.}-{4:4}, at: bdev_open+0xe0/0xcc0 block/bdev.c:953
but task is already holding lock:
ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_probe_dev block/genhd.c:881 [inline]
ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_request_module+0x35/0x2a0 block/genhd.c:897
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (major_names_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
blk_probe_dev block/genhd.c:881 [inline]
blk_request_module+0x35/0x2a0 block/genhd.c:897
blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
bdev_file_open_by_dev+0xa0/0x240 block/bdev.c:1054
swsusp_check+0x56/0x490 kernel/power/swap.c:1571
software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
iter_file_splice_write+0x9a6/0x10f0 fs/splice.c:736
do_splice_from fs/splice.c:936 [inline]
direct_splice_actor+0x104/0x160 fs/splice.c:1159
splice_direct_to_actor+0x545/0xc80 fs/splice.c:1103
do_splice_direct_actor fs/splice.c:1202 [inline]
do_splice_direct+0x19b/0x2a0 fs/splice.c:1228
do_sendfile+0x547/0x7e0 fs/read_write.c:1372
__do_sys_sendfile64 fs/read_write.c:1433 [inline]
__se_sys_sendfile64+0x144/0x1a0 fs/read_write.c:1419
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (system_transition_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
lock_system_sleep+0x49/0x70 kernel/power/main.c:71
resume_store+0x2ff/0x4f0 kernel/power/hibernate.c:1300
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
iter_file_splice_write+0x9a6/0x10f0 fs/splice.c:736
do_splice_from fs/splice.c:936 [inline]
direct_splice_actor+0x104/0x160 fs/splice.c:1159
splice_direct_to_actor+0x545/0xc80 fs/splice.c:1103
do_splice_direct_actor fs/splice.c:1202 [inline]
do_splice_direct+0x19b/0x2a0 fs/splice.c:1228
do_sendfile+0x547/0x7e0 fs/read_write.c:1372
__do_sys_sendfile64 fs/read_write.c:1433 [inline]
__se_sys_sendfile64+0x144/0x1a0 fs/read_write.c:1419
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&of->mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
kernfs_seq_start+0x5c/0x420 fs/kernfs/file.c:172
seq_read_iter+0x3f8/0xe20 fs/seq_file.c:226
copy_splice_read+0x605/0xab0 fs/splice.c:362
do_splice_read fs/splice.c:980 [inline]
splice_direct_to_actor+0x483/0xc80 fs/splice.c:1084
do_splice_direct_actor fs/splice.c:1202 [inline]
do_splice_direct+0x19b/0x2a0 fs/splice.c:1228
do_sendfile+0x547/0x7e0 fs/read_write.c:1372
__do_sys_sendfile64 fs/read_write.c:1433 [inline]
__se_sys_sendfile64+0x144/0x1a0 fs/read_write.c:1419
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (&p->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
seq_read_iter+0xb8/0xe20 fs/seq_file.c:183
lo_rw_aio+0xc80/0xf00 include/linux/percpu-rwsem.h:-1
do_req_filebacked drivers/block/loop.c:435 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x92a/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 ((work_completion)(&worker->work)){+.+.}-{0:0}:
process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #1 ((wq_completion)loop1){+.+.}-{0:0}:
touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
__flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
__loop_clr_fd drivers/block/loop.c:1130 [inline]
lo_release+0x287/0x8f0 drivers/block/loop.c:1767
bdev_release+0x541/0x660 block/bdev.c:-1
blkdev_release+0x15/0x20 block/fops.c:705
__fput+0x461/0xa70 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1511 [inline]
__se_sys_close fs/open.c:1496 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1496
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (&disk->open_mutex){+.+.}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
bdev_open+0xe0/0xcc0 block/bdev.c:953
bdev_file_open_by_dev+0x1be/0x240 block/bdev.c:1067
disk_scan_partitions+0x1c1/0x2c0 block/genhd.c:387
add_disk_final block/genhd.c:416 [inline]
add_disk_fwnode+0x321/0x480 block/genhd.c:610
add_disk include/linux/blkdev.h:794 [inline]
brd_alloc+0x5c9/0x7d0 drivers/block/brd.c:340
blk_probe_dev block/genhd.c:884 [inline]
blk_request_module+0x27d/0x2a0 block/genhd.c:-1
blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
bdev_file_open_by_dev+0xa0/0x240 block/bdev.c:1054
swsusp_check+0x56/0x490 kernel/power/swap.c:1571
software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x629/0xba0 fs/read_write.c:688
ksys_write+0x156/0x270 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
&disk->open_mutex --> system_transition_mutex --> major_names_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(major_names_lock);
lock(system_transition_mutex);
lock(major_names_lock);
lock(&disk->open_mutex);
*** DEADLOCK ***
6 locks held by syz.0.336/7301:
#0: ffff888011240528 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x252/0x320 fs/file.c:1260
#1: ffff888035564480 (sb_writers#7){.+.+}-{0:0}, at: file_start_write include/linux/fs.h:2733 [inline]
#1: ffff888035564480 (sb_writers#7){.+.+}-{0:0}, at: vfs_write+0x22d/0xba0 fs/read_write.c:684
#2: ffff88801efe3078 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1df/0x540 fs/kernfs/file.c:343
#3: ffff88801eec5c38 (kn->active#59){.+.+}-{0:0}, at: kernfs_get_active_of fs/kernfs/file.c:80 [inline]
#3: ffff88801eec5c38 (kn->active#59){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x232/0x540 fs/kernfs/file.c:344
#4: ffffffff8de756d8 (system_transition_mutex){+.+.}-{4:4}, at: software_resume+0x47/0x4c0 kernel/power/hibernate.c:1022
#5: ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_probe_dev block/genhd.c:881 [inline]
#5: ffffffff8e74bdd8 (major_names_lock){+.+.}-{4:4}, at: blk_request_module+0x35/0x2a0 block/genhd.c:897
stack backtrace:
CPU: 1 UID: 0 PID: 7301 Comm: syz.0.336 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)}
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2045
check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2177
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
bdev_open+0xe0/0xcc0 block/bdev.c:953
bdev_file_open_by_dev+0x1be/0x240 block/bdev.c:1067
disk_scan_partitions+0x1c1/0x2c0 block/genhd.c:387
add_disk_final block/genhd.c:416 [inline]
add_disk_fwnode+0x321/0x480 block/genhd.c:610
add_disk include/linux/blkdev.h:794 [inline]
brd_alloc+0x5c9/0x7d0 drivers/block/brd.c:340
blk_probe_dev block/genhd.c:884 [inline]
blk_request_module+0x27d/0x2a0 block/genhd.c:-1
blkdev_get_no_open+0x3f/0xe0 block/bdev.c:833
bdev_file_open_by_dev+0xa0/0x240 block/bdev.c:1054
swsusp_check+0x56/0x490 kernel/power/swap.c:1571
software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x629/0xba0 fs/read_write.c:688
ksys_write+0x156/0x270 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f157962ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f157783c028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f15798a6180 RCX: 00007f157962ce59
RDX: 0000000000000024 RSI: 0000200000000040 RDI: 000000000000000d
RBP: 00007f15796c2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f15798a6218 R14: 00007f15798a6180 R15: 00007fffa4f29e08
</TASK>
block device autoloading is deprecated and will be removed.
PM: Image not found (code -22)
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* Re: [PATCH v8] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-24 22:39 UTC (permalink / raw)
To: axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260524014204.622699-1-atomlin@atomlin.com>
On Sat, May 23, 2026 at 09:42:04PM -0400, Aaron Tomlin wrote:
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 6aa79e2d799c..736e176f6d17 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -226,6 +226,61 @@ DECLARE_EVENT_CLASS(block_rq,
> IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
> );
>
> +/**
> + * block_rq_tag_wait - triggered when a request is starved of a tag
> + * @q: request queue of the target device
> + * @hctx: hardware context of the request experiencing starvation
> + * @is_sched_tag: indicates whether the starved pool is the software scheduler
> + * @alloc_flags: allocation flags dictating the specific tag pool
> + *
> + * Called immediately before the submitting context is forced to block due
> + * to the exhaustion of available tags (i.e., physical hardware driver
> + * tags, software scheduler tags, or reserved tags). This trace point
> + * indicates that the context will be placed into an uninterruptible state
> + * via io_schedule() until an active request completes and relinquishes its
> + * assigned tag.
> + */
> +TRACE_EVENT(block_rq_tag_wait,
> +
> + TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> + bool is_sched_tag, unsigned int alloc_flags),
> +
> + TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
> +
> + TP_STRUCT__entry(
> + __field( dev_t, dev )
> + __field( u32, hctx_id )
> + __field( u32, nr_tags )
> + __field( bool, is_sched_tag )
> + __field( bool, is_reserved )
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = q->disk ? disk_devt(q->disk) : 0;
> + __entry->hctx_id = hctx->queue_num;
> + __entry->is_sched_tag = is_sched_tag;
> + __entry->is_reserved = alloc_flags & BLK_MQ_REQ_RESERVED;
> +
> + if (__entry->is_reserved) {
> + __entry->nr_tags = is_sched_tag ?
> + hctx->sched_tags->nr_reserved_tags :
> + hctx->tags->nr_reserved_tags;
> + } else {
> + __entry->nr_tags = is_sched_tag ?
> + hctx->sched_tags->nr_tags :
> + hctx->tags->nr_tags;
> + }
> +
> + ),
> +
> + TP_printk("%d,%d hctx=%u starved on %s%s tags (depth=%u)",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->hctx_id,
> + __entry->is_sched_tag ? "scheduler" : "hardware",
> + __entry->is_reserved ? " reserved" : "",
> + __entry->nr_tags)
> +);
This is wrong.
If __entry->is_reserved is false, the current logic incorrectly reports the
total capacity pool depth (i.e., both reserved and standard tags combined).
I have refactored the TP_fast_assign block to evaluate the reserved status
orthogonally, ensuring nr_reserved_tags is correctly reported for I/O
schedulers. Additionally, the unreserved pool calculation has been fixed to
accurately subtract nr_reserved_tags from nr_tags.
I will include these corrections in the next iteration. Given the extent of
the functional changes to the tracepoint assignment logic, I will drop the
existing "Reviewed-by:" tags.
--
Aaron Tomlin
^ permalink raw reply
* [PATCH 1/1] rust: block: fix GenDiskBuilder failure cleanup
From: Ren Wei @ 2026-05-24 15:26 UTC (permalink / raw)
To: linux-block, rust-for-linux
Cc: ojeda, boqun, gary, bjorn3_gh, lossin, a.hindborg, aliceryhl,
tmgross, dakr, daniel.almeida, axboe, sunke, tamird, yuantan098,
bird, royenheart, n05ec
In-Reply-To: <cover.1779596478.git.royenheart@gmail.com>
From: Haoze Xie <royenheart@gmail.com>
If GenDiskBuilder::build() fails after __blk_mq_alloc_disk(), the
allocated gendisk is left behind until the caller drops the last
tagset reference.
Handle the failure path by releasing the temporary gendisk first,
then converting the foreign queue data back, so probe failures clean
up both resources before returning an error.
Fixes: 3253aba3408aa ("rust: block: introduce `kernel::block::mq` module")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
rust/kernel/block/mq/gen_disk.rs | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
index 912cb805caf5..100c7b937a7e 100644
--- a/rust/kernel/block/mq/gen_disk.rs
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -149,6 +149,17 @@ pub fn build<T: Operations>(
// SAFETY: `gendisk` is a valid pointer as we initialized it above
unsafe { (*gendisk).fops = &TABLE };
+ let cleanup_failure = ScopeGuard::new_with_data((gendisk, data), |(gendisk, data)| {
+ // SAFETY: `gendisk` came from `__blk_mq_alloc_disk()` above and
+ // has not been added to the VFS on this cleanup path.
+ unsafe { bindings::put_disk(gendisk) };
+ // SAFETY: `data` came from `into_foreign()` above and has not been
+ // converted back on this cleanup path.
+ drop(unsafe { T::QueueData::from_foreign(data) });
+ });
+ // The failure guard now owns both pieces of cleanup; the early guard
+ // must not run on this path anymore.
+ recover_data.dismiss();
let mut writer = NullTerminatedFormatter::new(
// SAFETY: `gendisk` points to a valid and initialized instance. We
@@ -172,7 +183,7 @@ pub fn build<T: Operations>(
},
)?;
- recover_data.dismiss();
+ cleanup_failure.dismiss();
// INVARIANT: `gendisk` was initialized above.
// INVARIANT: `gendisk` was added to the VFS via `device_add_disk` above.
--
2.47.3
^ permalink raw reply related
* [syzbot] [block?] possible deadlock in bdev_release (2)
From: syzbot @ 2026-05-24 13:36 UTC (permalink / raw)
To: axboe, linux-block, linux-kernel, syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: c1ecb239fa34 Add linux-next specific files for 20260522
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=153a947e580000
kernel config: https://syzkaller.appspot.com/x/.config?x=77a9211ff284de54
dashboard link: https://syzkaller.appspot.com/bug?extid=2f62807dc3239b8f584e
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/4cb88c910144/disk-c1ecb239.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4a9bc938cf88/vmlinux-c1ecb239.xz
kernel image: https://storage.googleapis.com/syzbot-assets/684f1e33f264/bzImage-c1ecb239.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2f62807dc3239b8f584e@syzkaller.appspotmail.com
======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
------------------------------------------------------
udevd/5769 is trying to acquire lock:
ffff88805ecd4938 ((wq_completion)loop4){+.+.}-{0:0}, at: touch_wq_lockdep_map+0xb5/0x180 kernel/workqueue.c:4033
but task is already holding lock:
ffff88802674c4c8 (&disk->open_mutex){+.+.}-{4:4}, at: bdev_release+0x1af/0x660 block/bdev.c:1136
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (&disk->open_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
bdev_open+0xe0/0xcc0 block/bdev.c:953
bdev_file_open_by_dev+0x1be/0x240 block/bdev.c:1067
swsusp_check+0x56/0x490 kernel/power/swap.c:1571
software_resume+0x51/0x4c0 kernel/power/hibernate.c:1023
resume_store+0x333/0x4f0 kernel/power/hibernate.c:1307
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x629/0xba0 fs/read_write.c:688
ksys_write+0x156/0x270 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (system_transition_mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
lock_system_sleep+0x49/0x70 kernel/power/main.c:71
resume_store+0x2ff/0x4f0 kernel/power/hibernate.c:1300
kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0x629/0xba0 fs/read_write.c:688
ksys_write+0x156/0x270 fs/read_write.c:740
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (&of->mutex){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
kernfs_seq_start+0x5c/0x420 fs/kernfs/file.c:172
traverse+0x164/0x580 fs/seq_file.c:107
seq_read_iter+0xd09/0xe20 fs/seq_file.c:196
lo_rw_aio+0xc80/0xf00 include/linux/percpu-rwsem.h:-1
do_req_filebacked drivers/block/loop.c:435 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x92a/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&p->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/rtmutex_api.c:559 [inline]
mutex_lock_nested+0x5a/0x1d0 kernel/locking/rtmutex_api.c:578
seq_read_iter+0xb8/0xe20 fs/seq_file.c:183
lo_rw_aio+0xc80/0xf00 include/linux/percpu-rwsem.h:-1
do_req_filebacked drivers/block/loop.c:435 [inline]
loop_handle_cmd drivers/block/loop.c:1941 [inline]
loop_process_work+0x92a/0x11b0 drivers/block/loop.c:1976
process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #1 ((work_completion)(&worker->work)){+.+.}-{0:0}:
process_one_work+0x8d7/0x1630 kernel/workqueue.c:3294
process_scheduled_works kernel/workqueue.c:3401 [inline]
worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 ((wq_completion)loop4){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
__flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
__loop_clr_fd drivers/block/loop.c:1130 [inline]
lo_release+0x287/0x8f0 drivers/block/loop.c:1767
bdev_release+0x541/0x660 block/bdev.c:-1
blkdev_release+0x15/0x20 block/fops.c:705
__fput+0x461/0xa70 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1511 [inline]
__se_sys_close fs/open.c:1496 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1496
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
(wq_completion)loop4 --> system_transition_mutex --> &disk->open_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&disk->open_mutex);
lock(system_transition_mutex);
lock(&disk->open_mutex);
lock((wq_completion)loop4);
*** DEADLOCK ***
1 lock held by udevd/5769:
#0: ffff88802674c4c8 (&disk->open_mutex){+.+.}-{4:4}, at: bdev_release+0x1af/0x660 block/bdev.c:1136
stack backtrace:
CPU: 0 UID: 0 PID: 5769 Comm: udevd Tainted: G L syzkaller #0 PREEMPT_{RT,(full)}
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2045
check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2177
check_prev_add kernel/locking/lockdep.c:3167 [inline]
check_prevs_add kernel/locking/lockdep.c:3286 [inline]
validate_chain kernel/locking/lockdep.c:3910 [inline]
__lock_acquire+0x15a5/0x2d10 kernel/locking/lockdep.c:5239
lock_acquire+0x106/0x350 kernel/locking/lockdep.c:5870
touch_wq_lockdep_map+0xcb/0x180 kernel/workqueue.c:4033
__flush_workqueue+0x14b/0x14f0 kernel/workqueue.c:4075
drain_workqueue+0xd3/0x390 kernel/workqueue.c:4239
__loop_clr_fd drivers/block/loop.c:1130 [inline]
lo_release+0x287/0x8f0 drivers/block/loop.c:1767
bdev_release+0x541/0x660 block/bdev.c:-1
blkdev_release+0x15/0x20 block/fops.c:705
__fput+0x461/0xa70 fs/file_table.c:510
fput_close_sync+0x11f/0x240 fs/file_table.c:615
__do_sys_close fs/open.c:1511 [inline]
__se_sys_close fs/open.c:1496 [inline]
__x64_sys_close+0x7e/0x110 fs/open.c:1496
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f836530a407
Code: 48 89 fa 4c 89 df e8 38 aa 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
RSP: 002b:00007ffd5e8529c0 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 00007f836521c880 RCX: 00007f836530a407
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000008
RBP: 00007f836521c6e8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
R13: 000055c60f674190 R14: 0000000000000008 R15: 000055c60f683f10
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* [PATCH v8] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-24 1:42 UTC (permalink / raw)
To: axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.
Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.
This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.
This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:
# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
Attaching 1 probe...
^C
@tag_waits[4]: 12
@tag_waits[12]: 87
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Changes since v7 [1]:
- Added an is_reserved boolean to the trace record to explicitly expose
reserved pool starvation to userspace
- Fixed TP_fast_assign to report the correct nr_reserved_tags depth
when I/O schedulers utilise the reserved pool
Changes since v6 [2]:
- Dropped Patch 2. Observability is now driven entirely by the tracepoint,
with the commit message updated to demonstrate how userspace (e.g.,
bpftrace) can safely replicate counting out-of-band (Jens Axboe)
- Moved tracepoint call above sbitmap_prepare_to_wait(). This prevents
inadvertently resetting the task state under PREEMPT_RT locks
- Updated the tracepoint signature and TP_fast_assign block to evaluate
the allocation flags. If the submitting context is starved of a reserved
tag (BLK_MQ_REQ_RESERVED), the tracepoint now accurately reports the
severely constrained nr_reserved_tags depth instead of the total nr_tags
depth.
Changes since v5 [3]:
- Replaced this_cpu_inc() with raw_cpu_inc() within
blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
preemptible context immediately prior to io_schedule(). This adjustment
deliberately prioritises the reduction of execution overhead over
absolute statistical precision for this diagnostic interface.
Changes since v4 [4]:
- Prevented a NULL pointer dereference in the tracepoint fast-assign for
disk-less request queues by safely checking q->disk before resolving the
dev_t
- Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
the per-CPU counter allocation from the volatile debugfs lifecycle and
tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
and blk_mq_exit_hctx())
- Fixed a potential compiler double-fetch bug by wrapping the per-CPU
pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()
- Passed the appropriate gfp_t flags down to the allocation routines to
maintain the strict GFP_NOIO context
- Updated kernel-doc descriptions to clarify that the NULL pointer
checks guard against memory allocation failures under pressure, rather
than initialisation race conditions
Changes since v3 [5]:
- Transitioned tracking architecture from shared atomic_t variables to
dynamically allocated per-CPU counters to resolve cache line bouncing
(Bart Van Assche)
Changes since v2 [6]:
- Added "Reviewed-by:" and "Tested-by:" tags for patch 1
- Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
- Introduced atomic counters via debugfs
Changes since v1 [7]:
- Improved the description of the trace point (Damien Le Moal)
- Removed the redundant "active requests" (Laurence Oberman)
- Introduced pool-specific starvation tracking
[1]: https://lore.kernel.org/lkml/20260523200942.587199-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260517213614.350367-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[6]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[7]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
---
block/blk-mq-tag.c | 6 ++++
include/trace/events/block.h | 55 ++++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..35deee5bbc73 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
#include <linux/kmemleak.h>
#include <linux/delay.h>
+#include <trace/events/block.h>
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-sched.h"
@@ -181,6 +182,11 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
if (tag != BLK_MQ_NO_TAG)
break;
+ /* Log the starvation event before altering task state */
+ trace_block_rq_tag_wait(data->q, data->hctx,
+ data->rq_flags & RQF_SCHED_TAGS,
+ data->flags);
+
sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
tag = __blk_mq_get_tag(data, bt);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..736e176f6d17 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,61 @@ DECLARE_EVENT_CLASS(block_rq,
IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
);
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ * @alloc_flags: allocation flags dictating the specific tag pool
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver
+ * tags, software scheduler tags, or reserved tags). This trace point
+ * indicates that the context will be placed into an uninterruptible state
+ * via io_schedule() until an active request completes and relinquishes its
+ * assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+ TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ bool is_sched_tag, unsigned int alloc_flags),
+
+ TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( u32, hctx_id )
+ __field( u32, nr_tags )
+ __field( bool, is_sched_tag )
+ __field( bool, is_reserved )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = q->disk ? disk_devt(q->disk) : 0;
+ __entry->hctx_id = hctx->queue_num;
+ __entry->is_sched_tag = is_sched_tag;
+ __entry->is_reserved = alloc_flags & BLK_MQ_REQ_RESERVED;
+
+ if (__entry->is_reserved) {
+ __entry->nr_tags = is_sched_tag ?
+ hctx->sched_tags->nr_reserved_tags :
+ hctx->tags->nr_reserved_tags;
+ } else {
+ __entry->nr_tags = is_sched_tag ?
+ hctx->sched_tags->nr_tags :
+ hctx->tags->nr_tags;
+ }
+
+ ),
+
+ TP_printk("%d,%d hctx=%u starved on %s%s tags (depth=%u)",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->hctx_id,
+ __entry->is_sched_tag ? "scheduler" : "hardware",
+ __entry->is_reserved ? " reserved" : "",
+ __entry->nr_tags)
+);
+
/**
* block_rq_insert - insert block operation request into queue
* @rq: block IO operation request
base-commit: 6779b50faa562e6cca1aa6a4649a4d764c6c7e28
--
2.51.0
^ permalink raw reply related
* Re: [PATCH blktests v2 0/3] introduce command trace feature
From: Shin'ichiro Kawasaki @ 2026-05-24 0:17 UTC (permalink / raw)
To: linux-block; +Cc: Daniel Wagner, John Meneghini, Bart Van Assche
In-Reply-To: <20260516120729.113659-1-shinichiro.kawasaki@wdc.com>
On May 16, 2026 / 21:07, Shin'ichiro Kawasaki wrote:
> Some blktests test cases have deep nesting, making their behavior
> difficult to understand. For example, the nvme test group has many
> helper functions that set sysfs attribute values and call nvme-cli
> commands. Understanding these behaviors is essential for debugging test
> case failures.
>
> This series adds a new 'command trace' feature to blktests. The first
> patch introduces a new --cmd-trace (or -t) option to record commands
> executed during test runs. The second and third patches add a new helper
> function _set_attr(), which traces both the value and file name of sysfs
> attribute writes.
>
> With this series, blktests users can use the option to generate a
> .cmdtrace file that records all commands executed during a test case
> run. By grepping the .cmdtrace file, users can check writes to sysfs
> attributes and nvme-cli command invocations. The example below shows how
> nvme targets are set up for the nvme/008 test case with rdma transport.
FYI, I applied this series.
^ permalink raw reply
* Re: [PATCH v7] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-23 23:55 UTC (permalink / raw)
To: axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260523200942.587199-1-atomlin@atomlin.com>
[-- Attachment #1: Type: text/plain, Size: 2171 bytes --]
On Sat, May 23, 2026 at 04:09:42PM -0400, Aaron Tomlin wrote:
> +/**
> + * block_rq_tag_wait - triggered when a request is starved of a tag
> + * @q: request queue of the target device
> + * @hctx: hardware context of the request experiencing starvation
> + * @is_sched_tag: indicates whether the starved pool is the software scheduler
> + * @alloc_flags: allocation flags dictating the specific tag pool
> + *
> + * Called immediately before the submitting context is forced to block due
> + * to the exhaustion of available tags (i.e., physical hardware driver
> + * tags, software scheduler tags, or reserved tags). This trace point
> + * indicates that the context will be placed into an uninterruptible state
> + * via io_schedule() until an active request completes and relinquishes its
> + * assigned tag.
> + */
> +TRACE_EVENT(block_rq_tag_wait,
> +
> + TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> + bool is_sched_tag, unsigned int alloc_flags),
> +
> + TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
> +
> + TP_STRUCT__entry(
> + __field( dev_t, dev )
> + __field( u32, hctx_id )
> + __field( u32, nr_tags )
> + __field( bool, is_sched_tag )
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = q->disk ? disk_devt(q->disk) : 0;
> + __entry->hctx_id = hctx->queue_num;
> + __entry->is_sched_tag = is_sched_tag;
> +
> + if (is_sched_tag) {
> + __entry->nr_tags = hctx->sched_tags->nr_tags;
> + } else if (alloc_flags & BLK_MQ_REQ_RESERVED) {
> + __entry->nr_tags = hctx->tags->nr_reserved_tags;
> + } else {
> + __entry->nr_tags = hctx->tags->nr_tags;
> + }
> +
> + ),
> +
> + TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->hctx_id,
> + __entry->is_sched_tag ? "scheduler" : "hardware",
> + __entry->nr_tags)
> +);
> +
> /**
> * block_rq_insert - insert block operation request into queue
> * @rq: block IO operation request
I completely overlooked that a request could legitimately have both
RQF_SCHED_TAGS and BLK_MQ_REQ_RESERVED set simultaneously.
--
Aaron Tomlin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* [PATCH v7] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-23 20:09 UTC (permalink / raw)
To: axboe, rostedt, mhiramat, mathieu.desnoyers
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.
Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.
This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.
This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:
# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
Attaching 1 probe...
^C
@tag_waits[4]: 12
@tag_waits[12]: 87
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Changes since v6 [1]:
- Dropped Patch 2. Observability is now driven entirely by the tracepoint,
with the commit message updated to demonstrate how userspace (e.g.,
bpftrace) can safely replicate counting out-of-band (Jens Axboe)
- Moved tracepoint call above sbitmap_prepare_to_wait(). This prevents
inadvertently resetting the task state under PREEMPT_RT locks
- Updated the tracepoint signature and TP_fast_assign block to evaluate
the allocation flags. If the submitting context is starved of a reserved
tag (BLK_MQ_REQ_RESERVED), the tracepoint now accurately reports the
severely constrained nr_reserved_tags depth instead of the total nr_tags
depth.
Changes since v5 [2]:
- Replaced this_cpu_inc() with raw_cpu_inc() within
blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
preemptible context immediately prior to io_schedule(). This adjustment
deliberately prioritises the reduction of execution overhead over
absolute statistical precision for this diagnostic interface.
Changes since v4 [3]:
- Prevented a NULL pointer dereference in the tracepoint fast-assign for
disk-less request queues by safely checking q->disk before resolving the
dev_t
- Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
the per-CPU counter allocation from the volatile debugfs lifecycle and
tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
and blk_mq_exit_hctx())
- Fixed a potential compiler double-fetch bug by wrapping the per-CPU
pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()
- Passed the appropriate gfp_t flags down to the allocation routines to
maintain the strict GFP_NOIO context
- Updated kernel-doc descriptions to clarify that the NULL pointer
checks guard against memory allocation failures under pressure, rather
than initialisation race conditions
Changes since v3 [4]:
- Transitioned tracking architecture from shared atomic_t variables to
dynamically allocated per-CPU counters to resolve cache line bouncing
(Bart Van Assche)
Changes since v2 [5]:
- Added "Reviewed-by:" and "Tested-by:" tags for patch 1
- Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
- Introduced atomic counters via debugfs
Changes since v1 [6]:
- Improved the description of the trace point (Damien Le Moal)
- Removed the redundant "active requests" (Laurence Oberman)
- Introduced pool-specific starvation tracking
[1]: https://lore.kernel.org/lkml/20260517213614.350367-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[6]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
---
block/blk-mq-tag.c | 6 +++++
include/trace/events/block.h | 50 ++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..35deee5bbc73 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
#include <linux/kmemleak.h>
#include <linux/delay.h>
+#include <trace/events/block.h>
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-sched.h"
@@ -181,6 +182,11 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
if (tag != BLK_MQ_NO_TAG)
break;
+ /* Log the starvation event before altering task state */
+ trace_block_rq_tag_wait(data->q, data->hctx,
+ data->rq_flags & RQF_SCHED_TAGS,
+ data->flags);
+
sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
tag = __blk_mq_get_tag(data, bt);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..15b2e0edd2d4 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,56 @@ DECLARE_EVENT_CLASS(block_rq,
IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
);
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ * @alloc_flags: allocation flags dictating the specific tag pool
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver
+ * tags, software scheduler tags, or reserved tags). This trace point
+ * indicates that the context will be placed into an uninterruptible state
+ * via io_schedule() until an active request completes and relinquishes its
+ * assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+ TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+ bool is_sched_tag, unsigned int alloc_flags),
+
+ TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( u32, hctx_id )
+ __field( u32, nr_tags )
+ __field( bool, is_sched_tag )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = q->disk ? disk_devt(q->disk) : 0;
+ __entry->hctx_id = hctx->queue_num;
+ __entry->is_sched_tag = is_sched_tag;
+
+ if (is_sched_tag) {
+ __entry->nr_tags = hctx->sched_tags->nr_tags;
+ } else if (alloc_flags & BLK_MQ_REQ_RESERVED) {
+ __entry->nr_tags = hctx->tags->nr_reserved_tags;
+ } else {
+ __entry->nr_tags = hctx->tags->nr_tags;
+ }
+
+ ),
+
+ TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->hctx_id,
+ __entry->is_sched_tag ? "scheduler" : "hardware",
+ __entry->nr_tags)
+);
+
/**
* block_rq_insert - insert block operation request into queue
* @rq: block IO operation request
base-commit: 6779b50faa562e6cca1aa6a4649a4d764c6c7e28
--
2.51.0
^ permalink raw reply related
* Re: [PATCH] block, nvme: export and use passthrough stats
From: Nilay Shroff @ 2026-05-23 18:54 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme; +Cc: axboe, hch, Keith Busch
In-Reply-To: <20260522151537.1509784-1-kbusch@meta.com>
On 5/22/26 8:45 PM, Keith Busch wrote:
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 263161cb8ac06..435fab0be6401 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -175,9 +175,11 @@ void nvme_mpath_start_request(struct request *rq)
> nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> }
>
> - if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
> + if (!blk_queue_io_stat(disk->queue) ||
> (nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> return;
> + if (blk_rq_is_passthrough(rq) && !blk_rq_passthrough_stats(rq))
> + return;
>
> nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
> nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
Thanks for this patch! It will be very useful for nvme top dashboard for
displaying passthru I/O stat under head node.
Well, I see an issue here with @rq being passed to blk_rq_passthrough_stats()
as the @rq->q points to nvme path queue object rather than the head node queue
object. The intention here is to account passthrough I/O statistics at the
head node level, but with the current implementation the decision depends
on the path queue's iostats_passthrough setting instead of the head queue's
setting. As a result, head-node passthrough I/O statistics may get accounted
based on whether the underlying path queue has iostats_passthrough enabled,
irrespective of the head queue configuration.
Another issue is that nvme_mpath_start_request() is only reached from
nvme_start_request() when REQ_NVME_MPATH is set. Normal bio based I/O
through the head gets that flag in nvme_ns_head_submit_bio(), but the
passthrough ioctl/uring path in the quoted patch does not set it. So
head passthrough I/O would not enter nvme_mpath_start_request() and
would not get head-node stats.
Thanks,
--Nilay
^ permalink raw reply
* Re: DEPT (the dependency tracker) as AI review prompt?
From: Yunseong Kim @ 2026-05-23 15:04 UTC (permalink / raw)
To: Harry Yoo, Yunseong Kim
Cc: Byungchul Park, linux-kernel, kernel_team, torvalds,
damien.lemoal, linux-ide, adilger.kernel, linux-ext4, mingo,
peterz, will, tglx, rostedt, joel, sashal, daniel.vetter,
duyuyang, johannes.berg, tj, tytso, willy, david, amir73il,
gregkh, kernel-team, linux-mm, akpm, mhocko, minchan, hannes,
vdavydov.dev, sj, jglisse, dennis, cl, penberg, rientjes, vbabka,
ngupta, linux-block, linux-fsdevel, jack, jlayton, dan.j.williams,
hch, djwong, dri-devel, rodrigosiqueiramelo, melissa.srw,
hamohammed.sa, harry.yoo, chris.p.wilson, gwan-gyeong.mun,
max.byungchul.park, boqun.feng, longman, yunseong.kim,
yeoreum.yun, netdev, matthew.brost, her0gyugyu, corbet,
catalin.marinas, bp, x86, hpa, luto, sumit.semwal, gustavo,
christian.koenig, andi.shyti, arnd, lorenzo.stoakes, Liam.Howlett,
rppt, surenb, mcgrof, petr.pavlu, da.gomez, samitolvanen, paulmck,
frederic, neeraj.upadhyay, joelagnelf, josh, urezki,
mathieu.desnoyers, jiangshanlai, qiang.zhang, juri.lelli,
vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
chuck.lever, neil, okorniev, Dai.Ngo, tom, trondmy, anna, kees,
bigeasy, clrkwllms, mark.rutland, ada.coupriediaz,
kristina.martsenko, wangkefeng.wang, broonie, kevin.brodsky, dwmw,
shakeel.butt, ast, ziy, yuzhao, baolin.wang, usamaarif642,
joel.granados, richard.weiyang, geert+renesas, tim.c.chen, linux,
alexander.shishkin, lillian, chenhuacai, francesco,
guoweikang.kernel, link, jpoimboe, masahiroy, brauner,
thomas.weissschuh, oleg, mjguzik, andrii, wangfushuai, linux-doc,
linux-arm-kernel, linux-media, linaro-mm-sig, linux-i2c,
linux-arch, linux-modules, rcu, linux-nfs, linux-rt-devel,
2407018371, dakr, miguel.ojeda.sandonis, neilb, bagasdotme,
wsa+renesas, dave.hansen, geert, ojeda, alex.gaynor, gary,
bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross, rust-for-linux,
Chris Mason, Roman Gushchin, Josef Bacik
In-Reply-To: <0592b09b-a084-4d9d-bcbf-1b77e45226cf@kernel.org>
Hi Harry,
On 5/23/26 16:34, Harry Yoo wrote:
>
>
> On 5/23/26 11:00 PM, Yunseong Kim wrote:
>> I've previously experimented with running DEPT alongside syzkaller fuzzing,
>> and many hung tasks missed by lockdep are caught by DEPT, but the resulting
>> high volume of reports makes it easy for issues to get lost in the massive
>> log output. Sorting through that output manually is a huge bottleneck, so
>> leveraging a well-crafted AI prompt to triage the warnings and filter out
>> the false positives would be incredibly valuable.
>
> I mean both 1) detection of deadlock issues AND 2) false positive elimination with AI.
I completely agree. Implanting DEPT's model into an AI review prompt
is a great idea. As you suggested, the patterns we develop for the AI
could provide valuable feedback to enhance DEPT's itself.
> If the review prompt is only used to eliminate DEPT's false positives, I think that would be quite hard to get broad use.
>
> Someone would have to build out-of-tree DEPT, collect the reports, and then feed those back into the AI. I don't think building that kind of pipeline would actually work well in practice.
I also have a huge dept report of DEPT reports, and manually
reviewing all of them is makes me sigh. The constant kernel rebuilds
required for lockup testing every time are also quite expensive.
Thanks for the summary!
Best Regards,
Yunseong
^ permalink raw reply
* Re: DEPT (the dependency tracker) as AI review prompt?
From: Harry Yoo @ 2026-05-23 14:34 UTC (permalink / raw)
To: Yunseong Kim
Cc: Byungchul Park, linux-kernel, kernel_team, torvalds,
damien.lemoal, linux-ide, adilger.kernel, linux-ext4, mingo,
peterz, will, tglx, rostedt, joel, sashal, daniel.vetter,
duyuyang, johannes.berg, tj, tytso, willy, david, amir73il,
gregkh, kernel-team, linux-mm, akpm, mhocko, minchan, hannes,
vdavydov.dev, sj, jglisse, dennis, cl, penberg, rientjes, vbabka,
ngupta, linux-block, linux-fsdevel, jack, jlayton, dan.j.williams,
hch, djwong, dri-devel, rodrigosiqueiramelo, melissa.srw,
hamohammed.sa, harry.yoo, chris.p.wilson, gwan-gyeong.mun,
max.byungchul.park, boqun.feng, longman, yunseong.kim,
yeoreum.yun, netdev, matthew.brost, her0gyugyu, corbet,
catalin.marinas, bp, x86, hpa, luto, sumit.semwal, gustavo,
christian.koenig, andi.shyti, arnd, lorenzo.stoakes, Liam.Howlett,
rppt, surenb, mcgrof, petr.pavlu, da.gomez, samitolvanen, paulmck,
frederic, neeraj.upadhyay, joelagnelf, josh, urezki,
mathieu.desnoyers, jiangshanlai, qiang.zhang, juri.lelli,
vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
chuck.lever, neil, okorniev, Dai.Ngo, tom, trondmy, anna, kees,
bigeasy, clrkwllms, mark.rutland, ada.coupriediaz,
kristina.martsenko, wangkefeng.wang, broonie, kevin.brodsky, dwmw,
shakeel.butt, ast, ziy, yuzhao, baolin.wang, usamaarif642,
joel.granados, richard.weiyang, geert+renesas, tim.c.chen, linux,
alexander.shishkin, lillian, chenhuacai, francesco,
guoweikang.kernel, link, jpoimboe, masahiroy, brauner,
thomas.weissschuh, oleg, mjguzik, andrii, wangfushuai, linux-doc,
linux-arm-kernel, linux-media, linaro-mm-sig, linux-i2c,
linux-arch, linux-modules, rcu, linux-nfs, linux-rt-devel,
2407018371, dakr, miguel.ojeda.sandonis, neilb, bagasdotme,
wsa+renesas, dave.hansen, geert, ojeda, alex.gaynor, gary,
bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross, rust-for-linux,
Chris Mason, Roman Gushchin, Josef Bacik, Yunseong Kim
In-Reply-To: <CA+7O06GxeDLR9RcKDN2i-Rgc4kgzz6BfF4b0XAH4tFx=A723Nw@mail.gmail.com>
On 5/23/26 11:00 PM, Yunseong Kim wrote:
> I've previously experimented with running DEPT alongside syzkaller fuzzing,
> and many hung tasks missed by lockdep are caught by DEPT, but the resulting
> high volume of reports makes it easy for issues to get lost in the massive
> log output. Sorting through that output manually is a huge bottleneck, so
> leveraging a well-crafted AI prompt to triage the warnings and filter out
> the false positives would be incredibly valuable.
I mean both 1) detection of deadlock issues AND 2) false positive
elimination with AI.
If the review prompt is only used to eliminate DEPT's false positives, I
think that would be quite hard to get broad use.
Someone would have to build out-of-tree DEPT, collect the reports, and
then feed those back into the AI. I don't think building that kind of
pipeline would actually work well in practice.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply
* Re: DEPT (the dependency tracker) as AI review prompt? (was: DEPT v18)
From: Yunseong Kim @ 2026-05-23 14:00 UTC (permalink / raw)
To: Harry Yoo
Cc: Byungchul Park, linux-kernel, kernel_team, torvalds,
damien.lemoal, linux-ide, adilger.kernel, linux-ext4, mingo,
peterz, will, tglx, rostedt, joel, sashal, daniel.vetter,
duyuyang, johannes.berg, tj, tytso, willy, david, amir73il,
gregkh, kernel-team, linux-mm, akpm, mhocko, minchan, hannes,
vdavydov.dev, sj, jglisse, dennis, cl, penberg, rientjes, vbabka,
ngupta, linux-block, linux-fsdevel, jack, jlayton, dan.j.williams,
hch, djwong, dri-devel, rodrigosiqueiramelo, melissa.srw,
hamohammed.sa, harry.yoo, chris.p.wilson, gwan-gyeong.mun,
max.byungchul.park, boqun.feng, longman, yunseong.kim, ysk,
yeoreum.yun, netdev, matthew.brost, her0gyugyu, corbet,
catalin.marinas, bp, x86, hpa, luto, sumit.semwal, gustavo,
christian.koenig, andi.shyti, arnd, lorenzo.stoakes, Liam.Howlett,
rppt, surenb, mcgrof, petr.pavlu, da.gomez, samitolvanen, paulmck,
frederic, neeraj.upadhyay, joelagnelf, josh, urezki,
mathieu.desnoyers, jiangshanlai, qiang.zhang, juri.lelli,
vincent.guittot, dietmar.eggemann, bsegall, mgorman, vschneid,
chuck.lever, neil, okorniev, Dai.Ngo, tom, trondmy, anna, kees,
bigeasy, clrkwllms, mark.rutland, ada.coupriediaz,
kristina.martsenko, wangkefeng.wang, broonie, kevin.brodsky, dwmw,
shakeel.butt, ast, ziy, yuzhao, baolin.wang, usamaarif642,
joel.granados, richard.weiyang, geert+renesas, tim.c.chen, linux,
alexander.shishkin, lillian, chenhuacai, francesco,
guoweikang.kernel, link, jpoimboe, masahiroy, brauner,
thomas.weissschuh, oleg, mjguzik, andrii, wangfushuai, linux-doc,
linux-arm-kernel, linux-media, linaro-mm-sig, linux-i2c,
linux-arch, linux-modules, rcu, linux-nfs, linux-rt-devel,
2407018371, dakr, miguel.ojeda.sandonis, neilb, bagasdotme,
wsa+renesas, dave.hansen, geert, ojeda, alex.gaynor, gary,
bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross, rust-for-linux,
Chris Mason, Roman Gushchin, Josef Bacik, Yunseong Kim
In-Reply-To: <6b2a816f-eb3b-4e0c-a024-ee2e3743eb04@kernel.org>
Hi Harry,
On Sat, May 23, 2026 at 2:33 PM Harry Yoo <harry@kernel.org> wrote:
>
> Can we start DEPT as an AI review prompt, by documenting DEPT's
> dependency tracking model and false positive elimination rules as a
> carefully crafted prompt?
>
> While DEPT can identify deadlock issues beyond lockdep's capabilities,
> it is hard to enable in automated testing; without fine-grained
> annotations it can produce a high rate of false positives, and verifying
> them requires significant human effort.
>
> The open source AI Review Prompt has locking.md file [1] that teaches
> the AI how to review locks and detect misuse.
>
> If we can write a review prompt for DEPT in a similar manner and have
> the AI do the deadlock detection and false positive elimination, I think
> we could identify those problems more effectively with much less human
> effort.
>
> [1]
> https://github.com/masoncl/review-prompts/blob/main/kernel/subsystem/locking.md
>
> --
> Cheers,
> Harry / Hyeonggon
I think this is an excellent idea, Harry.
I've previously experimented with running DEPT alongside syzkaller fuzzing,
and many hung tasks missed by lockdep are caught by DEPT, but the resulting
high volume of reports makes it easy for issues to get lost in the massive
log output. Sorting through that output manually is a huge bottleneck, so
leveraging a well-crafted AI prompt to triage the warnings and filter out
the false positives would be incredibly valuable.
Leveraging an AI prompt to triage these warnings would be incredibly valuable.
I'd be happy to help translate DEPT's tracking model into specific rules for
reducing false positives and establishing solid filtering patterns.
> On 12/5/25 4:18 PM, Byungchul Park wrote:
> > I'm happy to see that DEPT reported real problems in practice:
> >
> > https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SAKURA.ne.jp/
> > https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.park@lge.com/
> > https://lore.kernel.org/all/b6e00e77-4a8c-4e05-ab79-266bf05fcc2d@igalia.com/
> >
> > I’ve added documentation describing DEPT — this should help you
> > understand what DEPT is and how it works. You can use DEPT simply by
> > enabling CONFIG_DEPT and checking dmesg at runtime.
> > ---
> >
> > Hi Linus and folks,
> >
> > I’ve been developing a tool to detect deadlock possibilities by tracking
> > waits/events — rather than lock acquisition order — to cover all the
> > synchronization mechanisms. To summarize the design rationale, starting
> > from the problem statement, through analysis, to the solution:
> >
> > CURRENT STATUS
> > --------------
> > Lockdep tracks lock acquisition order to identify deadlock conditions.
> > Additionally, it tracks IRQ state changes — via {en,dis}able — to
> > detect cases where locks are acquired unintentionally during
> > interrupt handling.
> >
> > PROBLEM
> > -------
> > Waits and their associated events that are never reachable can
> > eventually lead to deadlocks. However, since Lockdep focuses solely
> > on lock acquisition order, it has inherent limitations when handling
> > waits and events.
> >
> > Moreover, by tracking only lock acquisition order, Lockdep cannot
> > properly handle read locks or cross-event scenarios — such as
> > wait_for_completion() and complete() — making it increasingly
> > inadequate as a general-purpose deadlock detection tool.
> >
> > SOLUTION
> > --------
> > Once again, waits and their associated events that are never
> > reachable can eventually lead to deadlocks. The new solution, DEPT,
> > focuses directly on waits and events. DEPT monitors waits and events,
> > and reports them when any become unreachable.
> >
> > DEPT provides:
> >
> > * Correct handling of read locks.
> > * Support for general waits and events.
> > * Continuous operation, even after multiple reports.
> > * Simple, intuitive annotation APIs.
> >
> > There are still false positives, and some are already being worked on
> > for suppression. Especially splitting the folio class into several
> > appropriate classes e.g. block device mapping class and regular file
> > mapping class, is currently under active development by me and Yeoreum
> > Yun.
> >> Anyway, these efforts will need to continue for a while, as we’ve seen
> > with lockdep over two decades. DEPT is tagged as EXPERIMENTAL in
> > Kconfig — meaning it’s not yet suitable for use as an automation tool.
> >
> > However, for those who are interested in using DEPT to analyze complex
> > synchronization patterns and extract dependency insights, DEPT would be
> > a great tool for the purpose.
Best regards,
Yunseong
^ permalink raw reply
* Re: [PATCHv3] blk-mq: pop cached request if it is usable
From: Jens Axboe @ 2026-05-23 13:56 UTC (permalink / raw)
To: Keith Busch, Ming Lei; +Cc: Keith Busch, hch, linux-block
In-Reply-To: <ahDe25f1Dw0YBsOL@kbusch-mbp>
On 5/22/26 4:55 PM, Keith Busch wrote:
> On Fri, May 22, 2026 at 04:08:56PM -0600, Keith Busch wrote:
>> On Fri, May 22, 2026 at 12:12:18PM +0800, Ming Lei wrote:
>>> On Thu, May 21, 2026 at 07:44:50PM -0600, Keith Busch wrote:
>>>> On Fri, May 22, 2026 at 07:33:39AM +0800, Ming Lei wrote:
>>>>>
>>>>> BTW, as mentioned in v2, the request may be added back in case of merge,
>>>>> but seems not a big deal given blk_mq_free_plug_rqs() doesn't free requests
>>>>> in batch.
>>>>
>>>> We could introduce a special goto label for the merge case to push it
>>>
>>> It can be done simply by replacing the added `blk_mq_free_request` with moving
>>> it back to plug list.
>>
>> What I'm worried about is hitting a blocking allocation, then the
>> cached_rqs list is freed, leaving the current request from it the only
>> one still holding a queue reference. I think we ought to re-enter the
>> queue in that case.
>
> Hmm, I may be mistaken here. The block allocation doesn't call
> blk_finish_plug(), so the current plug is left intact; the other
> queue_exit goto's are either from non-blocking contexts or a
> successful merge that holds queue references in other ways. I guess
> there is no queue_exit goto where unconditionally pushing back might
> be a problem. So yeah, sorry, maybe restoring it to the cached_rqs is
> a worthy optimization to make.
Yeah should be fine, we also discussed this one the other day. Want to
send a patch for that? Just mark it as fixing the previous even if it
isn't a bug fix in the strictest sense of the word, but then we ensure
that if one is backported, the other one will be too.
--
Jens Axboe
^ permalink raw reply
* [PATCH v1] block: switch numa_node to int in blk_mq_hw_ctx and init_request
From: Mateusz Nowicki @ 2026-05-23 12:52 UTC (permalink / raw)
To: Jens Axboe
Cc: Caleb Sander Mateos, Sung-woo Kim, Josef Bacik, Alasdair Kergon,
Mike Snitzer, Mikulas Patocka, Benjamin Marzinski, Ulf Hansson,
Richard Weinberger, Zhihao Cheng, Miquel Raynal,
Vignesh Raghavendra, Sven Peter, Janne Grunau, Neal Gompa,
Keith Busch, Christoph Hellwig, Sagi Grimberg, Justin Tee,
Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
James E.J. Bottomley, Martin K. Petersen, Thomas Fourier, Al Viro,
Luke Wang, Kees Cook, linux-block, linux-kernel, nbd, dm-devel,
linux-mmc, linux-mtd, asahi, linux-arm-kernel, linux-nvme,
linux-scsi
numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.
Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/ [1]
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
---
block/bsg-lib.c | 2 +-
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/block/nbd.c | 2 +-
drivers/md/dm-rq.c | 2 +-
drivers/mmc/core/queue.c | 2 +-
drivers/mtd/ubi/block.c | 2 +-
drivers/nvme/host/apple.c | 2 +-
drivers/nvme/host/fc.c | 2 +-
drivers/nvme/host/pci.c | 2 +-
drivers/nvme/host/rdma.c | 2 +-
drivers/nvme/host/tcp.c | 2 +-
drivers/nvme/target/loop.c | 2 +-
drivers/scsi/scsi_lib.c | 2 +-
include/linux/blk-mq.h | 4 ++--
14 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index fdb4b290ca68..895db30a7033 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -299,7 +299,7 @@ static blk_status_t bsg_queue_rq(struct blk_mq_hw_ctx *hctx,
/* called right after the request is allocated for the request_queue */
static int bsg_init_rq(struct blk_mq_tag_set *set, struct request *req,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct bsg_job *job = blk_mq_rq_to_pdu(req);
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 567192e371a8..8aedba9b5690 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3340,7 +3340,7 @@ static void mtip_free_cmd(struct blk_mq_tag_set *set, struct request *rq,
}
static int mtip_init_cmd(struct blk_mq_tag_set *set, struct request *rq,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct driver_data *dd = set->driver_data;
struct mtip_cmd *cmd = blk_mq_rq_to_pdu(rq);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index fe63f3c55d0d..e2fe9e3308fc 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1888,7 +1888,7 @@ static void nbd_dbg_close(void)
#endif
static int nbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct nbd_cmd *cmd = blk_mq_rq_to_pdu(rq);
cmd->nbd = set->driver_data;
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9703b3ae364e..9a386254d836 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -462,7 +462,7 @@ static void dm_start_request(struct mapped_device *md, struct request *orig)
}
static int dm_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct mapped_device *md = set->driver_data;
struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 39fcb662c43f..cfa268925c26 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -208,7 +208,7 @@ static unsigned short mmc_get_max_segments(struct mmc_host *host)
}
static int mmc_mq_init_request(struct blk_mq_tag_set *set, struct request *req,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct mmc_queue_req *mq_rq = req_to_mmc_queue_req(req);
struct mmc_queue *mq = set->driver_data;
diff --git a/drivers/mtd/ubi/block.c b/drivers/mtd/ubi/block.c
index 8880a783c3bc..29c0d6941a81 100644
--- a/drivers/mtd/ubi/block.c
+++ b/drivers/mtd/ubi/block.c
@@ -312,7 +312,7 @@ static blk_status_t ubiblock_queue_rq(struct blk_mq_hw_ctx *hctx,
static int ubiblock_init_request(struct blk_mq_tag_set *set,
struct request *req, unsigned int hctx_idx,
- unsigned int numa_node)
+ int numa_node)
{
struct ubiblock_pdu *pdu = blk_mq_rq_to_pdu(req);
diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index c692fc73babf..97586307ac1a 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -819,7 +819,7 @@ static int apple_nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
static int apple_nvme_init_request(struct blk_mq_tag_set *set,
struct request *req, unsigned int hctx_idx,
- unsigned int numa_node)
+ int numa_node)
{
struct apple_nvme_queue *q = set->driver_data;
struct apple_nvme *anv = queue_to_apple_nvme(q);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index e4f4528fe2a2..1907da499ad2 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2109,7 +2109,7 @@ __nvme_fc_init_request(struct nvme_fc_ctrl *ctrl,
static int
nvme_fc_init_request(struct blk_mq_tag_set *set, struct request *rq,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(set->driver_data);
struct nvme_fcp_op_w_sgl *op = blk_mq_rq_to_pdu(rq);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 139a10cd687f..afd407df640f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -660,7 +660,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
static int nvme_pci_init_request(struct blk_mq_tag_set *set,
struct request *req, unsigned int hctx_idx,
- unsigned int numa_node)
+ int numa_node)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index f77c960f7632..08459c65c3d5 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -292,7 +292,7 @@ static void nvme_rdma_exit_request(struct blk_mq_tag_set *set,
static int nvme_rdma_init_request(struct blk_mq_tag_set *set,
struct request *rq, unsigned int hctx_idx,
- unsigned int numa_node)
+ int numa_node)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(set->driver_data);
struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 15d36d6a728e..36b3ec50a9fd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -548,7 +548,7 @@ static void nvme_tcp_exit_request(struct blk_mq_tag_set *set,
static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
struct request *rq, unsigned int hctx_idx,
- unsigned int numa_node)
+ int numa_node)
{
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(set->driver_data);
struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index d98d0cdc5d6f..ae00bcef2251 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -202,7 +202,7 @@ static int nvme_loop_init_iod(struct nvme_loop_ctrl *ctrl,
static int nvme_loop_init_request(struct blk_mq_tag_set *set,
struct request *req, unsigned int hctx_idx,
- unsigned int numa_node)
+ int numa_node)
{
struct nvme_loop_ctrl *ctrl = to_loop_ctrl(set->driver_data);
struct nvme_loop_iod *iod = blk_mq_rq_to_pdu(req);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6e8c7a42603e..67f789bd02e7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1950,7 +1950,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
}
static int scsi_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
- unsigned int hctx_idx, unsigned int numa_node)
+ unsigned int hctx_idx, int numa_node)
{
struct Scsi_Host *shost = set->driver_data;
struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..2e7f90048171 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -428,7 +428,7 @@ struct blk_mq_hw_ctx {
struct blk_mq_tags *sched_tags;
/** @numa_node: NUMA node the storage adapter has been connected to. */
- unsigned int numa_node;
+ int numa_node;
/** @queue_num: Index of this hardware queue. */
unsigned int queue_num;
@@ -653,7 +653,7 @@ struct blk_mq_ops {
* flush request.
*/
int (*init_request)(struct blk_mq_tag_set *set, struct request *,
- unsigned int, unsigned int);
+ unsigned int, int);
/**
* @exit_request: Ditto for exit/teardown.
*/
base-commit: 45255ea1ca096b11b1303c9b54502a28f3a31dd1
--
2.53.0
^ permalink raw reply related
* DEPT (the dependency tracker) as AI review prompt? (was: DEPT v18)
From: Harry Yoo @ 2026-05-23 12:32 UTC (permalink / raw)
To: Byungchul Park, linux-kernel
Cc: kernel_team, torvalds, damien.lemoal, linux-ide, adilger.kernel,
linux-ext4, mingo, peterz, will, tglx, rostedt, joel, sashal,
daniel.vetter, duyuyang, johannes.berg, tj, tytso, willy, david,
amir73il, gregkh, kernel-team, linux-mm, akpm, mhocko, minchan,
hannes, vdavydov.dev, sj, jglisse, dennis, cl, penberg, rientjes,
vbabka, ngupta, linux-block, josef, linux-fsdevel, jack, jlayton,
dan.j.williams, hch, djwong, dri-devel, rodrigosiqueiramelo,
melissa.srw, hamohammed.sa, harry.yoo, chris.p.wilson,
gwan-gyeong.mun, max.byungchul.park, boqun.feng, longman,
yunseong.kim, ysk, yeoreum.yun, netdev, matthew.brost, her0gyugyu,
corbet, catalin.marinas, bp, x86, hpa, luto, sumit.semwal,
gustavo, christian.koenig, andi.shyti, arnd, lorenzo.stoakes,
Liam.Howlett, rppt, surenb, mcgrof, petr.pavlu, da.gomez,
samitolvanen, paulmck, frederic, neeraj.upadhyay, joelagnelf,
josh, urezki, mathieu.desnoyers, jiangshanlai, qiang.zhang,
juri.lelli, vincent.guittot, dietmar.eggemann, bsegall, mgorman,
vschneid, chuck.lever, neil, okorniev, Dai.Ngo, tom, trondmy,
anna, kees, bigeasy, clrkwllms, mark.rutland, ada.coupriediaz,
kristina.martsenko, wangkefeng.wang, broonie, kevin.brodsky, dwmw,
shakeel.butt, ast, ziy, yuzhao, baolin.wang, usamaarif642,
joel.granados, richard.weiyang, geert+renesas, tim.c.chen, linux,
alexander.shishkin, lillian, chenhuacai, francesco,
guoweikang.kernel, link, jpoimboe, masahiroy, brauner,
thomas.weissschuh, oleg, mjguzik, andrii, wangfushuai, linux-doc,
linux-arm-kernel, linux-media, linaro-mm-sig, linux-i2c,
linux-arch, linux-modules, rcu, linux-nfs, linux-rt-devel,
2407018371, dakr, miguel.ojeda.sandonis, neilb, bagasdotme,
wsa+renesas, dave.hansen, geert, ojeda, alex.gaynor, gary,
bjorn3_gh, lossin, a.hindborg, aliceryhl, tmgross, rust-for-linux,
Chris Mason, Roman Gushchin, Josef Bacik
In-Reply-To: <20251205071855.72743-1-byungchul@sk.com>
Can we start DEPT as an AI review prompt, by documenting DEPT's
dependency tracking model and false positive elimination rules as a
carefully crafted prompt?
While DEPT can identify deadlock issues beyond lockdep's capabilities,
it is hard to enable in automated testing; without fine-grained
annotations it can produce a high rate of false positives, and verifying
them requires significant human effort.
The open source AI Review Prompt has locking.md file [1] that teaches
the AI how to review locks and detect misuse.
If we can write a review prompt for DEPT in a similar manner and have
the AI do the deadlock detection and false positive elimination, I think
we could identify those problems more effectively with much less human
effort.
[1]
https://github.com/masoncl/review-prompts/blob/main/kernel/subsystem/locking.md
--
Cheers,
Harry / Hyeonggon
On 12/5/25 4:18 PM, Byungchul Park wrote:
> I'm happy to see that DEPT reported real problems in practice:
>
> https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SAKURA.ne.jp/
> https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.park@lge.com/
> https://lore.kernel.org/all/b6e00e77-4a8c-4e05-ab79-266bf05fcc2d@igalia.com/
>
> I’ve added documentation describing DEPT — this should help you
> understand what DEPT is and how it works. You can use DEPT simply by
> enabling CONFIG_DEPT and checking dmesg at runtime.
> ---
>
> Hi Linus and folks,
>
> I’ve been developing a tool to detect deadlock possibilities by tracking
> waits/events — rather than lock acquisition order — to cover all the
> synchronization mechanisms. To summarize the design rationale, starting
> from the problem statement, through analysis, to the solution:
>
> CURRENT STATUS
> --------------
> Lockdep tracks lock acquisition order to identify deadlock conditions.
> Additionally, it tracks IRQ state changes — via {en,dis}able — to
> detect cases where locks are acquired unintentionally during
> interrupt handling.
>
> PROBLEM
> -------
> Waits and their associated events that are never reachable can
> eventually lead to deadlocks. However, since Lockdep focuses solely
> on lock acquisition order, it has inherent limitations when handling
> waits and events.
>
> Moreover, by tracking only lock acquisition order, Lockdep cannot
> properly handle read locks or cross-event scenarios — such as
> wait_for_completion() and complete() — making it increasingly
> inadequate as a general-purpose deadlock detection tool.
>
> SOLUTION
> --------
> Once again, waits and their associated events that are never
> reachable can eventually lead to deadlocks. The new solution, DEPT,
> focuses directly on waits and events. DEPT monitors waits and events,
> and reports them when any become unreachable.
>
> DEPT provides:
>
> * Correct handling of read locks.
> * Support for general waits and events.
> * Continuous operation, even after multiple reports.
> * Simple, intuitive annotation APIs.
>
> There are still false positives, and some are already being worked on
> for suppression. Especially splitting the folio class into several
> appropriate classes e.g. block device mapping class and regular file
> mapping class, is currently under active development by me and Yeoreum
> Yun.
>> Anyway, these efforts will need to continue for a while, as we’ve seen
> with lockdep over two decades. DEPT is tagged as EXPERIMENTAL in
> Kconfig — meaning it’s not yet suitable for use as an automation tool.
>
> However, for those who are interested in using DEPT to analyze complex
> synchronization patterns and extract dependency insights, DEPT would be
> a great tool for the purpose.
^ permalink raw reply
* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
From: Tal Zussman @ 2026-05-22 23:17 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <20260514-blk-dontcache-v6-4-782e2fa7477b@columbia.edu>
On 5/14/26 5:51 PM, Tal Zussman wrote:
> Block device buffered reads and writes already pass through
> filemap_read() and iomap_file_buffered_write() respectively, both of
> which handle IOCB_DONTCACHE. Enable RWF_DONTCACHE for block device files
> by setting FOP_DONTCACHE in def_blk_fops.
>
> For CONFIG_BUFFER_HEAD=y paths, use block_write_begin_iocb() in
> blkdev_write_begin() to thread the kiocb through so that buffer_head
> writeback gets dropbehind support.
>
> CONFIG_BUFFER_HEAD=n paths are handled by the previously added iomap
> BIO_COMPLETE_IN_TASK support.
>
> This support is useful for databases that operate on raw block devices,
> among other userspace applications.
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Responding to Sashiko review inline:
Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu
Q: "Could this code path be unreachable during block device writes?
Block device buffered writes use blkdev_write_iter(), which unconditionally
delegates to blkdev_buffered_write() and subsequently
iomap_file_buffered_write(). The iomap infrastructure bypasses the legacy
address_space_operations .write_begin method.
During a write, iomap_write_begin() handles buffer head allocation internally
by calling __block_write_begin_int() directly. This naturally inherits the
FGP_DONTCACHE flag passed down from the kiocb via iomap_get_folio().
If the VFS write paths were actually calling .write_begin for block devices, a
CONFIG_BUFFER_HEAD=n kernel would crash with a NULL pointer dereference since
def_blk_aops does not define .write_begin or .write_end in that configuration."
A: So this actually seems legit... doesn't look like anything actually calls
blkdev_write_begin() or blkdev_write_end(), unless I'm missing something.
block_write_begin_iocb() usage seems necessary for bh-based filesystems, but
block devices seem to use iomap for writes unconditionally.
> ---
> block/fops.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index bb6642b45937..31b073181d87 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -504,7 +504,8 @@ static int blkdev_write_begin(const struct kiocb *iocb,
> unsigned len, struct folio **foliop,
> void **fsdata)
> {
> - return block_write_begin(mapping, pos, len, foliop, blkdev_get_block);
> + return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
> + blkdev_get_block);
> }
>
> static int blkdev_write_end(const struct kiocb *iocb,
> @@ -966,7 +967,7 @@ const struct file_operations def_blk_fops = {
> .splice_write = iter_file_splice_write,
> .fallocate = blkdev_fallocate,
> .uring_cmd = blkdev_uring_cmd,
> - .fop_flags = FOP_BUFFER_RASYNC,
> + .fop_flags = FOP_BUFFER_RASYNC | FOP_DONTCACHE,
> };
>
> static __init int blkdev_init(void)
>
^ permalink raw reply
* Re: [PATCH v6 3/4] buffer: add dropbehind writeback support
From: Tal Zussman @ 2026-05-22 23:14 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <20260514-blk-dontcache-v6-3-782e2fa7477b@columbia.edu>
On 5/14/26 5:51 PM, Tal Zussman wrote:
> Add block_write_begin_iocb() which threads the kiocb through to
> __filemap_get_folio() so that buffer_head-based I/O can use DONTCACHE
> behavior. When the iocb has IOCB_DONTCACHE set, FGP_DONTCACHE is
> passed to mark the folio for dropbehind. The existing
> block_write_begin() is preserved as a wrapper that passes a NULL iocb.
>
> Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() when the folio has
> dropbehind set, so that buffer_head writeback completions get deferred
> to task context.
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Responding to Sashiko review inline:
Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu
> ---
> fs/buffer.c | 19 +++++++++++++++++--
> include/linux/buffer_head.h | 3 +++
> 2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b0b3792b1496..d0abaf44d782 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2138,14 +2138,19 @@ EXPORT_SYMBOL(block_commit_write);
> *
> * The filesystem needs to handle block truncation upon failure.
> */
> -int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> +int block_write_begin_iocb(const struct kiocb *iocb,
> + struct address_space *mapping, loff_t pos, unsigned len,
> struct folio **foliop, get_block_t *get_block)
> {
> pgoff_t index = pos >> PAGE_SHIFT;
> + fgf_t fgp_flags = FGP_WRITEBEGIN;
> struct folio *folio;
> int status;
>
> - folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> + if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
> + fgp_flags |= FGP_DONTCACHE;
> +
> + folio = __filemap_get_folio(mapping, index, fgp_flags,
> mapping_gfp_mask(mapping));
> if (IS_ERR(folio))
> return PTR_ERR(folio);
> @@ -2160,6 +2165,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> *foliop = folio;
> return status;
> }
Q: "Is an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() needed for
block_write_begin_iocb()?
Since the legacy wrapper block_write_begin() remains exported, it appears
filesystems compiled as loadable modules (like ext4 or fat) will encounter
linker errors if they attempt to transition to the new API."
A: Can be added once filesystems want to use it.
> +
> +int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> + struct folio **foliop, get_block_t *get_block)
> +{
> + return block_write_begin_iocb(NULL, mapping, pos, len, foliop,
> + get_block);
> +}
> EXPORT_SYMBOL(block_write_begin);
>
> int block_write_end(loff_t pos, unsigned len, unsigned copied,
> @@ -2715,6 +2727,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
>
> bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
>
> + if (folio_test_dropbehind(bh->b_folio))
> + bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> +
Q: "Does this add unnecessary workqueue overhead for read operations?
While task-context deferral is required for asynchronous writeback completions
to allow folio_end_writeback() to sleep and invalidate the folio, read
dropbehind invalidation is handled synchronously in the reader's task context
via filemap_end_dropbehind_read().
Could this be gated by an op == REQ_OP_WRITE or op_is_write(opf) check so
that read bio completions aren't forced into a workqueue?"
A: Seems worth adding a op_is_write(opf) check to avoid deferring read
dropbehind completions.
> if (IS_ENABLED(CONFIG_FS_ENCRYPTION))
> buffer_set_crypto_ctx(bio, bh, GFP_NOIO);
>
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index e4939e33b4b5..4ce50882d621 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -260,6 +260,9 @@ int block_read_full_folio(struct folio *, get_block_t *);
> bool block_is_partially_uptodate(struct folio *, size_t from, size_t count);
> int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> struct folio **foliop, get_block_t *get_block);
> +int block_write_begin_iocb(const struct kiocb *iocb,
> + struct address_space *mapping, loff_t pos, unsigned len,
> + struct folio **foliop, get_block_t *get_block);
> int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
> get_block_t *get_block);
> int block_write_end(loff_t pos, unsigned len, unsigned copied, struct folio *);
>
^ permalink raw reply
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Tal Zussman @ 2026-05-22 23:09 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
In-Reply-To: <20260514-blk-dontcache-v6-1-782e2fa7477b@columbia.edu>
On 5/14/26 5:51 PM, Tal Zussman wrote:
Responding to Sashiko review inline:
Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct delayed_work *dw = to_delayed_work(w);
> + struct bio_complete_batch *batch =
> + container_of(dw, struct bio_complete_batch, work);
> +
> + while (1) {
> + struct bio_list list;
> + struct bio *bio;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (bio_list_empty(&list))
> + break;
> +
> + while ((bio = bio_list_pop(&list)))
> + bio->bi_end_io(bio);
Q: "Does this code handle bios that do not have a completion callback?
If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
bi_end_io pointer (like some flush or discard bios), executing
bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
The synchronous completion path in bio_endio() checks if bio->bi_end_io is
non-NULL before calling it. Should a similar check be added here?"
A: Every bio that reaches this point should have bi_end_io set. There's no
point in deferring a bio without a completion handler.
Q: "Could this loop also monopolize the CPU and cause latency spikes?
Because the work uses a 1-jiffie delay, tens of thousands of bios might
accumulate under heavy I/O load. Processing the entire batch sequentially
without yielding or calling cond_resched() inside the inner loop could
lead to scheduler latency or watchdog warnings on non-preemptible kernels."
A: With PREEMPT_LAZY as the default, the worker is preemptible, as per
Hillf and Christoph, so this is unnecessary.
> + if (need_resched()) {
> + bool is_empty;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + is_empty = bio_list_empty(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> + if (!is_empty)
> + mod_delayed_work_on(batch->cpu,
> + bio_complete_wq,
> + &batch->work, 0);
> + break;
> + }
> + }
> +}
> +
> +void __bio_complete_in_task(struct bio *bio)
> +{
> + struct bio_complete_batch *batch;
> + unsigned long flags;
> + bool was_empty;
> +
> + local_lock_irqsave(&bio_complete_batch.lock, flags);
Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
context?
On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
called when bio_in_atomic() is true (which includes hardware interrupts or
execution under a raw_spinlock_t), attempting to acquire a sleepable lock
here would trigger an "Invalid wait context" lockdep warning.
Would a lockless list (llist) be more appropriate here to avoid sleeping
in atomic contexts?"
A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
to switch to raw_spinlock_t, as it seems like that would add unnecessary
overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
(as is done for the per-CPU bio allocation cache) should work.
> + batch = this_cpu_ptr(&bio_complete_batch);
> + was_empty = bio_list_empty(&batch->list);
> + bio_list_add(&batch->list, bio);
> + local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> + if (was_empty)
> + mod_delayed_work_on(batch->cpu, bio_complete_wq,
> + &batch->work, 1);
> +}
> +EXPORT_SYMBOL_GPL(__bio_complete_in_task);
> +
> static inline bool bio_remaining_done(struct bio *bio)
> {
> /*
> @@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
> }
> #endif
>
> - if (bio->bi_end_io)
> + if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
> + __bio_complete_in_task(bio);
> + else if (bio->bi_end_io)
> bio->bi_end_io(bio);
> }
> EXPORT_SYMBOL(bio_endio);
> @@ -1977,6 +2053,51 @@ int bioset_init(struct bio_set *bs,
> }
> EXPORT_SYMBOL(bioset_init);
>
> +static int bio_complete_batch_cpu_online(unsigned int cpu)
> +{
> + enable_delayed_work(&per_cpu(bio_complete_batch, cpu).work);
> + return 0;
> +}
> +
Q: "Could a CPU hotplug rollback strand deferred bios indefinitely?
During the early teardown phase (CPUHP_AP_ONLINE_DYN),
bio_complete_batch_cpu_down_prep() disables the delayed work. However, the
CPU remains active and can continue completing bios. When these bios
complete, __bio_complete_in_task() adds them to the list with
was_empty = true, but mod_delayed_work_on() fails to queue the work
because it is disabled.
If the CPU offline operation is aborted, bio_complete_batch_cpu_online()
re-enables the delayed work but does not explicitly queue it. Because the
list is no longer empty, any future bio completions will see
was_empty == false and skip calling mod_delayed_work_on(), permanently
stranding the bios."
A: This is legit. The solution is to requeue the work if the list isn't
empty on CPU online.
> +/*
> + * Disable this CPU's delayed work so that it cannot run on an unbound worker
> + * after the CPU is offlined.
> + */
> +static int bio_complete_batch_cpu_down_prep(unsigned int cpu)
> +{
> + disable_delayed_work_sync(&per_cpu(bio_complete_batch, cpu).work);
> + return 0;
> +}
> +
> +/*
> + * Drain a dead CPU's deferred bio completions. The CPU is dead and the worker
> + * is canceled so no locking is needed.
> + */
> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, cpu);
> + struct bio *bio;
> +
> + while ((bio = bio_list_pop(&batch->list)))
> + bio->bi_end_io(bio);
> +
Q: "Is it possible to hit the same NULL pointer dereference issue here if the
bio lacks a bi_end_io callback?"
A: Same as above.
> + return 0;
> +}
> +
> +static void __init bio_complete_batch_init(int cpu)
> +{
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, cpu);
> +
> + bio_list_init(&batch->list);
> + INIT_DELAYED_WORK(&batch->work, bio_complete_work_fn);
> + batch->cpu = cpu;
> +
> + if (!cpu_online(cpu))
> + disable_delayed_work_sync(&batch->work);
> +}
> +
> static int __init init_bio(void)
> {
> int i;
> @@ -1991,6 +2112,30 @@ static int __init init_bio(void)
> SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> }
>
> + for_each_possible_cpu(i)
> + bio_complete_batch_init(i);
> +
> + bio_complete_wq = alloc_workqueue("bio_complete",
> + WQ_MEM_RECLAIM | WQ_PERCPU, 0);
Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
protection in bio_complete_work_fn()?
When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
can execute per-CPU work items during worker pool congestion. This rescuer
thread executes unbound, meaning it could run on CPU B while processing
CPU A's work item.
Since local_lock operates strictly on the currently executing CPU, the
rescuer thread on CPU B would acquire CPU B's lock, while popping elements
from CPU A's list (derived via container_of()).
If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
it will acquire CPU A's lock and modify the same list without mutual
exclusion, potentially causing list corruption."
A: The rescuer should run on the same CPU, not unbound, so this is not an
issue.
> + if (!bio_complete_wq)
> + panic("bio: can't allocate bio_complete workqueue\n");
> +
> + /*
> + * bio task-context completion draining on hot-unplugged CPUs:
> + *
> + * 1. Stop the per-CPU delayed work while the CPU is still online, so
> + * that it cannot run on an unbound worker later.
> + * 2. Drain leftover bios added between worker disabling and CPU
> + * offlining.
> + */
> + cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> + "block/bio:complete:online",
> + bio_complete_batch_cpu_online,
> + bio_complete_batch_cpu_down_prep);
> + cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
> + "block/bio:complete:dead",
> + NULL, bio_complete_batch_cpu_dead);
> +
> cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
> bio_cpu_dead);
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 97d747320b35..c0214d6c28d6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -369,6 +369,38 @@ static inline struct bio *bio_alloc(struct block_device *bdev,
>
> void submit_bio(struct bio *bio);
>
> +/**
> + * bio_in_atomic - check if the current context is unsafe for bio completion
> + *
> + * Return: %true in atomic contexts (e.g. hard/soft IRQ, preempt-disabled);
> + * %false when a bio can be safely completed in the current context.
> + */
> +static inline bool bio_in_atomic(void)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> + return true;
> + if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> + return true;
Q: "Will this cause an infinite loop of bio offloading on kernels with
CONFIG_PREEMPT_COUNT disabled?
Because bio_in_atomic() unconditionally returns true without preempt count
support, a dynamic call to bio_complete_in_task() from within a bi_end_io()
callback will always offload the bio to the workqueue.
When the workqueue executes bio->bi_end_io(bio), the callback will evaluate
bio_complete_in_task() again, which will return true again, creating a
permanent offloading loop."
A: Legit issue. This can be solved by changing bio_complete_in_task() to:
static inline bool bio_complete_in_task(struct bio *bio)
{
if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
return false;
if (!bio_in_atomic())
return false;
bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
__bio_complete_in_task(bio);
return true;
}
We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
been deferred to the workqueue as is safe to run.
> + return !preemptible();
> +}
> +
> +void __bio_complete_in_task(struct bio *bio);
> +
> +/**
> + * bio_complete_in_task - ensure a bio is completed in preemptible task context
> + * @bio: bio to complete
> + *
> + * If called from non-task context, offload the bio completion to a worker
> + * thread and return %true. Else return %false and do nothing.
> + */
> +static inline bool bio_complete_in_task(struct bio *bio)
> +{
> + if (!bio_in_atomic())
> + return false;
> + __bio_complete_in_task(bio);
> + return true;
> +}
> +
> extern void bio_endio(struct bio *);
>
> static inline void bio_io_error(struct bio *bio)
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8808ee76e73c..d49d97a050d0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -322,6 +322,7 @@ enum {
> BIO_REMAPPED,
> BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> + BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
> BIO_FLAG_LAST
> };
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox