* Re: [PATCH] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-26 14:02 UTC (permalink / raw)
To: Ming Lei; +Cc: Keith Busch, linux-block, axboe, Christoph Hellwig
In-Reply-To: <ahT656Dazfz5oc8r@fedora>
On Mon, May 25, 2026 at 08:44:07PM -0500, Ming Lei wrote:
> On Mon, May 25, 2026 at 09:07:44AM -0700, Keith Busch wrote:
> > + rq_list_push(&plug->cached_rqs, rq);
>
> rq_list_add_head()?
Yes indeed. Serves me right for trying to squeeze this in over a
holiday. Thanks.
^ permalink raw reply
* [PATCH] block: blk-zoned: fix zwplug refcount leak on write error path
From: Wentao Liang @ 2026-05-26 14:18 UTC (permalink / raw)
To: Jens Axboe, Damien Le Moal
Cc: linux-block, linux-kernel, Wentao Liang, stable
blk_zone_wplug_handle_write() increments zwplug->ref via kref_get()
when preparing to handle a zone write. On the error path where
blk_zone_wplug_handle_write_noalloc() fails, the function returns
without calling kref_put() on zwplug->ref, leaking the reference.
Add kref_put(&zwplug->ref, ...) on the error path to properly release
the reference.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
block/blk-zoned.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 42ef830054dc..24b899663a48 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1503,6 +1503,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
spin_unlock_irqrestore(&zwplug->lock, flags);
+ disk_put_zone_wplug(zwplug);
bio_io_error(bio);
return true;
}
@@ -1511,6 +1512,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
spin_unlock_irqrestore(&zwplug->lock, flags);
+ disk_put_zone_wplug(zwplug);
return false;
--
2.34.1
^ permalink raw reply related
* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Matthew Wilcox @ 2026-05-26 14:37 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Mike Rapoport, Christoph Hellwig, Jens Axboe, linux-block,
linux-kernel, linux-mm
In-Reply-To: <1dea9df9-18c2-46f5-bf47-abb3f088574b@suse.com>
On Tue, May 26, 2026 at 02:07:36PM +0200, Vlastimil Babka wrote:
> The main reasons for switching AFAIU would be related with the
> folio/memdesc conversions? If one needs just a kernel memory buffer,
> kmalloc() it is, even if it happens to be page size. Page allocator
> should be only used if you need e.g. the refcounting or anything else
> that struct page provides. But then in some cases the memdesc conversion
> would need adjustments at some point. With kmalloc() we can forget about
> this user.
No, I think this is unrelated to memdescs.
I've seen a few people say slightly wrong things about
folios/pages/memdescs recently, so let me try to clarify the end state.
I do not intend to get rid of the ability to allocate a bare page of
memory with something like alloc_pages() or get_free_page(). It's
just that the struct page associated with it will contain far less
information (because it's smaller).
https://kernelnewbies.org/MatthewWilcox/Memdescs has a bit more
information, but to distill it:
You get a u64 worth of data (technically one per page, but if you
allocate multiple pages, they're all going to be the same).
Bits 0-3 will be type 0 (to indicate that it has no memdesc).
Bits 4-10 will be subtype 2 (to indicate no information about owner).
Bit 11 will be clear to indicate that this page should not be mappable
to userspace.
Bits 12-17 will store the allocation order.
The top few bits will encode zone/node/section like page->flags
do today.
That doesn't leave many free bits for the user, but that's OK because
most allocations don't actually need any bits in struct page. If you do
want something like a refcount or list_head, see the "Managed memory"
section on that page. If you actually want a full-fat folio, well,
allocate a folio, not a page.
^ permalink raw reply
* Re: [PATCH] block: partitions: fix of_node refcount leak in of_partition()
From: Haris Iqbal @ 2026-05-26 14:59 UTC (permalink / raw)
To: Wentao Liang, Jens Axboe, stable
Cc: Josh Law, Kees Cook, linux-block, linux-kernel
In-Reply-To: <20260526102124.2283846-1-vulab@iscas.ac.cn>
On 5/26/26 12:21, Wentao Liang wrote:
> of_partition() calls of_node_get() on the parent device node at the
> beginning of the function, storing the reference in 'partitions_np'.
> This reference is leaked in two paths:
>
> 1. The compatibility check at the top of the function returns 0
> without releasing partitions_np when the node exists but is not
> "fixed-partitions" compatible.
>
> 2. The function returns 1 at the end after successfully processing
> all partitions without releasing partitions_np.
>
> Fix both leaks by adding of_node_put(partitions_np) on each path.
>
> Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
> Cc: stable@vger.kernel.org
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Looks good:
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
> ---
> block/partitions/of.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block/partitions/of.c b/block/partitions/of.c
> index c22b60661098..53664ea06b65 100644
> --- a/block/partitions/of.c
> +++ b/block/partitions/of.c
> @@ -74,8 +74,10 @@ int of_partition(struct parsed_partitions *state)
> struct device_node *partitions_np = of_node_get(ddev->of_node);
>
> if (!partitions_np ||
> - !of_device_is_compatible(partitions_np, "fixed-partitions"))
> + !of_device_is_compatible(partitions_np, "fixed-partitions")) {
> + of_node_put(partitions_np);
> return 0;
> + }
>
> slot = 1;
> /* Validate parition offset and size */
> @@ -104,5 +106,6 @@ int of_partition(struct parsed_partitions *state)
>
> seq_buf_puts(&state->pp_buf, "\n");
>
> + of_node_put(partitions_np);
> return 1;
> }
^ permalink raw reply
* Re: [PATCH] blk-mq: reinsert cached request to the list
From: kernel test robot @ 2026-05-26 15:02 UTC (permalink / raw)
To: Keith Busch, linux-block, axboe
Cc: llvm, oe-kbuild-all, Keith Busch, Ming Lei, Christoph Hellwig
In-Reply-To: <20260525160744.896047-1-kbusch@meta.com>
Hi Keith,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe/for-next]
[also build test ERROR on next-20260525]
[cannot apply to linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/blk-mq-reinsert-cached-request-to-the-list/20260526-000916
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link: https://lore.kernel.org/r/20260525160744.896047-1-kbusch%40meta.com
patch subject: [PATCH] blk-mq: reinsert cached request to the list
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260526/202605261716.TITwjvlB-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260526/202605261716.TITwjvlB-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605261716.TITwjvlB-lkp@intel.com/
All errors (new ones prefixed by >>):
>> block/blk-mq.c:3249:3: error: call to undeclared function 'rq_list_push'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
3249 | rq_list_push(&plug->cached_rqs, rq);
| ^
1 error generated.
vim +/rq_list_push +3249 block/blk-mq.c
3110
3111 /**
3112 * blk_mq_submit_bio - Create and send a request to block device.
3113 * @bio: Bio pointer.
3114 *
3115 * Builds up a request structure from @q and @bio and send to the device. The
3116 * request may not be queued directly to hardware if:
3117 * * This request can be merged with another one
3118 * * We want to place request at plug queue for possible future merging
3119 * * There is an IO scheduler active at this queue
3120 *
3121 * It will not queue the request if there is an error with the bio, or at the
3122 * request creation.
3123 */
3124 void blk_mq_submit_bio(struct bio *bio)
3125 {
3126 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
3127 struct blk_plug *plug = current->plug;
3128 const int is_sync = op_is_sync(bio->bi_opf);
3129 unsigned int integrity_action;
3130 struct blk_mq_hw_ctx *hctx;
3131 unsigned int nr_segs;
3132 struct request *rq;
3133 blk_status_t ret;
3134
3135 /*
3136 * If the plug has a cached request for this queue, try to use it.
3137 */
3138 rq = blk_mq_get_cached_request(plug, q, bio->bi_opf);
3139
3140 /*
3141 * A BIO that was released from a zone write plug has already been
3142 * through the preparation in this function, already holds a reference
3143 * on the queue usage counter, and is the only write BIO in-flight for
3144 * the target zone. Go straight to preparing a request for it.
3145 */
3146 if (bio_zone_write_plugging(bio)) {
3147 nr_segs = bio->__bi_nr_segments;
3148 if (rq)
3149 blk_queue_exit(q);
3150 goto new_request;
3151 }
3152
3153 /*
3154 * The cached request already holds a q_usage_counter reference and we
3155 * don't have to acquire a new one if we use it.
3156 */
3157 if (!rq) {
3158 if (unlikely(bio_queue_enter(bio)))
3159 return;
3160 }
3161
3162 /*
3163 * Device reconfiguration may change logical block size or reduce the
3164 * number of poll queues, so the checks for alignment and poll support
3165 * have to be done with queue usage counter held.
3166 */
3167 if (unlikely(bio_unaligned(bio, q))) {
3168 bio_io_error(bio);
3169 goto queue_exit;
3170 }
3171
3172 if ((bio->bi_opf & REQ_POLLED) && !blk_mq_can_poll(q)) {
3173 bio->bi_status = BLK_STS_NOTSUPP;
3174 bio_endio(bio);
3175 goto queue_exit;
3176 }
3177
3178 bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
3179 if (!bio)
3180 goto queue_exit;
3181
3182 integrity_action = bio_integrity_action(bio);
3183 if (integrity_action)
3184 bio_integrity_prep(bio, integrity_action);
3185
3186 blk_mq_bio_issue_init(q, bio);
3187 if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
3188 goto queue_exit;
3189
3190 if (bio_needs_zone_write_plugging(bio)) {
3191 if (blk_zone_plug_bio(bio, nr_segs))
3192 goto queue_exit;
3193 }
3194
3195 new_request:
3196 if (rq) {
3197 rq_qos_throttle(rq->q, bio);
3198 blk_mq_rq_time_init(rq, blk_time_get_ns());
3199 rq->cmd_flags = bio->bi_opf;
3200 INIT_LIST_HEAD(&rq->queuelist);
3201 } else {
3202 rq = blk_mq_get_new_requests(q, plug, bio);
3203 if (unlikely(!rq)) {
3204 if (bio->bi_opf & REQ_NOWAIT)
3205 bio_wouldblock_error(bio);
3206 goto queue_exit;
3207 }
3208 }
3209
3210 trace_block_getrq(bio);
3211
3212 rq_qos_track(q, rq, bio);
3213
3214 blk_mq_bio_to_request(rq, bio, nr_segs);
3215
3216 ret = blk_crypto_rq_get_keyslot(rq);
3217 if (ret != BLK_STS_OK) {
3218 bio->bi_status = ret;
3219 bio_endio(bio);
3220 blk_mq_free_request(rq);
3221 return;
3222 }
3223
3224 if (bio_zone_write_plugging(bio))
3225 blk_zone_write_plug_init_request(rq);
3226
3227 if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
3228 return;
3229
3230 if (plug) {
3231 blk_add_rq_to_plug(plug, rq);
3232 return;
3233 }
3234
3235 hctx = rq->mq_hctx;
3236 if ((rq->rq_flags & RQF_USE_SCHED) ||
3237 (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
3238 blk_mq_insert_request(rq, 0);
3239 blk_mq_run_hw_queue(hctx, true);
3240 } else {
3241 blk_mq_run_dispatch_ops(q, blk_mq_try_issue_directly(hctx, rq));
3242 }
3243 return;
3244
3245 queue_exit:
3246 if (!rq)
3247 blk_queue_exit(q);
3248 else
> 3249 rq_list_push(&plug->cached_rqs, rq);
3250 }
3251
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: Observing higher CPU utilization during random IO fio testing
From: Wen Xiong @ 2026-05-26 15:28 UTC (permalink / raw)
To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
In-Reply-To: <043e357f-5b37-4e05-9433-271504fc1d30@fygo.io>
On 2026-05-25 00:28, Yu Kuai wrote:
> Hi,
>
> 在 2026/5/22 5:52, Jens Axboe 写道:
>>> - More IO scheduler interaction, forces requests through scheduler
>>> instead of direct dispatch(direct dispatch to hardware queue)
> I don't understand this point. Can you explain more? I think plug
> should not matter if request go through scheduler or not.
My understanding is:
Random IO tests are more CPU intensive.
Plug delays the dispatch IOs to hardware queue(quick way) directly.
Plug submits multiple IO requests in a batch to defer submitting IO
until calling blk_flush_plug(dispatch to hardware queue) or task gets
scheduling.
>
> And I assume you're testing raw disk, because filesystems should
> always enable plug.
>
Yes. FIO random IO tests over raw disks.
> Yes, perf data will be helpful. And please show your test in details
> and I'll
> check if I can reproduce it.
System config:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
64Gb FC switch
FlashSystem: FS9500, 12 LUNs/FC port
Below is fio config for rwmixread=100:
[global]
randrepeat=0
buffered=0
direct=1
norandommap=1
group_reporting=1
size=80g
ioengine=libaio
rw=randrw
bs=4k
iodepth=1
rwmixread=100
runtime=600
ramp_time=5
time_based=1
numjobs=20
[job1]
filename=/dev/dm-2
[job2]
filename=/dev/dm-3
...
24 jobs in total.
We collected some perf data. What kind of perf data you want? Let me
know.
Thanks,
Wendy
^ permalink raw reply
* [PATCHv2] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-26 15:35 UTC (permalink / raw)
To: linux-block, axboe; +Cc: Keith Busch, Ming Lei, Christoph Hellwig
From: Keith Busch <kbusch@kernel.org>
A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.
Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:
Actually use the correct rq_list function to return the rq to the
list.
block/blk-mq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..a24175441380e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3246,7 +3246,7 @@ void blk_mq_submit_bio(struct bio *bio)
if (!rq)
blk_queue_exit(q);
else
- blk_mq_free_request(rq);
+ rq_list_add_head(&plug->cached_rqs, rq);
}
#ifdef CONFIG_BLK_MQ_STACKING
--
2.53.0-Meta
^ permalink raw reply related
* [PATCHv2 0/2] block, nvme: enable passthrough iostats
From: Keith Busch @ 2026-05-26 15:39 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch
From: Keith Busch <kbusch@kernel.org>
v1->v2:
Split the block and nvme parts into separate patches.
Fixed up the nvme driver to ensure passthrough commands go through the
multipath start/end functions.
Have the helper function take a request_queue parameter since the
queue accumulating the stats for multipath isn't the same as the
request's queue.
Keith Busch (2):
block: export passthrough stats enabled
nvme: add support multipath passthrough iostats
block/blk-mq.c | 32 +-------------------------------
drivers/nvme/host/ioctl.c | 4 ++++
drivers/nvme/host/multipath.c | 5 ++++-
include/linux/blk-mq.h | 30 ++++++++++++++++++++++++++++++
4 files changed, 39 insertions(+), 32 deletions(-)
--
2.53.0-Meta
^ permalink raw reply
* [PATCHv2 1/2] block: export passthrough stats enabled
From: Keith Busch @ 2026-05-26 15:39 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch
In-Reply-To: <20260526153921.2402015-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
A user can enable io accounting for passthrough requests, so export the
helper that checks if the request should be tracked. This will enable
stacking drivers to to report iostats for passthrough workloads.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
block/blk-mq.c | 32 +-------------------------------
include/linux/blk-mq.h | 30 ++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+), 31 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..48115e1d9d6a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1088,43 +1088,13 @@ static inline void blk_account_io_done(struct request *req, u64 now)
}
}
-static inline bool blk_rq_passthrough_stats(struct request *req)
-{
- struct bio *bio = req->bio;
-
- if (!blk_queue_passthrough_stat(req->q))
- return false;
-
- /* Requests without a bio do not transfer data. */
- if (!bio)
- return false;
-
- /*
- * Stats are accumulated in the bdev, so must have one attached to a
- * bio to track stats. Most drivers do not set the bdev for passthrough
- * requests, but nvme is one that will set it.
- */
- if (!bio->bi_bdev)
- return false;
-
- /*
- * We don't know what a passthrough command does, but we know the
- * payload size and data direction. Ensuring the size is aligned to the
- * block size filters out most commands with payloads that don't
- * represent sector access.
- */
- if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
- return false;
- return true;
-}
-
static inline void blk_account_io_start(struct request *req)
{
trace_block_io_start(req);
if (!blk_queue_io_stat(req->q))
return;
- if (blk_rq_is_passthrough(req) && !blk_rq_passthrough_stats(req))
+ if (blk_rq_is_passthrough(req) && !blk_rq_passthrough_stats(req, req->q))
return;
req->rq_flags |= RQF_IO_STAT;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581d..25931a8076d2a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1243,4 +1243,34 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
}
void blk_dump_rq_flags(struct request *, char *);
+static inline bool blk_rq_passthrough_stats(struct request *req,
+ struct request_queue *q)
+{
+ struct bio *bio = req->bio;
+
+ if (!blk_queue_passthrough_stat(q))
+ return false;
+
+ /* Requests without a bio do not transfer data. */
+ if (!bio)
+ return false;
+
+ /*
+ * Stats are accumulated in the bdev, so must have one attached to a
+ * bio to track stats. Most drivers do not set the bdev for passthrough
+ * requests, but nvme is one that will set it.
+ */
+ if (!bio->bi_bdev)
+ return false;
+
+ /*
+ * We don't know what a passthrough command does, but we know the
+ * payload size and data direction. Ensuring the size is aligned to the
+ * block size filters out most commands with payloads that don't
+ * represent sector access.
+ */
+ if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
+ return false;
+ return true;
+}
#endif /* BLK_MQ_H */
--
2.53.0-Meta
^ permalink raw reply related
* [PATCHv2 2/2] nvme: add support multipath passthrough iostats
From: Keith Busch @ 2026-05-26 15:39 UTC (permalink / raw)
To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch
In-Reply-To: <20260526153921.2402015-1-kbusch@meta.com>
From: Keith Busch <kbusch@kernel.org>
Don't skip io accounting for passthrough commands if the user enabled
tracking these.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/ioctl.c | 4 ++++
drivers/nvme/host/multipath.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 08889b20e5d8c..38ca04567406a 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -102,8 +102,12 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
struct nvme_command *cmd, blk_opf_t rq_flags,
blk_mq_req_flags_t blk_flags)
{
+ struct nvme_ns *ns = q->queuedata;
struct request *req;
+ if (ns && nvme_ns_head_multipath(ns->head))
+ rq_flags |= REQ_NVME_MPATH;
+
req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags);
if (IS_ERR(req))
return req;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac06..d0a95cde181c4 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -175,9 +175,12 @@ void nvme_mpath_start_request(struct request *rq)
nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
}
- if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
+ if (!blk_queue_io_stat(disk->queue) ||
(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
+ if (blk_rq_is_passthrough(rq) &&
+ !blk_rq_passthrough_stats(rq, disk->queue))
+ return;
nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
--
2.53.0-Meta
^ permalink raw reply related
* Re: [PATCH] block: rename need_dispatch to cautious_dispatch in blk-mq sched
From: Jens Axboe @ 2026-05-26 15:54 UTC (permalink / raw)
To: Guixin Liu, Christoph Hellwig, Keith Busch
Cc: linux-block, xlpang, oliver.yang
In-Reply-To: <20260526131103.3105411-1-kanie@linux.alibaba.com>
On 5/26/26 7:11 AM, Guixin Liu wrote:
> The local boolean in __blk_mq_sched_dispatch_requests() decides whether
> to fall back to the per-ctx round-robin path (blk_mq_do_dispatch_ctx())
> instead of the batch flush path (blk_mq_flush_busy_ctxs()). The whole
> function is about dispatching anyway, so the name "need_dispatch" is
> not particularly informative and can mislead readers into thinking that
> a false value means "skip dispatching".
>
> Rename it to "cautious_dispatch" to match the comment right above the
> check ("dequeue request one by one from sw queue if queue is busy")
> and to convey the actual intent: take the cautious, fair, one-at-a-time
> path either when we just drained hctx->dispatch (so the device has
> recently pushed back) or when the dispatch_busy EWMA still indicates
> congestion. The fast batch path is only taken when neither signal
> suggests recent backpressure.
If we're going to do churn like that, it should at least be an
improvement. 'cautious_dispatch' tells the reader nothing about
what kind of behavior this modifies. 'piecemeal_dispatch' would
be much better, as it actually accurately describes what it
does.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH v15 0/8] blk: honor isolcpus configuration
From: Daniel Wagner @ 2026-05-26 16:05 UTC (permalink / raw)
To: Aaron Tomlin
Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
longman, chenridong, hare, kch, ming.lei, tom.leiming, steve,
sean, chjohnst, neelx, mproche, nick.lange, marco.crivellari,
rishil1999, linux-block, linux-kernel
In-Reply-To: <20260521232956.553287-1-atomlin@atomlin.com>
Hi Aaron,
On Thu, May 21, 2026 at 07:29:48PM -0400, Aaron Tomlin wrote:
> Please let me know your thoughts.
>
>
> Changes since v14:
You’re moving fast with these updates! It’s great energy, but it’s
actually moving a bit faster than the review process can keep up with.
I’ve heard from some folks in the CC that they waiting for a 'final'
version.
Is this latest version ready for a full, deep-dive review, or are there
still a few 'knacks' you’re looking to iron out first?
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH] block: partitions: fix of_node refcount leak in of_partition()
From: Jens Axboe @ 2026-05-26 16:36 UTC (permalink / raw)
To: stable, Wentao Liang; +Cc: Josh Law, Kees Cook, linux-block, linux-kernel
In-Reply-To: <20260526102124.2283846-1-vulab@iscas.ac.cn>
On Tue, 26 May 2026 10:21:24 +0000, Wentao Liang wrote:
> of_partition() calls of_node_get() on the parent device node at the
> beginning of the function, storing the reference in 'partitions_np'.
> This reference is leaked in two paths:
>
> 1. The compatibility check at the top of the function returns 0
> without releasing partitions_np when the node exists but is not
> "fixed-partitions" compatible.
>
> [...]
Applied, thanks!
[1/1] block: partitions: fix of_node refcount leak in of_partition()
commit: 148cd4873115feb266c002d4d4618ea7f14342d9
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: Wentao Liang; +Cc: linux-block, linux-kernel, stable
In-Reply-To: <20260526103722.2287587-1-vulab@iscas.ac.cn>
On Tue, 26 May 2026 10:37:22 +0000, Wentao Liang wrote:
> blk_mq_mark_tag_wait() calls sbitmap_queue_get() which increments
> sbq->ws_active. On the error path where the waitqueue_active() check
> fails and the function returns early, sbq->ws_active is not decremented,
> leaking the reference.
>
> Fix this by calling sbitmap_queue_clear() to properly release the
> ws_active reference before returning on the error path.
>
> [...]
Applied, thanks!
[1/1] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
commit: 94028f339610f5d39d101449dc27156aea03b3cb
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH v9] blk-mq: add tracepoint block_rq_tag_wait
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, Aaron Tomlin
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260525005123.722277-1-atomlin@atomlin.com>
On Sun, 24 May 2026 20:51:23 -0400, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
>
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred via sched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
>
> [...]
Applied, thanks!
[1/1] blk-mq: add tracepoint block_rq_tag_wait
commit: 9ece10778f8931630f86e802f94dc71115de0c8c
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: Chao Shi
Cc: Christoph Hellwig, Christian Brauner, Josef Bacik, linux-block,
linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu
In-Reply-To: <20260522220025.1770388-1-coshi036@gmail.com>
On Fri, 22 May 2026 18:00:25 -0400, Chao Shi wrote:
> bdev_mark_dead()'s @surprise == true means the device is already gone.
> The filesystem callback fs_bdev_mark_dead() honours this and skips
> sync_filesystem(), but the bare block device path (no ->mark_dead op)
> lost its !surprise guard when the holder ->mark_dead callback was wired
> up (see Fixes), and now calls sync_blockdev() unconditionally, which can
> hang forever waiting on writeback that can no longer complete.
>
> [...]
Applied, thanks!
[1/1] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
commit: 304f384f34af98a205086ce67331cad4fea6504d
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH v1] block: switch numa_node to int in blk_mq_hw_ctx and init_request
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: Mateusz Nowicki
Cc: Caleb Sander Mateos, Sung-woo Kim, Josef Bacik, Alasdair Kergon,
Mike Snitzer, Mikulas Patocka, Benjamin Marzinski, Ulf Hansson,
Richard Weinberger, Zhihao Cheng, Miquel Raynal,
Vignesh Raghavendra, Sven Peter, Janne Grunau, Neal Gompa,
Keith Busch, Christoph Hellwig, Sagi Grimberg, Justin Tee,
Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
James E.J. Bottomley, Martin K. Petersen, Thomas Fourier, Al Viro,
Luke Wang, Kees Cook, linux-block, linux-kernel, nbd, dm-devel,
linux-mmc, linux-mtd, asahi, linux-arm-kernel, linux-nvme,
linux-scsi
In-Reply-To: <20260523125210.272274-1-mateusz.nowicki@posteo.net>
On Sat, 23 May 2026 12:52:35 +0000, Mateusz Nowicki wrote:
> numa_node in blk_mq_hw_ctx and the matching argument of
> blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
> unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
> nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
>
> Switch the field and the callback prototype to int and update all
> in-tree init_request implementations. No functional change:
> cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
> take int.
>
> [...]
Applied, thanks!
[1/1] block: switch numa_node to int in blk_mq_hw_ctx and init_request
commit: 65e1c8f96ad1a1f3b72e8a91d1341d570f91d985
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] block: Avoid mounting the bdev pseudo-filesystem in userspace
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: Denis Arefev; +Cc: linux-block, linux-kernel, lvc-project, stable
In-Reply-To: <20260521072857.5078-1-arefev@swemel.ru>
On Thu, 21 May 2026 10:28:56 +0300, Denis Arefev wrote:
> The bdev pseudo-filesystem is an internal kernel filesystem with which
> userspace should not interfere. Unregister it so that userspace cannot
> even attempt to mount it.
>
> This fixes a bug [1] that occurs when attempting to access files,
> because the system call move_mount() uses pointers declared in the
> inode_operations structure, which for the bdev pseudo-filesystem
> are always equal to 0. `inode->i_op = &empty_iops;`
>
> [...]
Applied, thanks!
[1/1] block: Avoid mounting the bdev pseudo-filesystem in userspace
commit: b518ae170f6c411cac2d5f320278c27d902bc628
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: REQ_NOAIT cleanups
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-block
In-Reply-To: <20260518063336.507369-1-hch@lst.de>
On Mon, 18 May 2026 08:33:28 +0200, Christoph Hellwig wrote:
> this series cleans up spurious code related to REQ_NOWAIT handling.
>
> I have block layer work depending on this pending, so merging it through
> the block tree would be helpful.
>
> Diffstat:
> fs/direct-io.c | 15 ++++-----------
> include/linux/bio.h | 1 -
> 2 files changed, 4 insertions(+), 12 deletions(-)
>
> [...]
Applied, thanks!
[1/2] direct-io: remove IOCB_NOWAIT support
commit: ef9049ec8b9fd6c508832d9f7ab12029f3355102
[2/2] block: don't set BIO_QUIET for BLK_STS_AGAIN
commit: 481105a949c8d11f7aa770b45fc4c8efcc53f205
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH v1] mtip32xx: fix use-after-free on service thread failure
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: Yuho Choi
Cc: Thomas Fourier, Martin K . Petersen, Andy Shevchenko, Al Viro,
linux-block, linux-kernel
In-Reply-To: <20260525162531.1406677-1-dbgh9129@gmail.com>
On Mon, 25 May 2026 12:25:31 -0400, Yuho Choi wrote:
> If service thread creation fails after device_add_disk() succeeds,
> mtip_block_initialize() calls del_gendisk() and then falls through to
> put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
> put_disk() can release dd on the added-disk path.
>
> The same unwind then continues to use dd for blk_mq_free_tag_set() and
> mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
> cause a use-after-free and double free.
>
> [...]
Applied, thanks!
[1/1] mtip32xx: fix use-after-free on service thread failure
commit: 6b24446bee489e90f7ea843fbc0473393c73cbf9
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] block: remove blkdev_write_begin() and blkdev_write_end()
From: Jens Axboe @ 2026-05-26 16:42 UTC (permalink / raw)
To: Christoph Hellwig, Tal Zussman; +Cc: linux-block, linux-kernel
In-Reply-To: <20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu>
On Mon, 25 May 2026 14:25:55 -0400, Tal Zussman wrote:
> Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
> def_blk_aops. These have been unreachable since commit 487c607df790
> ("block: use iomap for writes to block devices") switched block device
> buffered writes from generic_perform_write() to
> iomap_file_buffered_write(), which bypasses aops->write_begin/end.
>
>
> [...]
Applied, thanks!
[1/1] block: remove blkdev_write_begin() and blkdev_write_end()
(no commit info)
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Bart Van Assche @ 2026-05-26 16:58 UTC (permalink / raw)
To: Wentao Liang, Jens Axboe; +Cc: linux-block, linux-kernel, stable
In-Reply-To: <20260526103722.2287587-1-vulab@iscas.ac.cn>
On 5/26/26 3:37 AM, Wentao Liang wrote:
> blk_mq_mark_tag_wait() calls sbitmap_queue_get()
I don't see any sbitmap_queue_get() calls in blk_mq_mark_tag_wait().
Additionally, I don't see any other code above the modified code in
blk_mq_mark_tag_wait() that modifies sbq->ws_active directly or
indirectly. What am I missing?
> Fix this by calling sbitmap_queue_clear() to properly release the
> ws_active reference before returning on the error path.
This patch doesn't add a sbitmap_queue_clear() call. It seems like
there is a mismatch between the patch description and the code changes?
Bart.
^ permalink raw reply
* Re: [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Keith Busch @ 2026-05-26 17:05 UTC (permalink / raw)
To: Wentao Liang; +Cc: Jens Axboe, linux-block, linux-kernel, stable
In-Reply-To: <20260526103722.2287587-1-vulab@iscas.ac.cn>
On Tue, May 26, 2026 at 10:37:22AM +0000, Wentao Liang wrote:
> blk_mq_mark_tag_wait() calls sbitmap_queue_get() which increments
> sbq->ws_active. On the error path where the waitqueue_active() check
> fails and the function returns early, sbq->ws_active is not decremented,
> leaking the reference.
I must be confused as I'm not making sense of this. Not only does
blk_mq_mark_tag_wait not call sbitmap_queue_get, sbitmap_queue_get does
not increment sbq->ws_active either. Could you clarify the actual
sequence?
> Fix this by calling sbitmap_queue_clear() to properly release the
> ws_active reference before returning on the error path.
And same here, I don't see sbitmap_queue_clear() called anywhere in this
path, nor does sbitmap_queue_clear() release ws_active anyway. What is
the actual sequence that gets there?
> Fixes: c27d53fb445f ("blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
> block/blk-mq.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d0c37daf568f..e1c2ac416693 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1952,6 +1952,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
> spin_lock_irq(&wq->lock);
> spin_lock(&hctx->dispatch_wait_lock);
> if (!list_empty(&wait->entry)) {
> + list_del_init(&wait->entry);
> + atomic_dec(&sbq->ws_active);
As far as I can tell, sbq->ws_active is incremented from three places:
- blk_mq_mark_tag_wait() itself, but just below this line. So your
change decrements before the local increment happened, no?
- sbitmap_prepare_to_wait() / sbitmap_add_wait_queue() in
lib/sbitmap.c, which are unrelated helpers not used here
What am I missing?
^ permalink raw reply
* Re: [PATCHv2] blk-mq: reinsert cached request to the list
From: Chaitanya Kulkarni @ 2026-05-26 17:06 UTC (permalink / raw)
To: Keith Busch, linux-block@vger.kernel.org, axboe@kernel.dk
Cc: Keith Busch, Ming Lei, Christoph Hellwig
In-Reply-To: <20260526153531.2365935-1-kbusch@meta.com>
On 5/26/26 08:35, Keith Busch wrote:
> From: Keith Busch<kbusch@kernel.org>
>
> A previous commit removed an optimization out of caution for a scenario
> that turns out not to be real: all the "queue_exit" goto's are safe to
> reinsert the request into the cached_rq's plug list as they are either
> from a non-blocking path, or a successful merge that already holds the
> queue reference. This optimization is most needed for small sequential
> workloads that successfully merge into larger requests.
>
> Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
> Suggested-by: Ming Lei<tom.leiming@gmail.com>
> Suggested-by: Christoph Hellwig<hch@lst.de>
> Signed-off-by: Keith Busch<kbusch@kernel.org>
Looks good.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply
* Re: [PATCH V4 0/3] md/nvme: Enable PCI P2PDMA support for RAID0 and NVMe Multipath
From: Chaitanya Kulkarni @ 2026-05-26 17:09 UTC (permalink / raw)
To: axboe@kernel.dk
Cc: song@kernel.org, yukuai@fnnas.com, Christoph Hellwig,
linan122@huawei.com, kbusch@kernel.org, sagi@grimberg.me,
linux-block@vger.kernel.org, linux-raid@vger.kernel.org,
linux-nvme@lists.infradead.org, Kiran Modukuri
In-Reply-To: <4ed83782-04cf-45b5-93a0-05a08e61b82e@nvidia.com>
Jens,
On 5/19/26 17:11, Chaitanya Kulkarni wrote:
> Jens,
>
>
> On 5/14/26 9:35 PM, Christoph Hellwig wrote:
>> Still looks good to me as per the reviews.
>>
> If there no objection, can we merge this ?
>
> -Chaitanya
>
>
There is outstanding work I want to send out based on this one.
May I please request you to merge this patch series ?
-ck
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox