Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-26 14:02 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, linux-block, axboe, Christoph Hellwig
In-Reply-To: <ahT656Dazfz5oc8r@fedora>

On Mon, May 25, 2026 at 08:44:07PM -0500, Ming Lei wrote:
> On Mon, May 25, 2026 at 09:07:44AM -0700, Keith Busch wrote:
> > +		rq_list_push(&plug->cached_rqs, rq);
> 
> rq_list_add_head()?

Yes indeed. Serves me right for trying to squeeze this in over a
holiday. Thanks.

^ permalink raw reply

* [PATCH] block: blk-zoned: fix zwplug refcount leak on write error path
From: Wentao Liang @ 2026-05-26 14:18 UTC (permalink / raw)
  To: Jens Axboe, Damien Le Moal
  Cc: linux-block, linux-kernel, Wentao Liang, stable

blk_zone_wplug_handle_write() increments zwplug->ref via kref_get()
when preparing to handle a zone write. On the error path where
blk_zone_wplug_handle_write_noalloc() fails, the function returns
without calling kref_put() on zwplug->ref, leaking the reference.

Add kref_put(&zwplug->ref, ...) on the error path to properly release
the reference.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 block/blk-zoned.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 42ef830054dc..24b899663a48 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1503,6 +1503,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
+		disk_put_zone_wplug(zwplug);
 		bio_io_error(bio);
 		return true;
 	}
@@ -1511,6 +1512,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
+	disk_put_zone_wplug(zwplug);
 
 	return false;
 
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] block: partitions: replace __get_free_page() with kmalloc()
From: Matthew Wilcox @ 2026-05-26 14:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Rapoport, Christoph Hellwig, Jens Axboe, linux-block,
	linux-kernel, linux-mm
In-Reply-To: <1dea9df9-18c2-46f5-bf47-abb3f088574b@suse.com>

On Tue, May 26, 2026 at 02:07:36PM +0200, Vlastimil Babka wrote:
> The main reasons for switching AFAIU would be related with the
> folio/memdesc conversions? If one needs just a kernel memory buffer,
> kmalloc() it is, even if it happens to be page size. Page allocator
> should be only used if you need e.g. the refcounting or anything else
> that struct page provides. But then in some cases the memdesc conversion
> would need adjustments at some point. With kmalloc() we can forget about
> this user.

No, I think this is unrelated to memdescs.

I've seen a few people say slightly wrong things about
folios/pages/memdescs recently, so let me try to clarify the end state.

I do not intend to get rid of the ability to allocate a bare page of
memory with something like alloc_pages() or get_free_page().  It's
just that the struct page associated with it will contain far less
information (because it's smaller).

https://kernelnewbies.org/MatthewWilcox/Memdescs has a bit more
information, but to distill it:

You get a u64 worth of data (technically one per page, but if you
allocate multiple pages, they're all going to be the same).
Bits 0-3 will be type 0 (to indicate that it has no memdesc).  
Bits 4-10 will be subtype 2 (to indicate no information about owner).
Bit 11 will be clear to indicate that this page should not be mappable
to userspace.
Bits 12-17 will store the allocation order.
The top few bits will encode zone/node/section like page->flags
do today.

That doesn't leave many free bits for the user, but that's OK because
most allocations don't actually need any bits in struct page.  If you do
want something like a refcount or list_head, see the "Managed memory"
section on that page.  If you actually want a full-fat folio, well,
allocate a folio, not a page.

^ permalink raw reply

* Re: [PATCH] block: partitions: fix of_node refcount leak in of_partition()
From: Haris Iqbal @ 2026-05-26 14:59 UTC (permalink / raw)
  To: Wentao Liang, Jens Axboe, stable
  Cc: Josh Law, Kees Cook, linux-block, linux-kernel
In-Reply-To: <20260526102124.2283846-1-vulab@iscas.ac.cn>



On 5/26/26 12:21, Wentao Liang wrote:
> of_partition() calls of_node_get() on the parent device node at the
> beginning of the function, storing the reference in 'partitions_np'.
> This reference is leaked in two paths:
> 
> 1. The compatibility check at the top of the function returns 0
>     without releasing partitions_np when the node exists but is not
>     "fixed-partitions" compatible.
> 
> 2. The function returns 1 at the end after successfully processing
>     all partitions without releasing partitions_np.
> 
> Fix both leaks by adding of_node_put(partitions_np) on each path.
> 
> Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
> Cc: stable@vger.kernel.org
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>

Looks good:

Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>

> ---
>   block/partitions/of.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/partitions/of.c b/block/partitions/of.c
> index c22b60661098..53664ea06b65 100644
> --- a/block/partitions/of.c
> +++ b/block/partitions/of.c
> @@ -74,8 +74,10 @@ int of_partition(struct parsed_partitions *state)
>   	struct device_node *partitions_np = of_node_get(ddev->of_node);
>   
>   	if (!partitions_np ||
> -	    !of_device_is_compatible(partitions_np, "fixed-partitions"))
> +	    !of_device_is_compatible(partitions_np, "fixed-partitions")) {
> +		of_node_put(partitions_np);
>   		return 0;
> +	}
>   
>   	slot = 1;
>   	/* Validate parition offset and size */
> @@ -104,5 +106,6 @@ int of_partition(struct parsed_partitions *state)
>   
>   	seq_buf_puts(&state->pp_buf, "\n");
>   
> +	of_node_put(partitions_np);
>   	return 1;
>   }


^ permalink raw reply

* Re: [PATCH] blk-mq: reinsert cached request to the list
From: kernel test robot @ 2026-05-26 15:02 UTC (permalink / raw)
  To: Keith Busch, linux-block, axboe
  Cc: llvm, oe-kbuild-all, Keith Busch, Ming Lei, Christoph Hellwig
In-Reply-To: <20260525160744.896047-1-kbusch@meta.com>

Hi Keith,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe/for-next]
[also build test ERROR on next-20260525]
[cannot apply to linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/blk-mq-reinsert-cached-request-to-the-list/20260526-000916
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git for-next
patch link:    https://lore.kernel.org/r/20260525160744.896047-1-kbusch%40meta.com
patch subject: [PATCH] blk-mq: reinsert cached request to the list
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260526/202605261716.TITwjvlB-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260526/202605261716.TITwjvlB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605261716.TITwjvlB-lkp@intel.com/

All errors (new ones prefixed by >>):

>> block/blk-mq.c:3249:3: error: call to undeclared function 'rq_list_push'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    3249 |                 rq_list_push(&plug->cached_rqs, rq);
         |                 ^
   1 error generated.


vim +/rq_list_push +3249 block/blk-mq.c

  3110	
  3111	/**
  3112	 * blk_mq_submit_bio - Create and send a request to block device.
  3113	 * @bio: Bio pointer.
  3114	 *
  3115	 * Builds up a request structure from @q and @bio and send to the device. The
  3116	 * request may not be queued directly to hardware if:
  3117	 * * This request can be merged with another one
  3118	 * * We want to place request at plug queue for possible future merging
  3119	 * * There is an IO scheduler active at this queue
  3120	 *
  3121	 * It will not queue the request if there is an error with the bio, or at the
  3122	 * request creation.
  3123	 */
  3124	void blk_mq_submit_bio(struct bio *bio)
  3125	{
  3126		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
  3127		struct blk_plug *plug = current->plug;
  3128		const int is_sync = op_is_sync(bio->bi_opf);
  3129		unsigned int integrity_action;
  3130		struct blk_mq_hw_ctx *hctx;
  3131		unsigned int nr_segs;
  3132		struct request *rq;
  3133		blk_status_t ret;
  3134	
  3135		/*
  3136		 * If the plug has a cached request for this queue, try to use it.
  3137		 */
  3138		rq = blk_mq_get_cached_request(plug, q, bio->bi_opf);
  3139	
  3140		/*
  3141		 * A BIO that was released from a zone write plug has already been
  3142		 * through the preparation in this function, already holds a reference
  3143		 * on the queue usage counter, and is the only write BIO in-flight for
  3144		 * the target zone. Go straight to preparing a request for it.
  3145		 */
  3146		if (bio_zone_write_plugging(bio)) {
  3147			nr_segs = bio->__bi_nr_segments;
  3148			if (rq)
  3149				blk_queue_exit(q);
  3150			goto new_request;
  3151		}
  3152	
  3153		/*
  3154		 * The cached request already holds a q_usage_counter reference and we
  3155		 * don't have to acquire a new one if we use it.
  3156		 */
  3157		if (!rq) {
  3158			if (unlikely(bio_queue_enter(bio)))
  3159				return;
  3160		}
  3161	
  3162		/*
  3163		 * Device reconfiguration may change logical block size or reduce the
  3164		 * number of poll queues, so the checks for alignment and poll support
  3165		 * have to be done with queue usage counter held.
  3166		 */
  3167		if (unlikely(bio_unaligned(bio, q))) {
  3168			bio_io_error(bio);
  3169			goto queue_exit;
  3170		}
  3171	
  3172		if ((bio->bi_opf & REQ_POLLED) && !blk_mq_can_poll(q)) {
  3173			bio->bi_status = BLK_STS_NOTSUPP;
  3174			bio_endio(bio);
  3175			goto queue_exit;
  3176		}
  3177	
  3178		bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
  3179		if (!bio)
  3180			goto queue_exit;
  3181	
  3182		integrity_action = bio_integrity_action(bio);
  3183		if (integrity_action)
  3184			bio_integrity_prep(bio, integrity_action);
  3185	
  3186		blk_mq_bio_issue_init(q, bio);
  3187		if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
  3188			goto queue_exit;
  3189	
  3190		if (bio_needs_zone_write_plugging(bio)) {
  3191			if (blk_zone_plug_bio(bio, nr_segs))
  3192				goto queue_exit;
  3193		}
  3194	
  3195	new_request:
  3196		if (rq) {
  3197			rq_qos_throttle(rq->q, bio);
  3198			blk_mq_rq_time_init(rq, blk_time_get_ns());
  3199			rq->cmd_flags = bio->bi_opf;
  3200			INIT_LIST_HEAD(&rq->queuelist);
  3201		} else {
  3202			rq = blk_mq_get_new_requests(q, plug, bio);
  3203			if (unlikely(!rq)) {
  3204				if (bio->bi_opf & REQ_NOWAIT)
  3205					bio_wouldblock_error(bio);
  3206				goto queue_exit;
  3207			}
  3208		}
  3209	
  3210		trace_block_getrq(bio);
  3211	
  3212		rq_qos_track(q, rq, bio);
  3213	
  3214		blk_mq_bio_to_request(rq, bio, nr_segs);
  3215	
  3216		ret = blk_crypto_rq_get_keyslot(rq);
  3217		if (ret != BLK_STS_OK) {
  3218			bio->bi_status = ret;
  3219			bio_endio(bio);
  3220			blk_mq_free_request(rq);
  3221			return;
  3222		}
  3223	
  3224		if (bio_zone_write_plugging(bio))
  3225			blk_zone_write_plug_init_request(rq);
  3226	
  3227		if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
  3228			return;
  3229	
  3230		if (plug) {
  3231			blk_add_rq_to_plug(plug, rq);
  3232			return;
  3233		}
  3234	
  3235		hctx = rq->mq_hctx;
  3236		if ((rq->rq_flags & RQF_USE_SCHED) ||
  3237		    (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
  3238			blk_mq_insert_request(rq, 0);
  3239			blk_mq_run_hw_queue(hctx, true);
  3240		} else {
  3241			blk_mq_run_dispatch_ops(q, blk_mq_try_issue_directly(hctx, rq));
  3242		}
  3243		return;
  3244	
  3245	queue_exit:
  3246		if (!rq)
  3247			blk_queue_exit(q);
  3248		else
> 3249			rq_list_push(&plug->cached_rqs, rq);
  3250	}
  3251	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: Observing higher CPU utilization during random IO fio testing
From: Wen Xiong @ 2026-05-26 15:28 UTC (permalink / raw)
  To: yukuai; +Cc: Jens Axboe, linux-block, tom.leiming, jmoyer, Gjoyce, wenxiong
In-Reply-To: <043e357f-5b37-4e05-9433-271504fc1d30@fygo.io>

On 2026-05-25 00:28, Yu Kuai wrote:
> Hi,
> 
> 在 2026/5/22 5:52, Jens Axboe 写道:
>>> - More IO scheduler interaction, forces requests through scheduler 
>>> instead of direct dispatch(direct dispatch to hardware queue)
> I don't understand this point. Can you explain more? I think plug
> should not matter if request go through scheduler or not.

My understanding is:
Random IO tests are more CPU intensive.
Plug delays the dispatch IOs to hardware queue(quick way) directly.
Plug submits multiple IO requests in a batch to defer submitting IO 
until calling blk_flush_plug(dispatch to hardware queue) or task gets 
scheduling.
> 

> And I assume you're testing raw disk, because filesystems should
> always enable plug.
> 
Yes. FIO random IO tests over raw disks.

> Yes, perf data will be helpful. And please show your test in details 
> and I'll
> check if I can reproduce it.

System config:
47 dedicate cores
120 GB memory
PCIe4 2-Port 64Gb FC Adapter
64Gb FC switch
FlashSystem: FS9500, 12 LUNs/FC port

Below is fio config for rwmixread=100:
[global]
randrepeat=0
buffered=0
direct=1
norandommap=1
group_reporting=1
size=80g
ioengine=libaio
rw=randrw
bs=4k
iodepth=1
rwmixread=100
runtime=600
ramp_time=5
time_based=1
numjobs=20

[job1]
filename=/dev/dm-2

[job2]
filename=/dev/dm-3
...
24 jobs in total.

We collected some perf data. What kind of perf data you want? Let me 
know.
Thanks,
Wendy



^ permalink raw reply

* [PATCHv2] blk-mq: reinsert cached request to the list
From: Keith Busch @ 2026-05-26 15:35 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: Keith Busch, Ming Lei, Christoph Hellwig

From: Keith Busch <kbusch@kernel.org>

A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.

Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:

  Actually use the correct rq_list function to return the rq to the
  list.

 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..a24175441380e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3246,7 +3246,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (!rq)
 		blk_queue_exit(q);
 	else
-		blk_mq_free_request(rq);
+		rq_list_add_head(&plug->cached_rqs, rq);
 }

 #ifdef CONFIG_BLK_MQ_STACKING
-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCHv2 0/2] block, nvme: enable passthrough iostats
From: Keith Busch @ 2026-05-26 15:39 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch

From: Keith Busch <kbusch@kernel.org>

v1->v2:

  Split the block and nvme parts into separate patches.

  Fixed up the nvme driver to ensure passthrough commands go through the
  multipath start/end functions.

  Have the helper function take a request_queue parameter since the
  queue accumulating the stats for multipath isn't the same as the
  request's queue. 

Keith Busch (2):
  block: export passthrough stats enabled
  nvme: add support multipath passthrough iostats

 block/blk-mq.c                | 32 +-------------------------------
 drivers/nvme/host/ioctl.c     |  4 ++++
 drivers/nvme/host/multipath.c |  5 ++++-
 include/linux/blk-mq.h        | 30 ++++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+), 32 deletions(-)

-- 
2.53.0-Meta

^ permalink raw reply

* [PATCHv2 1/2] block: export passthrough stats enabled
From: Keith Busch @ 2026-05-26 15:39 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch
In-Reply-To: <20260526153921.2402015-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

A user can enable io accounting for passthrough requests, so export the
helper that checks if the request should be tracked. This will enable
stacking drivers to to report iostats for passthrough workloads.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/blk-mq.c         | 32 +-------------------------------
 include/linux/blk-mq.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28c2d931e75ea..48115e1d9d6a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1088,43 +1088,13 @@ static inline void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-static inline bool blk_rq_passthrough_stats(struct request *req)
-{
-	struct bio *bio = req->bio;
-
-	if (!blk_queue_passthrough_stat(req->q))
-		return false;
-
-	/* Requests without a bio do not transfer data. */
-	if (!bio)
-		return false;
-
-	/*
-	 * Stats are accumulated in the bdev, so must have one attached to a
-	 * bio to track stats. Most drivers do not set the bdev for passthrough
-	 * requests, but nvme is one that will set it.
-	 */
-	if (!bio->bi_bdev)
-		return false;
-
-	/*
-	 * We don't know what a passthrough command does, but we know the
-	 * payload size and data direction. Ensuring the size is aligned to the
-	 * block size filters out most commands with payloads that don't
-	 * represent sector access.
-	 */
-	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
-		return false;
-	return true;
-}
-
 static inline void blk_account_io_start(struct request *req)
 {
 	trace_block_io_start(req);
 
 	if (!blk_queue_io_stat(req->q))
 		return;
-	if (blk_rq_is_passthrough(req) && !blk_rq_passthrough_stats(req))
+	if (blk_rq_is_passthrough(req) && !blk_rq_passthrough_stats(req, req->q))
 		return;
 
 	req->rq_flags |= RQF_IO_STAT;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581d..25931a8076d2a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1243,4 +1243,34 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline bool blk_rq_passthrough_stats(struct request *req,
+					    struct request_queue *q)
+{
+	struct bio *bio = req->bio;
+
+	if (!blk_queue_passthrough_stat(q))
+		return false;
+
+	/* Requests without a bio do not transfer data. */
+	if (!bio)
+		return false;
+
+	/*
+	 * Stats are accumulated in the bdev, so must have one attached to a
+	 * bio to track stats. Most drivers do not set the bdev for passthrough
+	 * requests, but nvme is one that will set it.
+	 */
+	if (!bio->bi_bdev)
+		return false;
+
+	/*
+	 * We don't know what a passthrough command does, but we know the
+	 * payload size and data direction. Ensuring the size is aligned to the
+	 * block size filters out most commands with payloads that don't
+	 * represent sector access.
+	 */
+	if (blk_rq_bytes(req) & (bdev_logical_block_size(bio->bi_bdev) - 1))
+		return false;
+	return true;
+}
 #endif /* BLK_MQ_H */
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCHv2 2/2] nvme: add support multipath passthrough iostats
From: Keith Busch @ 2026-05-26 15:39 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: axboe, hch, nilay, Keith Busch
In-Reply-To: <20260526153921.2402015-1-kbusch@meta.com>

From: Keith Busch <kbusch@kernel.org>

Don't skip io accounting for passthrough commands if the user enabled
tracking these.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/ioctl.c     | 4 ++++
 drivers/nvme/host/multipath.c | 5 ++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 08889b20e5d8c..38ca04567406a 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -102,8 +102,12 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 		struct nvme_command *cmd, blk_opf_t rq_flags,
 		blk_mq_req_flags_t blk_flags)
 {
+	struct nvme_ns *ns = q->queuedata;
 	struct request *req;
 
+	if (ns && nvme_ns_head_multipath(ns->head))
+		rq_flags |= REQ_NVME_MPATH;
+
 	req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags);
 	if (IS_ERR(req))
 		return req;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac06..d0a95cde181c4 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -175,9 +175,12 @@ void nvme_mpath_start_request(struct request *rq)
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
 	}
 
-	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
+	if (!blk_queue_io_stat(disk->queue) ||
 	    (nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
+	if (blk_rq_is_passthrough(rq) &&
+	    !blk_rq_passthrough_stats(rq, disk->queue))
+		return;
 
 	nvme_req(rq)->flags |= NVME_MPATH_IO_STATS;
 	nvme_req(rq)->start_time = bdev_start_io_acct(disk->part0, req_op(rq),
-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH] block: rename need_dispatch to cautious_dispatch in blk-mq sched
From: Jens Axboe @ 2026-05-26 15:54 UTC (permalink / raw)
  To: Guixin Liu, Christoph Hellwig, Keith Busch
  Cc: linux-block, xlpang, oliver.yang
In-Reply-To: <20260526131103.3105411-1-kanie@linux.alibaba.com>

On 5/26/26 7:11 AM, Guixin Liu wrote:
> The local boolean in __blk_mq_sched_dispatch_requests() decides whether
> to fall back to the per-ctx round-robin path (blk_mq_do_dispatch_ctx())
> instead of the batch flush path (blk_mq_flush_busy_ctxs()).  The whole
> function is about dispatching anyway, so the name "need_dispatch" is
> not particularly informative and can mislead readers into thinking that
> a false value means "skip dispatching".
> 
> Rename it to "cautious_dispatch" to match the comment right above the
> check ("dequeue request one by one from sw queue if queue is busy")
> and to convey the actual intent: take the cautious, fair, one-at-a-time
> path either when we just drained hctx->dispatch (so the device has
> recently pushed back) or when the dispatch_busy EWMA still indicates
> congestion.  The fast batch path is only taken when neither signal
> suggests recent backpressure.

If we're going to do churn like that, it should at least be an
improvement. 'cautious_dispatch' tells the reader nothing about
what kind of behavior this modifies. 'piecemeal_dispatch' would
be much better, as it actually accurately describes what it
does.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH v15 0/8] blk: honor isolcpus configuration
From: Daniel Wagner @ 2026-05-26 16:05 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: axboe, kbusch, hch, sagi, mst, aacraid, James.Bottomley,
	martin.petersen, liyihang9, kashyap.desai, sumit.saxena,
	shivasharan.srikanteshwara, chandrakanth.patil, sathya.prakash,
	sreekanth.reddy, suganath-prabu.subramani, ranjan.kumar,
	jinpu.wang, tglx, mingo, peterz, juri.lelli, vincent.guittot,
	akpm, maz, ruanjinjie, bigeasy, yphbchou0911, wagi, frederic,
	longman, chenridong, hare, kch, ming.lei, tom.leiming, steve,
	sean, chjohnst, neelx, mproche, nick.lange, marco.crivellari,
	rishil1999, linux-block, linux-kernel
In-Reply-To: <20260521232956.553287-1-atomlin@atomlin.com>

Hi Aaron,

On Thu, May 21, 2026 at 07:29:48PM -0400, Aaron Tomlin wrote:
> Please let me know your thoughts.
> 
> 
> Changes since v14:

You’re moving fast with these updates! It’s great energy, but it’s
actually moving a bit faster than the review process can keep up with.
I’ve heard from some folks in the CC that they waiting for a 'final'
version.

Is this latest version ready for a full, deep-dive review, or are there
still a few 'knacks' you’re looking to iron out first?

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] block: partitions: fix of_node refcount leak in of_partition()
From: Jens Axboe @ 2026-05-26 16:36 UTC (permalink / raw)
  To: stable, Wentao Liang; +Cc: Josh Law, Kees Cook, linux-block, linux-kernel
In-Reply-To: <20260526102124.2283846-1-vulab@iscas.ac.cn>


On Tue, 26 May 2026 10:21:24 +0000, Wentao Liang wrote:
> of_partition() calls of_node_get() on the parent device node at the
> beginning of the function, storing the reference in 'partitions_np'.
> This reference is leaked in two paths:
> 
> 1. The compatibility check at the top of the function returns 0
>    without releasing partitions_np when the node exists but is not
>    "fixed-partitions" compatible.
> 
> [...]

Applied, thanks!

[1/1] block: partitions: fix of_node refcount leak in of_partition()
      commit: 148cd4873115feb266c002d4d4618ea7f14342d9

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: Wentao Liang; +Cc: linux-block, linux-kernel, stable
In-Reply-To: <20260526103722.2287587-1-vulab@iscas.ac.cn>


On Tue, 26 May 2026 10:37:22 +0000, Wentao Liang wrote:
> blk_mq_mark_tag_wait() calls sbitmap_queue_get() which increments
> sbq->ws_active. On the error path where the waitqueue_active() check
> fails and the function returns early, sbq->ws_active is not decremented,
> leaking the reference.
> 
> Fix this by calling sbitmap_queue_clear() to properly release the
> ws_active reference before returning on the error path.
> 
> [...]

Applied, thanks!

[1/1] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
      commit: 94028f339610f5d39d101449dc27156aea03b3cb

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v9] blk-mq: add tracepoint block_rq_tag_wait
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, Aaron Tomlin
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	john.g.garry, loberman, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260525005123.722277-1-atomlin@atomlin.com>


On Sun, 24 May 2026 20:51:23 -0400, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
> 
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred via sched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
> 
> [...]

Applied, thanks!

[1/1] blk-mq: add tracepoint block_rq_tag_wait
      commit: 9ece10778f8931630f86e802f94dc71115de0c8c

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: Chao Shi
  Cc: Christoph Hellwig, Christian Brauner, Josef Bacik, linux-block,
	linux-kernel, Sungwoo Kim, Dave Tian, Weidong Zhu
In-Reply-To: <20260522220025.1770388-1-coshi036@gmail.com>


On Fri, 22 May 2026 18:00:25 -0400, Chao Shi wrote:
> bdev_mark_dead()'s @surprise == true means the device is already gone.
> The filesystem callback fs_bdev_mark_dead() honours this and skips
> sync_filesystem(), but the bare block device path (no ->mark_dead op)
> lost its !surprise guard when the holder ->mark_dead callback was wired
> up (see Fixes), and now calls sync_blockdev() unconditionally, which can
> hang forever waiting on writeback that can no longer complete.
> 
> [...]

Applied, thanks!

[1/1] block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
      commit: 304f384f34af98a205086ce67331cad4fea6504d

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v1] block: switch numa_node to int in blk_mq_hw_ctx and init_request
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: Mateusz Nowicki
  Cc: Caleb Sander Mateos, Sung-woo Kim, Josef Bacik, Alasdair Kergon,
	Mike Snitzer, Mikulas Patocka, Benjamin Marzinski, Ulf Hansson,
	Richard Weinberger, Zhihao Cheng, Miquel Raynal,
	Vignesh Raghavendra, Sven Peter, Janne Grunau, Neal Gompa,
	Keith Busch, Christoph Hellwig, Sagi Grimberg, Justin Tee,
	Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	James E.J. Bottomley, Martin K. Petersen, Thomas Fourier, Al Viro,
	Luke Wang, Kees Cook, linux-block, linux-kernel, nbd, dm-devel,
	linux-mmc, linux-mtd, asahi, linux-arm-kernel, linux-nvme,
	linux-scsi
In-Reply-To: <20260523125210.272274-1-mateusz.nowicki@posteo.net>


On Sat, 23 May 2026 12:52:35 +0000, Mateusz Nowicki wrote:
> numa_node in blk_mq_hw_ctx and the matching argument of
> blk_mq_ops::init_request can be NUMA_NO_NODE (-1).  Declared as
> unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
> nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
> 
> Switch the field and the callback prototype to int and update all
> in-tree init_request implementations.  No functional change:
> cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
> take int.
> 
> [...]

Applied, thanks!

[1/1] block: switch numa_node to int in blk_mq_hw_ctx and init_request
      commit: 65e1c8f96ad1a1f3b72e8a91d1341d570f91d985

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: Avoid mounting the bdev pseudo-filesystem in userspace
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: Denis Arefev; +Cc: linux-block, linux-kernel, lvc-project, stable
In-Reply-To: <20260521072857.5078-1-arefev@swemel.ru>


On Thu, 21 May 2026 10:28:56 +0300, Denis Arefev wrote:
> The bdev pseudo-filesystem is an internal kernel filesystem with which
> userspace should not interfere. Unregister it so that userspace cannot
> even attempt to mount it.
> 
> This fixes a bug [1] that occurs when attempting to access files,
> because the system call move_mount() uses pointers declared in the
> inode_operations structure, which for the bdev pseudo-filesystem
> are always equal to 0. `inode->i_op = &empty_iops;`
> 
> [...]

Applied, thanks!

[1/1] block: Avoid mounting the bdev pseudo-filesystem in userspace
      commit: b518ae170f6c411cac2d5f320278c27d902bc628

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: REQ_NOAIT cleanups
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-block
In-Reply-To: <20260518063336.507369-1-hch@lst.de>


On Mon, 18 May 2026 08:33:28 +0200, Christoph Hellwig wrote:
> this series cleans up spurious code related to REQ_NOWAIT handling.
> 
> I have block layer work depending on this pending, so merging it through
> the block tree would be helpful.
> 
> Diffstat:
>  fs/direct-io.c      |   15 ++++-----------
>  include/linux/bio.h |    1 -
>  2 files changed, 4 insertions(+), 12 deletions(-)
> 
> [...]

Applied, thanks!

[1/2] direct-io: remove IOCB_NOWAIT support
      commit: ef9049ec8b9fd6c508832d9f7ab12029f3355102
[2/2] block: don't set BIO_QUIET for BLK_STS_AGAIN
      commit: 481105a949c8d11f7aa770b45fc4c8efcc53f205

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v1] mtip32xx: fix use-after-free on service thread failure
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
  To: Yuho Choi
  Cc: Thomas Fourier, Martin K . Petersen, Andy Shevchenko, Al Viro,
	linux-block, linux-kernel
In-Reply-To: <20260525162531.1406677-1-dbgh9129@gmail.com>


On Mon, 25 May 2026 12:25:31 -0400, Yuho Choi wrote:
> If service thread creation fails after device_add_disk() succeeds,
> mtip_block_initialize() calls del_gendisk() and then falls through to
> put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
> put_disk() can release dd on the added-disk path.
> 
> The same unwind then continues to use dd for blk_mq_free_tag_set() and
> mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
> cause a use-after-free and double free.
> 
> [...]

Applied, thanks!

[1/1] mtip32xx: fix use-after-free on service thread failure
      commit: 6b24446bee489e90f7ea843fbc0473393c73cbf9

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: remove blkdev_write_begin() and blkdev_write_end()
From: Jens Axboe @ 2026-05-26 16:42 UTC (permalink / raw)
  To: Christoph Hellwig, Tal Zussman; +Cc: linux-block, linux-kernel
In-Reply-To: <20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu>


On Mon, 25 May 2026 14:25:55 -0400, Tal Zussman wrote:
> Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
> def_blk_aops. These have been unreachable since commit 487c607df790
> ("block: use iomap for writes to block devices") switched block device
> buffered writes from generic_perform_write() to
> iomap_file_buffered_write(), which bypasses aops->write_begin/end.
> 
> 
> [...]

Applied, thanks!

[1/1] block: remove blkdev_write_begin() and blkdev_write_end()
      (no commit info)

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Bart Van Assche @ 2026-05-26 16:58 UTC (permalink / raw)
  To: Wentao Liang, Jens Axboe; +Cc: linux-block, linux-kernel, stable
In-Reply-To: <20260526103722.2287587-1-vulab@iscas.ac.cn>

On 5/26/26 3:37 AM, Wentao Liang wrote:
> blk_mq_mark_tag_wait() calls sbitmap_queue_get()

I don't see any sbitmap_queue_get() calls in blk_mq_mark_tag_wait().
Additionally, I don't see any other code above the modified code in
blk_mq_mark_tag_wait() that modifies sbq->ws_active directly or
indirectly. What am I missing?

> Fix this by calling sbitmap_queue_clear() to properly release the
> ws_active reference before returning on the error path.

This patch doesn't add a sbitmap_queue_clear() call. It seems like
there is a mismatch between the patch description and the code changes?

Bart.

^ permalink raw reply

* Re: [PATCH] block: blk-mq: fix ws_active refcount leak in blk_mq_mark_tag_wait()
From: Keith Busch @ 2026-05-26 17:05 UTC (permalink / raw)
  To: Wentao Liang; +Cc: Jens Axboe, linux-block, linux-kernel, stable
In-Reply-To: <20260526103722.2287587-1-vulab@iscas.ac.cn>

On Tue, May 26, 2026 at 10:37:22AM +0000, Wentao Liang wrote:
> blk_mq_mark_tag_wait() calls sbitmap_queue_get() which increments
> sbq->ws_active. On the error path where the waitqueue_active() check
> fails and the function returns early, sbq->ws_active is not decremented,
> leaking the reference.

I must be confused as I'm not making sense of this. Not only does
blk_mq_mark_tag_wait not call sbitmap_queue_get, sbitmap_queue_get does
not increment sbq->ws_active either. Could you clarify the actual
sequence?
 
> Fix this by calling sbitmap_queue_clear() to properly release the
> ws_active reference before returning on the error path.

And same here, I don't see sbitmap_queue_clear() called anywhere in this
path, nor does sbitmap_queue_clear() release ws_active anyway. What is
the actual sequence that gets there? 

> Fixes: c27d53fb445f ("blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
>  block/blk-mq.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d0c37daf568f..e1c2ac416693 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1952,6 +1952,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
>  	spin_lock_irq(&wq->lock);
>  	spin_lock(&hctx->dispatch_wait_lock);
>  	if (!list_empty(&wait->entry)) {
> +		list_del_init(&wait->entry);
> +		atomic_dec(&sbq->ws_active);

As far as I can tell, sbq->ws_active is incremented from three places:

  - blk_mq_mark_tag_wait() itself, but just below this line. So your
    change decrements before the local increment happened, no?
  - sbitmap_prepare_to_wait() / sbitmap_add_wait_queue() in
    lib/sbitmap.c, which are unrelated helpers not used here

What am I missing?

^ permalink raw reply

* Re: [PATCHv2] blk-mq: reinsert cached request to the list
From: Chaitanya Kulkarni @ 2026-05-26 17:06 UTC (permalink / raw)
  To: Keith Busch, linux-block@vger.kernel.org, axboe@kernel.dk
  Cc: Keith Busch, Ming Lei, Christoph Hellwig
In-Reply-To: <20260526153531.2365935-1-kbusch@meta.com>

On 5/26/26 08:35, Keith Busch wrote:
> From: Keith Busch<kbusch@kernel.org>
>
> A previous commit removed an optimization out of caution for a scenario
> that turns out not to be real: all the "queue_exit" goto's are safe to
> reinsert the request into the cached_rq's plug list as they are either
> from a non-blocking path, or a successful merge that already holds the
> queue reference. This optimization is most needed for small sequential
> workloads that successfully merge into larger requests.
>
> Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
> Suggested-by: Ming Lei<tom.leiming@gmail.com>
> Suggested-by: Christoph Hellwig<hch@lst.de>
> Signed-off-by: Keith Busch<kbusch@kernel.org>

Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply

* Re: [PATCH V4 0/3] md/nvme: Enable PCI P2PDMA support for RAID0 and NVMe Multipath
From: Chaitanya Kulkarni @ 2026-05-26 17:09 UTC (permalink / raw)
  To: axboe@kernel.dk
  Cc: song@kernel.org, yukuai@fnnas.com, Christoph Hellwig,
	linan122@huawei.com, kbusch@kernel.org, sagi@grimberg.me,
	linux-block@vger.kernel.org, linux-raid@vger.kernel.org,
	linux-nvme@lists.infradead.org, Kiran Modukuri
In-Reply-To: <4ed83782-04cf-45b5-93a0-05a08e61b82e@nvidia.com>

Jens,

On 5/19/26 17:11, Chaitanya Kulkarni wrote:
> Jens,
>
>
> On 5/14/26 9:35 PM, Christoph Hellwig wrote:
>> Still looks good to me as per the reviews.
>>
> If there no objection, can we merge this ?
>
> -Chaitanya
>
>
There is outstanding work I want to send out based on this one.

May I please request you to merge this patch series ?

-ck



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox