linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators
@ 2025-08-01 11:44 Ming Lei
  2025-08-01 11:44 ` [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-01 11:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty, Ming Lei

Hello Jens,

Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.

Avoids scsi_host_busy() lockup during scsi host blocked  in case of big cpu
cores & deep queue depth.

Also it becomes possible to use blk_mq_in_driver_rw() for io accounting now.

Take the following approach:

- clearing rq reference in tags->rqs[] and deferring freeing scheduler requests
in SRCU callback

- replace tags->lock with srcu read lock in tags iterator.


Ming Lei (5):
  blk-mq: Move flush queue allocation into blk_mq_init_hctx()
  blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
  blk-mq: Defer freeing of tags page_list to SRCU callback
  blk-mq: Defer freeing flush queue to SRCU callback
  blk-mq: Replace tags->lock with SRCU for tag iterators

 block/blk-mq-sched.c   |  4 +-
 block/blk-mq-sysfs.c   |  1 -
 block/blk-mq-tag.c     | 38 +++++++++++++++---
 block/blk-mq.c         | 87 +++++++++++++++++++++---------------------
 block/blk-mq.h         |  4 +-
 block/blk.h            |  1 +
 include/linux/blk-mq.h |  2 +
 7 files changed, 82 insertions(+), 55 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx()
  2025-08-01 11:44 [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
@ 2025-08-01 11:44 ` Ming Lei
  2025-08-04  6:06   ` Yu Kuai
  2025-08-04  7:07   ` Hannes Reinecke
  2025-08-01 11:44 ` [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-01 11:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty, Ming Lei

Move flush queue allocation into blk_mq_init_hctx() and its release into
blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
draining inflight request walking. blk_mq_exit_hctx() is the last chance
for us to get valid `tag_set` reference, and we need to add one SRCU to
`tag_set` for freeing flush request via call_srcu().

It is safe to move flush queue & request release into blk_mq_exit_hctx(),
because blk_mq_clear_flush_rq_mapping() clears the flush request
reference int driver tags inflight request table, meantime inflight
request walking is drained.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sysfs.c |  1 -
 block/blk-mq.c       | 20 +++++++++++++-------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 24656980f443..c8dfed6c1c96 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -34,7 +34,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
 						  kobj);
 
-	blk_free_flush_queue(hctx->fq);
 	sbitmap_free(&hctx->ctx_map);
 	free_cpumask_var(hctx->cpumask);
 	kfree(hctx->ctxs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9692fa4c3ef2..c6a1366dbe77 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3939,6 +3939,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
+	blk_free_flush_queue(hctx->fq);
+	hctx->fq = NULL;
+
 	xa_erase(&q->hctx_table, hctx_idx);
 
 	spin_lock(&q->unused_hctx_lock);
@@ -3964,13 +3967,19 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
 {
+	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
+	if (!hctx->fq)
+		goto fail;
+
 	hctx->queue_num = hctx_idx;
 
 	hctx->tags = set->tags[hctx_idx];
 
 	if (set->ops->init_hctx &&
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto fail;
+		goto fail_free_fq;
 
 	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
 				hctx->numa_node))
@@ -3987,6 +3996,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
+ fail_free_fq:
+	blk_free_flush_queue(hctx->fq);
+	hctx->fq = NULL;
  fail:
 	return -1;
 }
@@ -4038,16 +4050,10 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
 	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
-	if (!hctx->fq)
-		goto free_bitmap;
-
 	blk_mq_hctx_kobj_init(hctx);
 
 	return hctx;
 
- free_bitmap:
-	sbitmap_free(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
  free_cpumask:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
  2025-08-01 11:44 [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
  2025-08-01 11:44 ` [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
@ 2025-08-01 11:44 ` Ming Lei
  2025-08-04  7:08   ` Hannes Reinecke
  2025-08-05  7:48   ` Yu Kuai
  2025-08-01 11:44 ` [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-01 11:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty, Ming Lei

To prepare for converting the tag->rqs freeing to be SRCU-based, the
tag_set is needed in the freeing helper functions.

This patch adds 'struct blk_mq_tag_set *' as the first parameter to
blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
sites.

This allows access to the tag_set's SRCU structure in the next step,
which will be used to free the tag maps after a grace period.

No functional change is intended in this patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  4 ++--
 block/blk-mq-tag.c   |  2 +-
 block/blk-mq.c       | 10 +++++-----
 block/blk-mq.h       |  4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..8b5ad3b3b071 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -393,7 +393,7 @@ static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
 
 static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
 {
-	blk_mq_free_rq_map(queue->sched_shared_tags);
+	blk_mq_free_rq_map(queue->tag_set, queue->sched_shared_tags);
 	queue->sched_shared_tags = NULL;
 }
 
@@ -406,7 +406,7 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int fla
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->sched_tags) {
 			if (!blk_mq_is_shared_tags(flags))
-				blk_mq_free_rq_map(hctx->sched_tags);
+				blk_mq_free_rq_map(q->tag_set, hctx->sched_tags);
 			hctx->sched_tags = NULL;
 		}
 	}
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..6fce42851f03 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -576,7 +576,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	return NULL;
 }
 
-void blk_mq_free_tags(struct blk_mq_tags *tags)
+void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
 {
 	sbitmap_queue_free(&tags->bitmap_tags);
 	sbitmap_queue_free(&tags->breserved_tags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c6a1366dbe77..673cb534bc62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3491,14 +3491,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	}
 }
 
-void blk_mq_free_rq_map(struct blk_mq_tags *tags)
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
 {
 	kfree(tags->rqs);
 	tags->rqs = NULL;
 	kfree(tags->static_rqs);
 	tags->static_rqs = NULL;
 
-	blk_mq_free_tags(tags);
+	blk_mq_free_tags(set, tags);
 }
 
 static enum hctx_type hctx_idx_to_type(struct blk_mq_tag_set *set,
@@ -3560,7 +3560,7 @@ static struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 err_free_rqs:
 	kfree(tags->rqs);
 err_free_tags:
-	blk_mq_free_tags(tags);
+	blk_mq_free_tags(set, tags);
 	return NULL;
 }
 
@@ -4107,7 +4107,7 @@ struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
 
 	ret = blk_mq_alloc_rqs(set, tags, hctx_idx, depth);
 	if (ret) {
-		blk_mq_free_rq_map(tags);
+		blk_mq_free_rq_map(set, tags);
 		return NULL;
 	}
 
@@ -4135,7 +4135,7 @@ void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
 {
 	if (tags) {
 		blk_mq_free_rqs(set, tags, hctx_idx);
-		blk_mq_free_rq_map(tags);
+		blk_mq_free_rq_map(set, tags);
 	}
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..b96a753809ab 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -59,7 +59,7 @@ void blk_mq_put_rq_ref(struct request *rq);
  */
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx);
-void blk_mq_free_rq_map(struct blk_mq_tags *tags);
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags);
 struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
 				unsigned int hctx_idx, unsigned int depth);
 void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
@@ -162,7 +162,7 @@ struct blk_mq_alloc_data {
 
 struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
 		unsigned int reserved_tags, unsigned int flags, int node);
-void blk_mq_free_tags(struct blk_mq_tags *tags);
+void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags);
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
 unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback
  2025-08-01 11:44 [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
  2025-08-01 11:44 ` [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
  2025-08-01 11:44 ` [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
@ 2025-08-01 11:44 ` Ming Lei
  2025-08-04  7:09   ` Hannes Reinecke
  2025-08-06  9:15   ` Yu Kuai
  2025-08-01 11:44 ` [PATCH 4/5] blk-mq: Defer freeing flush queue " Ming Lei
  2025-08-01 11:44 ` [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
  4 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-01 11:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty, Ming Lei

Tag iterators can race with the freeing of the request pages(tags->page_list),
potentially leading to use-after-free issues.

Defer the freeing of the page list and the tags structure itself until
after an SRCU grace period has passed. This ensures that any concurrent
tag iterators have completed before the memory is released. With this
way, we can replace the big tags->lock in tags iterator code path with
srcu for solving the issue.

This is achieved by:
- Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
  tag map iteration.
- Adding an `rcu_head` to `struct blk_mq_tags` to be used with
  `call_srcu`.
- Moving the page list freeing logic and the `kfree(tags)` call into a
  new callback function, `blk_mq_free_tags_callback`.
- In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
  callback for deferred execution.

The read-side protection for the tag iterators will be added in a
subsequent patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c     | 24 +++++++++++++++++++++++-
 block/blk-mq.c         | 26 +++++++++++++-------------
 include/linux/blk-mq.h |  2 ++
 3 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 6fce42851f03..6c2f5881e0de 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -8,6 +8,9 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/kmemleak.h>
 
 #include <linux/delay.h>
 #include "blk.h"
@@ -576,11 +579,30 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	return NULL;
 }
 
+static void blk_mq_free_tags_callback(struct rcu_head *head)
+{
+	struct blk_mq_tags *tags = container_of(head, struct blk_mq_tags,
+						rcu_head);
+	struct page *page;
+
+	while (!list_empty(&tags->page_list)) {
+		page = list_first_entry(&tags->page_list, struct page, lru);
+		list_del_init(&page->lru);
+		/*
+		 * Remove kmemleak object previously allocated in
+		 * blk_mq_alloc_rqs().
+		 */
+		kmemleak_free(page_address(page));
+		__free_pages(page, page->private);
+	}
+	kfree(tags);
+}
+
 void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
 {
 	sbitmap_queue_free(&tags->bitmap_tags);
 	sbitmap_queue_free(&tags->breserved_tags);
-	kfree(tags);
+	call_srcu(&set->tags_srcu, &tags->rcu_head, blk_mq_free_tags_callback);
 }
 
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 673cb534bc62..f14aa0a19ef0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3454,7 +3454,6 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
 	struct blk_mq_tags *drv_tags;
-	struct page *page;
 
 	if (list_empty(&tags->page_list))
 		return;
@@ -3478,17 +3477,10 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	}
 
 	blk_mq_clear_rq_mapping(drv_tags, tags);
-
-	while (!list_empty(&tags->page_list)) {
-		page = list_first_entry(&tags->page_list, struct page, lru);
-		list_del_init(&page->lru);
-		/*
-		 * Remove kmemleak object previously allocated in
-		 * blk_mq_alloc_rqs().
-		 */
-		kmemleak_free(page_address(page));
-		__free_pages(page, page->private);
-	}
+	/*
+	 * Free request pages in SRCU callback, which is called from
+	 * blk_mq_free_tags().
+	 */
 }
 
 void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
@@ -4834,6 +4826,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		if (ret)
 			goto out_free_srcu;
 	}
+	ret = init_srcu_struct(&set->tags_srcu);
+	if (ret)
+		goto out_cleanup_srcu;
 
 	init_rwsem(&set->update_nr_hwq_lock);
 
@@ -4842,7 +4837,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 				 sizeof(struct blk_mq_tags *), GFP_KERNEL,
 				 set->numa_node);
 	if (!set->tags)
-		goto out_cleanup_srcu;
+		goto out_cleanup_tags_srcu;
 
 	for (i = 0; i < set->nr_maps; i++) {
 		set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
@@ -4871,6 +4866,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	}
 	kfree(set->tags);
 	set->tags = NULL;
+out_cleanup_tags_srcu:
+	cleanup_srcu_struct(&set->tags_srcu);
 out_cleanup_srcu:
 	if (set->flags & BLK_MQ_F_BLOCKING)
 		cleanup_srcu_struct(set->srcu);
@@ -4916,6 +4913,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 
 	kfree(set->tags);
 	set->tags = NULL;
+
+	srcu_barrier(&set->tags_srcu);
+	cleanup_srcu_struct(&set->tags_srcu);
 	if (set->flags & BLK_MQ_F_BLOCKING) {
 		cleanup_srcu_struct(set->srcu);
 		kfree(set->srcu);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0..1325ceeb743a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -531,6 +531,7 @@ struct blk_mq_tag_set {
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
 	struct srcu_struct	*srcu;
+	struct srcu_struct	tags_srcu;
 
 	struct rw_semaphore	update_nr_hwq_lock;
 };
@@ -767,6 +768,7 @@ struct blk_mq_tags {
 	 * request pool
 	 */
 	spinlock_t lock;
+	struct rcu_head rcu_head;
 };
 
 static inline struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] blk-mq: Defer freeing flush queue to SRCU callback
  2025-08-01 11:44 [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
                   ` (2 preceding siblings ...)
  2025-08-01 11:44 ` [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
@ 2025-08-01 11:44 ` Ming Lei
  2025-08-04  7:11   ` Hannes Reinecke
  2025-08-06  9:17   ` Yu Kuai
  2025-08-01 11:44 ` [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
  4 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-01 11:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty, Ming Lei

The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags->lock in tags iterator
code path with srcu for solving the issue.

This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.

The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
  to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
  `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
  instance to ensure synchronization with tag iterators.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 11 ++++++++++-
 block/blk.h    |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f14aa0a19ef0..7b4ab8e398b6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3912,6 +3912,14 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
 	spin_unlock_irqrestore(&tags->lock, flags);
 }
 
+static void blk_free_flush_queue_callback(struct rcu_head *head)
+{
+	struct blk_flush_queue *fq =
+		container_of(head, struct blk_flush_queue, rcu_head);
+
+	blk_free_flush_queue(fq);
+}
+
 /* hctx->ctxs will be freed in queue's release handler */
 static void blk_mq_exit_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
@@ -3931,7 +3939,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
-	blk_free_flush_queue(hctx->fq);
+	call_srcu(&set->tags_srcu, &hctx->fq->rcu_head,
+			blk_free_flush_queue_callback);
 	hctx->fq = NULL;
 
 	xa_erase(&q->hctx_table, hctx_idx);
diff --git a/block/blk.h b/block/blk.h
index 76901a39997f..c7e52552290f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -39,6 +39,7 @@ struct blk_flush_queue {
 	struct list_head	flush_queue[2];
 	unsigned long		flush_data_in_flight;
 	struct request		*flush_rq;
+	struct rcu_head		rcu_head;
 };
 
 bool is_flush_rq(struct request *req);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-01 11:44 [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
                   ` (3 preceding siblings ...)
  2025-08-01 11:44 ` [PATCH 4/5] blk-mq: Defer freeing flush queue " Ming Lei
@ 2025-08-01 11:44 ` Ming Lei
  2025-08-04  6:30   ` Yu Kuai
                     ` (2 more replies)
  4 siblings, 3 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-01 11:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty, Ming Lei

Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.

This is done by:

- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().

- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().

This change improves performance by replacing a spinlock with a more
scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
shost->host_blocked.

Meantime it becomes possible to use blk_mq_in_driver_rw() for io
accounting.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 12 ++++++++----
 block/blk-mq.c     | 24 ++++--------------------
 2 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 6c2f5881e0de..7ae431077a32 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
 		unsigned int bitnr)
 {
 	struct request *rq;
-	unsigned long flags;
 
-	spin_lock_irqsave(&tags->lock, flags);
 	rq = tags->rqs[bitnr];
 	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
 		rq = NULL;
-	spin_unlock_irqrestore(&tags->lock, flags);
 	return rq;
 }
 
@@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	unsigned int flags = tagset->flags;
-	int i, nr_tags;
+	int i, nr_tags, srcu_idx;
+
+	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
 
 	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
 
@@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
 					      BT_TAG_ITER_STARTED);
 	}
+	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
@@ -499,6 +499,8 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
 		void *priv)
 {
+	int srcu_idx;
+
 	/*
 	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table
 	 * while the queue is frozen. So we can use q_usage_counter to avoid
@@ -507,6 +509,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
+	srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu);
 	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
 		struct blk_mq_tags *tags = q->tag_set->shared_tags;
 		struct sbitmap_queue *bresv = &tags->breserved_tags;
@@ -536,6 +539,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
 			bt_for_each(hctx, q, btags, fn, priv, false);
 		}
 	}
+	srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx);
 	blk_queue_exit(q);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7b4ab8e398b6..43b15e58ffe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3415,7 +3415,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
 				    struct blk_mq_tags *tags)
 {
 	struct page *page;
-	unsigned long flags;
 
 	/*
 	 * There is no need to clear mapping if driver tags is not initialized
@@ -3439,15 +3438,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
 			}
 		}
 	}
-
-	/*
-	 * Wait until all pending iteration is done.
-	 *
-	 * Request reference is cleared and it is guaranteed to be observed
-	 * after the ->lock is released.
-	 */
-	spin_lock_irqsave(&drv_tags->lock, flags);
-	spin_unlock_irqrestore(&drv_tags->lock, flags);
 }
 
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
@@ -3670,8 +3660,12 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
 	struct rq_iter_data data = {
 		.hctx	= hctx,
 	};
+	int srcu_idx;
 
+	srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu);
 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+	srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx);
+
 	return data.has_rq;
 }
 
@@ -3891,7 +3885,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
 		unsigned int queue_depth, struct request *flush_rq)
 {
 	int i;
-	unsigned long flags;
 
 	/* The hw queue may not be mapped yet */
 	if (!tags)
@@ -3901,15 +3894,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
 
 	for (i = 0; i < queue_depth; i++)
 		cmpxchg(&tags->rqs[i], flush_rq, NULL);
-
-	/*
-	 * Wait until all pending iteration is done.
-	 *
-	 * Request reference is cleared and it is guaranteed to be observed
-	 * after the ->lock is released.
-	 */
-	spin_lock_irqsave(&tags->lock, flags);
-	spin_unlock_irqrestore(&tags->lock, flags);
 }
 
 static void blk_free_flush_queue_callback(struct rcu_head *head)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx()
  2025-08-01 11:44 ` [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
@ 2025-08-04  6:06   ` Yu Kuai
  2025-08-04  7:07   ` Hannes Reinecke
  1 sibling, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-04  6:06 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: John Garry, Sathya Prakash Veerichetty, yukuai (C)

在 2025/08/01 19:44, Ming Lei 写道:
> Move flush queue allocation into blk_mq_init_hctx() and its release into
> blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
> draining inflight request walking. blk_mq_exit_hctx() is the last chance
> for us to get valid `tag_set` reference, and we need to add one SRCU to
> `tag_set` for freeing flush request via call_srcu().
> 
> It is safe to move flush queue & request release into blk_mq_exit_hctx(),
> because blk_mq_clear_flush_rq_mapping() clears the flush request
> reference int driver tags inflight request table, meantime inflight
> request walking is drained.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-sysfs.c |  1 -
>   block/blk-mq.c       | 20 +++++++++++++-------
>   2 files changed, 13 insertions(+), 8 deletions(-)
> 
LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>

> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 24656980f443..c8dfed6c1c96 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -34,7 +34,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
>   	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
>   						  kobj);
>   
> -	blk_free_flush_queue(hctx->fq);
>   	sbitmap_free(&hctx->ctx_map);
>   	free_cpumask_var(hctx->cpumask);
>   	kfree(hctx->ctxs);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9692fa4c3ef2..c6a1366dbe77 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3939,6 +3939,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   	if (set->ops->exit_hctx)
>   		set->ops->exit_hctx(hctx, hctx_idx);
>   
> +	blk_free_flush_queue(hctx->fq);
> +	hctx->fq = NULL;
> +
>   	xa_erase(&q->hctx_table, hctx_idx);
>   
>   	spin_lock(&q->unused_hctx_lock);
> @@ -3964,13 +3967,19 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   		struct blk_mq_tag_set *set,
>   		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
>   {
> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
> +	if (!hctx->fq)
> +		goto fail;
> +
>   	hctx->queue_num = hctx_idx;
>   
>   	hctx->tags = set->tags[hctx_idx];
>   
>   	if (set->ops->init_hctx &&
>   	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
> -		goto fail;
> +		goto fail_free_fq;
>   
>   	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
>   				hctx->numa_node))
> @@ -3987,6 +3996,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
>    exit_hctx:
>   	if (set->ops->exit_hctx)
>   		set->ops->exit_hctx(hctx, hctx_idx);
> + fail_free_fq:
> +	blk_free_flush_queue(hctx->fq);
> +	hctx->fq = NULL;
>    fail:
>   	return -1;
>   }
> @@ -4038,16 +4050,10 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
>   	init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
>   	INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
>   
> -	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
> -	if (!hctx->fq)
> -		goto free_bitmap;
> -
>   	blk_mq_hctx_kobj_init(hctx);
>   
>   	return hctx;
>   
> - free_bitmap:
> -	sbitmap_free(&hctx->ctx_map);
>    free_ctxs:
>   	kfree(hctx->ctxs);
>    free_cpumask:
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-01 11:44 ` [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
@ 2025-08-04  6:30   ` Yu Kuai
  2025-08-04 11:32     ` Ming Lei
  2025-08-04  7:13   ` Hannes Reinecke
  2025-08-06  9:21   ` Yu Kuai
  2 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-04  6:30 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: John Garry, Sathya Prakash Veerichetty, yukuai (C)

Hi,

在 2025/08/01 19:44, Ming Lei 写道:
> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> around the tag iterators.
> 
> This is done by:
> 
> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> 
> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> 
> This change improves performance by replacing a spinlock with a more
> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> shost->host_blocked.
> 
> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> accounting.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c | 12 ++++++++----
>   block/blk-mq.c     | 24 ++++--------------------
>   2 files changed, 12 insertions(+), 24 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 6c2f5881e0de..7ae431077a32 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>   		unsigned int bitnr)
>   {
>   	struct request *rq;
> -	unsigned long flags;
>   
> -	spin_lock_irqsave(&tags->lock, flags);
>   	rq = tags->rqs[bitnr];
>   	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>   		rq = NULL;
> -	spin_unlock_irqrestore(&tags->lock, flags);
>   	return rq;
>   }
>
Just wonder, does the lockup problem due to the tags->lock contention by
concurrent scsi_host_busy?

> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   		busy_tag_iter_fn *fn, void *priv)
>   {
>   	unsigned int flags = tagset->flags;
> -	int i, nr_tags;
> +	int i, nr_tags, srcu_idx;
> +
> +	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>   
>   	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
>   
> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>   					      BT_TAG_ITER_STARTED);
>   	}
> +	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);

And should we add cond_resched() after finish interating one tags, even
with the srcu change, looks like it's still possible to lockup with
big cpu cores & deep queue depth.

Thanks,
Kuai

>   }
>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>   
> @@ -499,6 +499,8 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>   		void *priv)
>   {
> +	int srcu_idx;
> +
>   	/*
>   	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table
>   	 * while the queue is frozen. So we can use q_usage_counter to avoid
> @@ -507,6 +509,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>   	if (!percpu_ref_tryget(&q->q_usage_counter))
>   		return;
>   
> +	srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu);
>   	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>   		struct blk_mq_tags *tags = q->tag_set->shared_tags;
>   		struct sbitmap_queue *bresv = &tags->breserved_tags;
> @@ -536,6 +539,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>   			bt_for_each(hctx, q, btags, fn, priv, false);
>   		}
>   	}
> +	srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx);
>   	blk_queue_exit(q);
>   }
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7b4ab8e398b6..43b15e58ffe1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3415,7 +3415,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
>   				    struct blk_mq_tags *tags)
>   {
>   	struct page *page;
> -	unsigned long flags;
>   
>   	/*
>   	 * There is no need to clear mapping if driver tags is not initialized
> @@ -3439,15 +3438,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
>   			}
>   		}
>   	}
> -
> -	/*
> -	 * Wait until all pending iteration is done.
> -	 *
> -	 * Request reference is cleared and it is guaranteed to be observed
> -	 * after the ->lock is released.
> -	 */
> -	spin_lock_irqsave(&drv_tags->lock, flags);
> -	spin_unlock_irqrestore(&drv_tags->lock, flags);
>   }
>   
>   void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> @@ -3670,8 +3660,12 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>   	struct rq_iter_data data = {
>   		.hctx	= hctx,
>   	};
> +	int srcu_idx;
>   
> +	srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu);
>   	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> +	srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx);
> +
>   	return data.has_rq;
>   }
>   
> @@ -3891,7 +3885,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
>   		unsigned int queue_depth, struct request *flush_rq)
>   {
>   	int i;
> -	unsigned long flags;
>   
>   	/* The hw queue may not be mapped yet */
>   	if (!tags)
> @@ -3901,15 +3894,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
>   
>   	for (i = 0; i < queue_depth; i++)
>   		cmpxchg(&tags->rqs[i], flush_rq, NULL);
> -
> -	/*
> -	 * Wait until all pending iteration is done.
> -	 *
> -	 * Request reference is cleared and it is guaranteed to be observed
> -	 * after the ->lock is released.
> -	 */
> -	spin_lock_irqsave(&tags->lock, flags);
> -	spin_unlock_irqrestore(&tags->lock, flags);
>   }
>   
>   static void blk_free_flush_queue_callback(struct rcu_head *head)
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx()
  2025-08-01 11:44 ` [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
  2025-08-04  6:06   ` Yu Kuai
@ 2025-08-04  7:07   ` Hannes Reinecke
  1 sibling, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-08-04  7:07 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty

On 8/1/25 13:44, Ming Lei wrote:
> Move flush queue allocation into blk_mq_init_hctx() and its release into
> blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
> draining inflight request walking. blk_mq_exit_hctx() is the last chance
> for us to get valid `tag_set` reference, and we need to add one SRCU to
> `tag_set` for freeing flush request via call_srcu().
> 
> It is safe to move flush queue & request release into blk_mq_exit_hctx(),
> because blk_mq_clear_flush_rq_mapping() clears the flush request
> reference int driver tags inflight request table, meantime inflight
> request walking is drained.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-sysfs.c |  1 -
>   block/blk-mq.c       | 20 +++++++++++++-------
>   2 files changed, 13 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
  2025-08-01 11:44 ` [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
@ 2025-08-04  7:08   ` Hannes Reinecke
  2025-08-05  7:48   ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-08-04  7:08 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty

On 8/1/25 13:44, Ming Lei wrote:
> To prepare for converting the tag->rqs freeing to be SRCU-based, the
> tag_set is needed in the freeing helper functions.
> 
> This patch adds 'struct blk_mq_tag_set *' as the first parameter to
> blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
> sites.
> 
> This allows access to the tag_set's SRCU structure in the next step,
> which will be used to free the tag maps after a grace period.
> 
> No functional change is intended in this patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-sched.c |  4 ++--
>   block/blk-mq-tag.c   |  2 +-
>   block/blk-mq.c       | 10 +++++-----
>   block/blk-mq.h       |  4 ++--
>   4 files changed, 10 insertions(+), 10 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback
  2025-08-01 11:44 ` [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
@ 2025-08-04  7:09   ` Hannes Reinecke
  2025-08-06  9:15   ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-08-04  7:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty

On 8/1/25 13:44, Ming Lei wrote:
> Tag iterators can race with the freeing of the request pages(tags->page_list),
> potentially leading to use-after-free issues.
> 
> Defer the freeing of the page list and the tags structure itself until
> after an SRCU grace period has passed. This ensures that any concurrent
> tag iterators have completed before the memory is released. With this
> way, we can replace the big tags->lock in tags iterator code path with
> srcu for solving the issue.
> 
> This is achieved by:
> - Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
>    tag map iteration.
> - Adding an `rcu_head` to `struct blk_mq_tags` to be used with
>    `call_srcu`.
> - Moving the page list freeing logic and the `kfree(tags)` call into a
>    new callback function, `blk_mq_free_tags_callback`.
> - In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
>    callback for deferred execution.
> 
> The read-side protection for the tag iterators will be added in a
> subsequent patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c     | 24 +++++++++++++++++++++++-
>   block/blk-mq.c         | 26 +++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   3 files changed, 38 insertions(+), 14 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] blk-mq: Defer freeing flush queue to SRCU callback
  2025-08-01 11:44 ` [PATCH 4/5] blk-mq: Defer freeing flush queue " Ming Lei
@ 2025-08-04  7:11   ` Hannes Reinecke
  2025-08-06  9:17   ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-08-04  7:11 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty

On 8/1/25 13:44, Ming Lei wrote:
> The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
> tag iterators that may still be accessing it. To prevent a potential
> use-after-free, the deallocation should be deferred until after a grace
> period. With this way, we can replace the big tags->lock in tags iterator
> code path with srcu for solving the issue.
> 
> This patch introduces an SRCU-based deferred freeing mechanism for the
> flush queue.
> 
> The changes include:
> - Adding a `rcu_head` to `struct blk_flush_queue`.
> - Creating a new callback function, `blk_free_flush_queue_callback`,
>    to handle the actual freeing.
> - Replacing the direct call to `blk_free_flush_queue()` in
>    `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
>    instance to ensure synchronization with tag iterators.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 11 ++++++++++-
>   block/blk.h    |  1 +
>   2 files changed, 11 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-01 11:44 ` [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
  2025-08-04  6:30   ` Yu Kuai
@ 2025-08-04  7:13   ` Hannes Reinecke
  2025-08-04 11:35     ` Ming Lei
  2025-08-06  9:21   ` Yu Kuai
  2 siblings, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2025-08-04  7:13 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Yu Kuai, John Garry, Sathya Prakash Veerichetty

On 8/1/25 13:44, Ming Lei wrote:
> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> around the tag iterators.
> 
> This is done by:
> 
> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> 
> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> 
> This change improves performance by replacing a spinlock with a more
> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> shost->host_blocked.
> 
> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> accounting.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c | 12 ++++++++----
>   block/blk-mq.c     | 24 ++++--------------------
>   2 files changed, 12 insertions(+), 24 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 6c2f5881e0de..7ae431077a32 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>   		unsigned int bitnr)
>   {
>   	struct request *rq;
> -	unsigned long flags;
>   
> -	spin_lock_irqsave(&tags->lock, flags);
>   	rq = tags->rqs[bitnr];
>   	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>   		rq = NULL;
> -	spin_unlock_irqrestore(&tags->lock, flags);
>   	return rq;
>   }
>   
> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   		busy_tag_iter_fn *fn, void *priv)
>   {
>   	unsigned int flags = tagset->flags;
> -	int i, nr_tags;
> +	int i, nr_tags, srcu_idx;
> +
> +	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>   
>   	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
>   
> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>   					      BT_TAG_ITER_STARTED);
>   	}
> +	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
>   }
>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>   
> @@ -499,6 +499,8 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>   		void *priv)
>   {
> +	int srcu_idx;
> +
>   	/*
>   	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table
>   	 * while the queue is frozen. So we can use q_usage_counter to avoid
> @@ -507,6 +509,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>   	if (!percpu_ref_tryget(&q->q_usage_counter))
>   		return;
>   
> +	srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu);
>   	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>   		struct blk_mq_tags *tags = q->tag_set->shared_tags;
>   		struct sbitmap_queue *bresv = &tags->breserved_tags;
> @@ -536,6 +539,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>   			bt_for_each(hctx, q, btags, fn, priv, false);
>   		}
>   	}
> +	srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx);
>   	blk_queue_exit(q);
>   }
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7b4ab8e398b6..43b15e58ffe1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3415,7 +3415,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
>   				    struct blk_mq_tags *tags)
>   {
>   	struct page *page;
> -	unsigned long flags;
>   
>   	/*
>   	 * There is no need to clear mapping if driver tags is not initialized
> @@ -3439,15 +3438,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
>   			}
>   		}
>   	}
> -
> -	/*
> -	 * Wait until all pending iteration is done.
> -	 *
> -	 * Request reference is cleared and it is guaranteed to be observed
> -	 * after the ->lock is released.
> -	 */
> -	spin_lock_irqsave(&drv_tags->lock, flags);
> -	spin_unlock_irqrestore(&drv_tags->lock, flags);
>   }
>   
>   void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> @@ -3670,8 +3660,12 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>   	struct rq_iter_data data = {
>   		.hctx	= hctx,
>   	};
> +	int srcu_idx;
>   
> +	srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu);
>   	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> +	srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx);
> +
>   	return data.has_rq;
>   }
>   
> @@ -3891,7 +3885,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
>   		unsigned int queue_depth, struct request *flush_rq)
>   {
>   	int i;
> -	unsigned long flags;
>   
>   	/* The hw queue may not be mapped yet */
>   	if (!tags)
> @@ -3901,15 +3894,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
>   
>   	for (i = 0; i < queue_depth; i++)
>   		cmpxchg(&tags->rqs[i], flush_rq, NULL);
> -
> -	/*
> -	 * Wait until all pending iteration is done.
> -	 *
> -	 * Request reference is cleared and it is guaranteed to be observed
> -	 * after the ->lock is released.
> -	 */
> -	spin_lock_irqsave(&tags->lock, flags);
> -	spin_unlock_irqrestore(&tags->lock, flags);
>   }
>   
>   static void blk_free_flush_queue_callback(struct rcu_head *head)

While this looks good, I do wonder what happened to the 'fq' srcu.
Don't we need to insert an srcu_read_lock() when we're trying to
access it?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-04  6:30   ` Yu Kuai
@ 2025-08-04 11:32     ` Ming Lei
  2025-08-05  8:33       ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-04 11:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/01 19:44, Ming Lei 写道:
> > Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> > around the tag iterators.
> > 
> > This is done by:
> > 
> > - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> > blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> > 
> > - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> > 
> > This change improves performance by replacing a spinlock with a more
> > scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> > shost->host_blocked.
> > 
> > Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> > accounting.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-tag.c | 12 ++++++++----
> >   block/blk-mq.c     | 24 ++++--------------------
> >   2 files changed, 12 insertions(+), 24 deletions(-)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 6c2f5881e0de..7ae431077a32 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> >   		unsigned int bitnr)
> >   {
> >   	struct request *rq;
> > -	unsigned long flags;
> > -	spin_lock_irqsave(&tags->lock, flags);
> >   	rq = tags->rqs[bitnr];
> >   	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
> >   		rq = NULL;
> > -	spin_unlock_irqrestore(&tags->lock, flags);
> >   	return rq;
> >   }
> > 
> Just wonder, does the lockup problem due to the tags->lock contention by
> concurrent scsi_host_busy?

Yes.

> 
> > @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >   		busy_tag_iter_fn *fn, void *priv)
> >   {
> >   	unsigned int flags = tagset->flags;
> > -	int i, nr_tags;
> > +	int i, nr_tags, srcu_idx;
> > +
> > +	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
> >   	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
> > @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> >   					      BT_TAG_ITER_STARTED);
> >   	}
> > +	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
> 
> And should we add cond_resched() after finish interating one tags, even
> with the srcu change, looks like it's still possible to lockup with
> big cpu cores & deep queue depth.

The main trouble is from the big tags->lock.

IMO it isn't needed, because max queue depth is just 10K, which is much
bigger than actual queue depth. We can add it when someone shows it is
really needed.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-04  7:13   ` Hannes Reinecke
@ 2025-08-04 11:35     ` Ming Lei
  2025-08-04 11:45       ` Hannes Reinecke
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-04 11:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Yu Kuai, John Garry,
	Sathya Prakash Veerichetty

On Mon, Aug 04, 2025 at 09:13:08AM +0200, Hannes Reinecke wrote:
> On 8/1/25 13:44, Ming Lei wrote:
> > Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> > around the tag iterators.
> > 
> > This is done by:
> > 
> > - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> > blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> > 
> > - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> > 
> > This change improves performance by replacing a spinlock with a more
> > scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> > shost->host_blocked.
> > 
> > Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> > accounting.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-tag.c | 12 ++++++++----
> >   block/blk-mq.c     | 24 ++++--------------------
> >   2 files changed, 12 insertions(+), 24 deletions(-)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 6c2f5881e0de..7ae431077a32 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> >   		unsigned int bitnr)
> >   {
> >   	struct request *rq;
> > -	unsigned long flags;
> > -	spin_lock_irqsave(&tags->lock, flags);
> >   	rq = tags->rqs[bitnr];
> >   	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
> >   		rq = NULL;
> > -	spin_unlock_irqrestore(&tags->lock, flags);
> >   	return rq;
> >   }
> > @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >   		busy_tag_iter_fn *fn, void *priv)
> >   {
> >   	unsigned int flags = tagset->flags;
> > -	int i, nr_tags;
> > +	int i, nr_tags, srcu_idx;
> > +
> > +	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
> >   	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
> > @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> >   					      BT_TAG_ITER_STARTED);
> >   	}
> > +	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
> >   }
> >   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> > @@ -499,6 +499,8 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
> >   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
> >   		void *priv)
> >   {
> > +	int srcu_idx;
> > +
> >   	/*
> >   	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table
> >   	 * while the queue is frozen. So we can use q_usage_counter to avoid
> > @@ -507,6 +509,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
> >   	if (!percpu_ref_tryget(&q->q_usage_counter))
> >   		return;
> > +	srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu);
> >   	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> >   		struct blk_mq_tags *tags = q->tag_set->shared_tags;
> >   		struct sbitmap_queue *bresv = &tags->breserved_tags;
> > @@ -536,6 +539,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
> >   			bt_for_each(hctx, q, btags, fn, priv, false);
> >   		}
> >   	}
> > +	srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx);
> >   	blk_queue_exit(q);
> >   }
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 7b4ab8e398b6..43b15e58ffe1 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3415,7 +3415,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
> >   				    struct blk_mq_tags *tags)
> >   {
> >   	struct page *page;
> > -	unsigned long flags;
> >   	/*
> >   	 * There is no need to clear mapping if driver tags is not initialized
> > @@ -3439,15 +3438,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
> >   			}
> >   		}
> >   	}
> > -
> > -	/*
> > -	 * Wait until all pending iteration is done.
> > -	 *
> > -	 * Request reference is cleared and it is guaranteed to be observed
> > -	 * after the ->lock is released.
> > -	 */
> > -	spin_lock_irqsave(&drv_tags->lock, flags);
> > -	spin_unlock_irqrestore(&drv_tags->lock, flags);
> >   }
> >   void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> > @@ -3670,8 +3660,12 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> >   	struct rq_iter_data data = {
> >   		.hctx	= hctx,
> >   	};
> > +	int srcu_idx;
> > +	srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu);
> >   	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> > +	srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx);
> > +
> >   	return data.has_rq;
> >   }
> > @@ -3891,7 +3885,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
> >   		unsigned int queue_depth, struct request *flush_rq)
> >   {
> >   	int i;
> > -	unsigned long flags;
> >   	/* The hw queue may not be mapped yet */
> >   	if (!tags)
> > @@ -3901,15 +3894,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
> >   	for (i = 0; i < queue_depth; i++)
> >   		cmpxchg(&tags->rqs[i], flush_rq, NULL);
> > -
> > -	/*
> > -	 * Wait until all pending iteration is done.
> > -	 *
> > -	 * Request reference is cleared and it is guaranteed to be observed
> > -	 * after the ->lock is released.
> > -	 */
> > -	spin_lock_irqsave(&tags->lock, flags);
> > -	spin_unlock_irqrestore(&tags->lock, flags);
> >   }
> >   static void blk_free_flush_queue_callback(struct rcu_head *head)
> 
> While this looks good, I do wonder what happened to the 'fq' srcu.
> Don't we need to insert an srcu_read_lock() when we're trying to
> access it?

That is exactly the srcu read lock added in this patch.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-04 11:35     ` Ming Lei
@ 2025-08-04 11:45       ` Hannes Reinecke
  0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-08-04 11:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Yu Kuai, Sathya Prakash Veerichetty

On 8/4/25 13:35, Ming Lei wrote:
> On Mon, Aug 04, 2025 at 09:13:08AM +0200, Hannes Reinecke wrote:
>> On 8/1/25 13:44, Ming Lei wrote:
>>> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
>>> around the tag iterators.
>>>
>>> This is done by:
>>>
>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>
>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>
>>> This change improves performance by replacing a spinlock with a more
>>> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
>>> shost->host_blocked.
>>>
>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>> accounting.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    block/blk-mq-tag.c | 12 ++++++++----
>>>    block/blk-mq.c     | 24 ++++--------------------
>>>    2 files changed, 12 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index 6c2f5881e0de..7ae431077a32 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>>>    		unsigned int bitnr)
>>>    {
>>>    	struct request *rq;
>>> -	unsigned long flags;
>>> -	spin_lock_irqsave(&tags->lock, flags);
>>>    	rq = tags->rqs[bitnr];
>>>    	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>>    		rq = NULL;
>>> -	spin_unlock_irqrestore(&tags->lock, flags);
>>>    	return rq;
>>>    }
>>> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>>    		busy_tag_iter_fn *fn, void *priv)
>>>    {
>>>    	unsigned int flags = tagset->flags;
>>> -	int i, nr_tags;
>>> +	int i, nr_tags, srcu_idx;
>>> +
>>> +	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>>>    	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
>>> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>>    			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>>    					      BT_TAG_ITER_STARTED);
>>>    	}
>>> +	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
>>>    }
>>>    EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>>> @@ -499,6 +499,8 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
>>>    void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>>>    		void *priv)
>>>    {
>>> +	int srcu_idx;
>>> +
>>>    	/*
>>>    	 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table
>>>    	 * while the queue is frozen. So we can use q_usage_counter to avoid
>>> @@ -507,6 +509,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>>>    	if (!percpu_ref_tryget(&q->q_usage_counter))
>>>    		return;
>>> +	srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu);
>>>    	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>>>    		struct blk_mq_tags *tags = q->tag_set->shared_tags;
>>>    		struct sbitmap_queue *bresv = &tags->breserved_tags;
>>> @@ -536,6 +539,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
>>>    			bt_for_each(hctx, q, btags, fn, priv, false);
>>>    		}
>>>    	}
>>> +	srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx);
>>>    	blk_queue_exit(q);
>>>    }
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 7b4ab8e398b6..43b15e58ffe1 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3415,7 +3415,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
>>>    				    struct blk_mq_tags *tags)
>>>    {
>>>    	struct page *page;
>>> -	unsigned long flags;
>>>    	/*
>>>    	 * There is no need to clear mapping if driver tags is not initialized
>>> @@ -3439,15 +3438,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
>>>    			}
>>>    		}
>>>    	}
>>> -
>>> -	/*
>>> -	 * Wait until all pending iteration is done.
>>> -	 *
>>> -	 * Request reference is cleared and it is guaranteed to be observed
>>> -	 * after the ->lock is released.
>>> -	 */
>>> -	spin_lock_irqsave(&drv_tags->lock, flags);
>>> -	spin_unlock_irqrestore(&drv_tags->lock, flags);
>>>    }
>>>    void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>> @@ -3670,8 +3660,12 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>>>    	struct rq_iter_data data = {
>>>    		.hctx	= hctx,
>>>    	};
>>> +	int srcu_idx;
>>> +	srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu);
>>>    	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
>>> +	srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx);
>>> +
>>>    	return data.has_rq;
>>>    }
>>> @@ -3891,7 +3885,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
>>>    		unsigned int queue_depth, struct request *flush_rq)
>>>    {
>>>    	int i;
>>> -	unsigned long flags;
>>>    	/* The hw queue may not be mapped yet */
>>>    	if (!tags)
>>> @@ -3901,15 +3894,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
>>>    	for (i = 0; i < queue_depth; i++)
>>>    		cmpxchg(&tags->rqs[i], flush_rq, NULL);
>>> -
>>> -	/*
>>> -	 * Wait until all pending iteration is done.
>>> -	 *
>>> -	 * Request reference is cleared and it is guaranteed to be observed
>>> -	 * after the ->lock is released.
>>> -	 */
>>> -	spin_lock_irqsave(&tags->lock, flags);
>>> -	spin_unlock_irqrestore(&tags->lock, flags);
>>>    }
>>>    static void blk_free_flush_queue_callback(struct rcu_head *head)
>>
>> While this looks good, I do wonder what happened to the 'fq' srcu.
>> Don't we need to insert an srcu_read_lock() when we're trying to
>> access it?
> 
> That is exactly the srcu read lock added in this patch.
> 

Ah, indeed. Misread the patch.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
  2025-08-01 11:44 ` [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
  2025-08-04  7:08   ` Hannes Reinecke
@ 2025-08-05  7:48   ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-05  7:48 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: John Garry, Sathya Prakash Veerichetty, yukuai (C)

在 2025/08/01 19:44, Ming Lei 写道:
> To prepare for converting the tag->rqs freeing to be SRCU-based, the
> tag_set is needed in the freeing helper functions.
> 
> This patch adds 'struct blk_mq_tag_set *' as the first parameter to
> blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
> sites.
> 
> This allows access to the tag_set's SRCU structure in the next step,
> which will be used to free the tag maps after a grace period.
> 
> No functional change is intended in this patch.
> 
> Signed-off-by: Ming Lei<ming.lei@redhat.com>
> ---
>   block/blk-mq-sched.c |  4 ++--
>   block/blk-mq-tag.c   |  2 +-
>   block/blk-mq.c       | 10 +++++-----
>   block/blk-mq.h       |  4 ++--
>   4 files changed, 10 insertions(+), 10 deletions(-)

LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-04 11:32     ` Ming Lei
@ 2025-08-05  8:33       ` Yu Kuai
  2025-08-05  8:38         ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-05  8:33 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

Hi,

在 2025/08/04 19:32, Ming Lei 写道:
> On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/01 19:44, Ming Lei 写道:
>>> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
>>> around the tag iterators.
>>>
>>> This is done by:
>>>
>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>
>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>
>>> This change improves performance by replacing a spinlock with a more
>>> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
>>> shost->host_blocked.
>>>
>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>> accounting.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    block/blk-mq-tag.c | 12 ++++++++----
>>>    block/blk-mq.c     | 24 ++++--------------------
>>>    2 files changed, 12 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index 6c2f5881e0de..7ae431077a32 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>>>    		unsigned int bitnr)
>>>    {
>>>    	struct request *rq;
>>> -	unsigned long flags;
>>> -	spin_lock_irqsave(&tags->lock, flags);
>>>    	rq = tags->rqs[bitnr];
>>>    	if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>>    		rq = NULL;
>>> -	spin_unlock_irqrestore(&tags->lock, flags);
>>>    	return rq;
>>>    }
>>>
>> Just wonder, does the lockup problem due to the tags->lock contention by
>> concurrent scsi_host_busy?
> 
> Yes.
> 
>>
>>> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>>    		busy_tag_iter_fn *fn, void *priv)
>>>    {
>>>    	unsigned int flags = tagset->flags;
>>> -	int i, nr_tags;
>>> +	int i, nr_tags, srcu_idx;
>>> +
>>> +	srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>>>    	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
>>> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>>    			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>>    					      BT_TAG_ITER_STARTED);
>>>    	}
>>> +	srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
>>
>> And should we add cond_resched() after finish interating one tags, even
>> with the srcu change, looks like it's still possible to lockup with
>> big cpu cores & deep queue depth.
> 
> The main trouble is from the big tags->lock.
> 
> IMO it isn't needed, because max queue depth is just 10K, which is much
> bigger than actual queue depth. We can add it when someone shows it is
> really needed.

If we don't want this, why not using srcu here? Looks like just use
rcu_read_lock and rcu_read_unlock to protect blk_mq_find_and_get_req()
will be enough.

Thanks,
Kuai

> 
> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-05  8:33       ` Yu Kuai
@ 2025-08-05  8:38         ` Yu Kuai
  2025-08-05  8:48           ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-05  8:38 UTC (permalink / raw)
  To: Yu Kuai, Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

Hi,

在 2025/08/05 16:33, Yu Kuai 写道:
> Hi,
> 
> 在 2025/08/04 19:32, Ming Lei 写道:
>> On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/08/01 19:44, Ming Lei 写道:
>>>> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read 
>>>> lock
>>>> around the tag iterators.
>>>>
>>>> This is done by:
>>>>
>>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>>
>>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>>
>>>> This change improves performance by replacing a spinlock with a more
>>>> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in 
>>>> case of
>>>> shost->host_blocked.
>>>>
>>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>>> accounting.
>>>>
>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> ---
>>>>    block/blk-mq-tag.c | 12 ++++++++----
>>>>    block/blk-mq.c     | 24 ++++--------------------
>>>>    2 files changed, 12 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>> index 6c2f5881e0de..7ae431077a32 100644
>>>> --- a/block/blk-mq-tag.c
>>>> +++ b/block/blk-mq-tag.c
>>>> @@ -256,13 +256,10 @@ static struct request 
>>>> *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>>>>            unsigned int bitnr)
>>>>    {
>>>>        struct request *rq;
>>>> -    unsigned long flags;
>>>> -    spin_lock_irqsave(&tags->lock, flags);
>>>>        rq = tags->rqs[bitnr];
>>>>        if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>>>            rq = NULL;
>>>> -    spin_unlock_irqrestore(&tags->lock, flags);
>>>>        return rq;
>>>>    }
>>>>
>>> Just wonder, does the lockup problem due to the tags->lock contention by
>>> concurrent scsi_host_busy?
>>
>> Yes.
>>
>>>
>>>> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct 
>>>> blk_mq_tag_set *tagset,
>>>>            busy_tag_iter_fn *fn, void *priv)
>>>>    {
>>>>        unsigned int flags = tagset->flags;
>>>> -    int i, nr_tags;
>>>> +    int i, nr_tags, srcu_idx;
>>>> +
>>>> +    srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>>>>        nr_tags = blk_mq_is_shared_tags(flags) ? 1 : 
>>>> tagset->nr_hw_queues;
>>>> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct 
>>>> blk_mq_tag_set *tagset,
>>>>                __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>>>                              BT_TAG_ITER_STARTED);
>>>>        }
>>>> +    srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
>>>
>>> And should we add cond_resched() after finish interating one tags, even
>>> with the srcu change, looks like it's still possible to lockup with
>>> big cpu cores & deep queue depth.
>>
>> The main trouble is from the big tags->lock.
>>
>> IMO it isn't needed, because max queue depth is just 10K, which is much
>> bigger than actual queue depth. We can add it when someone shows it is
>> really needed.
> 
> If we don't want this, why not using srcu here? Looks like just use
> rcu_read_lock and rcu_read_unlock to protect blk_mq_find_and_get_req()
> will be enough.

Like following patch:

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..e2381ee9747d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -255,11 +255,11 @@ static struct request 
*blk_mq_find_and_get_req(struct blk_mq_tags *tags,
         struct request *rq;
         unsigned long flags;

-       spin_lock_irqsave(&tags->lock, flags);
+       rcu_read_lock();
         rq = tags->rqs[bitnr];
         if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
                 rq = NULL;
-       spin_unlock_irqrestore(&tags->lock, flags);
+       rcu_read_unlock();
         return rq;
  }

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b1d81839679f..a70959cad692 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3442,12 +3442,8 @@ static void blk_mq_clear_rq_mapping(struct 
blk_mq_tags *drv_tags,

         /*
          * Wait until all pending iteration is done.
-        *
-        * Request reference is cleared and it is guaranteed to be observed
-        * after the ->lock is released.
          */
-       spin_lock_irqsave(&drv_tags->lock, flags);
-       spin_unlock_irqrestore(&drv_tags->lock, flags);
+       synchronize_rcu();
  }

  void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
@@ -3912,12 +3908,8 @@ static void blk_mq_clear_flush_rq_mapping(struct 
blk_mq_tags *tags,

         /*
          * Wait until all pending iteration is done.
-        *
-        * Request reference is cleared and it is guaranteed to be observed
-        * after the ->lock is released.
          */
-       spin_lock_irqsave(&tags->lock, flags);
-       spin_unlock_irqrestore(&tags->lock, flags);
+       synchronize_rcu();
  }

  /* hctx->ctxs will be freed in queue's release handler */

> 
> Thanks,
> Kuai
> 
>>
>>
>>
>> Thanks,
>> Ming
>>
>>
>> .
>>
> 
> .
> 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-05  8:38         ` Yu Kuai
@ 2025-08-05  8:48           ` Ming Lei
  2025-08-06  1:06             ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-05  8:48 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

On Tue, Aug 05, 2025 at 04:38:56PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/05 16:33, Yu Kuai 写道:
> > Hi,
> > 
> > 在 2025/08/04 19:32, Ming Lei 写道:
> > > On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
> > > > Hi,
> > > > 
> > > > 在 2025/08/01 19:44, Ming Lei 写道:
> > > > > Replace the spinlock in blk_mq_find_and_get_req() with an
> > > > > SRCU read lock
> > > > > around the tag iterators.
> > > > > 
> > > > > This is done by:
> > > > > 
> > > > > - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> > > > > blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> > > > > 
> > > > > - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> > > > > 
> > > > > This change improves performance by replacing a spinlock with a more
> > > > > scalable SRCU lock, and fixes lockup issue in
> > > > > scsi_host_busy() in case of
> > > > > shost->host_blocked.
> > > > > 
> > > > > Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> > > > > accounting.
> > > > > 
> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > ---
> > > > >    block/blk-mq-tag.c | 12 ++++++++----
> > > > >    block/blk-mq.c     | 24 ++++--------------------
> > > > >    2 files changed, 12 insertions(+), 24 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > > > index 6c2f5881e0de..7ae431077a32 100644
> > > > > --- a/block/blk-mq-tag.c
> > > > > +++ b/block/blk-mq-tag.c
> > > > > @@ -256,13 +256,10 @@ static struct request
> > > > > *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> > > > >            unsigned int bitnr)
> > > > >    {
> > > > >        struct request *rq;
> > > > > -    unsigned long flags;
> > > > > -    spin_lock_irqsave(&tags->lock, flags);
> > > > >        rq = tags->rqs[bitnr];
> > > > >        if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
> > > > >            rq = NULL;
> > > > > -    spin_unlock_irqrestore(&tags->lock, flags);
> > > > >        return rq;
> > > > >    }
> > > > > 
> > > > Just wonder, does the lockup problem due to the tags->lock contention by
> > > > concurrent scsi_host_busy?
> > > 
> > > Yes.
> > > 
> > > > 
> > > > > @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct
> > > > > blk_mq_tag_set *tagset,
> > > > >            busy_tag_iter_fn *fn, void *priv)
> > > > >    {
> > > > >        unsigned int flags = tagset->flags;
> > > > > -    int i, nr_tags;
> > > > > +    int i, nr_tags, srcu_idx;
> > > > > +
> > > > > +    srcu_idx = srcu_read_lock(&tagset->tags_srcu);
> > > > >        nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
> > > > > tagset->nr_hw_queues;
> > > > > @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct
> > > > > blk_mq_tag_set *tagset,
> > > > >                __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> > > > >                              BT_TAG_ITER_STARTED);
> > > > >        }
> > > > > +    srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
> > > > 
> > > > And should we add cond_resched() after finish interating one tags, even
> > > > with the srcu change, looks like it's still possible to lockup with
> > > > big cpu cores & deep queue depth.
> > > 
> > > The main trouble is from the big tags->lock.
> > > 
> > > IMO it isn't needed, because max queue depth is just 10K, which is much
> > > bigger than actual queue depth. We can add it when someone shows it is
> > > really needed.
> > 
> > If we don't want this, why not using srcu here? Looks like just use
> > rcu_read_lock and rcu_read_unlock to protect blk_mq_find_and_get_req()
> > will be enough.
> 
> Like following patch:
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d880c50629d6..e2381ee9747d 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -255,11 +255,11 @@ static struct request *blk_mq_find_and_get_req(struct
> blk_mq_tags *tags,
>         struct request *rq;
>         unsigned long flags;
> 
> -       spin_lock_irqsave(&tags->lock, flags);
> +       rcu_read_lock();
>         rq = tags->rqs[bitnr];
>         if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>                 rq = NULL;
> -       spin_unlock_irqrestore(&tags->lock, flags);
> +       rcu_read_unlock();
>         return rq;
>  }

srcu read lock has to be grabbed when request reference is being accessed,
so the above change is wrong, otherwise plain rcu is enough.

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b1d81839679f..a70959cad692 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3442,12 +3442,8 @@ static void blk_mq_clear_rq_mapping(struct
> blk_mq_tags *drv_tags,
> 
>         /*
>          * Wait until all pending iteration is done.
> -        *
> -        * Request reference is cleared and it is guaranteed to be observed
> -        * after the ->lock is released.
>          */
> -       spin_lock_irqsave(&drv_tags->lock, flags);
> -       spin_unlock_irqrestore(&drv_tags->lock, flags);
> +       synchronize_rcu();

We do want to avoid big delay in this code path, so call_srcu() is much
better.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-05  8:48           ` Ming Lei
@ 2025-08-06  1:06             ` Yu Kuai
  2025-08-06  8:59               ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-06  1:06 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

Hi,

在 2025/08/05 16:48, Ming Lei 写道:
> On Tue, Aug 05, 2025 at 04:38:56PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/05 16:33, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2025/08/04 19:32, Ming Lei 写道:
>>>> On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> 在 2025/08/01 19:44, Ming Lei 写道:
>>>>>> Replace the spinlock in blk_mq_find_and_get_req() with an
>>>>>> SRCU read lock
>>>>>> around the tag iterators.
>>>>>>
>>>>>> This is done by:
>>>>>>
>>>>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>>>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>>>>
>>>>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>>>>
>>>>>> This change improves performance by replacing a spinlock with a more
>>>>>> scalable SRCU lock, and fixes lockup issue in
>>>>>> scsi_host_busy() in case of
>>>>>> shost->host_blocked.
>>>>>>
>>>>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>>>>> accounting.
>>>>>>
>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>>> ---
>>>>>>     block/blk-mq-tag.c | 12 ++++++++----
>>>>>>     block/blk-mq.c     | 24 ++++--------------------
>>>>>>     2 files changed, 12 insertions(+), 24 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>>>> index 6c2f5881e0de..7ae431077a32 100644
>>>>>> --- a/block/blk-mq-tag.c
>>>>>> +++ b/block/blk-mq-tag.c
>>>>>> @@ -256,13 +256,10 @@ static struct request
>>>>>> *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>>>>>>             unsigned int bitnr)
>>>>>>     {
>>>>>>         struct request *rq;
>>>>>> -    unsigned long flags;
>>>>>> -    spin_lock_irqsave(&tags->lock, flags);
>>>>>>         rq = tags->rqs[bitnr];
>>>>>>         if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>>>>>             rq = NULL;
>>>>>> -    spin_unlock_irqrestore(&tags->lock, flags);
>>>>>>         return rq;
>>>>>>     }
>>>>>>
>>>>> Just wonder, does the lockup problem due to the tags->lock contention by
>>>>> concurrent scsi_host_busy?
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct
>>>>>> blk_mq_tag_set *tagset,
>>>>>>             busy_tag_iter_fn *fn, void *priv)
>>>>>>     {
>>>>>>         unsigned int flags = tagset->flags;
>>>>>> -    int i, nr_tags;
>>>>>> +    int i, nr_tags, srcu_idx;
>>>>>> +
>>>>>> +    srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>>>>>>         nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
>>>>>> tagset->nr_hw_queues;
>>>>>> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct
>>>>>> blk_mq_tag_set *tagset,
>>>>>>                 __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>>>>>                               BT_TAG_ITER_STARTED);
>>>>>>         }
>>>>>> +    srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
>>>>>
>>>>> And should we add cond_resched() after finish interating one tags, even
>>>>> with the srcu change, looks like it's still possible to lockup with
>>>>> big cpu cores & deep queue depth.
>>>>
>>>> The main trouble is from the big tags->lock.
>>>>
>>>> IMO it isn't needed, because max queue depth is just 10K, which is much
>>>> bigger than actual queue depth. We can add it when someone shows it is
>>>> really needed.
>>>
>>> If we don't want this, why not using srcu here? Looks like just use
>>> rcu_read_lock and rcu_read_unlock to protect blk_mq_find_and_get_req()
>>> will be enough.
>>
>> Like following patch:
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d880c50629d6..e2381ee9747d 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -255,11 +255,11 @@ static struct request *blk_mq_find_and_get_req(struct
>> blk_mq_tags *tags,
>>          struct request *rq;
>>          unsigned long flags;
>>
>> -       spin_lock_irqsave(&tags->lock, flags);
>> +       rcu_read_lock();
>>          rq = tags->rqs[bitnr];
>>          if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>                  rq = NULL;
>> -       spin_unlock_irqrestore(&tags->lock, flags);
>> +       rcu_read_unlock();
>>          return rq;
>>   }
> 
> srcu read lock has to be grabbed when request reference is being accessed,
> so the above change is wrong, otherwise plain rcu is enough.
> 
I don't quite understand, I think it's enough to protect grabbing req
reference, because IO issue path grab q_usage_counter before setting
req reference to 1, and IO complete path decrease req reference to 0
before dropping q_usage_counter.

>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b1d81839679f..a70959cad692 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3442,12 +3442,8 @@ static void blk_mq_clear_rq_mapping(struct
>> blk_mq_tags *drv_tags,
>>
>>          /*
>>           * Wait until all pending iteration is done.
>> -        *
>> -        * Request reference is cleared and it is guaranteed to be observed
>> -        * after the ->lock is released.
>>           */
>> -       spin_lock_irqsave(&drv_tags->lock, flags);
>> -       spin_unlock_irqrestore(&drv_tags->lock, flags);
>> +       synchronize_rcu();
> 
> We do want to avoid big delay in this code path, so call_srcu() is much
> better.

Agreed, however, there is rcu verion helper as well, call_rcu().

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-06  1:06             ` Yu Kuai
@ 2025-08-06  8:59               ` Ming Lei
  2025-08-06  9:06                 ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-06  8:59 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

On Wed, Aug 06, 2025 at 09:06:28AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/05 16:48, Ming Lei 写道:
> > On Tue, Aug 05, 2025 at 04:38:56PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2025/08/05 16:33, Yu Kuai 写道:
> > > > Hi,
> > > > 
> > > > 在 2025/08/04 19:32, Ming Lei 写道:
> > > > > On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > 在 2025/08/01 19:44, Ming Lei 写道:
> > > > > > > Replace the spinlock in blk_mq_find_and_get_req() with an
> > > > > > > SRCU read lock
> > > > > > > around the tag iterators.
> > > > > > > 
> > > > > > > This is done by:
> > > > > > > 
> > > > > > > - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> > > > > > > blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> > > > > > > 
> > > > > > > - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> > > > > > > 
> > > > > > > This change improves performance by replacing a spinlock with a more
> > > > > > > scalable SRCU lock, and fixes lockup issue in
> > > > > > > scsi_host_busy() in case of
> > > > > > > shost->host_blocked.
> > > > > > > 
> > > > > > > Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> > > > > > > accounting.
> > > > > > > 
> > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > > ---
> > > > > > >     block/blk-mq-tag.c | 12 ++++++++----
> > > > > > >     block/blk-mq.c     | 24 ++++--------------------
> > > > > > >     2 files changed, 12 insertions(+), 24 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > > > > > index 6c2f5881e0de..7ae431077a32 100644
> > > > > > > --- a/block/blk-mq-tag.c
> > > > > > > +++ b/block/blk-mq-tag.c
> > > > > > > @@ -256,13 +256,10 @@ static struct request
> > > > > > > *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> > > > > > >             unsigned int bitnr)
> > > > > > >     {
> > > > > > >         struct request *rq;
> > > > > > > -    unsigned long flags;
> > > > > > > -    spin_lock_irqsave(&tags->lock, flags);
> > > > > > >         rq = tags->rqs[bitnr];
> > > > > > >         if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
> > > > > > >             rq = NULL;
> > > > > > > -    spin_unlock_irqrestore(&tags->lock, flags);
> > > > > > >         return rq;
> > > > > > >     }
> > > > > > > 
> > > > > > Just wonder, does the lockup problem due to the tags->lock contention by
> > > > > > concurrent scsi_host_busy?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > 
> > > > > > > @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct
> > > > > > > blk_mq_tag_set *tagset,
> > > > > > >             busy_tag_iter_fn *fn, void *priv)
> > > > > > >     {
> > > > > > >         unsigned int flags = tagset->flags;
> > > > > > > -    int i, nr_tags;
> > > > > > > +    int i, nr_tags, srcu_idx;
> > > > > > > +
> > > > > > > +    srcu_idx = srcu_read_lock(&tagset->tags_srcu);
> > > > > > >         nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
> > > > > > > tagset->nr_hw_queues;
> > > > > > > @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct
> > > > > > > blk_mq_tag_set *tagset,
> > > > > > >                 __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> > > > > > >                               BT_TAG_ITER_STARTED);
> > > > > > >         }
> > > > > > > +    srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
> > > > > > 
> > > > > > And should we add cond_resched() after finish interating one tags, even
> > > > > > with the srcu change, looks like it's still possible to lockup with
> > > > > > big cpu cores & deep queue depth.
> > > > > 
> > > > > The main trouble is from the big tags->lock.
> > > > > 
> > > > > IMO it isn't needed, because max queue depth is just 10K, which is much
> > > > > bigger than actual queue depth. We can add it when someone shows it is
> > > > > really needed.
> > > > 
> > > > If we don't want this, why not using srcu here? Looks like just use
> > > > rcu_read_lock and rcu_read_unlock to protect blk_mq_find_and_get_req()
> > > > will be enough.
> > > 
> > > Like following patch:
> > > 
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index d880c50629d6..e2381ee9747d 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -255,11 +255,11 @@ static struct request *blk_mq_find_and_get_req(struct
> > > blk_mq_tags *tags,
> > >          struct request *rq;
> > >          unsigned long flags;
> > > 
> > > -       spin_lock_irqsave(&tags->lock, flags);
> > > +       rcu_read_lock();
> > >          rq = tags->rqs[bitnr];
> > >          if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
> > >                  rq = NULL;
> > > -       spin_unlock_irqrestore(&tags->lock, flags);
> > > +       rcu_read_unlock();
> > >          return rq;
> > >   }
> > 
> > srcu read lock has to be grabbed when request reference is being accessed,
> > so the above change is wrong, otherwise plain rcu is enough.
> > 
> I don't quite understand, I think it's enough to protect grabbing req
> reference, because IO issue path grab q_usage_counter before setting
> req reference to 1, and IO complete path decrease req reference to 0
> before dropping q_usage_counter.

In theory it is true, but the implementation is pretty fragile, because the
correctness replies on the implied memory barrier(un-documented) in blk_mq_get_tag()
between blk_try_enter_queue() and req_ref_set(rq, 1).

> 
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index b1d81839679f..a70959cad692 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -3442,12 +3442,8 @@ static void blk_mq_clear_rq_mapping(struct
> > > blk_mq_tags *drv_tags,
> > > 
> > >          /*
> > >           * Wait until all pending iteration is done.
> > > -        *
> > > -        * Request reference is cleared and it is guaranteed to be observed
> > > -        * after the ->lock is released.
> > >           */
> > > -       spin_lock_irqsave(&drv_tags->lock, flags);
> > > -       spin_unlock_irqrestore(&drv_tags->lock, flags);
> > > +       synchronize_rcu();
> > 
> > We do want to avoid big delay in this code path, so call_srcu() is much
> > better.
> 
> Agreed, however, there is rcu verion helper as well, call_rcu().

I prefer to srcu, which is simple & straight-forward, especially the background
of this issue has been tough enough.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-06  8:59               ` Ming Lei
@ 2025-08-06  9:06                 ` Yu Kuai
  0 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-06  9:06 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

Hi,

在 2025/08/06 16:59, Ming Lei 写道:
> On Wed, Aug 06, 2025 at 09:06:28AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/05 16:48, Ming Lei 写道:
>>> On Tue, Aug 05, 2025 at 04:38:56PM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/08/05 16:33, Yu Kuai 写道:
>>>>> Hi,
>>>>>
>>>>> 在 2025/08/04 19:32, Ming Lei 写道:
>>>>>> On Mon, Aug 04, 2025 at 02:30:43PM +0800, Yu Kuai wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> 在 2025/08/01 19:44, Ming Lei 写道:
>>>>>>>> Replace the spinlock in blk_mq_find_and_get_req() with an
>>>>>>>> SRCU read lock
>>>>>>>> around the tag iterators.
>>>>>>>>
>>>>>>>> This is done by:
>>>>>>>>
>>>>>>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>>>>>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>>>>>>
>>>>>>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>>>>>>
>>>>>>>> This change improves performance by replacing a spinlock with a more
>>>>>>>> scalable SRCU lock, and fixes lockup issue in
>>>>>>>> scsi_host_busy() in case of
>>>>>>>> shost->host_blocked.
>>>>>>>>
>>>>>>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>>>>>>> accounting.
>>>>>>>>
>>>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>>>>> ---
>>>>>>>>      block/blk-mq-tag.c | 12 ++++++++----
>>>>>>>>      block/blk-mq.c     | 24 ++++--------------------
>>>>>>>>      2 files changed, 12 insertions(+), 24 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>>>>>> index 6c2f5881e0de..7ae431077a32 100644
>>>>>>>> --- a/block/blk-mq-tag.c
>>>>>>>> +++ b/block/blk-mq-tag.c
>>>>>>>> @@ -256,13 +256,10 @@ static struct request
>>>>>>>> *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>>>>>>>>              unsigned int bitnr)
>>>>>>>>      {
>>>>>>>>          struct request *rq;
>>>>>>>> -    unsigned long flags;
>>>>>>>> -    spin_lock_irqsave(&tags->lock, flags);
>>>>>>>>          rq = tags->rqs[bitnr];
>>>>>>>>          if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>>>>>>>              rq = NULL;
>>>>>>>> -    spin_unlock_irqrestore(&tags->lock, flags);
>>>>>>>>          return rq;
>>>>>>>>      }
>>>>>>>>
>>>>>>> Just wonder, does the lockup problem due to the tags->lock contention by
>>>>>>> concurrent scsi_host_busy?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>>> @@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct
>>>>>>>> blk_mq_tag_set *tagset,
>>>>>>>>              busy_tag_iter_fn *fn, void *priv)
>>>>>>>>      {
>>>>>>>>          unsigned int flags = tagset->flags;
>>>>>>>> -    int i, nr_tags;
>>>>>>>> +    int i, nr_tags, srcu_idx;
>>>>>>>> +
>>>>>>>> +    srcu_idx = srcu_read_lock(&tagset->tags_srcu);
>>>>>>>>          nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
>>>>>>>> tagset->nr_hw_queues;
>>>>>>>> @@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct
>>>>>>>> blk_mq_tag_set *tagset,
>>>>>>>>                  __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>>>>>>>                                BT_TAG_ITER_STARTED);
>>>>>>>>          }
>>>>>>>> +    srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
>>>>>>>
>>>>>>> And should we add cond_resched() after finish interating one tags, even
>>>>>>> with the srcu change, looks like it's still possible to lockup with
>>>>>>> big cpu cores & deep queue depth.
>>>>>>
>>>>>> The main trouble is from the big tags->lock.
>>>>>>
>>>>>> IMO it isn't needed, because max queue depth is just 10K, which is much
>>>>>> bigger than actual queue depth. We can add it when someone shows it is
>>>>>> really needed.
>>>>>
>>>>> If we don't want this, why not using srcu here? Looks like just use
>>>>> rcu_read_lock and rcu_read_unlock to protect blk_mq_find_and_get_req()
>>>>> will be enough.
>>>>
>>>> Like following patch:
>>>>
>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>> index d880c50629d6..e2381ee9747d 100644
>>>> --- a/block/blk-mq-tag.c
>>>> +++ b/block/blk-mq-tag.c
>>>> @@ -255,11 +255,11 @@ static struct request *blk_mq_find_and_get_req(struct
>>>> blk_mq_tags *tags,
>>>>           struct request *rq;
>>>>           unsigned long flags;
>>>>
>>>> -       spin_lock_irqsave(&tags->lock, flags);
>>>> +       rcu_read_lock();
>>>>           rq = tags->rqs[bitnr];
>>>>           if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
>>>>                   rq = NULL;
>>>> -       spin_unlock_irqrestore(&tags->lock, flags);
>>>> +       rcu_read_unlock();
>>>>           return rq;
>>>>    }
>>>
>>> srcu read lock has to be grabbed when request reference is being accessed,
>>> so the above change is wrong, otherwise plain rcu is enough.
>>>
>> I don't quite understand, I think it's enough to protect grabbing req
>> reference, because IO issue path grab q_usage_counter before setting
>> req reference to 1, and IO complete path decrease req reference to 0
>> before dropping q_usage_counter.
> 
> In theory it is true, but the implementation is pretty fragile, because the
> correctness replies on the implied memory barrier(un-documented) in blk_mq_get_tag()
> between blk_try_enter_queue() and req_ref_set(rq, 1).
> 
>>
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index b1d81839679f..a70959cad692 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -3442,12 +3442,8 @@ static void blk_mq_clear_rq_mapping(struct
>>>> blk_mq_tags *drv_tags,
>>>>
>>>>           /*
>>>>            * Wait until all pending iteration is done.
>>>> -        *
>>>> -        * Request reference is cleared and it is guaranteed to be observed
>>>> -        * after the ->lock is released.
>>>>            */
>>>> -       spin_lock_irqsave(&drv_tags->lock, flags);
>>>> -       spin_unlock_irqrestore(&drv_tags->lock, flags);
>>>> +       synchronize_rcu();
>>>
>>> We do want to avoid big delay in this code path, so call_srcu() is much
>>> better.
>>
>> Agreed, however, there is rcu verion helper as well, call_rcu().
> 
> I prefer to srcu, which is simple & straight-forward, especially the background
> of this issue has been tough enough.
> 
Thanks for the explanation, I'm good with srcu now :)
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback
  2025-08-01 11:44 ` [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
  2025-08-04  7:09   ` Hannes Reinecke
@ 2025-08-06  9:15   ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-06  9:15 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: John Garry, Sathya Prakash Veerichetty, yukuai (C)

在 2025/08/01 19:44, Ming Lei 写道:
> Tag iterators can race with the freeing of the request pages(tags->page_list),
> potentially leading to use-after-free issues.
> 
> Defer the freeing of the page list and the tags structure itself until
> after an SRCU grace period has passed. This ensures that any concurrent
> tag iterators have completed before the memory is released. With this
> way, we can replace the big tags->lock in tags iterator code path with
> srcu for solving the issue.
> 
> This is achieved by:
> - Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
>    tag map iteration.
> - Adding an `rcu_head` to `struct blk_mq_tags` to be used with
>    `call_srcu`.
> - Moving the page list freeing logic and the `kfree(tags)` call into a
>    new callback function, `blk_mq_free_tags_callback`.
> - In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
>    callback for deferred execution.
> 
> The read-side protection for the tag iterators will be added in a
> subsequent patch.
> 
> Signed-off-by: Ming Lei<ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c     | 24 +++++++++++++++++++++++-
>   block/blk-mq.c         | 26 +++++++++++++-------------
>   include/linux/blk-mq.h |  2 ++
>   3 files changed, 38 insertions(+), 14 deletions(-)

LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] blk-mq: Defer freeing flush queue to SRCU callback
  2025-08-01 11:44 ` [PATCH 4/5] blk-mq: Defer freeing flush queue " Ming Lei
  2025-08-04  7:11   ` Hannes Reinecke
@ 2025-08-06  9:17   ` Yu Kuai
  1 sibling, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-06  9:17 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: John Garry, Sathya Prakash Veerichetty, yukuai (C)

在 2025/08/01 19:44, Ming Lei 写道:
> The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
> tag iterators that may still be accessing it. To prevent a potential
> use-after-free, the deallocation should be deferred until after a grace
> period. With this way, we can replace the big tags->lock in tags iterator
> code path with srcu for solving the issue.
> 
> This patch introduces an SRCU-based deferred freeing mechanism for the
> flush queue.
> 
> The changes include:
> - Adding a `rcu_head` to `struct blk_flush_queue`.
> - Creating a new callback function, `blk_free_flush_queue_callback`,
>    to handle the actual freeing.
> - Replacing the direct call to `blk_free_flush_queue()` in
>    `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
>    instance to ensure synchronization with tag iterators.
> 
> Signed-off-by: Ming Lei<ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 11 ++++++++++-
>   block/blk.h    |  1 +
>   2 files changed, 11 insertions(+), 1 deletion(-)

LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-01 11:44 ` [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
  2025-08-04  6:30   ` Yu Kuai
  2025-08-04  7:13   ` Hannes Reinecke
@ 2025-08-06  9:21   ` Yu Kuai
  2025-08-06 13:28     ` Ming Lei
  2 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-06  9:21 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: John Garry, Sathya Prakash Veerichetty, yukuai (C)

Hi,

在 2025/08/01 19:44, Ming Lei 写道:
> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> around the tag iterators.
> 
> This is done by:
> 
> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> 
> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> 
> This change improves performance by replacing a spinlock with a more
> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> shost->host_blocked.
> 
> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> accounting.
> 
> Signed-off-by: Ming Lei<ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c | 12 ++++++++----
>   block/blk-mq.c     | 24 ++++--------------------
>   2 files changed, 12 insertions(+), 24 deletions(-)

I think it's not good to use blk_mq_in_driver_rw() for io accounting, we
start io accounting from blk_account_io_start(), where such io is not in
driver yet.

Otherwise, LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-06  9:21   ` Yu Kuai
@ 2025-08-06 13:28     ` Ming Lei
  2025-08-07  1:23       ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-06 13:28 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

On Wed, Aug 06, 2025 at 05:21:32PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/01 19:44, Ming Lei 写道:
> > Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> > around the tag iterators.
> > 
> > This is done by:
> > 
> > - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> > blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> > 
> > - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> > 
> > This change improves performance by replacing a spinlock with a more
> > scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> > shost->host_blocked.
> > 
> > Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> > accounting.
> > 
> > Signed-off-by: Ming Lei<ming.lei@redhat.com>
> > ---
> >   block/blk-mq-tag.c | 12 ++++++++----
> >   block/blk-mq.c     | 24 ++++--------------------
> >   2 files changed, 12 insertions(+), 24 deletions(-)
> 
> I think it's not good to use blk_mq_in_driver_rw() for io accounting, we
> start io accounting from blk_account_io_start(), where such io is not in
> driver yet.

In blk_account_io_start(), the current IO is _not_ taken into account in
update_io_ticks() yet.

Also please look at 'man iostat':

    %util  Percentage  of  elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device
           saturation occurs when this value is close to 100% for devices serving requests serially.  But for devices serving requests  in
           parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

which shows %util in device level, not from queuing IO to complete io from device.

That said the current approach for counting inflight IO via percpu counter
from blk_account_io_start() is not correct from %util viewpoint from
request based driver.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-06 13:28     ` Ming Lei
@ 2025-08-07  1:23       ` Yu Kuai
  2025-08-07  2:12         ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-07  1:23 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

Hi,

在 2025/08/06 21:28, Ming Lei 写道:
> On Wed, Aug 06, 2025 at 05:21:32PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/01 19:44, Ming Lei 写道:
>>> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
>>> around the tag iterators.
>>>
>>> This is done by:
>>>
>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>
>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>
>>> This change improves performance by replacing a spinlock with a more
>>> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
>>> shost->host_blocked.
>>>
>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>> accounting.
>>>
>>> Signed-off-by: Ming Lei<ming.lei@redhat.com>
>>> ---
>>>    block/blk-mq-tag.c | 12 ++++++++----
>>>    block/blk-mq.c     | 24 ++++--------------------
>>>    2 files changed, 12 insertions(+), 24 deletions(-)
>>
>> I think it's not good to use blk_mq_in_driver_rw() for io accounting, we
>> start io accounting from blk_account_io_start(), where such io is not in
>> driver yet.
> 
> In blk_account_io_start(), the current IO is _not_ taken into account in
> update_io_ticks() yet.

However, this is exactly what we did from coding for a long time, for
example, consider just one IO:

t1: blk_account_io_start
t2: blk_mq_start_request
t3: blk_account_io_done

The update_io_ticks() is called from blk_account_io_start() and
blk_account_io_done(), the time (t3 - t1) will be accounted into
io_ticks.

And if we consider more IO, there will be a mess:

t1: IO a: blk_account_io_start
t2: IO b: blk_account_io_start
t3: IO a: blk_mq_start_request
t4: IO b: blk_mq_start_request
t5: IO a: blk_account_io_done
t6: IO b: blk_account_io_done

In the old cases that IO is inflight untill blk_mq_start_request, the
io_ticks accounted is t6 - t2, which is werid. That's the reason I
changed the IO accouting, and consider IO inflight from
blk_account_io_start, and this will unify all the fields in iostat.
> 
> Also please look at 'man iostat':
> 
>      %util  Percentage  of  elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device
>             saturation occurs when this value is close to 100% for devices serving requests serially.  But for devices serving requests  in
>             parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
> 
> which shows %util in device level, not from queuing IO to complete io from device.
> 
> That said the current approach for counting inflight IO via percpu counter
> from blk_account_io_start() is not correct from %util viewpoint from
> request based driver.
> 

I'll prefer to update the documents, on the one hand, util is not
accurate for a long time until today; on the other hand, AFAIK, iostat
is not suitable for raw disk, as the IO latency, aqusz, etc are all
accounted based on request lifetime, which include elevator and hctx.

In downstream kernels, we recieve lots of reports about slow disks,
a typical scenario is that if there are multiple disks in one scsi
host, and one of the disk has problem, all IOs to the host will be
stuck in the hctx->dispatch until host recovery is done, and after
the recovery is done, iostat will observe huge latency, aqusz, util
for other disks while these disks are just fine.

BTW, we do add IO accounting for bio lifetime and device lifetime
in downstream kernels, and this really helps for slow disk detection.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-07  1:23       ` Yu Kuai
@ 2025-08-07  2:12         ` Ming Lei
  2025-08-07  3:44           ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-07  2:12 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

On Thu, Aug 07, 2025 at 09:23:24AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/06 21:28, Ming Lei 写道:
> > On Wed, Aug 06, 2025 at 05:21:32PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2025/08/01 19:44, Ming Lei 写道:
> > > > Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
> > > > around the tag iterators.
> > > > 
> > > > This is done by:
> > > > 
> > > > - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
> > > > blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
> > > > 
> > > > - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
> > > > 
> > > > This change improves performance by replacing a spinlock with a more
> > > > scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
> > > > shost->host_blocked.
> > > > 
> > > > Meantime it becomes possible to use blk_mq_in_driver_rw() for io
> > > > accounting.
> > > > 
> > > > Signed-off-by: Ming Lei<ming.lei@redhat.com>
> > > > ---
> > > >    block/blk-mq-tag.c | 12 ++++++++----
> > > >    block/blk-mq.c     | 24 ++++--------------------
> > > >    2 files changed, 12 insertions(+), 24 deletions(-)
> > > 
> > > I think it's not good to use blk_mq_in_driver_rw() for io accounting, we
> > > start io accounting from blk_account_io_start(), where such io is not in
> > > driver yet.
> > 
> > In blk_account_io_start(), the current IO is _not_ taken into account in
> > update_io_ticks() yet.
> 
> However, this is exactly what we did from coding for a long time, for
> example, consider just one IO:
> 
> t1: blk_account_io_start
> t2: blk_mq_start_request
> t3: blk_account_io_done
> 
> The update_io_ticks() is called from blk_account_io_start() and
> blk_account_io_done(), the time (t3 - t1) will be accounted into
> io_ticks.

That still may not be correct, please see "Documentation/block/stat.rst":

```
io_ticks        milliseconds  total time this block device has been active
```

What I meant is that it doesn't matter wrt. "io accounting from
blk_account_io_start()", because the current io is excluded from `inflight ios` in
update_io_ticks() from blk_account_io_start().

> 
> And if we consider more IO, there will be a mess:
> 
> t1: IO a: blk_account_io_start
> t2: IO b: blk_account_io_start
> t3: IO a: blk_mq_start_request
> t4: IO b: blk_mq_start_request
> t5: IO a: blk_account_io_done
> t6: IO b: blk_account_io_done
> 
> In the old cases that IO is inflight untill blk_mq_start_request, the
> io_ticks accounted is t6 - t2, which is werid. That's the reason I
> changed the IO accouting, and consider IO inflight from
> blk_account_io_start, and this will unify all the fields in iostat.

In reality implementation may include odd things, but the top thing is that
what/how 'io_ticks' should be defined in theory? same with util%.

> > 
> > Also please look at 'man iostat':
> > 
> >      %util  Percentage  of  elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device
> >             saturation occurs when this value is close to 100% for devices serving requests serially.  But for devices serving requests  in
> >             parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
> > 
> > which shows %util in device level, not from queuing IO to complete io from device.
> > 
> > That said the current approach for counting inflight IO via percpu counter
> > from blk_account_io_start() is not correct from %util viewpoint from
> > request based driver.
> > 
> 
> I'll prefer to update the documents, on the one hand, util is not

Can we update every distributed iostat's man page? And we can't refresh
every user's mind about %util's definition in short time.

Also what/how should we update the document to? which is one serious thing.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
  2025-08-07  2:12         ` Ming Lei
@ 2025-08-07  3:44           ` Yu Kuai
  0 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-07  3:44 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: Jens Axboe, linux-block, John Garry, Sathya Prakash Veerichetty,
	yukuai (C)

Hi,

在 2025/08/07 10:12, Ming Lei 写道:
> On Thu, Aug 07, 2025 at 09:23:24AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/06 21:28, Ming Lei 写道:
>>> On Wed, Aug 06, 2025 at 05:21:32PM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/08/01 19:44, Ming Lei 写道:
>>>>> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
>>>>> around the tag iterators.
>>>>>
>>>>> This is done by:
>>>>>
>>>>> - Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
>>>>> blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
>>>>>
>>>>> - Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
>>>>>
>>>>> This change improves performance by replacing a spinlock with a more
>>>>> scalable SRCU lock, and fixes lockup issue in scsi_host_busy() in case of
>>>>> shost->host_blocked.
>>>>>
>>>>> Meantime it becomes possible to use blk_mq_in_driver_rw() for io
>>>>> accounting.
>>>>>
>>>>> Signed-off-by: Ming Lei<ming.lei@redhat.com>
>>>>> ---
>>>>>     block/blk-mq-tag.c | 12 ++++++++----
>>>>>     block/blk-mq.c     | 24 ++++--------------------
>>>>>     2 files changed, 12 insertions(+), 24 deletions(-)
>>>>
>>>> I think it's not good to use blk_mq_in_driver_rw() for io accounting, we
>>>> start io accounting from blk_account_io_start(), where such io is not in
>>>> driver yet.
>>>
>>> In blk_account_io_start(), the current IO is _not_ taken into account in
>>> update_io_ticks() yet.
>>
>> However, this is exactly what we did from coding for a long time, for
>> example, consider just one IO:
>>
>> t1: blk_account_io_start
>> t2: blk_mq_start_request
>> t3: blk_account_io_done
>>
>> The update_io_ticks() is called from blk_account_io_start() and
>> blk_account_io_done(), the time (t3 - t1) will be accounted into
>> io_ticks.
> 
> That still may not be correct, please see "Documentation/block/stat.rst":
> 
> ```
> io_ticks        milliseconds  total time this block device has been active
> ```
> 
> What I meant is that it doesn't matter wrt. "io accounting from
> blk_account_io_start()", because the current io is excluded from `inflight ios` in
> update_io_ticks() from blk_account_io_start().

So, unless we move update_io_ticks() to blk_mq_start_request(),
blk_account_io_start() is always we start accouting from, no matter we
use percpu counter or blk_mq_in_driver_rw(), the inflight ios should
be 0 for the first io.
> 
>>
>> And if we consider more IO, there will be a mess:
>>
>> t1: IO a: blk_account_io_start
>> t2: IO b: blk_account_io_start
>> t3: IO a: blk_mq_start_request
>> t4: IO b: blk_mq_start_request
>> t5: IO a: blk_account_io_done
>> t6: IO b: blk_account_io_done
>>
>> In the old cases that IO is inflight untill blk_mq_start_request, the
>> io_ticks accounted is t6 - t2, which is werid. That's the reason I
>> changed the IO accouting, and consider IO inflight from
>> blk_account_io_start, and this will unify all the fields in iostat.
> 
> In reality implementation may include odd things, but the top thing is that
> what/how 'io_ticks' should be defined in theory? same with util%.

Yes, for now I prefer the current defination, let iostat start
everything from blk_account_io_start.

However, I'm fine if we decide to move update_io_ticks to
blk_mq_start_request(). One concern is that does the overhead of req
walking per jiffies acceptable?
> 
>>>
>>> Also please look at 'man iostat':
>>>
>>>       %util  Percentage  of  elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device
>>>              saturation occurs when this value is close to 100% for devices serving requests serially.  But for devices serving requests  in
>>>              parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
>>>
>>> which shows %util in device level, not from queuing IO to complete io from device.
>>>
>>> That said the current approach for counting inflight IO via percpu counter
>>> from blk_account_io_start() is not correct from %util viewpoint from
>>> request based driver.
>>>
>>
>> I'll prefer to update the documents, on the one hand, util is not
> 
> Can we update every distributed iostat's man page? And we can't refresh
> every user's mind about %util's definition in short time.
> 
> Also what/how should we update the document to? which is one serious thing.

I never do this before, however, I believe there is a way :) I can dig
deeper after we are in aggrement.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-08-07  3:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-01 11:44 [PATCH 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-01 11:44 ` [PATCH 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
2025-08-04  6:06   ` Yu Kuai
2025-08-04  7:07   ` Hannes Reinecke
2025-08-01 11:44 ` [PATCH 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
2025-08-04  7:08   ` Hannes Reinecke
2025-08-05  7:48   ` Yu Kuai
2025-08-01 11:44 ` [PATCH 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
2025-08-04  7:09   ` Hannes Reinecke
2025-08-06  9:15   ` Yu Kuai
2025-08-01 11:44 ` [PATCH 4/5] blk-mq: Defer freeing flush queue " Ming Lei
2025-08-04  7:11   ` Hannes Reinecke
2025-08-06  9:17   ` Yu Kuai
2025-08-01 11:44 ` [PATCH 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-04  6:30   ` Yu Kuai
2025-08-04 11:32     ` Ming Lei
2025-08-05  8:33       ` Yu Kuai
2025-08-05  8:38         ` Yu Kuai
2025-08-05  8:48           ` Ming Lei
2025-08-06  1:06             ` Yu Kuai
2025-08-06  8:59               ` Ming Lei
2025-08-06  9:06                 ` Yu Kuai
2025-08-04  7:13   ` Hannes Reinecke
2025-08-04 11:35     ` Ming Lei
2025-08-04 11:45       ` Hannes Reinecke
2025-08-06  9:21   ` Yu Kuai
2025-08-06 13:28     ` Ming Lei
2025-08-07  1:23       ` Yu Kuai
2025-08-07  2:12         ` Ming Lei
2025-08-07  3:44           ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).