* [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators
@ 2025-08-30 2:18 Ming Lei
2025-08-30 2:18 ` [PATCH V2 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Ming Lei @ 2025-08-30 2:18 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Hannes Reinecke, Yu Kuai, Ming Lei
Hello Jens,
Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.
Avoids scsi_host_busy() lockup during scsi host blocked in case of big cpu
cores & deep queue depth.
Also the big tags lock isn't needed when reading disk sysfs attribute
'inflight' any more.
Take the following approach:
- clearing rq reference in tags->rqs[] and deferring freeing scheduler requests
in SRCU callback
- replace tags->lock with srcu read lock in tags iterator.
V2:
- rebase on for-6.18/block
- add review tags
Thanks,
Ming
Ming Lei (5):
blk-mq: Move flush queue allocation into blk_mq_init_hctx()
blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
blk-mq: Defer freeing of tags page_list to SRCU callback
blk-mq: Defer freeing flush queue to SRCU callback
blk-mq: Replace tags->lock with SRCU for tag iterators
block/blk-mq-sysfs.c | 1 -
block/blk-mq-tag.c | 38 +++++++++++++++---
block/blk-mq.c | 87 +++++++++++++++++++++---------------------
block/blk-mq.h | 4 +-
block/blk.h | 1 +
include/linux/blk-mq.h | 2 +
6 files changed, 80 insertions(+), 53 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH V2 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx()
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
@ 2025-08-30 2:18 ` Ming Lei
2025-08-30 2:18 ` [PATCH V2 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2025-08-30 2:18 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Hannes Reinecke, Yu Kuai, Ming Lei
Move flush queue allocation into blk_mq_init_hctx() and its release into
blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
draining inflight request walking. blk_mq_exit_hctx() is the last chance
for us to get valid `tag_set` reference, and we need to add one SRCU to
`tag_set` for freeing flush request via call_srcu().
It is safe to move flush queue & request release into blk_mq_exit_hctx(),
because blk_mq_clear_flush_rq_mapping() clears the flush request
reference int driver tags inflight request table, meantime inflight
request walking is drained.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-sysfs.c | 1 -
block/blk-mq.c | 20 +++++++++++++-------
2 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 5c399ac562ea..58ec293373c6 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -34,7 +34,6 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
kobj);
- blk_free_flush_queue(hctx->fq);
sbitmap_free(&hctx->ctx_map);
free_cpumask_var(hctx->cpumask);
kfree(hctx->ctxs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ba3a4b77f578..cfd4bbc161ac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3939,6 +3939,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
+ blk_free_flush_queue(hctx->fq);
+ hctx->fq = NULL;
+
xa_erase(&q->hctx_table, hctx_idx);
spin_lock(&q->unused_hctx_lock);
@@ -3964,13 +3967,19 @@ static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
{
+ gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+ hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
+ if (!hctx->fq)
+ goto fail;
+
hctx->queue_num = hctx_idx;
hctx->tags = set->tags[hctx_idx];
if (set->ops->init_hctx &&
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
- goto fail;
+ goto fail_free_fq;
if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
hctx->numa_node))
@@ -3987,6 +3996,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
exit_hctx:
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
+ fail_free_fq:
+ blk_free_flush_queue(hctx->fq);
+ hctx->fq = NULL;
fail:
return -1;
}
@@ -4038,16 +4050,10 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
- hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
- if (!hctx->fq)
- goto free_bitmap;
-
blk_mq_hctx_kobj_init(hctx);
return hctx;
- free_bitmap:
- sbitmap_free(&hctx->ctx_map);
free_ctxs:
kfree(hctx->ctxs);
free_cpumask:
--
2.47.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH V2 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-30 2:18 ` [PATCH V2 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
@ 2025-08-30 2:18 ` Ming Lei
2025-08-30 2:18 ` [PATCH V2 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2025-08-30 2:18 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Hannes Reinecke, Yu Kuai, Ming Lei
To prepare for converting the tag->rqs freeing to be SRCU-based, the
tag_set is needed in the freeing helper functions.
This patch adds 'struct blk_mq_tag_set *' as the first parameter to
blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
sites.
This allows access to the tag_set's SRCU structure in the next step,
which will be used to free the tag maps after a grace period.
No functional change is intended in this patch.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-tag.c | 2 +-
block/blk-mq.c | 10 +++++-----
block/blk-mq.h | 4 ++--
3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..6fce42851f03 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -576,7 +576,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
return NULL;
}
-void blk_mq_free_tags(struct blk_mq_tags *tags)
+void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
{
sbitmap_queue_free(&tags->bitmap_tags);
sbitmap_queue_free(&tags->breserved_tags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cfd4bbc161ac..b8f13e321516 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3491,14 +3491,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
}
}
-void blk_mq_free_rq_map(struct blk_mq_tags *tags)
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
{
kfree(tags->rqs);
tags->rqs = NULL;
kfree(tags->static_rqs);
tags->static_rqs = NULL;
- blk_mq_free_tags(tags);
+ blk_mq_free_tags(set, tags);
}
static enum hctx_type hctx_idx_to_type(struct blk_mq_tag_set *set,
@@ -3560,7 +3560,7 @@ static struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
err_free_rqs:
kfree(tags->rqs);
err_free_tags:
- blk_mq_free_tags(tags);
+ blk_mq_free_tags(set, tags);
return NULL;
}
@@ -4107,7 +4107,7 @@ struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
ret = blk_mq_alloc_rqs(set, tags, hctx_idx, depth);
if (ret) {
- blk_mq_free_rq_map(tags);
+ blk_mq_free_rq_map(set, tags);
return NULL;
}
@@ -4135,7 +4135,7 @@ void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
{
if (tags) {
blk_mq_free_rqs(set, tags, hctx_idx);
- blk_mq_free_rq_map(tags);
+ blk_mq_free_rq_map(set, tags);
}
}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..b96a753809ab 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -59,7 +59,7 @@ void blk_mq_put_rq_ref(struct request *rq);
*/
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx);
-void blk_mq_free_rq_map(struct blk_mq_tags *tags);
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags);
struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set,
unsigned int hctx_idx, unsigned int depth);
void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set,
@@ -162,7 +162,7 @@ struct blk_mq_alloc_data {
struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
unsigned int reserved_tags, unsigned int flags, int node);
-void blk_mq_free_tags(struct blk_mq_tags *tags);
+void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags);
unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
--
2.47.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH V2 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-30 2:18 ` [PATCH V2 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
2025-08-30 2:18 ` [PATCH V2 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
@ 2025-08-30 2:18 ` Ming Lei
2025-08-30 2:18 ` [PATCH V2 4/5] blk-mq: Defer freeing flush queue " Ming Lei
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2025-08-30 2:18 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Hannes Reinecke, Yu Kuai, Ming Lei
Tag iterators can race with the freeing of the request pages(tags->page_list),
potentially leading to use-after-free issues.
Defer the freeing of the page list and the tags structure itself until
after an SRCU grace period has passed. This ensures that any concurrent
tag iterators have completed before the memory is released. With this
way, we can replace the big tags->lock in tags iterator code path with
srcu for solving the issue.
This is achieved by:
- Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
tag map iteration.
- Adding an `rcu_head` to `struct blk_mq_tags` to be used with
`call_srcu`.
- Moving the page list freeing logic and the `kfree(tags)` call into a
new callback function, `blk_mq_free_tags_callback`.
- In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
callback for deferred execution.
The read-side protection for the tag iterators will be added in a
subsequent patch.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-tag.c | 24 +++++++++++++++++++++++-
block/blk-mq.c | 26 +++++++++++++-------------
include/linux/blk-mq.h | 2 ++
3 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 6fce42851f03..6c2f5881e0de 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -8,6 +8,9 @@
*/
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/kmemleak.h>
#include <linux/delay.h>
#include "blk.h"
@@ -576,11 +579,30 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
return NULL;
}
+static void blk_mq_free_tags_callback(struct rcu_head *head)
+{
+ struct blk_mq_tags *tags = container_of(head, struct blk_mq_tags,
+ rcu_head);
+ struct page *page;
+
+ while (!list_empty(&tags->page_list)) {
+ page = list_first_entry(&tags->page_list, struct page, lru);
+ list_del_init(&page->lru);
+ /*
+ * Remove kmemleak object previously allocated in
+ * blk_mq_alloc_rqs().
+ */
+ kmemleak_free(page_address(page));
+ __free_pages(page, page->private);
+ }
+ kfree(tags);
+}
+
void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
{
sbitmap_queue_free(&tags->bitmap_tags);
sbitmap_queue_free(&tags->breserved_tags);
- kfree(tags);
+ call_srcu(&set->tags_srcu, &tags->rcu_head, blk_mq_free_tags_callback);
}
int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b8f13e321516..14bfdc6eadce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3454,7 +3454,6 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx)
{
struct blk_mq_tags *drv_tags;
- struct page *page;
if (list_empty(&tags->page_list))
return;
@@ -3478,17 +3477,10 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
}
blk_mq_clear_rq_mapping(drv_tags, tags);
-
- while (!list_empty(&tags->page_list)) {
- page = list_first_entry(&tags->page_list, struct page, lru);
- list_del_init(&page->lru);
- /*
- * Remove kmemleak object previously allocated in
- * blk_mq_alloc_rqs().
- */
- kmemleak_free(page_address(page));
- __free_pages(page, page->private);
- }
+ /*
+ * Free request pages in SRCU callback, which is called from
+ * blk_mq_free_tags().
+ */
}
void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags)
@@ -4834,6 +4826,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
if (ret)
goto out_free_srcu;
}
+ ret = init_srcu_struct(&set->tags_srcu);
+ if (ret)
+ goto out_cleanup_srcu;
init_rwsem(&set->update_nr_hwq_lock);
@@ -4842,7 +4837,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
sizeof(struct blk_mq_tags *), GFP_KERNEL,
set->numa_node);
if (!set->tags)
- goto out_cleanup_srcu;
+ goto out_cleanup_tags_srcu;
for (i = 0; i < set->nr_maps; i++) {
set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
@@ -4871,6 +4866,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
}
kfree(set->tags);
set->tags = NULL;
+out_cleanup_tags_srcu:
+ cleanup_srcu_struct(&set->tags_srcu);
out_cleanup_srcu:
if (set->flags & BLK_MQ_F_BLOCKING)
cleanup_srcu_struct(set->srcu);
@@ -4916,6 +4913,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
kfree(set->tags);
set->tags = NULL;
+
+ srcu_barrier(&set->tags_srcu);
+ cleanup_srcu_struct(&set->tags_srcu);
if (set->flags & BLK_MQ_F_BLOCKING) {
cleanup_srcu_struct(set->srcu);
kfree(set->srcu);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0..1325ceeb743a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -531,6 +531,7 @@ struct blk_mq_tag_set {
struct mutex tag_list_lock;
struct list_head tag_list;
struct srcu_struct *srcu;
+ struct srcu_struct tags_srcu;
struct rw_semaphore update_nr_hwq_lock;
};
@@ -767,6 +768,7 @@ struct blk_mq_tags {
* request pool
*/
spinlock_t lock;
+ struct rcu_head rcu_head;
};
static inline struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags,
--
2.47.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH V2 4/5] blk-mq: Defer freeing flush queue to SRCU callback
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
` (2 preceding siblings ...)
2025-08-30 2:18 ` [PATCH V2 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
@ 2025-08-30 2:18 ` Ming Lei
2025-08-30 2:18 ` [PATCH V2 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-31 0:35 ` [PATCH V2 0/5] " Martin K. Petersen
5 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2025-08-30 2:18 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Hannes Reinecke, Yu Kuai, Ming Lei
The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags->lock in tags iterator
code path with srcu for solving the issue.
This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.
The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
`blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
instance to ensure synchronization with tag iterators.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 11 ++++++++++-
block/blk.h | 1 +
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 14bfdc6eadce..c9c6e954bfbc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3912,6 +3912,14 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
spin_unlock_irqrestore(&tags->lock, flags);
}
+static void blk_free_flush_queue_callback(struct rcu_head *head)
+{
+ struct blk_flush_queue *fq =
+ container_of(head, struct blk_flush_queue, rcu_head);
+
+ blk_free_flush_queue(fq);
+}
+
/* hctx->ctxs will be freed in queue's release handler */
static void blk_mq_exit_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
@@ -3931,7 +3939,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
- blk_free_flush_queue(hctx->fq);
+ call_srcu(&set->tags_srcu, &hctx->fq->rcu_head,
+ blk_free_flush_queue_callback);
hctx->fq = NULL;
xa_erase(&q->hctx_table, hctx_idx);
diff --git a/block/blk.h b/block/blk.h
index 46f566f9b126..7d420c247d81 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -41,6 +41,7 @@ struct blk_flush_queue {
struct list_head flush_queue[2];
unsigned long flush_data_in_flight;
struct request *flush_rq;
+ struct rcu_head rcu_head;
};
bool is_flush_rq(struct request *req);
--
2.47.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH V2 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
` (3 preceding siblings ...)
2025-08-30 2:18 ` [PATCH V2 4/5] blk-mq: Defer freeing flush queue " Ming Lei
@ 2025-08-30 2:18 ` Ming Lei
2025-08-31 0:35 ` [PATCH V2 0/5] " Martin K. Petersen
5 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2025-08-30 2:18 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Hannes Reinecke, Yu Kuai, Ming Lei
Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.
This is done by:
- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().
- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().
This change fixes lockup issue in scsi_host_busy() in case of shost->host_blocked.
Also avoids big tags->lock when reading disk sysfs attribute `inflight`.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-tag.c | 12 ++++++++----
block/blk-mq.c | 24 ++++--------------------
2 files changed, 12 insertions(+), 24 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 6c2f5881e0de..7ae431077a32 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -256,13 +256,10 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
unsigned int bitnr)
{
struct request *rq;
- unsigned long flags;
- spin_lock_irqsave(&tags->lock, flags);
rq = tags->rqs[bitnr];
if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq))
rq = NULL;
- spin_unlock_irqrestore(&tags->lock, flags);
return rq;
}
@@ -440,7 +437,9 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
{
unsigned int flags = tagset->flags;
- int i, nr_tags;
+ int i, nr_tags, srcu_idx;
+
+ srcu_idx = srcu_read_lock(&tagset->tags_srcu);
nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
@@ -449,6 +448,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
BT_TAG_ITER_STARTED);
}
+ srcu_read_unlock(&tagset->tags_srcu, srcu_idx);
}
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -499,6 +499,8 @@ EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
void *priv)
{
+ int srcu_idx;
+
/*
* __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table
* while the queue is frozen. So we can use q_usage_counter to avoid
@@ -507,6 +509,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
+ srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu);
if (blk_mq_is_shared_tags(q->tag_set->flags)) {
struct blk_mq_tags *tags = q->tag_set->shared_tags;
struct sbitmap_queue *bresv = &tags->breserved_tags;
@@ -536,6 +539,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn,
bt_for_each(hctx, q, btags, fn, priv, false);
}
}
+ srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx);
blk_queue_exit(q);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c9c6e954bfbc..8191ffac998e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3415,7 +3415,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
struct blk_mq_tags *tags)
{
struct page *page;
- unsigned long flags;
/*
* There is no need to clear mapping if driver tags is not initialized
@@ -3439,15 +3438,6 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags,
}
}
}
-
- /*
- * Wait until all pending iteration is done.
- *
- * Request reference is cleared and it is guaranteed to be observed
- * after the ->lock is released.
- */
- spin_lock_irqsave(&drv_tags->lock, flags);
- spin_unlock_irqrestore(&drv_tags->lock, flags);
}
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
@@ -3670,8 +3660,12 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
struct rq_iter_data data = {
.hctx = hctx,
};
+ int srcu_idx;
+ srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu);
blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+ srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx);
+
return data.has_rq;
}
@@ -3891,7 +3885,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
unsigned int queue_depth, struct request *flush_rq)
{
int i;
- unsigned long flags;
/* The hw queue may not be mapped yet */
if (!tags)
@@ -3901,15 +3894,6 @@ static void blk_mq_clear_flush_rq_mapping(struct blk_mq_tags *tags,
for (i = 0; i < queue_depth; i++)
cmpxchg(&tags->rqs[i], flush_rq, NULL);
-
- /*
- * Wait until all pending iteration is done.
- *
- * Request reference is cleared and it is guaranteed to be observed
- * after the ->lock is released.
- */
- spin_lock_irqsave(&tags->lock, flags);
- spin_unlock_irqrestore(&tags->lock, flags);
}
static void blk_free_flush_queue_callback(struct rcu_head *head)
--
2.47.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
` (4 preceding siblings ...)
2025-08-30 2:18 ` [PATCH V2 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
@ 2025-08-31 0:35 ` Martin K. Petersen
5 siblings, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2025-08-31 0:35 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Hannes Reinecke, Yu Kuai
Ming,
> Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read
> lock around the tag iterators.
>
> Avoids scsi_host_busy() lockup during scsi host blocked in case of big
> cpu cores & deep queue depth.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
--
Martin K. Petersen
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-31 0:36 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-30 2:18 [PATCH V2 0/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-30 2:18 ` [PATCH V2 1/5] blk-mq: Move flush queue allocation into blk_mq_init_hctx() Ming Lei
2025-08-30 2:18 ` [PATCH V2 2/5] blk-mq: Pass tag_set to blk_mq_free_rq_map/tags Ming Lei
2025-08-30 2:18 ` [PATCH V2 3/5] blk-mq: Defer freeing of tags page_list to SRCU callback Ming Lei
2025-08-30 2:18 ` [PATCH V2 4/5] blk-mq: Defer freeing flush queue " Ming Lei
2025-08-30 2:18 ` [PATCH V2 5/5] blk-mq: Replace tags->lock with SRCU for tag iterators Ming Lei
2025-08-31 0:35 ` [PATCH V2 0/5] " Martin K. Petersen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).