From: Mohamed Khalfella <mkhalfella@purestorage.com>
To: Chaitanya Kulkarni <kch@nvidia.com>,
Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
Keith Busch <kbusch@kernel.org>, Sagi Grimberg <sagi@grimberg.me>
Cc: Casey Chen <cachen@purestorage.com>,
Yuanyuan Zhong <yzhong@purestorage.com>,
Hannes Reinecke <hare@suse.de>, Ming Lei <ming.lei@redhat.com>,
Waiman Long <llong@redhat.com>, Hillf Danton <hdanton@sina.com>,
linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org,
Mohamed Khalfella <mkhalfella@purestorage.com>
Subject: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
Date: Thu, 4 Dec 2025 10:11:53 -0800 [thread overview]
Message-ID: <20251204181212.1484066-2-mkhalfella@purestorage.com> (raw)
In-Reply-To: <20251204181212.1484066-1-mkhalfella@purestorage.com>
blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.
Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.
This results in deadlock between two threads with these stacktraces:
__schedule+0x48e/0xed0
schedule+0x5a/0xc0
schedule_preempt_disabled+0x11/0x20
__mutex_lock.constprop.0+0x3cc/0x760
blk_mq_quiesce_tagset+0x26/0xd0
nvme_dev_disable_locked+0x77/0x280 [nvme]
nvme_timeout+0x268/0x320 [nvme]
blk_mq_handle_expired+0x5d/0x90
bt_iter+0x7e/0x90
blk_mq_queue_tag_busy_iter+0x2b2/0x590
? __blk_mq_complete_request_remote+0x10/0x10
? __blk_mq_complete_request_remote+0x10/0x10
blk_mq_timeout_work+0x15b/0x1a0
process_one_work+0x133/0x2f0
? mod_delayed_work_on+0x90/0x90
worker_thread+0x2ec/0x400
? mod_delayed_work_on+0x90/0x90
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2d/0x50
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
__schedule+0x48e/0xed0
schedule+0x5a/0xc0
blk_mq_freeze_queue_wait+0x62/0x90
? destroy_sched_domains_rcu+0x30/0x30
blk_mq_exit_queue+0x151/0x180
disk_release+0xe3/0xf0
device_release+0x31/0x90
kobject_put+0x6d/0x180
nvme_scan_ns+0x858/0xc90 [nvme_core]
? nvme_scan_work+0x281/0x560 [nvme_core]
nvme_scan_work+0x281/0x560 [nvme_core]
process_one_work+0x133/0x2f0
? mod_delayed_work_on+0x90/0x90
worker_thread+0x2ec/0x400
? mod_delayed_work_on+0x90/0x90
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2d/0x50
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tires to walk the queues in set->tag_list.
The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.
Given that [un]quescing queue is an operation that does not need to
sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking
set->tag_list_lock. Also update blk_mq_{add,del}_queue_tag_set() to use
RCU safe list operations. This should help avoid deadlock seen above.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
block/blk-mq.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d626d32f6e57..ceb176ac154b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
{
struct request_queue *q;
- mutex_lock(&set->tag_list_lock);
- list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
if (!blk_queue_skip_tagset_quiesce(q))
blk_mq_quiesce_queue_nowait(q);
}
- mutex_unlock(&set->tag_list_lock);
+ rcu_read_unlock();
blk_mq_wait_quiesce_done(set);
}
@@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
{
struct request_queue *q;
- mutex_lock(&set->tag_list_lock);
- list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
if (!blk_queue_skip_tagset_quiesce(q))
blk_mq_unquiesce_queue(q);
}
- mutex_unlock(&set->tag_list_lock);
+ rcu_read_unlock();
}
EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
@@ -4294,7 +4294,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
struct blk_mq_tag_set *set = q->tag_set;
mutex_lock(&set->tag_list_lock);
- list_del(&q->tag_set_list);
+ list_del_rcu(&q->tag_set_list);
if (list_is_singular(&set->tag_list)) {
/* just transitioned to unshared */
set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
@@ -4302,6 +4302,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
blk_mq_update_tag_set_shared(set, false);
}
mutex_unlock(&set->tag_list_lock);
+
+ synchronize_rcu();
INIT_LIST_HEAD(&q->tag_set_list);
}
@@ -4321,7 +4323,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
}
if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
queue_set_hctx_shared(q, true);
- list_add_tail(&q->tag_set_list, &set->tag_list);
+ list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
mutex_unlock(&set->tag_list_lock);
}
--
2.51.2
next prev parent reply other threads:[~2025-12-04 18:12 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-04 18:11 [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
2025-12-04 18:11 ` Mohamed Khalfella [this message]
2025-12-04 18:22 ` [PATCH 1/1] block: " Bart Van Assche
2025-12-04 18:42 ` Mohamed Khalfella
2025-12-04 19:06 ` Bart Van Assche
2025-12-04 19:15 ` Mohamed Khalfella
2025-12-04 19:31 ` Bart Van Assche
2025-12-04 19:57 ` Mohamed Khalfella
2025-12-04 20:24 ` Bart Van Assche
2025-12-04 21:26 ` Keith Busch
2025-12-04 23:22 ` Bart Van Assche
2025-12-05 1:32 ` Keith Busch
2025-12-05 2:52 ` Bart Van Assche
2025-12-05 16:39 ` Mohamed Khalfella
2025-12-05 18:11 ` Keith Busch
2025-12-08 19:22 ` Bart Van Assche
2025-12-04 19:02 ` Mohamed Khalfella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251204181212.1484066-2-mkhalfella@purestorage.com \
--to=mkhalfella@purestorage.com \
--cc=axboe@kernel.dk \
--cc=cachen@purestorage.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=hdanton@sina.com \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=llong@redhat.com \
--cc=ming.lei@redhat.com \
--cc=sagi@grimberg.me \
--cc=yzhong@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox