public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
From: Keith Busch <kbusch@kernel.org>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Mohamed Khalfella <mkhalfella@purestorage.com>,
	Chaitanya Kulkarni <kch@nvidia.com>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Sagi Grimberg <sagi@grimberg.me>,
	Casey Chen <cachen@purestorage.com>,
	Yuanyuan Zhong <yzhong@purestorage.com>,
	Hannes Reinecke <hare@suse.de>, Ming Lei <ming.lei@redhat.com>,
	Waiman Long <llong@redhat.com>, Hillf Danton <hdanton@sina.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
Date: Thu, 4 Dec 2025 18:32:31 -0700	[thread overview]
Message-ID: <aTI2L6j50VWjp7aW@kbusch-mbp> (raw)
In-Reply-To: <201a7e9e-4782-4f71-a73b-9d58a51ee8ec@acm.org>

On Thu, Dec 04, 2025 at 01:22:49PM -1000, Bart Van Assche wrote:
> 
> On 12/4/25 11:26 AM, Keith Busch wrote:
> > On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:>> Hence, the deadlock can be
> > > solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
> > > and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
> > > I/O and does not call blk_mq_quiesce_tagset() then the
> > > blk_mq_freeze_queue_wait() call will finish instead of triggering a
> > > deadlock. However, I do not know whether this proposal seems acceptable
> > > to the NVMe maintainers.
> > 
> > You periodically make this suggestion, but there's never a reason
> > offered to introduce yet another work queue for the driver to
> > synchronize with at various points. The whole point of making blk-mq
> > timeout handler in a work queue (it used to be a timer) was so that we
> > could do blocking actions like this.
> Hi Keith,
> 
> The blk_mq_quiesce_tagset() call from the NVMe timeout handler is
> unfortunate because it triggers a deadlock with
> blk_mq_update_tag_set_shared().

So in this scenario, the driver is modifying a tagset list from two
queues to one, which causes blk-mq to clear the "shared" flags. The
remaining one just so happens to have hit a timeout at the same time,
which runs in a context with an elevated "q_usage_counter". The current
rule, then, is you can not take the tag_list_lock from any context using
any queue in the tag list.

> I proposed to modify the NVMe driver because I think that's a better
> approach than introducing a new synchronize_rcu() call in the block
> layer core.

I'm not interested in introducing rcu synchronize here either. I guess I
would make it so you can quiesce a tagset from a context that entered
the queue. So quick shot at that here:

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e96bb2462475..20450017b9512 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4262,11 +4262,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  * Caller needs to ensure that we're either frozen/quiesced, or that
  * the queue isn't live yet.
  */
-static void queue_set_hctx_shared(struct request_queue *q, bool shared)
+static void queue_set_hctx_shared_locked(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
+	bool shared;
+
+	lockdep_assert_held(&set->tag_list_lock);
 
+	shared = set->flags & BLK_MQ_F_TAG_QUEUE_SHARED;
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared) {
 			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
@@ -4277,24 +4282,22 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }
 
-static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
-					 bool shared)
+static void queue_set_hctx_shared(struct request_queue *q)
 {
-	struct request_queue *q;
+	struct blk_mq_tag_set *set = q->tag_set;
 	unsigned int memflags;
 
-	lockdep_assert_held(&set->tag_list_lock);
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		memflags = blk_mq_freeze_queue(q);
-		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q, memflags);
-	}
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&set->tag_list_lock);
+	queue_set_hctx_shared_locked(q);
+	mutex_unlock(&set->tag_list_lock);
+	blk_mq_unfreeze_queue(q, memflags);
 }
 
 static void blk_mq_del_queue_tag_set(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct request_queue *shared = NULL;
 
 	mutex_lock(&set->tag_list_lock);
 	list_del(&q->tag_set_list);
@@ -4302,15 +4305,25 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, false);
+		shared = list_first_entry(&set->tag_list, struct request_queue,
+					  tag_set_list);
+		if (!blk_get_queue(shared))
+			shared = NULL;
 	}
 	mutex_unlock(&set->tag_list_lock);
 	INIT_LIST_HEAD(&q->tag_set_list);
+
+	if (shared) {
+		queue_set_hctx_shared(shared);
+		blk_put_queue(shared);
+	}
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 				     struct request_queue *q)
 {
+	struct request_queue *shared = NULL;
+
 	mutex_lock(&set->tag_list_lock);
 
 	/*
@@ -4318,15 +4331,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 */
 	if (!list_empty(&set->tag_list) &&
 	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, true);
+		shared = list_first_entry(&set->tag_list, struct request_queue,
+					  tag_set_list);
+		if (!blk_get_queue(shared))
+			shared = NULL;
+		else
+			set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		queue_set_hctx_shared(q, true);
+		queue_set_hctx_shared_locked(q);
 	list_add_tail(&q->tag_set_list, &set->tag_list);
 
 	mutex_unlock(&set->tag_list_lock);
+
+	if (shared) {
+		queue_set_hctx_shared(shared);
+		blk_put_queue(shared);
+	}
 }
 
 /* All allocations will be freed in release handler of q->mq_kobj */
--

  reply	other threads:[~2025-12-05  1:32 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04 18:11 [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
2025-12-04 18:22   ` Bart Van Assche
2025-12-04 18:42     ` Mohamed Khalfella
2025-12-04 19:06       ` Bart Van Assche
2025-12-04 19:15         ` Mohamed Khalfella
2025-12-04 19:31           ` Bart Van Assche
2025-12-04 19:57             ` Mohamed Khalfella
2025-12-04 20:24               ` Bart Van Assche
2025-12-04 21:26                 ` Keith Busch
2025-12-04 23:22                   ` Bart Van Assche
2025-12-05  1:32                     ` Keith Busch [this message]
2025-12-05  2:52                       ` Bart Van Assche
2025-12-05 16:39                       ` Mohamed Khalfella
2025-12-05 18:11                         ` Keith Busch
2025-12-08 19:22                       ` Bart Van Assche
2025-12-04 19:02   ` Mohamed Khalfella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTI2L6j50VWjp7aW@kbusch-mbp \
    --to=kbusch@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=cachen@purestorage.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=hdanton@sina.com \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=llong@redhat.com \
    --cc=ming.lei@redhat.com \
    --cc=mkhalfella@purestorage.com \
    --cc=sagi@grimberg.me \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox