All of lore.kernel.org
 help / color / mirror / Atom feed
From: Keith Busch <kbusch@kernel.org>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Mohamed Khalfella <mkhalfella@purestorage.com>,
	Chaitanya Kulkarni <kch@nvidia.com>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Sagi Grimberg <sagi@grimberg.me>,
	Casey Chen <cachen@purestorage.com>,
	Yuanyuan Zhong <yzhong@purestorage.com>,
	Hannes Reinecke <hare@suse.de>, Ming Lei <ming.lei@redhat.com>,
	Waiman Long <llong@redhat.com>, Hillf Danton <hdanton@sina.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
Date: Thu, 4 Dec 2025 18:32:31 -0700	[thread overview]
Message-ID: <aTI2L6j50VWjp7aW@kbusch-mbp> (raw)
In-Reply-To: <201a7e9e-4782-4f71-a73b-9d58a51ee8ec@acm.org>

On Thu, Dec 04, 2025 at 01:22:49PM -1000, Bart Van Assche wrote:
> 
> On 12/4/25 11:26 AM, Keith Busch wrote:
> > On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:>> Hence, the deadlock can be
> > > solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
> > > and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
> > > I/O and does not call blk_mq_quiesce_tagset() then the
> > > blk_mq_freeze_queue_wait() call will finish instead of triggering a
> > > deadlock. However, I do not know whether this proposal seems acceptable
> > > to the NVMe maintainers.
> > 
> > You periodically make this suggestion, but there's never a reason
> > offered to introduce yet another work queue for the driver to
> > synchronize with at various points. The whole point of making blk-mq
> > timeout handler in a work queue (it used to be a timer) was so that we
> > could do blocking actions like this.
> Hi Keith,
> 
> The blk_mq_quiesce_tagset() call from the NVMe timeout handler is
> unfortunate because it triggers a deadlock with
> blk_mq_update_tag_set_shared().

So in this scenario, the driver is modifying a tagset list from two
queues to one, which causes blk-mq to clear the "shared" flags. The
remaining one just so happens to have hit a timeout at the same time,
which runs in a context with an elevated "q_usage_counter". The current
rule, then, is you can not take the tag_list_lock from any context using
any queue in the tag list.

> I proposed to modify the NVMe driver because I think that's a better
> approach than introducing a new synchronize_rcu() call in the block
> layer core.

I'm not interested in introducing rcu synchronize here either. I guess I
would make it so you can quiesce a tagset from a context that entered
the queue. So quick shot at that here:

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e96bb2462475..20450017b9512 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4262,11 +4262,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
  * Caller needs to ensure that we're either frozen/quiesced, or that
  * the queue isn't live yet.
  */
-static void queue_set_hctx_shared(struct request_queue *q, bool shared)
+static void queue_set_hctx_shared_locked(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
+	bool shared;
+
+	lockdep_assert_held(&set->tag_list_lock);
 
+	shared = set->flags & BLK_MQ_F_TAG_QUEUE_SHARED;
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared) {
 			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
@@ -4277,24 +4282,22 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }
 
-static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
-					 bool shared)
+static void queue_set_hctx_shared(struct request_queue *q)
 {
-	struct request_queue *q;
+	struct blk_mq_tag_set *set = q->tag_set;
 	unsigned int memflags;
 
-	lockdep_assert_held(&set->tag_list_lock);
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		memflags = blk_mq_freeze_queue(q);
-		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q, memflags);
-	}
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&set->tag_list_lock);
+	queue_set_hctx_shared_locked(q);
+	mutex_unlock(&set->tag_list_lock);
+	blk_mq_unfreeze_queue(q, memflags);
 }
 
 static void blk_mq_del_queue_tag_set(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct request_queue *shared = NULL;
 
 	mutex_lock(&set->tag_list_lock);
 	list_del(&q->tag_set_list);
@@ -4302,15 +4305,25 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, false);
+		shared = list_first_entry(&set->tag_list, struct request_queue,
+					  tag_set_list);
+		if (!blk_get_queue(shared))
+			shared = NULL;
 	}
 	mutex_unlock(&set->tag_list_lock);
 	INIT_LIST_HEAD(&q->tag_set_list);
+
+	if (shared) {
+		queue_set_hctx_shared(shared);
+		blk_put_queue(shared);
+	}
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 				     struct request_queue *q)
 {
+	struct request_queue *shared = NULL;
+
 	mutex_lock(&set->tag_list_lock);
 
 	/*
@@ -4318,15 +4331,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 */
 	if (!list_empty(&set->tag_list) &&
 	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_shared(set, true);
+		shared = list_first_entry(&set->tag_list, struct request_queue,
+					  tag_set_list);
+		if (!blk_get_queue(shared))
+			shared = NULL;
+		else
+			set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		queue_set_hctx_shared(q, true);
+		queue_set_hctx_shared_locked(q);
 	list_add_tail(&q->tag_set_list, &set->tag_list);
 
 	mutex_unlock(&set->tag_list_lock);
+
+	if (shared) {
+		queue_set_hctx_shared(shared);
+		blk_put_queue(shared);
+	}
 }
 
 /* All allocations will be freed in release handler of q->mq_kobj */
--

  reply	other threads:[~2025-12-05  1:32 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04 18:11 [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
2025-12-04 18:22   ` Bart Van Assche
2025-12-04 18:42     ` Mohamed Khalfella
2025-12-04 19:06       ` Bart Van Assche
2025-12-04 19:15         ` Mohamed Khalfella
2025-12-04 19:31           ` Bart Van Assche
2025-12-04 19:57             ` Mohamed Khalfella
2025-12-04 20:24               ` Bart Van Assche
2025-12-04 21:26                 ` Keith Busch
2025-12-04 23:22                   ` Bart Van Assche
2025-12-05  1:32                     ` Keith Busch [this message]
2025-12-05  2:52                       ` Bart Van Assche
2025-12-05 16:39                       ` Mohamed Khalfella
2025-12-05 18:11                         ` Keith Busch
2025-12-08 19:22                       ` Bart Van Assche
2025-12-04 19:02   ` Mohamed Khalfella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTI2L6j50VWjp7aW@kbusch-mbp \
    --to=kbusch@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=cachen@purestorage.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=hdanton@sina.com \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=llong@redhat.com \
    --cc=ming.lei@redhat.com \
    --cc=mkhalfella@purestorage.com \
    --cc=sagi@grimberg.me \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.