From: Mohamed Khalfella <mkhalfella@purestorage.com>
To: Keith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>,
Chaitanya Kulkarni <kch@nvidia.com>,
Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
Sagi Grimberg <sagi@grimberg.me>,
Casey Chen <cachen@purestorage.com>,
Yuanyuan Zhong <yzhong@purestorage.com>,
Hannes Reinecke <hare@suse.de>, Ming Lei <ming.lei@redhat.com>,
Waiman Long <llong@redhat.com>, Hillf Danton <hdanton@sina.com>,
linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
Date: Fri, 5 Dec 2025 08:39:26 -0800 [thread overview]
Message-ID: <20251205163926.GI2376676-mkhalfella@purestorage.com> (raw)
In-Reply-To: <aTI2L6j50VWjp7aW@kbusch-mbp>
On Thu 2025-12-04 18:32:31 -0700, Keith Busch wrote:
> On Thu, Dec 04, 2025 at 01:22:49PM -1000, Bart Van Assche wrote:
> >
> > On 12/4/25 11:26 AM, Keith Busch wrote:
> > > On Thu, Dec 04, 2025 at 10:24:03AM -1000, Bart Van Assche wrote:>> Hence, the deadlock can be
> > > > solved by removing the blk_mq_quiesce_tagset() call from nvme_timeout()
> > > > and by failing I/O from inside nvme_timeout(). If nvme_timeout() fails
> > > > I/O and does not call blk_mq_quiesce_tagset() then the
> > > > blk_mq_freeze_queue_wait() call will finish instead of triggering a
> > > > deadlock. However, I do not know whether this proposal seems acceptable
> > > > to the NVMe maintainers.
> > >
> > > You periodically make this suggestion, but there's never a reason
> > > offered to introduce yet another work queue for the driver to
> > > synchronize with at various points. The whole point of making blk-mq
> > > timeout handler in a work queue (it used to be a timer) was so that we
> > > could do blocking actions like this.
> > Hi Keith,
> >
> > The blk_mq_quiesce_tagset() call from the NVMe timeout handler is
> > unfortunate because it triggers a deadlock with
> > blk_mq_update_tag_set_shared().
>
> So in this scenario, the driver is modifying a tagset list from two
> queues to one, which causes blk-mq to clear the "shared" flags. The
> remaining one just so happens to have hit a timeout at the same time,
> which runs in a context with an elevated "q_usage_counter". The current
> rule, then, is you can not take the tag_list_lock from any context using
> any queue in the tag list.
>
> > I proposed to modify the NVMe driver because I think that's a better
> > approach than introducing a new synchronize_rcu() call in the block
> > layer core.
>
> I'm not interested in introducing rcu synchronize here either. I guess I
> would make it so you can quiesce a tagset from a context that entered
> the queue. So quick shot at that here:
Why sychronize_rcu() is intolerable in this blk_mq_del_queue_tag_set()?
This code is not performance sensitive, right?
Looking at the code again, I _think_ synchronize_rcu() along with
INIT_LIST_HEAD(&q->tag_set_list) can be deleted. I do not see usecase
where "q" is re-added to a tagset after it is deleted from one.
Also, "q" is freed in blk_free_queue() after end of RCU grace period.
Are there any other concerns with this approach other than
synchronize_rcu()? If not, I will delete it and submit v3.
>
> ---
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4e96bb2462475..20450017b9512 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4262,11 +4262,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
> * Caller needs to ensure that we're either frozen/quiesced, or that
> * the queue isn't live yet.
> */
> -static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> +static void queue_set_hctx_shared_locked(struct request_queue *q)
> {
> + struct blk_mq_tag_set *set = q->tag_set;
> struct blk_mq_hw_ctx *hctx;
> unsigned long i;
> + bool shared;
> +
> + lockdep_assert_held(&set->tag_list_lock);
>
> + shared = set->flags & BLK_MQ_F_TAG_QUEUE_SHARED;
> queue_for_each_hw_ctx(q, hctx, i) {
> if (shared) {
> hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> @@ -4277,24 +4282,22 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
> }
> }
>
> -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
> - bool shared)
> +static void queue_set_hctx_shared(struct request_queue *q)
> {
> - struct request_queue *q;
> + struct blk_mq_tag_set *set = q->tag_set;
> unsigned int memflags;
>
> - lockdep_assert_held(&set->tag_list_lock);
> -
> - list_for_each_entry(q, &set->tag_list, tag_set_list) {
> - memflags = blk_mq_freeze_queue(q);
> - queue_set_hctx_shared(q, shared);
> - blk_mq_unfreeze_queue(q, memflags);
> - }
> + memflags = blk_mq_freeze_queue(q);
> + mutex_lock(&set->tag_list_lock);
> + queue_set_hctx_shared_locked(q);
> + mutex_unlock(&set->tag_list_lock);
> + blk_mq_unfreeze_queue(q, memflags);
> }
>
> static void blk_mq_del_queue_tag_set(struct request_queue *q)
> {
> struct blk_mq_tag_set *set = q->tag_set;
> + struct request_queue *shared = NULL;
>
> mutex_lock(&set->tag_list_lock);
> list_del(&q->tag_set_list);
> @@ -4302,15 +4305,25 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
> /* just transitioned to unshared */
> set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> /* update existing queue */
> - blk_mq_update_tag_set_shared(set, false);
> + shared = list_first_entry(&set->tag_list, struct request_queue,
> + tag_set_list);
> + if (!blk_get_queue(shared))
> + shared = NULL;
> }
> mutex_unlock(&set->tag_list_lock);
> INIT_LIST_HEAD(&q->tag_set_list);
> +
> + if (shared) {
> + queue_set_hctx_shared(shared);
> + blk_put_queue(shared);
> + }
> }
>
> static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> struct request_queue *q)
> {
> + struct request_queue *shared = NULL;
> +
> mutex_lock(&set->tag_list_lock);
>
> /*
> @@ -4318,15 +4331,24 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
> */
> if (!list_empty(&set->tag_list) &&
> !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
> - set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> /* update existing queue */
> - blk_mq_update_tag_set_shared(set, true);
> + shared = list_first_entry(&set->tag_list, struct request_queue,
> + tag_set_list);
> + if (!blk_get_queue(shared))
> + shared = NULL;
> + else
> + set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
> }
> if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
> - queue_set_hctx_shared(q, true);
> + queue_set_hctx_shared_locked(q);
> list_add_tail(&q->tag_set_list, &set->tag_list);
>
> mutex_unlock(&set->tag_list_lock);
> +
> + if (shared) {
> + queue_set_hctx_shared(shared);
> + blk_put_queue(shared);
> + }
> }
>
> /* All allocations will be freed in release handler of q->mq_kobj */
> --
__blk_mq_update_nr_hw_queues() freezes the queues while holding
set->tag_list_lock. Can this cause the same deadlock?
next prev parent reply other threads:[~2025-12-05 16:39 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-04 18:11 [PATCH 0/1] Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
2025-12-04 18:11 ` [PATCH 1/1] block: " Mohamed Khalfella
2025-12-04 18:22 ` Bart Van Assche
2025-12-04 18:42 ` Mohamed Khalfella
2025-12-04 19:06 ` Bart Van Assche
2025-12-04 19:15 ` Mohamed Khalfella
2025-12-04 19:31 ` Bart Van Assche
2025-12-04 19:57 ` Mohamed Khalfella
2025-12-04 20:24 ` Bart Van Assche
2025-12-04 21:26 ` Keith Busch
2025-12-04 23:22 ` Bart Van Assche
2025-12-05 1:32 ` Keith Busch
2025-12-05 2:52 ` Bart Van Assche
2025-12-05 16:39 ` Mohamed Khalfella [this message]
2025-12-05 18:11 ` Keith Busch
2025-12-08 19:22 ` Bart Van Assche
2025-12-04 19:02 ` Mohamed Khalfella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251205163926.GI2376676-mkhalfella@purestorage.com \
--to=mkhalfella@purestorage.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=cachen@purestorage.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=hdanton@sina.com \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=llong@redhat.com \
--cc=ming.lei@redhat.com \
--cc=sagi@grimberg.me \
--cc=yzhong@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox