All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mohamed Khalfella <mkhalfella@purestorage.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Chaitanya Kulkarni <kch@nvidia.com>,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Keith Busch <kbusch@kernel.org>, Sagi Grimberg <sagi@grimberg.me>,
	Casey Chen <cachen@purestorage.com>,
	Yuanyuan Zhong <yzhong@purestorage.com>,
	Ming Lei <ming.lei@redhat.com>, Waiman Long <llong@redhat.com>,
	Hillf Danton <hdanton@sina.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 1/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock
Date: Tue, 9 Dec 2025 09:49:20 -0800	[thread overview]
Message-ID: <20251209174920.GF337106-mkhalfella@purestorage.com> (raw)
In-Reply-To: <eb03af5f-6371-4e3b-acfc-9c3d75403d18@suse.de>

On Tue 2025-12-09 08:30:23 +0100, Hannes Reinecke wrote:
> On 12/5/25 22:17, Mohamed Khalfella wrote:
> > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > tagset, the functions make sure that tagset and queues are marked as
> > shared when two or more queues are attached to the same tagset.
> > Initially a tagset starts as unshared and when the number of added
> > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > with all the queues attached to it. When the number of attached queues
> > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > the remaining queues as unshared.
> > 
> > Both functions need to freeze current queues in tagset before setting on
> > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > hold set->tag_list_lock mutex, which makes sense as we do not want
> > queues to be added or deleted in the process. This used to work fine
> > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > made the nvme driver quiesce tagset instead of quiscing individual
> > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > set->tag_list while holding set->tag_list_lock also.
> > 
> > This results in deadlock between two threads with these stacktraces:
> > 
> >    __schedule+0x47c/0xbb0
> >    ? timerqueue_add+0x66/0xb0
> >    schedule+0x1c/0xa0
> >    schedule_preempt_disabled+0xa/0x10
> >    __mutex_lock.constprop.0+0x271/0x600
> >    blk_mq_quiesce_tagset+0x25/0xc0
> >    nvme_dev_disable+0x9c/0x250
> >    nvme_timeout+0x1fc/0x520
> >    blk_mq_handle_expired+0x5c/0x90
> >    bt_iter+0x7e/0x90
> >    blk_mq_queue_tag_busy_iter+0x27e/0x550
> >    ? __blk_mq_complete_request_remote+0x10/0x10
> >    ? __blk_mq_complete_request_remote+0x10/0x10
> >    ? __call_rcu_common.constprop.0+0x1c0/0x210
> >    blk_mq_timeout_work+0x12d/0x170
> >    process_one_work+0x12e/0x2d0
> >    worker_thread+0x288/0x3a0
> >    ? rescuer_thread+0x480/0x480
> >    kthread+0xb8/0xe0
> >    ? kthread_park+0x80/0x80
> >    ret_from_fork+0x2d/0x50
> >    ? kthread_park+0x80/0x80
> >    ret_from_fork_asm+0x11/0x20
> > 
> >    __schedule+0x47c/0xbb0
> >    ? xas_find+0x161/0x1a0
> >    schedule+0x1c/0xa0
> >    blk_mq_freeze_queue_wait+0x3d/0x70
> >    ? destroy_sched_domains_rcu+0x30/0x30
> >    blk_mq_update_tag_set_shared+0x44/0x80
> >    blk_mq_exit_queue+0x141/0x150
> >    del_gendisk+0x25a/0x2d0
> >    nvme_ns_remove+0xc9/0x170
> >    nvme_remove_namespaces+0xc7/0x100
> >    nvme_remove+0x62/0x150
> >    pci_device_remove+0x23/0x60
> >    device_release_driver_internal+0x159/0x200
> >    unbind_store+0x99/0xa0
> >    kernfs_fop_write_iter+0x112/0x1e0
> >    vfs_write+0x2b1/0x3d0
> >    ksys_write+0x4e/0xb0
> >    do_syscall_64+0x5b/0x160
> >    entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > The top stacktrace is showing nvme_timeout() called to handle nvme
> > command timeout. timeout handler is trying to disable the controller and
> > as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
> > to call queue callback handlers. The thread is stuck waiting for
> > set->tag_list_lock as it tries to walk the queues in set->tag_list.
> > 
> > The lock is held by the second thread in the bottom stack which is
> > waiting for one of queues to be frozen. The queue usage counter will
> > drop to zero after nvme_timeout() finishes, and this will not happen
> > because the thread will wait for this mutex forever.
> > 
> > Given that [un]quiescing queue is an operation that does not need to
> > sleep, update blk_mq_[un]quiesce_tagset() to use RCU instead of taking
> > set->tag_list_lock, update blk_mq_{add,del}_queue_tag_set() to use RCU
> > safe list operations. Also, delete INIT_LIST_HEAD(&q->tag_set_list)
> > in blk_mq_del_queue_tag_set() because we can not re-initialize it while
> > the list is being traversed under RCU. The deleted queue will not be
> > added/deleted to/from a tagset and it will be freed in blk_free_queue()
> > after the end of RCU grace period.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > ---
> >   block/blk-mq.c | 17 ++++++++---------
> >   1 file changed, 8 insertions(+), 9 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index d626d32f6e57..05db3d20783f 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set)
> >   {
> >   	struct request_queue *q;
> >   
> > -	mutex_lock(&set->tag_list_lock);
> > -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
> >   		if (!blk_queue_skip_tagset_quiesce(q))
> >   			blk_mq_quiesce_queue_nowait(q);
> >   	}
> > -	mutex_unlock(&set->tag_list_lock);
> > +	rcu_read_unlock();
> >   
> >   	blk_mq_wait_quiesce_done(set);
> >   }
> > @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *set)
> >   {
> >   	struct request_queue *q;
> >   
> > -	mutex_lock(&set->tag_list_lock);
> > -	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list) {
> >   		if (!blk_queue_skip_tagset_quiesce(q))
> >   			blk_mq_unquiesce_queue(q);
> >   	}
> > -	mutex_unlock(&set->tag_list_lock);
> > +	rcu_read_unlock();
> >   }
> >   EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset);
> >   
> > @@ -4294,7 +4294,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
> >   	struct blk_mq_tag_set *set = q->tag_set;
> >   
> >   	mutex_lock(&set->tag_list_lock);
> > -	list_del(&q->tag_set_list);
> > +	list_del_rcu(&q->tag_set_list);
> >   	if (list_is_singular(&set->tag_list)) {
> >   		/* just transitioned to unshared */
> >   		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
> > @@ -4302,7 +4302,6 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
> >   		blk_mq_update_tag_set_shared(set, false);
> >   	}
> >   	mutex_unlock(&set->tag_list_lock);
> > -	INIT_LIST_HEAD(&q->tag_set_list);
> >   }
> >   
> I'm ever so sceptical whether we can remove the INIT_LIST_HEAD() here.
> If we can it was pointless to begin with, but I somehow doubt that.
> Do you have a rationale for that (except from the fact that you
> are moving to RCU, and hence the 'q' pointer might not be valid then).
> 
I think it was pointless to begin with. 'q' is on its way to be freed.
q->tag_set_list is not going to be used again.

  parent reply	other threads:[~2025-12-09 17:49 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-05 21:17 [PATCH v4 0/1] block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock Mohamed Khalfella
2025-12-05 21:17 ` [PATCH v4 1/1] " Mohamed Khalfella
2025-12-08 19:28   ` Bart Van Assche
2025-12-09  4:04   ` Ming Lei
2025-12-09  7:30   ` Hannes Reinecke
2025-12-09 17:00     ` Bart Van Assche
2025-12-09 17:49     ` Mohamed Khalfella [this message]
2025-12-09 17:34 ` [PATCH v4 0/1] " Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251209174920.GF337106-mkhalfella@purestorage.com \
    --to=mkhalfella@purestorage.com \
    --cc=axboe@kernel.dk \
    --cc=cachen@purestorage.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=hdanton@sina.com \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=llong@redhat.com \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.