Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Mohamed Khalfella <mkhalfella@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>, Keith Busch <kbusch@kernel.org>,
	Sagi Grimberg <sagi@grimberg.me>,
	Chaitanya Kulkarni <kch@nvidia.com>,
	Casey Chen <cachen@purestorage.com>,
	Vikas Manocha <vmanocha@purestorage.com>,
	Yuanyuan Zhong <yzhong@purestorage.com>,
	Hannes Reinecke <hare@suse.de>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
Date: Sun, 16 Nov 2025 23:15:38 +0800	[thread overview]
Message-ID: <aRnqmiFmWAI01zJq@fedora> (raw)
In-Reply-To: <20251114173419.GA2197103-mkhalfella@purestorage.com>

On Fri, Nov 14, 2025 at 09:34:19AM -0800, Mohamed Khalfella wrote:
> On Fri 2025-11-14 19:41:14 +0800, Ming Lei wrote:
> > On Thu, Nov 13, 2025 at 12:23:20PM -0800, Mohamed Khalfella wrote:
> > > blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
> > > tagset, the functions make sure that tagset and queues are marked as
> > > shared when two or more queues are attached to the same tagset.
> > > Initially a tagset starts as unshared and when the number of added
> > > queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
> > > with all the queues attached to it. When the number of attached queues
> > > drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
> > > the remaining queues as unshared.
> > > 
> > > Both functions need to freeze current queues in tagset before setting on
> > > unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
> > > hold set->tag_list_lock mutex, which makes sense as we do not want
> > > queues to be added or deleted in the process. This used to work fine
> > > until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset")
> > > made the nvme driver quiesce tagset instead of quiscing individual
> > > queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
> > > set->tag_list while holding set->tag_list_lock also.
> > > 
> > > This results in deadlock between two threads with these stacktraces:
> > > 
> > >   __schedule+0x48e/0xed0
> > >   schedule+0x5a/0xc0
> > >   schedule_preempt_disabled+0x11/0x20
> > >   __mutex_lock.constprop.0+0x3cc/0x760
> > >   blk_mq_quiesce_tagset+0x26/0xd0
> > >   nvme_dev_disable_locked+0x77/0x280 [nvme]
> > >   nvme_timeout+0x268/0x320 [nvme]
> > >   blk_mq_handle_expired+0x5d/0x90
> > >   bt_iter+0x7e/0x90
> > >   blk_mq_queue_tag_busy_iter+0x2b2/0x590
> > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > >   ? __blk_mq_complete_request_remote+0x10/0x10
> > >   blk_mq_timeout_work+0x15b/0x1a0
> > >   process_one_work+0x133/0x2f0
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   worker_thread+0x2ec/0x400
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   kthread+0xe2/0x110
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork+0x2d/0x50
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork_asm+0x11/0x20
> > > 
> > >   __schedule+0x48e/0xed0
> > >   schedule+0x5a/0xc0
> > >   blk_mq_freeze_queue_wait+0x62/0x90
> > >   ? destroy_sched_domains_rcu+0x30/0x30
> > >   blk_mq_exit_queue+0x151/0x180
> > >   disk_release+0xe3/0xf0
> > >   device_release+0x31/0x90
> > >   kobject_put+0x6d/0x180
> > >   nvme_scan_ns+0x858/0xc90 [nvme_core]
> > >   ? nvme_scan_work+0x281/0x560 [nvme_core]
> > >   nvme_scan_work+0x281/0x560 [nvme_core]
> > >   process_one_work+0x133/0x2f0
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   worker_thread+0x2ec/0x400
> > >   ? mod_delayed_work_on+0x90/0x90
> > >   kthread+0xe2/0x110
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork+0x2d/0x50
> > >   ? kthread_complete_and_exit+0x20/0x20
> > >   ret_from_fork_asm+0x11/0x20
> > 
> > It is one AB-BA deadlock, lockdep should have complained it, but nvme doesn't
> > support owned freeze queue.
> > 
> > Maybe the following change can avoid it?
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index c916176bd9f0..9967c4a7e72d 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -3004,6 +3004,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >         bool dead;
> > 
> >         mutex_lock(&dev->shutdown_lock);
> > +       nvme_quiesce_io_queues(&dev->ctrl);
> >         dead = nvme_pci_ctrl_is_dead(dev);
> >         if (state == NVME_CTRL_LIVE || state == NVME_CTRL_RESETTING) {
> >                 if (pci_is_enabled(pdev))
> > @@ -3016,8 +3017,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >                         nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> >         }
> > 
> > -       nvme_quiesce_io_queues(&dev->ctrl);
> > -
> >         if (!dead && dev->ctrl.queue_count > 0) {
> >                 nvme_delete_io_queues(dev);
> >                 nvme_disable_ctrl(&dev->ctrl, shutdown);
> > 
> > 
> 
> Interesting. Can you elaborate more on why this diff can help us avoid
> the problem?

In nvme_dev_disable(), NS queues are frozen first, then call
nvme_quiesce_io_queues().

queue freeze can be thought as one lock, so q->freeze_lock -> tag_set->tag_list_lock
in timeout code path.

However, in blk_mq_exit_queue(), the lock order becomes
tag_set->tag_list_lock -> q->freeze_lock.

That is why I call it AB-BA lock.

However, that looks not the reason in your deadlock, so my patch shouldn't
work here, in which nvme_dev_disable() doesn't provide forward-progress
because of ->tag_list_lock.

> 
> If the thread doing nvme_scan_work() is waiting for queue to be frozen
> while holding set->tag_list_lock, then nvme_quiesce_io_queues() moved up
> will cause the deadlock, no?

 __schedule+0x48e/0xed0
  schedule+0x5a/0xc0
  blk_mq_freeze_queue_wait+0x62/0x90
  ? destroy_sched_domains_rcu+0x30/0x30
  blk_mq_exit_queue+0x151/0x180
  disk_release+0xe3/0xf0
  device_release+0x31/0x90
  kobject_put+0x6d/0x180

Here blk_mq_exit_queue() is called from disk release, and the disk is not
added actually, so question is why blk_mq_freeze_queue_wait() doesn't
return? Who holds queue usage counter here?  Can you investigate a bit
and figure out the reason?

For one released NS/disk, no one should grab the queue usage counter,
right?


Thanks,
Ming

next prev parent reply	other threads:[~2025-11-16 15:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-13 20:23 [PATCH] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Mohamed Khalfella
2025-11-14  4:56 ` Chaitanya Kulkarni
2025-11-14  5:34   ` Mohamed Khalfella
2025-11-14 11:41 ` Ming Lei
2025-11-14 17:34   ` Mohamed Khalfella
2025-11-16 15:15     ` Ming Lei [this message]
2025-11-16 16:46       ` Mohamed Khalfella
2025-11-17  2:13         ` Ming Lei
2025-11-17  2:30 ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRnqmiFmWAI01zJq@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=cachen@purestorage.com \
    --cc=hare@suse.de \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mkhalfella@purestorage.com \
    --cc=sagi@grimberg.me \
    --cc=vmanocha@purestorage.com \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.