From mboxrd@z Thu Jan  1 00:00:00 1970
From: keith.busch@intel.com (Keith Busch)
Date: Thu, 10 Aug 2017 15:34:23 -0400
Subject: Bug Report: can't unload nvme module in case of disabled device
In-Reply-To: <20170810191717.GA1604@localhost.localdomain>
References: <f28ebc80-0e67-8af7-79b2-a4b368e4bd83@mellanox.com>
 <20170810164536.GC14634@localhost.localdomain>
 <20170810191717.GA1604@localhost.localdomain>
Message-ID: <20170810193423.GA1180@localhost.localdomain>

On Thu, Aug 10, 2017@03:17:17PM -0400, Keith Busch wrote:
> On Thu, Aug 10, 2017@12:45:36PM -0400, Keith Busch wrote:
> > On Tue, Aug 01, 2017@03:58:10PM +0300, Max Gurtovoy wrote:
> > > Hi all,
> > > 
> > > I would like to report a bug that reproduced by the following steps (I'm
> > > using 4.13.0-rc3+):
> > > 
> > > 1. modprobe nvme
> > > 2. echo 0 >  /sys/block/nvme0n1/device/device/enable
> > > 3. nvme list (stuck for more than 1-2 mins)
> > > 4. modprobe -r nvme (stuck forever)
> > > 
> > > log:
> > > 
> > > [ 1342.388888] nvme nvme0: controller is down; will reset: CSTS=0x3,
> > > PCI_STATUS=0x10
> > > [ 1476.021392] INFO: task kworker/u98:1:436 blocked for more than 120
> > > seconds.
> > > [ 1476.029072]       Not tainted 4.13.0-rc3+ #19
> > > [ 1476.033878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > this message.
> > > [ 1476.042505] kworker/u98:1   D    0   436      2 0x00000000
> > > [ 1476.048569] Workqueue: nvme-wq nvme_reset_work [nvme]
> > > [ 1476.054133] Call Trace:
> > > [ 1476.056862]  __schedule+0x1dc/0x780
> > > [ 1476.060706]  schedule+0x36/0x80
> > > [ 1476.064180]  blk_mq_freeze_queue_wait+0x4b/0xb0
> > > [ 1476.069175]  ? remove_wait_queue+0x60/0x60
> > > [ 1476.073693]  nvme_wait_freeze+0x33/0x50 [nvme_core]
> > > [ 1476.079068]  nvme_reset_work+0x6b9/0xc40 [nvme]
> > > [ 1476.084075]  ? __switch_to+0x23e/0x4a0
> > > [ 1476.088209]  process_one_work+0x149/0x360
> > > [ 1476.092625]  worker_thread+0x4d/0x3c0
> > > [ 1476.096692]  kthread+0x109/0x140
> > > [ 1476.100247]  ? rescuer_thread+0x380/0x380
> > > [ 1476.104664]  ? kthread_park+0x60/0x60
> > > [ 1476.108698]  ret_from_fork+0x25/0x30
> > 
> > This looks like a path does not pair the freeze start with the reset's
> > freeze wait. I'll have to see what the pci 'enable' sysfs entry does.
> 
> I see how the freeze start/stops are not paired in this scenario:
> nvme_dev_disable doesn't start the freeze if the pci device isn't
> disabled. It uses this to know if it is disabling the device twice.
> 
> In this test, though, you are disabling the pci device without the
> driver's knowledge, so that breaks that logic. In light of that, we'll
> need different criteria to know when the driver should start a freeze.
> I'll test some things out and send a patch.

This should fix it for your scenario, but I am not completely sure this
can't get the freeze depth higher than we need.

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cd888a4..ca03980 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2006,12 +2006,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
 	mutex_lock(&dev->shutdown_lock);
+
+	if (dev->ctrl.state == NVME_CTRL_LIVE ||
+	    dev->ctrl.state == NVME_CTRL_RESETTING)
+		nvme_start_freeze(&dev->ctrl);
+
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
--