Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2] nvme-pci: fix stuck reset on concurrent DPC and HP
@ 2025-03-07 23:26 Keith Busch
  2025-03-08  7:51 ` Nilay Shroff
  2025-03-10 13:23 ` Christoph Hellwig
  0 siblings, 2 replies; 3+ messages in thread
From: Keith Busch @ 2025-03-07 23:26 UTC (permalink / raw)
  To: linux-nvme; +Cc: hch, sagi, nilay, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The PCIe DPC handling has the nvme driver quiesce the device, attempt to
restart it, then wait for that restart to complete.

The DPC event also toggles the PCIe link. If the slot doesn't have
out-of-band presence detection, this will trigger a pciehp
re-enumeration.

The DPC's error handling that calls nvme_error_resume is holding the
device lock while this happens. This lock prevents pciehp's request to
disconnect the driver from proceeding.

Meanwhile the nvme's reset_work can't make forward progress because its
device isn't there anymore with admin IO, and the timeout handler won't
do anything to fix it because the device is undergoing error handling.

End result: deadlocked.

Fix this by having the timeout handler disable the nvme queueus for a
disconnected PCIe device. We're relying on an IO timeout to unblock
this, which is a minute by default.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:

  Leveraged the state machine to make the patch that much simpler.

 drivers/nvme/host/pci.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 640590b217282..710f3dfef3663 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1411,9 +1411,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	struct nvme_dev *dev = nvmeq->dev;
 	struct request *abort_req;
 	struct nvme_command cmd = { };
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	u32 csts = readl(dev->bar + NVME_REG_CSTS);
 	u8 opcode;
 
+	if (pci_dev_is_disconnected(pdev))
+		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
 	if (nvme_state_terminal(&dev->ctrl))
 		goto disable;
 
@@ -1421,7 +1424,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	 * the recovery mechanism will surely fail.
 	 */
 	mb();
-	if (pci_channel_offline(to_pci_dev(dev->dev)))
+	if (pci_channel_offline(pdev))
 		return BLK_EH_RESET_TIMER;
 
 	/*
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-03-10 14:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-07 23:26 [PATCHv2] nvme-pci: fix stuck reset on concurrent DPC and HP Keith Busch
2025-03-08  7:51 ` Nilay Shroff
2025-03-10 13:23 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox