* [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang.
@ 2024-02-21 8:43 Michael Kropaczek
2024-02-22 16:27 ` Keith Busch
0 siblings, 1 reply; 4+ messages in thread
From: Michael Kropaczek @ 2024-02-21 8:43 UTC (permalink / raw)
To: linux-nvme
Cc: Michael Kropaczek, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg
Description:
During endurance test, when a system was rebooted from NVMe drive, boot
process hung occasionally. The number of reboot cycles was set to 1000,
with interval of 120s. Hang occurred after ~300 reboot cycles.
After investigating the cause, it was established that NVMe driver
did not disable host memory during shutdown leaving NVMe controller
in a state preventing proper initialization in BIOS pre-boot stage.
Adding of the nvme_free_host_mem(dev) call fixed the issue.
Michael Kropaczek (1):
nvme: Fix problem when booting from NVMe drive was leading to a hang.
drivers/nvme/host/pci.c | 10 ++++++++++
1 file changed, 10 insertions(+)
base-commit: 8d30528a170905ede9ab6ab81f229e441808590b
--
2.34.1
From 1506c5099def8ecdd489d681033230492fa65bb2 Mon Sep 17 00:00:00 2001
From: Michael Kropaczek <michael.kropaczek@solidigm.com>
Date: Tue, 20 Feb 2024 23:40:36 -0800
Subject: [PATCH v1 1/1] nvme: Fix problem when booting from NVMe drive was
leading to a hang.
To: linux-nvme@lists.infradead.org
Cc: Keith Busch <kbusch@kernel.org>,
Jens Axboe <axboe@fb.com>,
Christoph Hellwig <hch@lst.de>,
Sagi Grimberg <sagi@grimberg.me>,
Michael Kropaczek <michael.kropaczek@solidigm.com>
On certain host architectures/HW, DRAM was keeping memory contents over reboot
cycles. Certain NVMe controllers were accessing host memory after startup which
led to undefined state, preventing proper initialization in BIOS boot stage.
Freeing host memory during host's shutdown prevents the problem from occurring.
Signed-off-by: Michael Kropaczek <michael.kropaczek@solidigm.com>
---
drivers/nvme/host/pci.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e6267a6aa380..ccddb7c379e3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2593,6 +2593,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
}
+ /*
+ * On certain host architectures/HW, DRAM was keeping memory contents over reboot-cycles.
+ * It was observed that certain controllers were accessing host memory after
+ * resetting which led to undefined state preventing proper initialization.
+ */
+ if (dev->hmb)
+ nvme_set_host_mem(dev, 0);
+
+ nvme_free_host_mem(dev);
+
nvme_quiesce_io_queues(&dev->ctrl);
if (!dead && dev->ctrl.queue_count > 0) {
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang.
2024-02-21 8:43 [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang Michael Kropaczek
@ 2024-02-22 16:27 ` Keith Busch
2024-02-22 20:02 ` Michael Kropaczek
0 siblings, 1 reply; 4+ messages in thread
From: Keith Busch @ 2024-02-22 16:27 UTC (permalink / raw)
To: Michael Kropaczek
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg
On Wed, Feb 21, 2024 at 12:43:40AM -0800, Michael Kropaczek wrote:
> @@ -2593,6 +2593,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> }
>
> + /*
> + * On certain host architectures/HW, DRAM was keeping memory contents over reboot-cycles.
> + * It was observed that certain controllers were accessing host memory after
> + * resetting which led to undefined state preventing proper initialization.
> + */
> + if (dev->hmb)
> + nvme_set_host_mem(dev, 0);
> +
> + nvme_free_host_mem(dev);
> +
Shouldn't this go in nvme_shutdown() instead? I don't think we want to
tear down the controller's host memory on every controller reset.
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang.
2024-02-22 16:27 ` Keith Busch
@ 2024-02-22 20:02 ` Michael Kropaczek
2024-02-22 20:36 ` Keith Busch
0 siblings, 1 reply; 4+ messages in thread
From: Michael Kropaczek @ 2024-02-22 20:02 UTC (permalink / raw)
To: Keith Busch
Cc: linux-nvme@lists.infradead.org, Jens Axboe, Christoph Hellwig,
Sagi Grimberg
> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Thursday, February 22, 2024 8:28 AM
> To: Michael Kropaczek <Michael.Kropaczek@solidigm.com>
> Cc: linux-nvme@lists.infradead.org; Jens Axboe <axboe@fb.com>; Christoph Hellwig <hch@lst.de>; Sagi Grimberg <sagi@grimberg.me>
> Subject: Re: [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang.
>
>
> On Wed, Feb 21, 2024 at 12:43:40AM -0800, Michael Kropaczek wrote:
> > @@ -2593,6 +2593,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> > }
> >
> > + /*
> > + * On certain host architectures/HW, DRAM was keeping memory contents over reboot-cycles.
> > + * It was observed that certain controllers were accessing host memory after
> > + * resetting which led to undefined state preventing proper initialization.
> > + */
> > + if (dev->hmb)
> > + nvme_set_host_mem(dev, 0);
> > +
> > + nvme_free_host_mem(dev);
> > +
>
> Shouldn't this go in nvme_shutdown() instead? I don't think we want to tear down the controller's host memory on every controller reset.
Thank you, Keith, this is a valid point.
The function nvme_dev_disable() changed here is called from the nvme_shutdown() subsequently through nvme_disable_prepare_reset() with the flag shutdown being set as true.
Considering the re-usability of the code, would it be correct to wrap up the change into if(shutdown) statement?
The reason for asking is that disabling of host memory should follow other calls as well and placing it within the body of nvme_shutdown() would require more adjustments.
The call to nvme_disable_prepare_reset() with the flag shutdown set as true is made whenever the NVMe controller should be shut down anyway.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang.
2024-02-22 20:02 ` Michael Kropaczek
@ 2024-02-22 20:36 ` Keith Busch
0 siblings, 0 replies; 4+ messages in thread
From: Keith Busch @ 2024-02-22 20:36 UTC (permalink / raw)
To: Michael Kropaczek
Cc: linux-nvme@lists.infradead.org, Jens Axboe, Christoph Hellwig,
Sagi Grimberg
On Thu, Feb 22, 2024 at 08:02:52PM +0000, Michael Kropaczek wrote:
> The function nvme_dev_disable() changed here is called from the nvme_shutdown() subsequently through nvme_disable_prepare_reset() with the flag shutdown being set as true.
>
> Considering the re-usability of the code, would it be correct to wrap up the change into if(shutdown) statement?
>
> The reason for asking is that disabling of host memory should follow other calls as well and placing it within the body of nvme_shutdown() would require more adjustments.
>
> The call to nvme_disable_prepare_reset() with the flag shutdown set as true is made whenever the NVMe controller should be shut down anyway.
Just calling nvme_set_host_mem() should be sufficient; no need to free
the memory here as we may just be giving it back to the device on the
reset or resume side.
And per spec, the device is supposed to automatically clear its host
memory enabling on a reset or shutdown, so this shouldn't be necessary.
Sounds like something isn't spec complaint in your environment.
Going through the git histroy, we use to disable host mem here, but
timeout handling doesn't work within nvme_dev_disable(). All the
operations done from here need to guarantee forward progress, and this
set function command doesn't do that. The only example of admin commands
that handle their own timeout is the queue deletion sequence.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-02-22 20:36 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-21 8:43 [PATCH v1 0/1] nvme: Fix problem when booting from NVMe drive was leading to a hang Michael Kropaczek
2024-02-22 16:27 ` Keith Busch
2024-02-22 20:02 ` Michael Kropaczek
2024-02-22 20:36 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox