* [PATCHv2] nvme: ensure disabling pairs with unquiesce
@ 2023-07-12 15:40 Keith Busch
2023-07-12 16:13 ` Christoph Hellwig
2023-07-13 2:54 ` Ming Lei
0 siblings, 2 replies; 4+ messages in thread
From: Keith Busch @ 2023-07-12 15:40 UTC (permalink / raw)
To: linux-nvme, hch; +Cc: sagi, Keith Busch, Ming Lei
From: Keith Busch <kbusch@kernel.org>
If any error handling that disables the controller fails to queue the
reset work, like if the state changed to disconnected inbetween, then
the failed teardown needs to unquiesce the queues since it's no longer
paired with reset_work. Just make sure that the controller can be put
into a resetting state prior to starting the disable so that no other
handling can change with the queue states while recovery is happening.
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:
Don't wait for a resetting state as that could deadlock. Just abort
the disabling if the error handling can't set the state to resetting.
drivers/nvme/host/pci.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8754b4a5c6844..8e7dbe0ab8904 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1298,9 +1298,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
*/
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
- nvme_dev_disable(dev, false);
- nvme_reset_ctrl(&dev->ctrl);
- return BLK_EH_DONE;
+ goto disable;
}
/*
@@ -1351,10 +1349,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
"I/O %d QID %d timeout, reset controller\n",
req->tag, nvmeq->qid);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
- nvme_dev_disable(dev, false);
- nvme_reset_ctrl(&dev->ctrl);
-
- return BLK_EH_DONE;
+ goto disable;
}
if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -1391,6 +1386,15 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
* as the device then is in a faulty state.
*/
return BLK_EH_RESET_TIMER;
+
+disable:
+ if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+ return BLK_EH_DONE;
+
+ nvme_dev_disable(dev, false);
+ if (nvme_try_sched_reset(&dev->ctrl))
+ nvme_unquiesce_io_queues(&dev->ctrl);
+ return BLK_EH_DONE;
}
static void nvme_free_queue(struct nvme_queue *nvmeq)
@@ -3278,6 +3282,10 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
case pci_channel_io_frozen:
dev_warn(dev->ctrl.device,
"frozen state error detected, reset controller\n");
+ if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+ nvme_dev_disable(dev, true);
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
nvme_dev_disable(dev, false);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
@@ -3294,7 +3302,8 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
dev_info(dev->ctrl.device, "restart after slot reset\n");
pci_restore_state(pdev);
- nvme_reset_ctrl(&dev->ctrl);
+ if (!nvme_try_sched_reset(&dev->ctrl))
+ nvme_unquiesce_io_queues(&dev->ctrl);
return PCI_ERS_RESULT_RECOVERED;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCHv2] nvme: ensure disabling pairs with unquiesce
2023-07-12 15:40 [PATCHv2] nvme: ensure disabling pairs with unquiesce Keith Busch
@ 2023-07-12 16:13 ` Christoph Hellwig
2023-07-12 16:24 ` Keith Busch
2023-07-13 2:54 ` Ming Lei
1 sibling, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2023-07-12 16:13 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, hch, sagi, Keith Busch, Ming Lei
On Wed, Jul 12, 2023 at 08:40:04AM -0700, Keith Busch wrote:
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1298,9 +1298,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> */
> if (nvme_should_reset(dev, csts)) {
> nvme_warn_reset(dev, csts);
> - nvme_dev_disable(dev, false);
> - nvme_reset_ctrl(&dev->ctrl);
> - return BLK_EH_DONE;
> + goto disable;
> }
>
> /*
> @@ -1351,10 +1349,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> "I/O %d QID %d timeout, reset controller\n",
> req->tag, nvmeq->qid);
> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> - nvme_dev_disable(dev, false);
> - nvme_reset_ctrl(&dev->ctrl);
> -
> - return BLK_EH_DONE;
> + goto disable;
> }
>
> if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
> @@ -1391,6 +1386,15 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> * as the device then is in a faulty state.
> */
> return BLK_EH_RESET_TIMER;
> +
> +disable:
> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
> + return BLK_EH_DONE;
> +
> + nvme_dev_disable(dev, false);
> + if (nvme_try_sched_reset(&dev->ctrl))
> + nvme_unquiesce_io_queues(&dev->ctrl);
> + return BLK_EH_DONE;
Just curious, do you remember why we do an explicitnvme_dev_disable
here instead of relying on the one at the beginning of
nvme_dev_disable? This isn't new, and the patch should not be blocked
on it, but it looks very odd to me an we'll need to either document
it or kill it eventually..
Otherwise looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCHv2] nvme: ensure disabling pairs with unquiesce
2023-07-12 16:13 ` Christoph Hellwig
@ 2023-07-12 16:24 ` Keith Busch
0 siblings, 0 replies; 4+ messages in thread
From: Keith Busch @ 2023-07-12 16:24 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme, sagi, Ming Lei
On Wed, Jul 12, 2023 at 06:13:27PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 08:40:04AM -0700, Keith Busch wrote:
> > +disable:
> > + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
> > + return BLK_EH_DONE;
> > +
> > + nvme_dev_disable(dev, false);
> > + if (nvme_try_sched_reset(&dev->ctrl))
> > + nvme_unquiesce_io_queues(&dev->ctrl);
> > + return BLK_EH_DONE;
>
> Just curious, do you remember why we do an explicitnvme_dev_disable
> here instead of relying on the one at the beginning of
> nvme_dev_disable? This isn't new, and the patch should not be blocked
> on it, but it looks very odd to me an we'll need to either document
> it or kill it eventually..
You mean why not rely on the disable in nvme_reset_work()? Because we're
in our timeout callback and can reclaim everything inline with the
notification. If we defer the disabling to the reset_work, the timeout
workqueue will keep running for potentially 1000's of outstanding IOs
for not particular reason. Not that that's a horrible thing, but it's
just inefficient. This was one of the nice benefits of switching the
timeout handling from a timer to workqueue.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCHv2] nvme: ensure disabling pairs with unquiesce
2023-07-12 15:40 [PATCHv2] nvme: ensure disabling pairs with unquiesce Keith Busch
2023-07-12 16:13 ` Christoph Hellwig
@ 2023-07-13 2:54 ` Ming Lei
1 sibling, 0 replies; 4+ messages in thread
From: Ming Lei @ 2023-07-13 2:54 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, hch, sagi, Keith Busch
On Wed, Jul 12, 2023 at 08:40:04AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> If any error handling that disables the controller fails to queue the
> reset work, like if the state changed to disconnected inbetween, then
> the failed teardown needs to unquiesce the queues since it's no longer
> paired with reset_work. Just make sure that the controller can be put
> into a resetting state prior to starting the disable so that no other
> handling can change with the queue states while recovery is happening.
>
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> v1->v2:
>
> Don't wait for a resetting state as that could deadlock. Just abort
> the disabling if the error handling can't set the state to resetting.
Looks fine:
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-07-13 2:55 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-12 15:40 [PATCHv2] nvme: ensure disabling pairs with unquiesce Keith Busch
2023-07-12 16:13 ` Christoph Hellwig
2023-07-12 16:24 ` Keith Busch
2023-07-13 2:54 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox