Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: kbusch@kernel.org, axboe@fb.com, hch@lst.de, sagi@grimberg.me
Cc: linux-nvme@lists.infradead.org, gjoyce@linux.ibm.com,
	nilay@linux.ibm.com
Subject: [PATCH] nvme-pci: Fix EEH failure on ppc after subsystem reset
Date: Wed, 31 Jan 2024 11:35:53 +0530	[thread overview]
Message-ID: <20240131060725.740426-1-nilay@linux.ibm.com> (raw)

If the nvme subsyetm reset causes the loss of communication to the nvme
adapter then EEH could potnetially recover the adapter. The detection of
comminication loss to the adapter only happens when the nvme driver
attempts to read an MMIO register.

The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
schedule adapter reset.In the case nvme subsystem reset caused the loss
of communication to the nvme adapter then either IO timeout event or
adapter reset handler could detect it. If IO timeout even could detect
loss of communication then EEH handler is able to recover the
communication to the adapter. This change was implemented in commit
651438bb0af5213 ("nvme-pci: Fix EEH failure on ppc"). However if the
adapter communication loss is detected in nvme reset work handler then
EEH is unable to successfully finish the adapter recovery.

This patch ensures that,
- nvme driver reset handler would observer pci channel was offline after
  a failed MMIO read and avoids marking the controller state to DEAD and
  thus gives a fair chance to EEH handler to recover the nvme adapter.

- if nvme controller is already in RESETTNG state and pci channel frozen
  error is detected then  nvme driver pci-error-handler code sends the
  correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
  so that EEH handler could proceed with the pci slot reset.

[  131.415601] EEH: Recovering PHB#40-PE#10000
[  131.415619] EEH: PE location: N/A, PHB location: N/A
[  131.415623] EEH: Frozen PHB#40-PE#10000 detected
[  131.415627] EEH: Call Trace:
[  131.415629] EEH: [c000000000051078] __eeh_send_failure_event+0x7c/0x15c
[  131.415782] EEH: [c000000000049bdc] eeh_dev_check_failure.part.0+0x27c/0x6b0
[  131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c
[  131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c
[  131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c
[  131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348
[  131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4
[  131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590
[  131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140
[  131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
[  131.415864] EEH: This PCI device has failed 1 times in the last hour
[  131.415874] EEH: Notify device drivers to shutdown
[  131.415882] EEH: Beginning: 'error_detected(IO frozen)'
[  131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected
[  131.415891] nvme nvme1: frozen state error detected, reset controller
[  131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002)
[  131.515778] nvme nvme1: Disabling device after reset failure: -19
[  131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect'
[  131.555343] EEH: Finished:'error_detected(IO frozen)'
[  131.555371] EEH: Unable to recover from failure from PHB#40-PE#10000.
[  131.555371] Please try reseating or replacing it

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/pci.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c1d6357ec98a..a6ba46e727ba 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
  out_unlock:
 	mutex_unlock(&dev->shutdown_lock);
  out:
+	/*
+	 * If PCI recovery is ongoing then let it finish first
+	 */
+	if (pci_channel_offline(to_pci_dev(dev->dev))) {
+		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
+		return;
+	}
+
 	/*
 	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
 	 * may be holding this pci_dev's device lock.
@@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
-			nvme_dev_disable(dev, true);
-			return PCI_ERS_RESULT_DISCONNECT;
+		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
+			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+				nvme_dev_disable(dev, true);
+				return PCI_ERS_RESULT_DISCONNECT;
+			}
 		}
 		nvme_dev_disable(dev, false);
 		return PCI_ERS_RESULT_NEED_RESET;
-- 
2.43.0



             reply	other threads:[~2024-01-31  6:08 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-31  6:05 Nilay Shroff [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-04-08 10:23 [PATCH] nvme-pci: Fix EEH failure on ppc after subsystem reset Nilay Shroff
2024-04-08 10:30 ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240131060725.740426-1-nilay@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox