public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Michael Kelley <mikelley@microsoft.com>
To: kbusch@kernel.org, axboe@fb.com, hch@lst.de, sagi@grimberg.me,
	linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Cc: mikelley@microsoft.com, caroline.subramoney@microsoft.com,
	riwurd@microsoft.com, nathan.obr@microsoft.com
Subject: [PATCH v3 2/2] nvme: handle persistent internal error AER from NVMe controller
Date: Mon,  6 Jun 2022 17:15:15 -0700	[thread overview]
Message-ID: <1654560915-22449-2-git-send-email-mikelley@microsoft.com> (raw)
In-Reply-To: <1654560915-22449-1-git-send-email-mikelley@microsoft.com>

In the NVM Express Revision 1.4 spec, Figure 145 describes possible
values for an AER with event type "Error" (value 000b). For a
Persistent Internal Error (value 03h), the host should perform a
controller reset.

Add support for this error using code that already exists for
doing a controller reset. As part of this support, introduce
two utility functions for parsing the AER type and subtype.

This new support was tested in a lab environment where we can
generate the persistent internal error on demand, and observe
both the Linux side and NVMe controller side to see that the
controller reset has been done.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---

Changes since v2:
* Instead of reading CSTS, use a constant value as input to
  nvme_should_reset() [Keith Busch]
* Introduce helper functions for parsing the AER result fields
  [Keith Busch]


 drivers/nvme/host/core.c | 41 +++++++++++++++++++++++++++++++++++++++--
 include/linux/nvme.h     |  4 ++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ea9ed04..1ca8e1f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4521,9 +4521,19 @@ static void nvme_fw_act_work(struct work_struct *work)
 	nvme_get_fw_slot_info(ctrl);
 }
 
+static u32 nvme_aer_type(u32 result)
+{
+	return result & 0x7;
+}
+
+static u32 nvme_aer_subtype(u32 result)
+{
+	return (result & 0xff00) >> 8;
+}
+
 static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
 {
-	u32 aer_notice_type = (result & 0xff00) >> 8;
+	u32 aer_notice_type = nvme_aer_subtype(result);
 
 	trace_nvme_async_event(ctrl, aer_notice_type);
 
@@ -4556,11 +4566,29 @@ static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
 	}
 }
 
+static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
+{
+	trace_nvme_async_event(ctrl, NVME_AER_ERROR);
+
+	/*
+	 * We can't read the CSTS here because we're in an atomic context on
+	 * some transports and the read may require submitting a request to the
+	 * to the controller and getting a response. Such a sequence isn't
+	 * likely to be successful anyway if the controller is reporting a
+	 * persistent internal error. So assume CSTS.CFS is set.
+	 */
+	if (nvme_should_reset(ctrl, NVME_CSTS_CFS)) {
+		dev_warn(ctrl->device, "resetting controller due to AER\n");
+		nvme_reset_ctrl(ctrl);
+	}
+}
+
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		volatile union nvme_result *res)
 {
 	u32 result = le32_to_cpu(res->u32);
-	u32 aer_type = result & 0x07;
+	u32 aer_type = nvme_aer_type(result);
+	u32 aer_subtype = nvme_aer_subtype(result);
 
 	if (le16_to_cpu(status) >> 1 != NVME_SC_SUCCESS)
 		return;
@@ -4570,6 +4598,15 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		nvme_handle_aen_notice(ctrl, result);
 		break;
 	case NVME_AER_ERROR:
+		/*
+		 * For a persistent internal error, don't run async_event_work
+		 * to submit a new AER. The controller reset will do it.
+		 */
+		if (aer_subtype == NVME_AER_ERROR_PERSIST_INT_ERR) {
+			nvme_handle_aer_persistent_error(ctrl);
+			return;
+		}
+		fallthrough;
 	case NVME_AER_SMART:
 	case NVME_AER_CSS:
 	case NVME_AER_VS:
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 29ec3e3..8ced243 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -712,6 +712,10 @@ enum {
 };
 
 enum {
+	NVME_AER_ERROR_PERSIST_INT_ERR	= 0x03,
+};
+
+enum {
 	NVME_AER_NOTICE_NS_CHANGED	= 0x00,
 	NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 	NVME_AER_NOTICE_ANA		= 0x03,
-- 
1.8.3.1


  reply	other threads:[~2022-06-07  0:19 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-07  0:15 [PATCH v3 1/2] nvme-pci: move nvme_should_reset() to core code Michael Kelley
2022-06-07  0:15 ` Michael Kelley [this message]
2022-06-07 10:35   ` [PATCH v3 2/2] nvme: handle persistent internal error AER from NVMe controller Christoph Hellwig
2022-06-08  3:59     ` Michael Kelley (LINUX)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1654560915-22449-2-git-send-email-mikelley@microsoft.com \
    --to=mikelley@microsoft.com \
    --cc=axboe@fb.com \
    --cc=caroline.subramoney@microsoft.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=nathan.obr@microsoft.com \
    --cc=riwurd@microsoft.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox