linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors
@ 2025-01-23  8:44 Smita Koralahalli
  2025-01-23  8:44 ` [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_ Smita Koralahalli
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

This patchset adds logging support for CXL CPER endpoint and port protocol
errors.

The first 3 patches update the existing codebase to support CXL CPER
Protocol error reporting.

The last 3 patches introduce recognizing and reporting CXL CPER Protocol
errors.

Link to v5:
https://lore.kernel.org/linux-cxl/20250114120427.149260-1-Smita.KoralahalliChannabasappa@amd.com

Changes in v5 -> v6:
[Dave, Jonathan, Ira]: Reviewed-by tags.
[Dave]: Check for cxlds before assigning fe.
Merge one of the patches (Port error trace logging) from Terry's Port
error handling.
Rename host -> parent.

Changes in v4 -> v5:
[Dave]: Reviewed-by tags.
[Jonathan]: Remove blank line.
[Jonathan, Ira]: Change CXL -> "CXL".
[Ira]: Fix build error for CONFIG_ACPI_APEI_PCIEAER.

Changes in v3 -> v4:
[Ira]: Use memcpy() for RAS Cap struct.
[Jonathan]: Commit description edits.
[Jonathan]: Use separate work registration functions for protocol and
component errors.
[Jonathan, Ira]: Replace flags with separate functions for port and
device errors.
[Jonathan]: Use goto for register and unregister calls.

Changes in v2 -> v3:
[Dan]: Define a new workqueue for CXL CPER Protocol errors and avoid
reusing existing workqueue which handles CXL CPER events.
[Dan] Update function and struct names.
[Ira] Don't define common function get_cxl_devstate().
[Dan] Use switch cases rather than defining array of structures.
[Dan] Pass the entire cxl_cper_prot_err struct for CXL subsystem.
[Dan] Use pr_err_ratelimited().
[Dan] Use AER_ severities directly. Don't define CXL_ severities.
[Dan] Limit either to Device ID or Agent Info check.
[Dan] Validate size of RAS field matches expectations.

Changes in v2 -> v1:
[Jonathan] Refactor code for trace support. Rename get_cxl_dev()
to get_cxl_devstate().
[Jonathan] Cleanups for get_cxl_devstate().
[Alison, Jonathan]: Define array of structures for Device ID and Serial
number comparison.
[Dave] p_err -> rec/p_rec.
[Jonathan] Remove pr_warn.

Smita Koralahalli (6):
  efi/cper, cxl: Prefix protocol error struct and function names with
    cxl_
  efi/cper, cxl: Make definitions and structures global
  efi/cper, cxl: Remove cper_cxl.h
  acpi/ghes, cper: Recognize and cache CXL Protocol errors
  acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  cxl/pci: Add trace logging for CXL PCIe Port RAS errors

 drivers/acpi/apei/ghes.c        | 103 ++++++++++++++++++++++++++++++++
 drivers/cxl/core/pci.c          |  62 +++++++++++++++++++
 drivers/cxl/core/trace.h        |  47 +++++++++++++++
 drivers/cxl/cxlpci.h            |   9 +++
 drivers/cxl/pci.c               |  59 +++++++++++++++++-
 drivers/firmware/efi/cper.c     |   6 +-
 drivers/firmware/efi/cper_cxl.c |  39 +-----------
 drivers/firmware/efi/cper_cxl.h |  66 --------------------
 include/cxl/event.h             | 101 +++++++++++++++++++++++++++++++
 include/linux/cper.h            |   8 +++
 10 files changed, 394 insertions(+), 106 deletions(-)
 delete mode 100644 drivers/firmware/efi/cper_cxl.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
@ 2025-01-23  8:44 ` Smita Koralahalli
  2025-02-04  0:12   ` Fan Ni
  2025-02-05 19:17   ` Gregory Price
  2025-01-23  8:44 ` [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global Smita Koralahalli
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Rename the protocol error struct from struct cper_sec_prot_err to
struct cxl_cper_sec_prot_err and cper_print_prot_err() to
cxl_cper_print_prot_err() to maintain naming consistency. No
functional changes.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/firmware/efi/cper.c     | 4 ++--
 drivers/firmware/efi/cper_cxl.c | 3 ++-
 drivers/firmware/efi/cper_cxl.h | 5 +++--
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index b69e68ef3f02..8e5762f7ef2e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -624,11 +624,11 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
 		else
 			goto err_section_too_small;
 	} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
-		struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
+		struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
 
 		printk("%ssection_type: CXL Protocol Error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*prot_err))
-			cper_print_prot_err(newpfx, prot_err);
+			cxl_cper_print_prot_err(newpfx, prot_err);
 		else
 			goto err_section_too_small;
 	} else {
diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c
index a55771b99a97..cbaabcb7382d 100644
--- a/drivers/firmware/efi/cper_cxl.c
+++ b/drivers/firmware/efi/cper_cxl.c
@@ -55,7 +55,8 @@ enum {
 	USP,	/* CXL Upstream Switch Port */
 };
 
-void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err)
+void cxl_cper_print_prot_err(const char *pfx,
+			     const struct cxl_cper_sec_prot_err *prot_err)
 {
 	if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE)
 		pr_info("%s agent_type: %d, %s\n", pfx, prot_err->agent_type,
diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h
index 86bfcf7909ec..0e3ab0ba17c3 100644
--- a/drivers/firmware/efi/cper_cxl.h
+++ b/drivers/firmware/efi/cper_cxl.h
@@ -18,7 +18,7 @@
 #pragma pack(1)
 
 /* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */
-struct cper_sec_prot_err {
+struct cxl_cper_sec_prot_err {
 	u64 valid_bits;
 	u8 agent_type;
 	u8 reserved[7];
@@ -61,6 +61,7 @@ struct cper_sec_prot_err {
 
 #pragma pack()
 
-void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err);
+void cxl_cper_print_prot_err(const char *pfx,
+			     const struct cxl_cper_sec_prot_err *prot_err);
 
 #endif //__CPER_CXL_
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
  2025-01-23  8:44 ` [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_ Smita Koralahalli
@ 2025-01-23  8:44 ` Smita Koralahalli
  2025-02-04  0:16   ` Fan Ni
  2025-02-05 19:16   ` Gregory Price
  2025-01-23  8:44 ` [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h Smita Koralahalli
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

In preparation to add tracepoint support, move protocol error UUID
definition to a common location, Also, make struct CXL RAS capability,
cxl_cper_sec_prot_err and CPER validation flags global for use across
different modules.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/firmware/efi/cper.c     |  1 +
 drivers/firmware/efi/cper_cxl.c | 35 +--------------
 drivers/firmware/efi/cper_cxl.h | 51 ---------------------
 include/cxl/event.h             | 80 +++++++++++++++++++++++++++++++++
 include/linux/cper.h            |  4 ++
 5 files changed, 86 insertions(+), 85 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 8e5762f7ef2e..ae1953e2b214 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -24,6 +24,7 @@
 #include <linux/bcd.h>
 #include <acpi/ghes.h>
 #include <ras/ras_event.h>
+#include <cxl/event.h>
 #include "cper_cxl.h"
 
 /*
diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c
index cbaabcb7382d..64c0dd27be6e 100644
--- a/drivers/firmware/efi/cper_cxl.c
+++ b/drivers/firmware/efi/cper_cxl.c
@@ -8,27 +8,9 @@
  */
 
 #include <linux/cper.h>
+#include <cxl/event.h>
 #include "cper_cxl.h"
 
-#define PROT_ERR_VALID_AGENT_TYPE		BIT_ULL(0)
-#define PROT_ERR_VALID_AGENT_ADDRESS		BIT_ULL(1)
-#define PROT_ERR_VALID_DEVICE_ID		BIT_ULL(2)
-#define PROT_ERR_VALID_SERIAL_NUMBER		BIT_ULL(3)
-#define PROT_ERR_VALID_CAPABILITY		BIT_ULL(4)
-#define PROT_ERR_VALID_DVSEC			BIT_ULL(5)
-#define PROT_ERR_VALID_ERROR_LOG		BIT_ULL(6)
-
-/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */
-struct cxl_ras_capability_regs {
-	u32 uncor_status;
-	u32 uncor_mask;
-	u32 uncor_severity;
-	u32 cor_status;
-	u32 cor_mask;
-	u32 cap_control;
-	u32 header_log[16];
-};
-
 static const char * const prot_err_agent_type_strs[] = {
 	"Restricted CXL Device",
 	"Restricted CXL Host Downstream Port",
@@ -40,21 +22,6 @@ static const char * const prot_err_agent_type_strs[] = {
 	"CXL Upstream Switch Port",
 };
 
-/*
- * The layout of the enumeration and the values matches CXL Agent Type
- * field in the UEFI 2.10 Section N.2.13,
- */
-enum {
-	RCD,	/* Restricted CXL Device */
-	RCH_DP,	/* Restricted CXL Host Downstream Port */
-	DEVICE,	/* CXL Device */
-	LD,	/* CXL Logical Device */
-	FMLD,	/* CXL Fabric Manager managed Logical Device */
-	RP,	/* CXL Root Port */
-	DSP,	/* CXL Downstream Switch Port */
-	USP,	/* CXL Upstream Switch Port */
-};
-
 void cxl_cper_print_prot_err(const char *pfx,
 			     const struct cxl_cper_sec_prot_err *prot_err)
 {
diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h
index 0e3ab0ba17c3..5ce1401ee17a 100644
--- a/drivers/firmware/efi/cper_cxl.h
+++ b/drivers/firmware/efi/cper_cxl.h
@@ -10,57 +10,6 @@
 #ifndef LINUX_CPER_CXL_H
 #define LINUX_CPER_CXL_H
 
-/* CXL Protocol Error Section */
-#define CPER_SEC_CXL_PROT_ERR						\
-	GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78,	\
-		  0x4B, 0x77, 0x10, 0x48)
-
-#pragma pack(1)
-
-/* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */
-struct cxl_cper_sec_prot_err {
-	u64 valid_bits;
-	u8 agent_type;
-	u8 reserved[7];
-
-	/*
-	 * Except for RCH Downstream Port, all the remaining CXL Agent
-	 * types are uniquely identified by the PCIe compatible SBDF number.
-	 */
-	union {
-		u64 rcrb_base_addr;
-		struct {
-			u8 function;
-			u8 device;
-			u8 bus;
-			u16 segment;
-			u8 reserved_1[3];
-		};
-	} agent_addr;
-
-	struct {
-		u16 vendor_id;
-		u16 device_id;
-		u16 subsystem_vendor_id;
-		u16 subsystem_id;
-		u8 class_code[2];
-		u16 slot;
-		u8 reserved_1[4];
-	} device_id;
-
-	struct {
-		u32 lower_dw;
-		u32 upper_dw;
-	} dev_serial_num;
-
-	u8 capability[60];
-	u16 dvsec_len;
-	u16 err_len;
-	u8 reserved_2[4];
-};
-
-#pragma pack()
-
 void cxl_cper_print_prot_err(const char *pfx,
 			     const struct cxl_cper_sec_prot_err *prot_err);
 
diff --git a/include/cxl/event.h b/include/cxl/event.h
index 0bea1afbd747..66d85fc87701 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -152,6 +152,86 @@ struct cxl_cper_work_data {
 	struct cxl_cper_event_rec rec;
 };
 
+#define PROT_ERR_VALID_AGENT_TYPE		BIT_ULL(0)
+#define PROT_ERR_VALID_AGENT_ADDRESS		BIT_ULL(1)
+#define PROT_ERR_VALID_DEVICE_ID		BIT_ULL(2)
+#define PROT_ERR_VALID_SERIAL_NUMBER		BIT_ULL(3)
+#define PROT_ERR_VALID_CAPABILITY		BIT_ULL(4)
+#define PROT_ERR_VALID_DVSEC			BIT_ULL(5)
+#define PROT_ERR_VALID_ERROR_LOG		BIT_ULL(6)
+
+/*
+ * The layout of the enumeration and the values matches CXL Agent Type
+ * field in the UEFI 2.10 Section N.2.13,
+ */
+enum {
+	RCD,	/* Restricted CXL Device */
+	RCH_DP,	/* Restricted CXL Host Downstream Port */
+	DEVICE,	/* CXL Device */
+	LD,	/* CXL Logical Device */
+	FMLD,	/* CXL Fabric Manager managed Logical Device */
+	RP,	/* CXL Root Port */
+	DSP,	/* CXL Downstream Switch Port */
+	USP,	/* CXL Upstream Switch Port */
+};
+
+#pragma pack(1)
+
+/* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */
+struct cxl_cper_sec_prot_err {
+	u64 valid_bits;
+	u8 agent_type;
+	u8 reserved[7];
+
+	/*
+	 * Except for RCH Downstream Port, all the remaining CXL Agent
+	 * types are uniquely identified by the PCIe compatible SBDF number.
+	 */
+	union {
+		u64 rcrb_base_addr;
+		struct {
+			u8 function;
+			u8 device;
+			u8 bus;
+			u16 segment;
+			u8 reserved_1[3];
+		};
+	} agent_addr;
+
+	struct {
+		u16 vendor_id;
+		u16 device_id;
+		u16 subsystem_vendor_id;
+		u16 subsystem_id;
+		u8 class_code[2];
+		u16 slot;
+		u8 reserved_1[4];
+	} device_id;
+
+	struct {
+		u32 lower_dw;
+		u32 upper_dw;
+	} dev_serial_num;
+
+	u8 capability[60];
+	u16 dvsec_len;
+	u16 err_len;
+	u8 reserved_2[4];
+};
+
+#pragma pack()
+
+/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */
+struct cxl_ras_capability_regs {
+	u32 uncor_status;
+	u32 uncor_mask;
+	u32 uncor_severity;
+	u32 cor_status;
+	u32 cor_mask;
+	u32 cap_control;
+	u32 header_log[16];
+};
+
 #ifdef CONFIG_ACPI_APEI_GHES
 int cxl_cper_register_work(struct work_struct *work);
 int cxl_cper_unregister_work(struct work_struct *work);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 265b0f8fc0b3..5c6d4d5b9975 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -89,6 +89,10 @@ enum {
 #define CPER_NOTIFY_DMAR						\
 	GUID_INIT(0x667DD791, 0xC6B3, 0x4c27, 0x8A, 0x6B, 0x0F, 0x8E,	\
 		  0x72, 0x2D, 0xEB, 0x41)
+/* CXL Protocol Error Section */
+#define CPER_SEC_CXL_PROT_ERR						\
+	GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78,	\
+		  0x4B, 0x77, 0x10, 0x48)
 
 /* CXL Event record UUIDs are formatted as GUIDs and reported in section type */
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
  2025-01-23  8:44 ` [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_ Smita Koralahalli
  2025-01-23  8:44 ` [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global Smita Koralahalli
@ 2025-01-23  8:44 ` Smita Koralahalli
  2025-02-04  0:20   ` Fan Ni
  2025-02-05 19:18   ` Gregory Price
  2025-01-23  8:44 ` [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors Smita Koralahalli
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Move the declaration of cxl_cper_print_prot_err() to include/linux/cper.h
to avoid maintaining a separate header file just for this function
declaration. Remove drivers/firmware/efi/cper_cxl.h as its contents have
been reorganized.

No functional changes.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/firmware/efi/cper.c     |  1 -
 drivers/firmware/efi/cper_cxl.c |  1 -
 drivers/firmware/efi/cper_cxl.h | 16 ----------------
 include/linux/cper.h            |  4 ++++
 4 files changed, 4 insertions(+), 18 deletions(-)
 delete mode 100644 drivers/firmware/efi/cper_cxl.h

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index ae1953e2b214..928409199a1a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -25,7 +25,6 @@
 #include <acpi/ghes.h>
 #include <ras/ras_event.h>
 #include <cxl/event.h>
-#include "cper_cxl.h"
 
 /*
  * CPER record ID need to be unique even after reboot, because record
diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c
index 64c0dd27be6e..8a7667faf953 100644
--- a/drivers/firmware/efi/cper_cxl.c
+++ b/drivers/firmware/efi/cper_cxl.c
@@ -9,7 +9,6 @@
 
 #include <linux/cper.h>
 #include <cxl/event.h>
-#include "cper_cxl.h"
 
 static const char * const prot_err_agent_type_strs[] = {
 	"Restricted CXL Device",
diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h
deleted file mode 100644
index 5ce1401ee17a..000000000000
--- a/drivers/firmware/efi/cper_cxl.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * UEFI Common Platform Error Record (CPER) support for CXL Section.
- *
- * Copyright (C) 2022 Advanced Micro Devices, Inc.
- *
- * Author: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
- */
-
-#ifndef LINUX_CPER_CXL_H
-#define LINUX_CPER_CXL_H
-
-void cxl_cper_print_prot_err(const char *pfx,
-			     const struct cxl_cper_sec_prot_err *prot_err);
-
-#endif //__CPER_CXL_
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 5c6d4d5b9975..0ed60a91eca9 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -605,4 +605,8 @@ void cper_estatus_print(const char *pfx,
 int cper_estatus_check_header(const struct acpi_hest_generic_status *estatus);
 int cper_estatus_check(const struct acpi_hest_generic_status *estatus);
 
+struct cxl_cper_sec_prot_err;
+void cxl_cper_print_prot_err(const char *pfx,
+			     const struct cxl_cper_sec_prot_err *prot_err);
+
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
                   ` (2 preceding siblings ...)
  2025-01-23  8:44 ` [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h Smita Koralahalli
@ 2025-01-23  8:44 ` Smita Koralahalli
  2025-02-03 18:59   ` Luck, Tony
                     ` (2 more replies)
  2025-01-23  8:44 ` [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors Smita Koralahalli
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Add support in GHES to detect and process CXL CPER Protocol errors, as
defined in UEFI v2.10, section N.2.13.

Define struct cxl_cper_prot_err_work_data to cache CXL protocol error
information, including RAS capabilities and severity, for further
handling.

These cached CXL CPER records will later be processed by workqueues
within the CXL subsystem.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/acpi/apei/ghes.c | 54 ++++++++++++++++++++++++++++++++++++++++
 include/cxl/event.h      |  6 +++++
 2 files changed, 60 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b72772494655..4d725d988c43 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
+static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
+				   int severity)
+{
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct cxl_cper_prot_err_work_data wd;
+	u8 *dvsec_start, *cap_start;
+
+	if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
+		pr_err_ratelimited("CXL CPER invalid agent type\n");
+		return;
+	}
+
+	if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
+		pr_err_ratelimited("CXL CPER invalid protocol error log\n");
+		return;
+	}
+
+	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
+		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
+				   prot_err->err_len);
+		return;
+	}
+
+	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
+		pr_warn(FW_WARN "CXL CPER no device serial number\n");
+
+	switch (prot_err->agent_type) {
+	case RCD:
+	case DEVICE:
+	case LD:
+	case FMLD:
+	case RP:
+	case DSP:
+	case USP:
+		memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
+
+		dvsec_start = (u8 *)(prot_err + 1);
+		cap_start = dvsec_start + prot_err->dvsec_len;
+
+		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
+		wd.severity = cper_severity_to_aer(severity);
+		break;
+	default:
+		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
+				   prot_err->agent_type);
+		return;
+	}
+#endif
+}
+
 /* Room for 8 entries for each of the 4 event log queues */
 #define CXL_CPER_FIFO_DEPTH 32
 DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
@@ -777,6 +827,10 @@ static bool ghes_do_proc(struct ghes *ghes,
 		}
 		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
 			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
+			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
 		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
 			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
 
diff --git a/include/cxl/event.h b/include/cxl/event.h
index 66d85fc87701..ee1c3dec62fa 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -232,6 +232,12 @@ struct cxl_ras_capability_regs {
 	u32 header_log[16];
 };
 
+struct cxl_cper_prot_err_work_data {
+	struct cxl_cper_sec_prot_err prot_err;
+	struct cxl_ras_capability_regs ras_cap;
+	int severity;
+};
+
 #ifdef CONFIG_ACPI_APEI_GHES
 int cxl_cper_register_work(struct work_struct *work);
 int cxl_cper_unregister_work(struct work_struct *work);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
                   ` (3 preceding siblings ...)
  2025-01-23  8:44 ` [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors Smita Koralahalli
@ 2025-01-23  8:44 ` Smita Koralahalli
  2025-02-03 19:03   ` Luck, Tony
                     ` (2 more replies)
  2025-01-23  8:44 ` [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors Smita Koralahalli
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

When PCIe AER is in FW-First, OS should process CXL Protocol errors from
CPER records. Introduce support for handling and logging CXL Protocol
errors.

The defined trace events cxl_aer_uncorrectable_error and
cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
to trace FW-First Protocol errors.

Since the CXL code is required to be called from process context and
GHES is in interrupt context, use workqueues for processing.

Similar to CXL CPER event handling, use kfifo to handle errors as it
simplifies queue processing by providing lock free fifo operations.

Add the ability for the CXL sub-system to register a workqueue to
process CXL CPER protocol errors.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/acpi/apei/ghes.c | 49 ++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/pci.c   | 36 +++++++++++++++++++++++++++++
 drivers/cxl/cxlpci.h     |  5 ++++
 drivers/cxl/pci.c        | 46 ++++++++++++++++++++++++++++++++++++-
 include/cxl/event.h      | 15 ++++++++++++
 5 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 4d725d988c43..289e365f84b2 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -674,6 +674,15 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
+/* Room for 8 entries */
+#define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
+static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
+		    CXL_CPER_PROT_ERR_FIFO_DEPTH);
+
+/* Synchronize schedule_work() with cxl_cper_prot_err_work changes */
+static DEFINE_SPINLOCK(cxl_cper_prot_err_work_lock);
+struct work_struct *cxl_cper_prot_err_work;
+
 static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
 				   int severity)
 {
@@ -700,6 +709,11 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
 	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
 		pr_warn(FW_WARN "CXL CPER no device serial number\n");
 
+	guard(spinlock_irqsave)(&cxl_cper_prot_err_work_lock);
+
+	if (!cxl_cper_prot_err_work)
+		return;
+
 	switch (prot_err->agent_type) {
 	case RCD:
 	case DEVICE:
@@ -721,9 +735,44 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
 				   prot_err->agent_type);
 		return;
 	}
+
+	if (!kfifo_put(&cxl_cper_prot_err_fifo, wd)) {
+		pr_err_ratelimited("CXL CPER kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_cper_prot_err_work);
 #endif
 }
 
+int cxl_cper_register_prot_err_work(struct work_struct *work)
+{
+	if (cxl_cper_prot_err_work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_prot_err_work_lock);
+	cxl_cper_prot_err_work = work;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
+
+int cxl_cper_unregister_prot_err_work(struct work_struct *work)
+{
+	if (cxl_cper_prot_err_work != work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_prot_err_work_lock);
+	cxl_cper_prot_err_work = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
+
+int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
+{
+	return kfifo_get(&cxl_cper_prot_err_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
+
 /* Room for 8 entries for each of the 4 event log queues */
 #define CXL_CPER_FIFO_DEPTH 32
 DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 9d58ab9d33c5..5840056bb9a3 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,6 +650,42 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
+void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
+				  struct cxl_ras_capability_regs ras_cap)
+{
+	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
+	struct cxl_dev_state *cxlds;
+
+	cxlds = pci_get_drvdata(pdev);
+	if (!cxlds)
+		return;
+
+	trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_corr_prot_err, "CXL");
+
+void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
+				    struct cxl_ras_capability_regs ras_cap)
+{
+	u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+	struct cxl_dev_state *cxlds;
+	u32 fe;
+
+	cxlds = pci_get_drvdata(pdev);
+	if (!cxlds)
+		return;
+
+	if (hweight32(status) > 1)
+		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
+				   ras_cap.cap_control));
+	else
+		fe = status;
+
+	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe,
+					  ras_cap.header_log);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_uncorr_prot_err, "CXL");
+
 static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 				 void __iomem *ras_base)
 {
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 4da07727ab9c..e457616373ed 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -129,4 +129,9 @@ void read_cdat_data(struct cxl_port *port);
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+struct cxl_ras_capability_regs;
+void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
+				  struct cxl_ras_capability_regs ras_cap);
+void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
+				    struct cxl_ras_capability_regs ras_cap);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 6d94ff4a4f1a..9d4b5f39b21a 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1160,6 +1160,37 @@ static void cxl_cper_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_work, cxl_cper_work_fn);
 
+static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
+{
+	unsigned int devfn = PCI_DEVFN(data->prot_err.agent_addr.device,
+				       data->prot_err.agent_addr.function);
+	struct pci_dev *pdev __free(pci_dev_put) =
+		pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
+					    data->prot_err.agent_addr.bus,
+					    devfn);
+
+	if (!pdev)
+		return;
+
+	guard(device)(&pdev->dev);
+	if (pdev->driver != &cxl_pci_driver)
+		return;
+
+	if (data->severity == AER_CORRECTABLE)
+		cxl_cper_trace_corr_prot_err(pdev, data->ras_cap);
+	else
+		cxl_cper_trace_uncorr_prot_err(pdev, data->ras_cap);
+}
+
+static void cxl_cper_prot_err_work_fn(struct work_struct *work)
+{
+	struct cxl_cper_prot_err_work_data wd;
+
+	while (cxl_cper_prot_err_kfifo_get(&wd))
+		cxl_cper_handle_prot_err(&wd);
+}
+static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
+
 static int __init cxl_pci_driver_init(void)
 {
 	int rc;
@@ -1170,7 +1201,18 @@ static int __init cxl_pci_driver_init(void)
 
 	rc = cxl_cper_register_work(&cxl_cper_work);
 	if (rc)
-		pci_unregister_driver(&cxl_pci_driver);
+		goto err_unreg;
+
+	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
+	if (rc)
+		goto err_unregister_work;
+
+	return 0;
+
+err_unregister_work:
+	cxl_cper_unregister_work(&cxl_cper_work);
+err_unreg:
+	pci_unregister_driver(&cxl_pci_driver);
 
 	return rc;
 }
@@ -1178,7 +1220,9 @@ static int __init cxl_pci_driver_init(void)
 static void __exit cxl_pci_driver_exit(void)
 {
 	cxl_cper_unregister_work(&cxl_cper_work);
+	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
 	cancel_work_sync(&cxl_cper_work);
+	cancel_work_sync(&cxl_cper_prot_err_work);
 	pci_unregister_driver(&cxl_pci_driver);
 }
 
diff --git a/include/cxl/event.h b/include/cxl/event.h
index ee1c3dec62fa..359a8f44a2e0 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -242,6 +242,9 @@ struct cxl_cper_prot_err_work_data {
 int cxl_cper_register_work(struct work_struct *work);
 int cxl_cper_unregister_work(struct work_struct *work);
 int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);
+int cxl_cper_register_prot_err_work(struct work_struct *work);
+int cxl_cper_unregister_prot_err_work(struct work_struct *work);
+int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd);
 #else
 static inline int cxl_cper_register_work(struct work_struct *work)
 {
@@ -256,6 +259,18 @@ static inline int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
 {
 	return 0;
 }
+static inline int cxl_cper_register_prot_err_work(struct work_struct *work)
+{
+	return 0;
+}
+static inline int cxl_cper_unregister_prot_err_work(struct work_struct *work)
+{
+	return 0;
+}
+static inline int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
+{
+	return 0;
+}
 #endif
 
 #endif /* _LINUX_CXL_EVENT_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
                   ` (4 preceding siblings ...)
  2025-01-23  8:44 ` [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors Smita Koralahalli
@ 2025-01-23  8:44 ` Smita Koralahalli
  2025-01-24 16:36   ` Ira Weiny
                     ` (2 more replies)
  2025-02-03 17:09 ` [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Dave Jiang
  2025-02-06 18:38 ` Dave Jiang
  7 siblings, 3 replies; 39+ messages in thread
From: Smita Koralahalli @ 2025-01-23  8:44 UTC (permalink / raw)
  To: linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

The CXL drivers use kernel trace functions for logging endpoint and
Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
Upstream Switch Ports.

Introduce trace logging functions for both RAS correctable and
uncorrectable errors specific to CXL PCIe Ports. Use them to trace
FW-First Protocol errors.

Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
 drivers/cxl/core/pci.c   | 26 ++++++++++++++++++++++
 drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlpci.h     |  4 ++++
 drivers/cxl/pci.c        | 13 +++++++++++
 4 files changed, 90 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5840056bb9a3..b535da901dec 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -686,6 +686,32 @@ void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_uncorr_prot_err, "CXL");
 
+void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
+				       struct cxl_ras_capability_regs ras_cap)
+{
+	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
+
+	trace_cxl_port_aer_correctable_error(&pdev->dev, status);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_corr_port_prot_err, "CXL");
+
+void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
+					 struct cxl_ras_capability_regs ras_cap)
+{
+	u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+	u32 fe;
+
+	if (hweight32(status) > 1)
+		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
+				   ras_cap.cap_control));
+	else
+		fe = status;
+
+	trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
+					       ras_cap.header_log);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_uncorr_port_prot_err, "CXL");
+
 static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 				 void __iomem *ras_base)
 {
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 8389a94adb1a..f684a2ae14e8 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,6 +48,34 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
+TRACE_EVENT(cxl_port_aer_uncorrectable_error,
+	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
+	TP_ARGS(dev, status, fe, hl),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__string(parent, dev_name(dev->parent))
+		__field(u32, status)
+		__field(u32, first_error)
+		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__assign_str(parent);
+		__entry->status = status;
+		__entry->first_error = fe;
+		/*
+		 * Embed the 512B headerlog data for user app retrieval and
+		 * parsing, but no need to print this in the trace buffer.
+		 */
+		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
+	),
+	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
+		  __get_str(devname), __get_str(parent),
+		  show_uc_errs(__entry->status),
+		  show_uc_errs(__entry->first_error)
+	)
+);
+
 TRACE_EVENT(cxl_aer_uncorrectable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
 	TP_ARGS(cxlmd, status, fe, hl),
@@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
+TRACE_EVENT(cxl_port_aer_correctable_error,
+	TP_PROTO(struct device *dev, u32 status),
+	TP_ARGS(dev, status),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__string(parent, dev_name(dev->parent))
+		__field(u32, status)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__assign_str(parent);
+		__entry->status = status;
+	),
+	TP_printk("device=%s host=%s status='%s'",
+		  __get_str(devname), __get_str(parent),
+		  show_ce_errs(__entry->status)
+	)
+);
+
 TRACE_EVENT(cxl_aer_correctable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
 	TP_ARGS(cxlmd, status),
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index e457616373ed..23f2b1c9bd13 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -134,4 +134,8 @@ void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
 				  struct cxl_ras_capability_regs ras_cap);
 void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
 				    struct cxl_ras_capability_regs ras_cap);
+void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
+				       struct cxl_ras_capability_regs ras_cap);
+void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
+					 struct cxl_ras_capability_regs ras_cap);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 9d4b5f39b21a..766447c169c8 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1168,6 +1168,7 @@ static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
 		pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
 					    data->prot_err.agent_addr.bus,
 					    devfn);
+	int port_type;
 
 	if (!pdev)
 		return;
@@ -1176,6 +1177,18 @@ static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
 	if (pdev->driver != &cxl_pci_driver)
 		return;
 
+	port_type = pci_pcie_type(pdev);
+	if (port_type == PCI_EXP_TYPE_ROOT_PORT ||
+	    port_type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    port_type == PCI_EXP_TYPE_UPSTREAM) {
+		if (data->severity == AER_CORRECTABLE)
+			cxl_cper_trace_corr_port_prot_err(pdev, data->ras_cap);
+		else
+			cxl_cper_trace_uncorr_port_prot_err(pdev, data->ras_cap);
+
+		return;
+	}
+
 	if (data->severity == AER_CORRECTABLE)
 		cxl_cper_trace_corr_prot_err(pdev, data->ras_cap);
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-23  8:44 ` [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors Smita Koralahalli
@ 2025-01-24 16:36   ` Ira Weiny
  2025-02-05 20:01   ` Gregory Price
  2025-02-05 23:06   ` Dan Williams
  2 siblings, 0 replies; 39+ messages in thread
From: Ira Weiny @ 2025-01-24 16:36 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Smita Koralahalli wrote:
> The CXL drivers use kernel trace functions for logging endpoint and
> Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
> is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
> Upstream Switch Ports.
> 
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Use them to trace
> FW-First Protocol errors.
> 
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
                   ` (5 preceding siblings ...)
  2025-01-23  8:44 ` [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors Smita Koralahalli
@ 2025-02-03 17:09 ` Dave Jiang
  2025-02-06 18:38 ` Dave Jiang
  7 siblings, 0 replies; 39+ messages in thread
From: Dave Jiang @ 2025-02-03 17:09 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman



On 1/23/25 1:44 AM, Smita Koralahalli wrote:
> This patchset adds logging support for CXL CPER endpoint and port protocol
> errors.

Hi Ard,
I'd like to apply this series to cxl/next. If the EFI bits look ok to you, can you please ack the relevant patches? Thank you!
 
> 
> The first 3 patches update the existing codebase to support CXL CPER
> Protocol error reporting.
> 
> The last 3 patches introduce recognizing and reporting CXL CPER Protocol
> errors.
> 
> Link to v5:
> https://lore.kernel.org/linux-cxl/20250114120427.149260-1-Smita.KoralahalliChannabasappa@amd.com
> 
> Changes in v5 -> v6:
> [Dave, Jonathan, Ira]: Reviewed-by tags.
> [Dave]: Check for cxlds before assigning fe.
> Merge one of the patches (Port error trace logging) from Terry's Port
> error handling.
> Rename host -> parent.
> 
> Changes in v4 -> v5:
> [Dave]: Reviewed-by tags.
> [Jonathan]: Remove blank line.
> [Jonathan, Ira]: Change CXL -> "CXL".
> [Ira]: Fix build error for CONFIG_ACPI_APEI_PCIEAER.
> 
> Changes in v3 -> v4:
> [Ira]: Use memcpy() for RAS Cap struct.
> [Jonathan]: Commit description edits.
> [Jonathan]: Use separate work registration functions for protocol and
> component errors.
> [Jonathan, Ira]: Replace flags with separate functions for port and
> device errors.
> [Jonathan]: Use goto for register and unregister calls.
> 
> Changes in v2 -> v3:
> [Dan]: Define a new workqueue for CXL CPER Protocol errors and avoid
> reusing existing workqueue which handles CXL CPER events.
> [Dan] Update function and struct names.
> [Ira] Don't define common function get_cxl_devstate().
> [Dan] Use switch cases rather than defining array of structures.
> [Dan] Pass the entire cxl_cper_prot_err struct for CXL subsystem.
> [Dan] Use pr_err_ratelimited().
> [Dan] Use AER_ severities directly. Don't define CXL_ severities.
> [Dan] Limit either to Device ID or Agent Info check.
> [Dan] Validate size of RAS field matches expectations.
> 
> Changes in v2 -> v1:
> [Jonathan] Refactor code for trace support. Rename get_cxl_dev()
> to get_cxl_devstate().
> [Jonathan] Cleanups for get_cxl_devstate().
> [Alison, Jonathan]: Define array of structures for Device ID and Serial
> number comparison.
> [Dave] p_err -> rec/p_rec.
> [Jonathan] Remove pr_warn.
> 
> Smita Koralahalli (6):
>   efi/cper, cxl: Prefix protocol error struct and function names with
>     cxl_
>   efi/cper, cxl: Make definitions and structures global
>   efi/cper, cxl: Remove cper_cxl.h
>   acpi/ghes, cper: Recognize and cache CXL Protocol errors
>   acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
>   cxl/pci: Add trace logging for CXL PCIe Port RAS errors
> 
>  drivers/acpi/apei/ghes.c        | 103 ++++++++++++++++++++++++++++++++
>  drivers/cxl/core/pci.c          |  62 +++++++++++++++++++
>  drivers/cxl/core/trace.h        |  47 +++++++++++++++
>  drivers/cxl/cxlpci.h            |   9 +++
>  drivers/cxl/pci.c               |  59 +++++++++++++++++-
>  drivers/firmware/efi/cper.c     |   6 +-
>  drivers/firmware/efi/cper_cxl.c |  39 +-----------
>  drivers/firmware/efi/cper_cxl.h |  66 --------------------
>  include/cxl/event.h             | 101 +++++++++++++++++++++++++++++++
>  include/linux/cper.h            |   8 +++
>  10 files changed, 394 insertions(+), 106 deletions(-)
>  delete mode 100644 drivers/firmware/efi/cper_cxl.h
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-01-23  8:44 ` [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors Smita Koralahalli
@ 2025-02-03 18:59   ` Luck, Tony
  2025-02-05 19:35     ` Gregory Price
  2025-02-05 22:21   ` Dan Williams
  2025-07-22 19:24   ` "invalid agent type: 1" in " Marc Herbert
  2 siblings, 1 reply; 39+ messages in thread
From: Luck, Tony @ 2025-02-03 18:59 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:19AM +0000, Smita Koralahalli wrote:
> Add support in GHES to detect and process CXL CPER Protocol errors, as
> defined in UEFI v2.10, section N.2.13.
> 
> Define struct cxl_cper_prot_err_work_data to cache CXL protocol error
> information, including RAS capabilities and severity, for further
> handling.
> 
> These cached CXL CPER records will later be processed by workqueues
> within the CXL subsystem.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/acpi/apei/ghes.c | 54 ++++++++++++++++++++++++++++++++++++++++
>  include/cxl/event.h      |  6 +++++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b72772494655..4d725d988c43 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>  	schedule_work(&entry->work);
>  }
>  
> +static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> +				   int severity)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER

#ifdef in ".c" code is ugly. But I don't see a less ugly way to deal
with this. Moving this elsewhere and adding an empty stub function
for when CONFIG_ACPI_APEI_PCIEAER isn't configured doesn't make things
any better.

> +	struct cxl_cper_prot_err_work_data wd;
> +	u8 *dvsec_start, *cap_start;
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
> +		pr_err_ratelimited("CXL CPER invalid agent type\n");
> +		return;
> +	}
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
> +		pr_err_ratelimited("CXL CPER invalid protocol error log\n");
> +		return;
> +	}
> +
> +	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> +		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> +				   prot_err->err_len);
> +		return;
> +	}
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> +		pr_warn(FW_WARN "CXL CPER no device serial number\n");

Should this be rate limited too? It's not like this device will somehow
acquire a serial number if there is another error.

> +
> +	switch (prot_err->agent_type) {
> +	case RCD:
> +	case DEVICE:
> +	case LD:
> +	case FMLD:
> +	case RP:
> +	case DSP:
> +	case USP:
> +		memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
> +
> +		dvsec_start = (u8 *)(prot_err + 1);
> +		cap_start = dvsec_start + prot_err->dvsec_len;
> +
> +		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
> +		wd.severity = cper_severity_to_aer(severity);
> +		break;
> +	default:
> +		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
> +				   prot_err->agent_type);
> +		return;
> +	}
> +#endif
> +}
> +
>  /* Room for 8 entries for each of the 4 event log queues */
>  #define CXL_CPER_FIFO_DEPTH 32
>  DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
> @@ -777,6 +827,10 @@ static bool ghes_do_proc(struct ghes *ghes,
>  		}
>  		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>  			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> +		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
> +			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
> +
> +			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
>  		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
>  			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
>  
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index 66d85fc87701..ee1c3dec62fa 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -232,6 +232,12 @@ struct cxl_ras_capability_regs {
>  	u32 header_log[16];
>  };
>  
> +struct cxl_cper_prot_err_work_data {
> +	struct cxl_cper_sec_prot_err prot_err;
> +	struct cxl_ras_capability_regs ras_cap;
> +	int severity;
> +};
> +
>  #ifdef CONFIG_ACPI_APEI_GHES
>  int cxl_cper_register_work(struct work_struct *work);
>  int cxl_cper_unregister_work(struct work_struct *work);
> -- 
> 2.17.1
> 

With added ratelimit mentioned above:

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  2025-01-23  8:44 ` [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors Smita Koralahalli
@ 2025-02-03 19:03   ` Luck, Tony
  2025-02-12 21:04     ` Koralahalli Channabasappa, Smita
  2025-02-05 19:50   ` Gregory Price
  2025-02-05 22:58   ` Dan Williams
  2 siblings, 1 reply; 39+ messages in thread
From: Luck, Tony @ 2025-02-03 19:03 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:20AM +0000, Smita Koralahalli wrote:
> When PCIe AER is in FW-First, OS should process CXL Protocol errors from
> CPER records. Introduce support for handling and logging CXL Protocol
> errors.
> 
> The defined trace events cxl_aer_uncorrectable_error and
> cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
> to trace FW-First Protocol errors.
> 
> Since the CXL code is required to be called from process context and
> GHES is in interrupt context, use workqueues for processing.
> 
> Similar to CXL CPER event handling, use kfifo to handle errors as it
> simplifies queue processing by providing lock free fifo operations.
> 
> Add the ability for the CXL sub-system to register a workqueue to
> process CXL CPER protocol errors.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/acpi/apei/ghes.c | 49 ++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/pci.c   | 36 +++++++++++++++++++++++++++++
>  drivers/cxl/cxlpci.h     |  5 ++++
>  drivers/cxl/pci.c        | 46 ++++++++++++++++++++++++++++++++++++-
>  include/cxl/event.h      | 15 ++++++++++++
>  5 files changed, 150 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 4d725d988c43..289e365f84b2 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -674,6 +674,15 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>  	schedule_work(&entry->work);
>  }
>  
> +/* Room for 8 entries */

Any science behind the choice of "8" here? This comment is merely
stating what the #define is used for, not why 8 was chosen.

> +#define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
> +static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
> +		    CXL_CPER_PROT_ERR_FIFO_DEPTH);
> +
> +/* Synchronize schedule_work() with cxl_cper_prot_err_work changes */
> +static DEFINE_SPINLOCK(cxl_cper_prot_err_work_lock);
> +struct work_struct *cxl_cper_prot_err_work;
> +
>  static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
>  				   int severity)
>  {
> @@ -700,6 +709,11 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
>  	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
>  		pr_warn(FW_WARN "CXL CPER no device serial number\n");
>  
> +	guard(spinlock_irqsave)(&cxl_cper_prot_err_work_lock);
> +
> +	if (!cxl_cper_prot_err_work)
> +		return;
> +
>  	switch (prot_err->agent_type) {
>  	case RCD:
>  	case DEVICE:
> @@ -721,9 +735,44 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
>  				   prot_err->agent_type);
>  		return;
>  	}
> +
> +	if (!kfifo_put(&cxl_cper_prot_err_fifo, wd)) {
> +		pr_err_ratelimited("CXL CPER kfifo overflow\n");
> +		return;
> +	}
> +
> +	schedule_work(cxl_cper_prot_err_work);
>  #endif
>  }
>  
> +int cxl_cper_register_prot_err_work(struct work_struct *work)
> +{
> +	if (cxl_cper_prot_err_work)
> +		return -EINVAL;
> +
> +	guard(spinlock)(&cxl_cper_prot_err_work_lock);
> +	cxl_cper_prot_err_work = work;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
> +
> +int cxl_cper_unregister_prot_err_work(struct work_struct *work)
> +{
> +	if (cxl_cper_prot_err_work != work)
> +		return -EINVAL;
> +
> +	guard(spinlock)(&cxl_cper_prot_err_work_lock);
> +	cxl_cper_prot_err_work = NULL;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
> +
> +int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_cper_prot_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
> +
>  /* Room for 8 entries for each of the 4 event log queues */
>  #define CXL_CPER_FIFO_DEPTH 32
>  DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 9d58ab9d33c5..5840056bb9a3 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -650,6 +650,42 @@ void read_cdat_data(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
>  
> +void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
> +				  struct cxl_ras_capability_regs ras_cap)
> +{
> +	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
> +	struct cxl_dev_state *cxlds;
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	if (!cxlds)
> +		return;
> +
> +	trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_corr_prot_err, "CXL");
> +
> +void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
> +				    struct cxl_ras_capability_regs ras_cap)
> +{
> +	u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
> +	struct cxl_dev_state *cxlds;
> +	u32 fe;
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	if (!cxlds)
> +		return;
> +
> +	if (hweight32(status) > 1)
> +		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
> +				   ras_cap.cap_control));
> +	else
> +		fe = status;
> +
> +	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe,
> +					  ras_cap.header_log);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_cper_trace_uncorr_prot_err, "CXL");
> +
>  static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
>  				 void __iomem *ras_base)
>  {
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 4da07727ab9c..e457616373ed 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -129,4 +129,9 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +struct cxl_ras_capability_regs;
> +void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
> +				  struct cxl_ras_capability_regs ras_cap);
> +void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
> +				    struct cxl_ras_capability_regs ras_cap);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 6d94ff4a4f1a..9d4b5f39b21a 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1160,6 +1160,37 @@ static void cxl_cper_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_work, cxl_cper_work_fn);
>  
> +static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
> +{
> +	unsigned int devfn = PCI_DEVFN(data->prot_err.agent_addr.device,
> +				       data->prot_err.agent_addr.function);
> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
> +					    data->prot_err.agent_addr.bus,
> +					    devfn);
> +
> +	if (!pdev)
> +		return;
> +
> +	guard(device)(&pdev->dev);
> +	if (pdev->driver != &cxl_pci_driver)
> +		return;
> +
> +	if (data->severity == AER_CORRECTABLE)
> +		cxl_cper_trace_corr_prot_err(pdev, data->ras_cap);
> +	else
> +		cxl_cper_trace_uncorr_prot_err(pdev, data->ras_cap);
> +}
> +
> +static void cxl_cper_prot_err_work_fn(struct work_struct *work)
> +{
> +	struct cxl_cper_prot_err_work_data wd;
> +
> +	while (cxl_cper_prot_err_kfifo_get(&wd))
> +		cxl_cper_handle_prot_err(&wd);
> +}
> +static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
> +
>  static int __init cxl_pci_driver_init(void)
>  {
>  	int rc;
> @@ -1170,7 +1201,18 @@ static int __init cxl_pci_driver_init(void)
>  
>  	rc = cxl_cper_register_work(&cxl_cper_work);
>  	if (rc)
> -		pci_unregister_driver(&cxl_pci_driver);
> +		goto err_unreg;
> +
> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	if (rc)
> +		goto err_unregister_work;
> +
> +	return 0;
> +
> +err_unregister_work:
> +	cxl_cper_unregister_work(&cxl_cper_work);
> +err_unreg:
> +	pci_unregister_driver(&cxl_pci_driver);
>  
>  	return rc;
>  }
> @@ -1178,7 +1220,9 @@ static int __init cxl_pci_driver_init(void)
>  static void __exit cxl_pci_driver_exit(void)
>  {
>  	cxl_cper_unregister_work(&cxl_cper_work);
> +	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>  	cancel_work_sync(&cxl_cper_work);
> +	cancel_work_sync(&cxl_cper_prot_err_work);
>  	pci_unregister_driver(&cxl_pci_driver);
>  }
>  
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index ee1c3dec62fa..359a8f44a2e0 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -242,6 +242,9 @@ struct cxl_cper_prot_err_work_data {
>  int cxl_cper_register_work(struct work_struct *work);
>  int cxl_cper_unregister_work(struct work_struct *work);
>  int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);
> +int cxl_cper_register_prot_err_work(struct work_struct *work);
> +int cxl_cper_unregister_prot_err_work(struct work_struct *work);
> +int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd);
>  #else
>  static inline int cxl_cper_register_work(struct work_struct *work)
>  {
> @@ -256,6 +259,18 @@ static inline int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
>  {
>  	return 0;
>  }
> +static inline int cxl_cper_register_prot_err_work(struct work_struct *work)
> +{
> +	return 0;
> +}
> +static inline int cxl_cper_unregister_prot_err_work(struct work_struct *work)
> +{
> +	return 0;
> +}
> +static inline int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
> +{
> +	return 0;
> +}
>  #endif
>  
>  #endif /* _LINUX_CXL_EVENT_H */
> -- 
> 2.17.1
> 

Nitpick about the "8" value, but patch can go in like this
and the value changed if there is later evidence to justify
some other value.

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_
  2025-01-23  8:44 ` [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_ Smita Koralahalli
@ 2025-02-04  0:12   ` Fan Ni
  2025-02-05 19:17   ` Gregory Price
  1 sibling, 0 replies; 39+ messages in thread
From: Fan Ni @ 2025-02-04  0:12 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:16AM +0000, Smita Koralahalli wrote:
> Rename the protocol error struct from struct cper_sec_prot_err to
> struct cxl_cper_sec_prot_err and cper_print_prot_err() to
> cxl_cper_print_prot_err() to maintain naming consistency. No
> functional changes.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
>  drivers/firmware/efi/cper.c     | 4 ++--
>  drivers/firmware/efi/cper_cxl.c | 3 ++-
>  drivers/firmware/efi/cper_cxl.h | 5 +++--
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index b69e68ef3f02..8e5762f7ef2e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -624,11 +624,11 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
>  		else
>  			goto err_section_too_small;
>  	} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
> -		struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
> +		struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
>  
>  		printk("%ssection_type: CXL Protocol Error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*prot_err))
> -			cper_print_prot_err(newpfx, prot_err);
> +			cxl_cper_print_prot_err(newpfx, prot_err);
>  		else
>  			goto err_section_too_small;
>  	} else {
> diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c
> index a55771b99a97..cbaabcb7382d 100644
> --- a/drivers/firmware/efi/cper_cxl.c
> +++ b/drivers/firmware/efi/cper_cxl.c
> @@ -55,7 +55,8 @@ enum {
>  	USP,	/* CXL Upstream Switch Port */
>  };
>  
> -void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err)
> +void cxl_cper_print_prot_err(const char *pfx,
> +			     const struct cxl_cper_sec_prot_err *prot_err)
>  {
>  	if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE)
>  		pr_info("%s agent_type: %d, %s\n", pfx, prot_err->agent_type,
> diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h
> index 86bfcf7909ec..0e3ab0ba17c3 100644
> --- a/drivers/firmware/efi/cper_cxl.h
> +++ b/drivers/firmware/efi/cper_cxl.h
> @@ -18,7 +18,7 @@
>  #pragma pack(1)
>  
>  /* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */
> -struct cper_sec_prot_err {
> +struct cxl_cper_sec_prot_err {
>  	u64 valid_bits;
>  	u8 agent_type;
>  	u8 reserved[7];
> @@ -61,6 +61,7 @@ struct cper_sec_prot_err {
>  
>  #pragma pack()
>  
> -void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err);
> +void cxl_cper_print_prot_err(const char *pfx,
> +			     const struct cxl_cper_sec_prot_err *prot_err);
>  
>  #endif //__CPER_CXL_
> -- 
> 2.17.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-01-23  8:44 ` [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global Smita Koralahalli
@ 2025-02-04  0:16   ` Fan Ni
  2025-02-05 19:16   ` Gregory Price
  1 sibling, 0 replies; 39+ messages in thread
From: Fan Ni @ 2025-02-04  0:16 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:17AM +0000, Smita Koralahalli wrote:
> In preparation to add tracepoint support, move protocol error UUID
> definition to a common location, Also, make struct CXL RAS capability,
s/, Also/. Also/

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> cxl_cper_sec_prot_err and CPER validation flags global for use across
> different modules.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/firmware/efi/cper.c     |  1 +
>  drivers/firmware/efi/cper_cxl.c | 35 +--------------
>  drivers/firmware/efi/cper_cxl.h | 51 ---------------------
>  include/cxl/event.h             | 80 +++++++++++++++++++++++++++++++++
>  include/linux/cper.h            |  4 ++
>  5 files changed, 86 insertions(+), 85 deletions(-)
> 
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 8e5762f7ef2e..ae1953e2b214 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -24,6 +24,7 @@
>  #include <linux/bcd.h>
>  #include <acpi/ghes.h>
>  #include <ras/ras_event.h>
> +#include <cxl/event.h>
>  #include "cper_cxl.h"
>  
>  /*
> diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c
> index cbaabcb7382d..64c0dd27be6e 100644
> --- a/drivers/firmware/efi/cper_cxl.c
> +++ b/drivers/firmware/efi/cper_cxl.c
> @@ -8,27 +8,9 @@
>   */
>  
>  #include <linux/cper.h>
> +#include <cxl/event.h>
>  #include "cper_cxl.h"
>  
> -#define PROT_ERR_VALID_AGENT_TYPE		BIT_ULL(0)
> -#define PROT_ERR_VALID_AGENT_ADDRESS		BIT_ULL(1)
> -#define PROT_ERR_VALID_DEVICE_ID		BIT_ULL(2)
> -#define PROT_ERR_VALID_SERIAL_NUMBER		BIT_ULL(3)
> -#define PROT_ERR_VALID_CAPABILITY		BIT_ULL(4)
> -#define PROT_ERR_VALID_DVSEC			BIT_ULL(5)
> -#define PROT_ERR_VALID_ERROR_LOG		BIT_ULL(6)
> -
> -/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */
> -struct cxl_ras_capability_regs {
> -	u32 uncor_status;
> -	u32 uncor_mask;
> -	u32 uncor_severity;
> -	u32 cor_status;
> -	u32 cor_mask;
> -	u32 cap_control;
> -	u32 header_log[16];
> -};
> -
>  static const char * const prot_err_agent_type_strs[] = {
>  	"Restricted CXL Device",
>  	"Restricted CXL Host Downstream Port",
> @@ -40,21 +22,6 @@ static const char * const prot_err_agent_type_strs[] = {
>  	"CXL Upstream Switch Port",
>  };
>  
> -/*
> - * The layout of the enumeration and the values matches CXL Agent Type
> - * field in the UEFI 2.10 Section N.2.13,
> - */
> -enum {
> -	RCD,	/* Restricted CXL Device */
> -	RCH_DP,	/* Restricted CXL Host Downstream Port */
> -	DEVICE,	/* CXL Device */
> -	LD,	/* CXL Logical Device */
> -	FMLD,	/* CXL Fabric Manager managed Logical Device */
> -	RP,	/* CXL Root Port */
> -	DSP,	/* CXL Downstream Switch Port */
> -	USP,	/* CXL Upstream Switch Port */
> -};
> -
>  void cxl_cper_print_prot_err(const char *pfx,
>  			     const struct cxl_cper_sec_prot_err *prot_err)
>  {
> diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h
> index 0e3ab0ba17c3..5ce1401ee17a 100644
> --- a/drivers/firmware/efi/cper_cxl.h
> +++ b/drivers/firmware/efi/cper_cxl.h
> @@ -10,57 +10,6 @@
>  #ifndef LINUX_CPER_CXL_H
>  #define LINUX_CPER_CXL_H
>  
> -/* CXL Protocol Error Section */
> -#define CPER_SEC_CXL_PROT_ERR						\
> -	GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78,	\
> -		  0x4B, 0x77, 0x10, 0x48)
> -
> -#pragma pack(1)
> -
> -/* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */
> -struct cxl_cper_sec_prot_err {
> -	u64 valid_bits;
> -	u8 agent_type;
> -	u8 reserved[7];
> -
> -	/*
> -	 * Except for RCH Downstream Port, all the remaining CXL Agent
> -	 * types are uniquely identified by the PCIe compatible SBDF number.
> -	 */
> -	union {
> -		u64 rcrb_base_addr;
> -		struct {
> -			u8 function;
> -			u8 device;
> -			u8 bus;
> -			u16 segment;
> -			u8 reserved_1[3];
> -		};
> -	} agent_addr;
> -
> -	struct {
> -		u16 vendor_id;
> -		u16 device_id;
> -		u16 subsystem_vendor_id;
> -		u16 subsystem_id;
> -		u8 class_code[2];
> -		u16 slot;
> -		u8 reserved_1[4];
> -	} device_id;
> -
> -	struct {
> -		u32 lower_dw;
> -		u32 upper_dw;
> -	} dev_serial_num;
> -
> -	u8 capability[60];
> -	u16 dvsec_len;
> -	u16 err_len;
> -	u8 reserved_2[4];
> -};
> -
> -#pragma pack()
> -
>  void cxl_cper_print_prot_err(const char *pfx,
>  			     const struct cxl_cper_sec_prot_err *prot_err);
>  
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index 0bea1afbd747..66d85fc87701 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -152,6 +152,86 @@ struct cxl_cper_work_data {
>  	struct cxl_cper_event_rec rec;
>  };
>  
> +#define PROT_ERR_VALID_AGENT_TYPE		BIT_ULL(0)
> +#define PROT_ERR_VALID_AGENT_ADDRESS		BIT_ULL(1)
> +#define PROT_ERR_VALID_DEVICE_ID		BIT_ULL(2)
> +#define PROT_ERR_VALID_SERIAL_NUMBER		BIT_ULL(3)
> +#define PROT_ERR_VALID_CAPABILITY		BIT_ULL(4)
> +#define PROT_ERR_VALID_DVSEC			BIT_ULL(5)
> +#define PROT_ERR_VALID_ERROR_LOG		BIT_ULL(6)
> +
> +/*
> + * The layout of the enumeration and the values matches CXL Agent Type
> + * field in the UEFI 2.10 Section N.2.13,
> + */
> +enum {
> +	RCD,	/* Restricted CXL Device */
> +	RCH_DP,	/* Restricted CXL Host Downstream Port */
> +	DEVICE,	/* CXL Device */
> +	LD,	/* CXL Logical Device */
> +	FMLD,	/* CXL Fabric Manager managed Logical Device */
> +	RP,	/* CXL Root Port */
> +	DSP,	/* CXL Downstream Switch Port */
> +	USP,	/* CXL Upstream Switch Port */
> +};
> +
> +#pragma pack(1)
> +
> +/* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */
> +struct cxl_cper_sec_prot_err {
> +	u64 valid_bits;
> +	u8 agent_type;
> +	u8 reserved[7];
> +
> +	/*
> +	 * Except for RCH Downstream Port, all the remaining CXL Agent
> +	 * types are uniquely identified by the PCIe compatible SBDF number.
> +	 */
> +	union {
> +		u64 rcrb_base_addr;
> +		struct {
> +			u8 function;
> +			u8 device;
> +			u8 bus;
> +			u16 segment;
> +			u8 reserved_1[3];
> +		};
> +	} agent_addr;
> +
> +	struct {
> +		u16 vendor_id;
> +		u16 device_id;
> +		u16 subsystem_vendor_id;
> +		u16 subsystem_id;
> +		u8 class_code[2];
> +		u16 slot;
> +		u8 reserved_1[4];
> +	} device_id;
> +
> +	struct {
> +		u32 lower_dw;
> +		u32 upper_dw;
> +	} dev_serial_num;
> +
> +	u8 capability[60];
> +	u16 dvsec_len;
> +	u16 err_len;
> +	u8 reserved_2[4];
> +};
> +
> +#pragma pack()
> +
> +/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */
> +struct cxl_ras_capability_regs {
> +	u32 uncor_status;
> +	u32 uncor_mask;
> +	u32 uncor_severity;
> +	u32 cor_status;
> +	u32 cor_mask;
> +	u32 cap_control;
> +	u32 header_log[16];
> +};
> +
>  #ifdef CONFIG_ACPI_APEI_GHES
>  int cxl_cper_register_work(struct work_struct *work);
>  int cxl_cper_unregister_work(struct work_struct *work);
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index 265b0f8fc0b3..5c6d4d5b9975 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -89,6 +89,10 @@ enum {
>  #define CPER_NOTIFY_DMAR						\
>  	GUID_INIT(0x667DD791, 0xC6B3, 0x4c27, 0x8A, 0x6B, 0x0F, 0x8E,	\
>  		  0x72, 0x2D, 0xEB, 0x41)
> +/* CXL Protocol Error Section */
> +#define CPER_SEC_CXL_PROT_ERR						\
> +	GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78,	\
> +		  0x4B, 0x77, 0x10, 0x48)
>  
>  /* CXL Event record UUIDs are formatted as GUIDs and reported in section type */
>  /*
> -- 
> 2.17.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h
  2025-01-23  8:44 ` [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h Smita Koralahalli
@ 2025-02-04  0:20   ` Fan Ni
  2025-02-05 19:18   ` Gregory Price
  1 sibling, 0 replies; 39+ messages in thread
From: Fan Ni @ 2025-02-04  0:20 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:18AM +0000, Smita Koralahalli wrote:
> Move the declaration of cxl_cper_print_prot_err() to include/linux/cper.h
> to avoid maintaining a separate header file just for this function
> declaration. Remove drivers/firmware/efi/cper_cxl.h as its contents have
> been reorganized.
> 
> No functional changes.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
>  drivers/firmware/efi/cper.c     |  1 -
>  drivers/firmware/efi/cper_cxl.c |  1 -
>  drivers/firmware/efi/cper_cxl.h | 16 ----------------
>  include/linux/cper.h            |  4 ++++
>  4 files changed, 4 insertions(+), 18 deletions(-)
>  delete mode 100644 drivers/firmware/efi/cper_cxl.h
> 
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index ae1953e2b214..928409199a1a 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -25,7 +25,6 @@
>  #include <acpi/ghes.h>
>  #include <ras/ras_event.h>
>  #include <cxl/event.h>
> -#include "cper_cxl.h"
>  
>  /*
>   * CPER record ID need to be unique even after reboot, because record
> diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c
> index 64c0dd27be6e..8a7667faf953 100644
> --- a/drivers/firmware/efi/cper_cxl.c
> +++ b/drivers/firmware/efi/cper_cxl.c
> @@ -9,7 +9,6 @@
>  
>  #include <linux/cper.h>
>  #include <cxl/event.h>
> -#include "cper_cxl.h"
>  
>  static const char * const prot_err_agent_type_strs[] = {
>  	"Restricted CXL Device",
> diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h
> deleted file mode 100644
> index 5ce1401ee17a..000000000000
> --- a/drivers/firmware/efi/cper_cxl.h
> +++ /dev/null
> @@ -1,16 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * UEFI Common Platform Error Record (CPER) support for CXL Section.
> - *
> - * Copyright (C) 2022 Advanced Micro Devices, Inc.
> - *
> - * Author: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> - */
> -
> -#ifndef LINUX_CPER_CXL_H
> -#define LINUX_CPER_CXL_H
> -
> -void cxl_cper_print_prot_err(const char *pfx,
> -			     const struct cxl_cper_sec_prot_err *prot_err);
> -
> -#endif //__CPER_CXL_
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index 5c6d4d5b9975..0ed60a91eca9 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -605,4 +605,8 @@ void cper_estatus_print(const char *pfx,
>  int cper_estatus_check_header(const struct acpi_hest_generic_status *estatus);
>  int cper_estatus_check(const struct acpi_hest_generic_status *estatus);
>  
> +struct cxl_cper_sec_prot_err;
> +void cxl_cper_print_prot_err(const char *pfx,
> +			     const struct cxl_cper_sec_prot_err *prot_err);
> +
>  #endif
> -- 
> 2.17.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-01-23  8:44 ` [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global Smita Koralahalli
  2025-02-04  0:16   ` Fan Ni
@ 2025-02-05 19:16   ` Gregory Price
  2025-02-06 10:54     ` Jonathan Cameron
  1 sibling, 1 reply; 39+ messages in thread
From: Gregory Price @ 2025-02-05 19:16 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:17AM +0000, Smita Koralahalli wrote:
> In preparation to add tracepoint support, move protocol error UUID
> definition to a common location, Also, make struct CXL RAS capability,
> cxl_cper_sec_prot_err and CPER validation flags global for use across
> different modules.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reveiwed-by: Gregory Price <gourry@gourry.net>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_
  2025-01-23  8:44 ` [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_ Smita Koralahalli
  2025-02-04  0:12   ` Fan Ni
@ 2025-02-05 19:17   ` Gregory Price
  1 sibling, 0 replies; 39+ messages in thread
From: Gregory Price @ 2025-02-05 19:17 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:16AM +0000, Smita Koralahalli wrote:
> Rename the protocol error struct from struct cper_sec_prot_err to
> struct cxl_cper_sec_prot_err and cper_print_prot_err() to
> cxl_cper_print_prot_err() to maintain naming consistency. No
> functional changes.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h
  2025-01-23  8:44 ` [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h Smita Koralahalli
  2025-02-04  0:20   ` Fan Ni
@ 2025-02-05 19:18   ` Gregory Price
  1 sibling, 0 replies; 39+ messages in thread
From: Gregory Price @ 2025-02-05 19:18 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:18AM +0000, Smita Koralahalli wrote:
> Move the declaration of cxl_cper_print_prot_err() to include/linux/cper.h
> to avoid maintaining a separate header file just for this function
> declaration. Remove drivers/firmware/efi/cper_cxl.h as its contents have
> been reorganized.
> 
> No functional changes.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-02-03 18:59   ` Luck, Tony
@ 2025-02-05 19:35     ` Gregory Price
  0 siblings, 0 replies; 39+ messages in thread
From: Gregory Price @ 2025-02-05 19:35 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl,
	Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Mon, Feb 03, 2025 at 10:59:14AM -0800, Luck, Tony wrote:
> On Thu, Jan 23, 2025 at 08:44:19AM +0000, Smita Koralahalli wrote:
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index b72772494655..4d725d988c43 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
> >  	schedule_work(&entry->work);
> >  }
> >  
> > +static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> > +				   int severity)
> > +{
> > +#ifdef CONFIG_ACPI_APEI_PCIEAER
> 
> #ifdef in ".c" code is ugly. But I don't see a less ugly way to deal
> with this. Moving this elsewhere and adding an empty stub function
> for when CONFIG_ACPI_APEI_PCIEAER isn't configured doesn't make things
> any better.
>

The generally accepted ways I've seen for static funcs in .c is:

#ifdef CONFIG_ACPI_APEI_PCIEAER
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
	...
}
#else
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) { }
#endif

This at least makes the function that does stuff easier to read and
gives you a spot to throw out a "Config not enabled" error if its
beneficial.

More of a style nit than anything else.


But either way,

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  2025-01-23  8:44 ` [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors Smita Koralahalli
  2025-02-03 19:03   ` Luck, Tony
@ 2025-02-05 19:50   ` Gregory Price
  2025-02-05 22:58   ` Dan Williams
  2 siblings, 0 replies; 39+ messages in thread
From: Gregory Price @ 2025-02-05 19:50 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:20AM +0000, Smita Koralahalli wrote:
> When PCIe AER is in FW-First, OS should process CXL Protocol errors from
> CPER records. Introduce support for handling and logging CXL Protocol
> errors.
> 
> The defined trace events cxl_aer_uncorrectable_error and
> cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
> to trace FW-First Protocol errors.
> 
> Since the CXL code is required to be called from process context and
> GHES is in interrupt context, use workqueues for processing.
> 
> Similar to CXL CPER event handling, use kfifo to handle errors as it
> simplifies queue processing by providing lock free fifo operations.
> 
> Add the ability for the CXL sub-system to register a workqueue to
> process CXL CPER protocol errors.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-23  8:44 ` [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors Smita Koralahalli
  2025-01-24 16:36   ` Ira Weiny
@ 2025-02-05 20:01   ` Gregory Price
  2025-02-05 23:06   ` Dan Williams
  2 siblings, 0 replies; 39+ messages in thread
From: Gregory Price @ 2025-02-05 20:01 UTC (permalink / raw)
  To: Smita Koralahalli
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On Thu, Jan 23, 2025 at 08:44:21AM +0000, Smita Koralahalli wrote:
> The CXL drivers use kernel trace functions for logging endpoint and
> Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
> is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
> Upstream Switch Ports.
> 
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Use them to trace
> FW-First Protocol errors.
> 
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 9d4b5f39b21a..766447c169c8 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1168,6 +1168,7 @@ static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
>  		pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
>  					    data->prot_err.agent_addr.bus,
>  					    devfn);
> +	int port_type;
>  
>  	if (!pdev)
>  		return;
> @@ -1176,6 +1177,18 @@ static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
>  	if (pdev->driver != &cxl_pci_driver)
>  		return;
>  
> +	port_type = pci_pcie_type(pdev);
> +	if (port_type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    port_type == PCI_EXP_TYPE_DOWNSTREAM ||
> +	    port_type == PCI_EXP_TYPE_UPSTREAM) {

Almost wish this was a macro for the sake of style, but not worth it.

"corr_prot_err" and "port_prot_err" kind of blend at first glance. Not
worth holding anything up.

> +		if (data->severity == AER_CORRECTABLE)
> +			cxl_cper_trace_corr_port_prot_err(pdev, data->ras_cap);
> +		else
> +			cxl_cper_trace_uncorr_port_prot_err(pdev, data->ras_cap);
> +
> +		return;
> +	}
> +
>  	if (data->severity == AER_CORRECTABLE)
>  		cxl_cper_trace_corr_prot_err(pdev, data->ras_cap);
>  	else
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-01-23  8:44 ` [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors Smita Koralahalli
  2025-02-03 18:59   ` Luck, Tony
@ 2025-02-05 22:21   ` Dan Williams
  2025-07-22 19:24   ` "invalid agent type: 1" in " Marc Herbert
  2 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2025-02-05 22:21 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Smita Koralahalli wrote:
> Add support in GHES to detect and process CXL CPER Protocol errors, as
> defined in UEFI v2.10, section N.2.13.
> 
> Define struct cxl_cper_prot_err_work_data to cache CXL protocol error
> information, including RAS capabilities and severity, for further
> handling.
> 
> These cached CXL CPER records will later be processed by workqueues
> within the CXL subsystem.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/acpi/apei/ghes.c | 54 ++++++++++++++++++++++++++++++++++++++++
>  include/cxl/event.h      |  6 +++++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b72772494655..4d725d988c43 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>  	schedule_work(&entry->work);
>  }
>  
> +static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> +				   int severity)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER

This is ok for now, but now that there are two of these functions I
would not say "no" to a follow on patch that moves ghes_handle_aer() and
cxl_cper_post_prot_err() to a new:

   drivers/acpi/apei/aer.c

...that gets conditionally compiled instead of this minor coding-style
nit. Certainly if a third instance like this arrives, a new aer.c
becomes warranted.

> +	struct cxl_cper_prot_err_work_data wd;
> +	u8 *dvsec_start, *cap_start;
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
> +		pr_err_ratelimited("CXL CPER invalid agent type\n");
> +		return;
> +	}

I notice that the component event prints are not ratelimited, so a
follow-on patch to add rate limiting to cxl_cper_post_event() would be
welcome to match what you have here.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  2025-01-23  8:44 ` [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors Smita Koralahalli
  2025-02-03 19:03   ` Luck, Tony
  2025-02-05 19:50   ` Gregory Price
@ 2025-02-05 22:58   ` Dan Williams
  2025-02-12 20:57     ` Koralahalli Channabasappa, Smita
  2 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2025-02-05 22:58 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Smita Koralahalli wrote:
> When PCIe AER is in FW-First, OS should process CXL Protocol errors from
> CPER records. Introduce support for handling and logging CXL Protocol
> errors.
> 
> The defined trace events cxl_aer_uncorrectable_error and
> cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
> to trace FW-First Protocol errors.
> 
> Since the CXL code is required to be called from process context and
> GHES is in interrupt context, use workqueues for processing.
> 
> Similar to CXL CPER event handling, use kfifo to handle errors as it
> simplifies queue processing by providing lock free fifo operations.
> 
> Add the ability for the CXL sub-system to register a workqueue to
> process CXL CPER protocol errors.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> ---

This patch confuses me. The plumbing to route CXL component error
records back to the cxl_pci driver was motivated by the driver having a
significant amount of context about component state and code to handle
OS first reporting of component errors from the device mailbox.

Protocol errors are different. They implicate various ports where the
cxl_pci driver may not have any additional information to add.

I feel like this patch makes more sense after CXL protocol errors become
a first class citizen in the core, and that generic CXL protocol error
tracing lives in the core, not a cxl_pci driver callback.

So, similar to how aer_recover_queue() traces all PCIe protocol errors
and optionally lets endpoint drivers recover the link via
pcie_do_recovery(), a cxl_recover_queue() is needed. That would be the
place to land general CXL protocol error prints and optionally call back
into drivers to add more device specific color if necessary.

I am ok with the CXL core centralizing all protocol error processing
like the built-in PCI core, but the generic CXL memory expander driver,
cxl_pci, is the wrong place to handle system wide protocol errors across
all device types.

I expect this is new infrastructure that we will get from Terry's
patches, but please do challenge me if you think I am missing something.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-23  8:44 ` [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors Smita Koralahalli
  2025-01-24 16:36   ` Ira Weiny
  2025-02-05 20:01   ` Gregory Price
@ 2025-02-05 23:06   ` Dan Williams
  2 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2025-02-05 23:06 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Smita Koralahalli

Smita Koralahalli wrote:
> The CXL drivers use kernel trace functions for logging endpoint and
> Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
> is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
> Upstream Switch Ports.
> 
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Use them to trace
> FW-First Protocol errors.
> 
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>

I think this functionality moves to a central / non-cxl_pci location
once we have a formal CXL AER path established.

So, for this series you can add my Reviewed-by: to patches 1-4, but I am
not yet convinced cxl_pci should play a role in emitting protocol errors
compared to a centralized place in the core.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-02-05 19:16   ` Gregory Price
@ 2025-02-06 10:54     ` Jonathan Cameron
  2025-02-06 16:14       ` Gregory Price
  0 siblings, 1 reply; 39+ messages in thread
From: Jonathan Cameron @ 2025-02-06 10:54 UTC (permalink / raw)
  To: Gregory Price
  Cc: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl,
	Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Yazen Ghannam, Terry Bowman

On Wed, 5 Feb 2025 14:16:54 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Thu, Jan 23, 2025 at 08:44:17AM +0000, Smita Koralahalli wrote:
> > In preparation to add tracepoint support, move protocol error UUID
> > definition to a common location, Also, make struct CXL RAS capability,
> > cxl_cper_sec_prot_err and CPER validation flags global for use across
> > different modules.
> > 
> > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>  
> 
> Reveiwed-by: Gregory Price <gourry@gourry.net>

Reviewed...

If Dave picks this up will need to tweak that.
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-02-06 10:54     ` Jonathan Cameron
@ 2025-02-06 16:14       ` Gregory Price
  2025-02-06 17:14         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 39+ messages in thread
From: Gregory Price @ 2025-02-06 16:14 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl,
	Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Yazen Ghannam, Terry Bowman

On Thu, Feb 06, 2025 at 10:54:03AM +0000, Jonathan Cameron wrote:
> On Wed, 5 Feb 2025 14:16:54 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> > On Thu, Jan 23, 2025 at 08:44:17AM +0000, Smita Koralahalli wrote:
> > > In preparation to add tracepoint support, move protocol error UUID
> > > definition to a common location, Also, make struct CXL RAS capability,
> > > cxl_cper_sec_prot_err and CPER validation flags global for use across
> > > different modules.
> > > 
> > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > > Reviewed-by: Dave Jiang <dave.jiang@intel.com>  
> > 
> > Reveiwed-by: Gregory Price <gourry@gourry.net>
> 
> Reviewed...
> 
> If Dave picks this up will need to tweak that.
> > 
> 

that egg on your face when someone has to spellcheck your tags
:facepalm:

trivia: The last upstream `Reveiwed` patch was 2017

Funny enough b4 takes the tag as-is.

~Gregory

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-02-06 16:14       ` Gregory Price
@ 2025-02-06 17:14         ` Konstantin Ryabitsev
  2025-02-06 17:32           ` Gregory Price
  0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Ryabitsev @ 2025-02-06 17:14 UTC (permalink / raw)
  To: Gregory Price
  Cc: Jonathan Cameron, Smita Koralahalli, linux-efi, linux-kernel,
	linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma,
	Ira Weiny, Dan Williams, Yazen Ghannam, Terry Bowman

On Thu, Feb 06, 2025 at 11:14:09AM -0500, Gregory Price wrote:
> > Reviewed...
> > 
> > If Dave picks this up will need to tweak that.
> > > 
> > 
> 
> that egg on your face when someone has to spellcheck your tags
> :facepalm:
> 
> trivia: The last upstream `Reveiwed` patch was 2017
> 
> Funny enough b4 takes the tag as-is.

Yeah, we can't possibly keep track of all the tags people use, with some
projects getting creative with "Reviewed-and-edited-by:" or similar combined
trailers. So, we just blanket accept any person-tags (that have a "Person Name
<adddress@example.com>" format).

-K

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global
  2025-02-06 17:14         ` Konstantin Ryabitsev
@ 2025-02-06 17:32           ` Gregory Price
  0 siblings, 0 replies; 39+ messages in thread
From: Gregory Price @ 2025-02-06 17:32 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Jonathan Cameron, Smita Koralahalli, linux-efi, linux-kernel,
	linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma,
	Ira Weiny, Dan Williams, Yazen Ghannam, Terry Bowman

On Thu, Feb 06, 2025 at 12:14:03PM -0500, Konstantin Ryabitsev wrote:
> On Thu, Feb 06, 2025 at 11:14:09AM -0500, Gregory Price wrote:
> > trivia: The last upstream `Reveiwed` patch was 2017
> > 
> > Funny enough b4 takes the tag as-is.
> 
> Yeah, we can't possibly keep track of all the tags people use, with some
> projects getting creative with "Reviewed-and-edited-by:" or similar combined
> trailers. So, we just blanket accept any person-tags (that have a "Person Name
> <adddress@example.com>" format).
> 
> -K

Wasn't suggesting a change, I just found it kinda funny.

Should suggest tag-search on lwn.net to find all the fun typos.

~Gregory

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors
  2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
                   ` (6 preceding siblings ...)
  2025-02-03 17:09 ` [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Dave Jiang
@ 2025-02-06 18:38 ` Dave Jiang
  7 siblings, 0 replies; 39+ messages in thread
From: Dave Jiang @ 2025-02-06 18:38 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Luck, Tony



On 1/23/25 1:44 AM, Smita Koralahalli wrote:
> This patchset adds logging support for CXL CPER endpoint and port protocol
> errors.
> 
> The first 3 patches update the existing codebase to support CXL CPER
> Protocol error reporting.
> 
> The last 3 patches introduce recognizing and reporting CXL CPER Protocol
> errors.

Patches 1-4 applied to cxl-next. I fixed up Gregory's review tag. :)
Patches 5 and 6 needs to address comments raised by Dan.

> 
> Link to v5:
> https://lore.kernel.org/linux-cxl/20250114120427.149260-1-Smita.KoralahalliChannabasappa@amd.com
> 
> Changes in v5 -> v6:
> [Dave, Jonathan, Ira]: Reviewed-by tags.
> [Dave]: Check for cxlds before assigning fe.
> Merge one of the patches (Port error trace logging) from Terry's Port
> error handling.
> Rename host -> parent.
> 
> Changes in v4 -> v5:
> [Dave]: Reviewed-by tags.
> [Jonathan]: Remove blank line.
> [Jonathan, Ira]: Change CXL -> "CXL".
> [Ira]: Fix build error for CONFIG_ACPI_APEI_PCIEAER.
> 
> Changes in v3 -> v4:
> [Ira]: Use memcpy() for RAS Cap struct.
> [Jonathan]: Commit description edits.
> [Jonathan]: Use separate work registration functions for protocol and
> component errors.
> [Jonathan, Ira]: Replace flags with separate functions for port and
> device errors.
> [Jonathan]: Use goto for register and unregister calls.
> 
> Changes in v2 -> v3:
> [Dan]: Define a new workqueue for CXL CPER Protocol errors and avoid
> reusing existing workqueue which handles CXL CPER events.
> [Dan] Update function and struct names.
> [Ira] Don't define common function get_cxl_devstate().
> [Dan] Use switch cases rather than defining array of structures.
> [Dan] Pass the entire cxl_cper_prot_err struct for CXL subsystem.
> [Dan] Use pr_err_ratelimited().
> [Dan] Use AER_ severities directly. Don't define CXL_ severities.
> [Dan] Limit either to Device ID or Agent Info check.
> [Dan] Validate size of RAS field matches expectations.
> 
> Changes in v2 -> v1:
> [Jonathan] Refactor code for trace support. Rename get_cxl_dev()
> to get_cxl_devstate().
> [Jonathan] Cleanups for get_cxl_devstate().
> [Alison, Jonathan]: Define array of structures for Device ID and Serial
> number comparison.
> [Dave] p_err -> rec/p_rec.
> [Jonathan] Remove pr_warn.
> 
> Smita Koralahalli (6):
>   efi/cper, cxl: Prefix protocol error struct and function names with
>     cxl_
>   efi/cper, cxl: Make definitions and structures global
>   efi/cper, cxl: Remove cper_cxl.h
>   acpi/ghes, cper: Recognize and cache CXL Protocol errors
>   acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
>   cxl/pci: Add trace logging for CXL PCIe Port RAS errors
> 
>  drivers/acpi/apei/ghes.c        | 103 ++++++++++++++++++++++++++++++++
>  drivers/cxl/core/pci.c          |  62 +++++++++++++++++++
>  drivers/cxl/core/trace.h        |  47 +++++++++++++++
>  drivers/cxl/cxlpci.h            |   9 +++
>  drivers/cxl/pci.c               |  59 +++++++++++++++++-
>  drivers/firmware/efi/cper.c     |   6 +-
>  drivers/firmware/efi/cper_cxl.c |  39 +-----------
>  drivers/firmware/efi/cper_cxl.h |  66 --------------------
>  include/cxl/event.h             | 101 +++++++++++++++++++++++++++++++
>  include/linux/cper.h            |   8 +++
>  10 files changed, 394 insertions(+), 106 deletions(-)
>  delete mode 100644 drivers/firmware/efi/cper_cxl.h
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  2025-02-05 22:58   ` Dan Williams
@ 2025-02-12 20:57     ` Koralahalli Channabasappa, Smita
  0 siblings, 0 replies; 39+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-02-12 20:57 UTC (permalink / raw)
  To: Dan Williams, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

Hi Dan,

On 2/5/2025 2:58 PM, Dan Williams wrote:
> Smita Koralahalli wrote:
>> When PCIe AER is in FW-First, OS should process CXL Protocol errors from
>> CPER records. Introduce support for handling and logging CXL Protocol
>> errors.
>>
>> The defined trace events cxl_aer_uncorrectable_error and
>> cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
>> to trace FW-First Protocol errors.
>>
>> Since the CXL code is required to be called from process context and
>> GHES is in interrupt context, use workqueues for processing.
>>
>> Similar to CXL CPER event handling, use kfifo to handle errors as it
>> simplifies queue processing by providing lock free fifo operations.
>>
>> Add the ability for the CXL sub-system to register a workqueue to
>> process CXL CPER protocol errors.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> ---
> 
> This patch confuses me. The plumbing to route CXL component error
> records back to the cxl_pci driver was motivated by the driver having a
> significant amount of context about component state and code to handle
> OS first reporting of component errors from the device mailbox.
> 
> Protocol errors are different. They implicate various ports where the
> cxl_pci driver may not have any additional information to add.
> 
> I feel like this patch makes more sense after CXL protocol errors become
> a first class citizen in the core, and that generic CXL protocol error
> tracing lives in the core, not a cxl_pci driver callback.
> 
> So, similar to how aer_recover_queue() traces all PCIe protocol errors
> and optionally lets endpoint drivers recover the link via
> pcie_do_recovery(), a cxl_recover_queue() is needed. That would be the
> place to land general CXL protocol error prints and optionally call back
> into drivers to add more device specific color if necessary.
> 
> I am ok with the CXL core centralizing all protocol error processing
> like the built-in PCI core, but the generic CXL memory expander driver,
> cxl_pci, is the wrong place to handle system wide protocol errors across
> all device types.
> 
> I expect this is new infrastructure that we will get from Terry's
> patches, but please do challenge me if you think I am missing something.

I agree. I don't have any specific reason to place all of this in 
cxl_pci driver. Given that CXL protocol errors affect multiple ports and 
should be handled centrally, should their processing be placed in 
pcie/aer.c or pcie/err.c?

Or would it be better to introduce a new pcie/cxl_err.c file to handle 
all CXL protocol errors separately?

Additionally, since the patch involves calling CXL tracing routines, 
some integration with the CXL driver is required. Would you suggest 
handling this plumbing within the PCIe error handling core, or should 
there be a specific callback mechanism into the CXL driver?

Thanks
Smita


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
  2025-02-03 19:03   ` Luck, Tony
@ 2025-02-12 21:04     ` Koralahalli Channabasappa, Smita
  0 siblings, 0 replies; 39+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-02-12 21:04 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman

On 2/3/2025 11:03 AM, Luck, Tony wrote:
> On Thu, Jan 23, 2025 at 08:44:20AM +0000, Smita Koralahalli wrote:
>> When PCIe AER is in FW-First, OS should process CXL Protocol errors from
>> CPER records. Introduce support for handling and logging CXL Protocol
>> errors.
>>
>> The defined trace events cxl_aer_uncorrectable_error and
>> cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
>> to trace FW-First Protocol errors.
>>
>> Since the CXL code is required to be called from process context and
>> GHES is in interrupt context, use workqueues for processing.
>>
>> Similar to CXL CPER event handling, use kfifo to handle errors as it
>> simplifies queue processing by providing lock free fifo operations.
>>
>> Add the ability for the CXL sub-system to register a workqueue to
>> process CXL CPER protocol errors.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 49 ++++++++++++++++++++++++++++++++++++++++
>>   drivers/cxl/core/pci.c   | 36 +++++++++++++++++++++++++++++
>>   drivers/cxl/cxlpci.h     |  5 ++++
>>   drivers/cxl/pci.c        | 46 ++++++++++++++++++++++++++++++++++++-
>>   include/cxl/event.h      | 15 ++++++++++++
>>   5 files changed, 150 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 4d725d988c43..289e365f84b2 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -674,6 +674,15 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>>   	schedule_work(&entry->work);
>>   }
>>   
>> +/* Room for 8 entries */
> 
> Any science behind the choice of "8" here? This comment is merely
> stating what the #define is used for, not why 8 was chosen.
> 

The choice of "8" was arbitrary and not based on a specific rationale. 
If there are better heuristics or considerations for determining the 
optimal number of entries, I’d appreciate any suggestions.

Thanks
Smita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-01-23  8:44 ` [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors Smita Koralahalli
  2025-02-03 18:59   ` Luck, Tony
  2025-02-05 22:21   ` Dan Williams
@ 2025-07-22 19:24   ` Marc Herbert
  2025-07-23  7:13     ` Marc Herbert
  2025-07-28 16:25     ` Koralahalli Channabasappa, Smita
  2 siblings, 2 replies; 39+ messages in thread
From: Marc Herbert @ 2025-07-22 19:24 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price

Hi Smita,

  The code below triggers the error "invalid agent type: 1" in Intel
validation (internal issue 15018133056)

It's not clear to anyone we asked why you did not include RCH_DP in
the `switch (prot_err->agent_type)` in cxl_cper_post_prot_err() below.

I can see how RCH_DP is special in cxl_cper_PRINT_prot_err() and I can
even understand (despite my near-zero CPER knowledge) some of the
special cases there. But in cxl_cper_post_prot_err() here, it's not
clear why RCH_DP would be rejected. Could this be an oversight? If not,
a comment with a short explanation would not hurt.

Marc

PS: the newer cxl_cper_post_prot_err() code is longer and does
something with `wd`. That's irrelevant for this test case since the
function errors and returns earlier anyway.


On 2025-01-23 00:44, Smita Koralahalli wrote:
> Add support in GHES to detect and process CXL CPER Protocol errors, as
> defined in UEFI v2.10, section N.2.13.
> 
> Define struct cxl_cper_prot_err_work_data to cache CXL protocol error
> information, including RAS capabilities and severity, for further
> handling.
> 
> These cached CXL CPER records will later be processed by workqueues
> within the CXL subsystem.
> 
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  drivers/acpi/apei/ghes.c | 54 ++++++++++++++++++++++++++++++++++++++++
>  include/cxl/event.h      |  6 +++++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b72772494655..4d725d988c43 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>  	schedule_work(&entry->work);
>  }
>  
> +static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> +				   int severity)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER
> +	struct cxl_cper_prot_err_work_data wd;
> +	u8 *dvsec_start, *cap_start;
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
> +		pr_err_ratelimited("CXL CPER invalid agent type\n");
> +		return;
> +	}
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
> +		pr_err_ratelimited("CXL CPER invalid protocol error log\n");
> +		return;
> +	}
> +
> +	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> +		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> +				   prot_err->err_len);
> +		return;
> +	}
> +
> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> +		pr_warn(FW_WARN "CXL CPER no device serial number\n");
> +
> +	switch (prot_err->agent_type) {
> +	case RCD:
> +	case DEVICE:
> +	case LD:
> +	case FMLD:
> +	case RP:
> +	case DSP:
> +	case USP:
> +		memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
> +
> +		dvsec_start = (u8 *)(prot_err + 1);
> +		cap_start = dvsec_start + prot_err->dvsec_len;
> +
> +		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
> +		wd.severity = cper_severity_to_aer(severity);
> +		break;
> +	default:
> +		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
> +				   prot_err->agent_type);
> +		return;
> +	}
> +#endif
> +}
> +
>  /* Room for 8 entries for each of the 4 event log queues */
>  #define CXL_CPER_FIFO_DEPTH 32
>  DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
> @@ -777,6 +827,10 @@ static bool ghes_do_proc(struct ghes *ghes,
>  		}
>  		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>  			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> +		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
> +			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
> +
> +			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
>  		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
>  			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
>  
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index 66d85fc87701..ee1c3dec62fa 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -232,6 +232,12 @@ struct cxl_ras_capability_regs {
>  	u32 header_log[16];
>  };
>  
> +struct cxl_cper_prot_err_work_data {
> +	struct cxl_cper_sec_prot_err prot_err;
> +	struct cxl_ras_capability_regs ras_cap;
> +	int severity;
> +};
> +
>  #ifdef CONFIG_ACPI_APEI_GHES
>  int cxl_cper_register_work(struct work_struct *work);
>  int cxl_cper_unregister_work(struct work_struct *work);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-22 19:24   ` "invalid agent type: 1" in " Marc Herbert
@ 2025-07-23  7:13     ` Marc Herbert
  2025-07-24 14:49       ` Fabio M. De Francesco
  2025-07-28 15:01       ` dan.j.williams
  2025-07-28 16:25     ` Koralahalli Channabasappa, Smita
  1 sibling, 2 replies; 39+ messages in thread
From: Marc Herbert @ 2025-07-23  7:13 UTC (permalink / raw)
  To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price



On 2025-07-22 12:24, Marc Herbert wrote:
> Hi Smita,
> 
>   The code below triggers the error "invalid agent type: 1" in Intel
> validation (internal issue 15018133056)

The same test case also triggers the other, warning message "CXL CPER no
device serial number".

I heard that "device" serial numbers are only for... devices and that
even then it's not always mandatory. So maybe that other message should
be downgraded from warning to the "info" level?

Marc


> On 2025-01-23 00:44, Smita Koralahalli wrote:
>> Add support in GHES to detect and process CXL CPER Protocol errors, as
>> defined in UEFI v2.10, section N.2.13.
>>
>> Define struct cxl_cper_prot_err_work_data to cache CXL protocol error
>> information, including RAS capabilities and severity, for further
>> handling.
>>
>> These cached CXL CPER records will later be processed by workqueues
>> within the CXL subsystem.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> ---
>>  drivers/acpi/apei/ghes.c | 54 ++++++++++++++++++++++++++++++++++++++++
>>  include/cxl/event.h      |  6 +++++
>>  2 files changed, 60 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b72772494655..4d725d988c43 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>>  	schedule_work(&entry->work);
>>  }
>>  
>> +static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
>> +				   int severity)
>> +{
>> +#ifdef CONFIG_ACPI_APEI_PCIEAER
>> +	struct cxl_cper_prot_err_work_data wd;
>> +	u8 *dvsec_start, *cap_start;
>> +
>> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
>> +		pr_err_ratelimited("CXL CPER invalid agent type\n");
>> +		return;
>> +	}
>> +
>> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
>> +		pr_err_ratelimited("CXL CPER invalid protocol error log\n");
>> +		return;
>> +	}
>> +
>> +	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
>> +		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
>> +				   prot_err->err_len);
>> +		return;
>> +	}
>> +
>> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
>> +		pr_warn(FW_WARN "CXL CPER no device serial number\n");
>> +
>> +	switch (prot_err->agent_type) {
>> +	case RCD:
>> +	case DEVICE:
>> +	case LD:
>> +	case FMLD:
>> +	case RP:
>> +	case DSP:
>> +	case USP:
>> +		memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
>> +
>> +		dvsec_start = (u8 *)(prot_err + 1);
>> +		cap_start = dvsec_start + prot_err->dvsec_len;
>> +
>> +		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
>> +		wd.severity = cper_severity_to_aer(severity);
>> +		break;
>> +	default:
>> +		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
>> +				   prot_err->agent_type);
>> +		return;
>> +	}
>> +#endif
>> +}
>> +
>>  /* Room for 8 entries for each of the 4 event log queues */
>>  #define CXL_CPER_FIFO_DEPTH 32
>>  DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
>> @@ -777,6 +827,10 @@ static bool ghes_do_proc(struct ghes *ghes,
>>  		}
>>  		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>>  			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>> +		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
>> +			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
>> +
>> +			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
>>  		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
>>  			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
>>  
>> diff --git a/include/cxl/event.h b/include/cxl/event.h
>> index 66d85fc87701..ee1c3dec62fa 100644
>> --- a/include/cxl/event.h
>> +++ b/include/cxl/event.h
>> @@ -232,6 +232,12 @@ struct cxl_ras_capability_regs {
>>  	u32 header_log[16];
>>  };
>>  
>> +struct cxl_cper_prot_err_work_data {
>> +	struct cxl_cper_sec_prot_err prot_err;
>> +	struct cxl_ras_capability_regs ras_cap;
>> +	int severity;
>> +};
>> +
>>  #ifdef CONFIG_ACPI_APEI_GHES
>>  int cxl_cper_register_work(struct work_struct *work);
>>  int cxl_cper_unregister_work(struct work_struct *work);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-23  7:13     ` Marc Herbert
@ 2025-07-24 14:49       ` Fabio M. De Francesco
  2025-07-25 11:04         ` Jonathan Cameron
  2025-07-28 15:01       ` dan.j.williams
  1 sibling, 1 reply; 39+ messages in thread
From: Fabio M. De Francesco @ 2025-07-24 14:49 UTC (permalink / raw)
  To: Smita Koralahalli, Marc Herbert
  Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel,
	Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
	Jonathan Cameron, Yazen Ghannam, Terry Bowman, Dave Jiang,
	tony.luck, Gregory Price

Hi Marc, Smita,

On Wednesday, July 23, 2025 9:13:34 AM Central European Summer Time Marc Herbert wrote:
> 
> On 2025-07-22 12:24, Marc Herbert wrote:
> > Hi Smita,
> > 
> >   The code below triggers the error "invalid agent type: 1" in Intel
> > validation (internal issue 15018133056)
> 
> The same test case also triggers the other, warning message "CXL CPER no
> device serial number".
> 
> I heard that "device" serial numbers are only for... devices and that
> even then it's not always mandatory. So maybe that other message should
> be downgraded from warning to the "info" level?
> 
> Marc
>

[skip]
 
> >> +
> >> +	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> >> +		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> >> +				   prot_err->err_len);
> >> +		return;
> >> +	}
> >> +
> >> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> >> +		pr_warn(FW_WARN "CXL CPER no device serial number\n");
> >> +

Maybe this test should be written on the line of the following snippet taken 
out from "ACPI: extlog: Trace CPER CXL Protocol Error Section".[1]

+
+	if ((prot_err->agent_type == RCD || prot_err->agent_type == DEVICE ||
+	     prot_err->agent_type == LD || prot_err->agent_type == FMLD) &&
+	    !(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
+		pr_warn_ratelimited(FW_WARN
+				    "CXL CPER no device serial number\n");
+

Thanks,

Fabio

[1] https://lore.kernel.org/linux-cxl/20250623145453.1046660-4-fabio.m.de.francesco@linux.intel.com/




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-24 14:49       ` Fabio M. De Francesco
@ 2025-07-25 11:04         ` Jonathan Cameron
  0 siblings, 0 replies; 39+ messages in thread
From: Jonathan Cameron @ 2025-07-25 11:04 UTC (permalink / raw)
  To: Fabio M. De Francesco
  Cc: Smita Koralahalli, Marc Herbert, linux-efi, linux-kernel,
	linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma,
	Ira Weiny, Dan Williams, Yazen Ghannam, Terry Bowman, Dave Jiang,
	tony.luck, Gregory Price

On Thu, 24 Jul 2025 16:49:00 +0200
"Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> wrote:

> Hi Marc, Smita,
> 
> On Wednesday, July 23, 2025 9:13:34 AM Central European Summer Time Marc Herbert wrote:
> > 
> > On 2025-07-22 12:24, Marc Herbert wrote:  
> > > Hi Smita,
> > > 
> > >   The code below triggers the error "invalid agent type: 1" in Intel
> > > validation (internal issue 15018133056)  
> > 
> > The same test case also triggers the other, warning message "CXL CPER no
> > device serial number".
> > 
> > I heard that "device" serial numbers are only for... devices and that
> > even then it's not always mandatory. So maybe that other message should
> > be downgraded from warning to the "info" level?
> > 
> > Marc
> >  
> 
> [skip]
>  
> > >> +
> > >> +	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> > >> +		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> > >> +				   prot_err->err_len);
> > >> +		return;
> > >> +	}
> > >> +
> > >> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> > >> +		pr_warn(FW_WARN "CXL CPER no device serial number\n");
> > >> +  
> 
> Maybe this test should be written on the line of the following snippet taken 
> out from "ACPI: extlog: Trace CPER CXL Protocol Error Section".[1]
> 
> +
> +	if ((prot_err->agent_type == RCD || prot_err->agent_type == DEVICE ||
> +	     prot_err->agent_type == LD || prot_err->agent_type == FMLD) &&
> +	    !(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> +		pr_warn_ratelimited(FW_WARN
> +				    "CXL CPER no device serial number\n");

They are mandatory for CXL type 3 class code devices (and so the LDs here I think)
Device and RCD might not be type 3 class code so it may be optional?

> +
> 
> Thanks,
> 
> Fabio
> 
> [1] https://lore.kernel.org/linux-cxl/20250623145453.1046660-4-fabio.m.de.francesco@linux.intel.com/
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-23  7:13     ` Marc Herbert
  2025-07-24 14:49       ` Fabio M. De Francesco
@ 2025-07-28 15:01       ` dan.j.williams
  1 sibling, 0 replies; 39+ messages in thread
From: dan.j.williams @ 2025-07-28 15:01 UTC (permalink / raw)
  To: Marc Herbert, Smita Koralahalli, linux-efi, linux-kernel,
	linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price

Marc Herbert wrote:
> On 2025-07-22 12:24, Marc Herbert wrote:
> > Hi Smita,
> > 
> >   The code below triggers the error "invalid agent type: 1" in Intel
> > validation (internal issue 15018133056)
> 
> The same test case also triggers the other, warning message "CXL CPER no
> device serial number".
> 
> I heard that "device" serial numbers are only for... devices and that
> even then it's not always mandatory. So maybe that other message should
> be downgraded from warning to the "info" level?

Agree. Care to send a fixup series?

This is a benefit / endorsement of your test updates [1] to catch
unexpected warn/err.

[1]: http://lore.kernel.org/20250611235256.3866724-1-marc.herbert@linux.intel.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-22 19:24   ` "invalid agent type: 1" in " Marc Herbert
  2025-07-23  7:13     ` Marc Herbert
@ 2025-07-28 16:25     ` Koralahalli Channabasappa, Smita
  2025-07-29  5:41       ` Marc Herbert
  1 sibling, 1 reply; 39+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-28 16:25 UTC (permalink / raw)
  To: Marc Herbert, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price

Hi Marc,

On 7/22/2025 12:24 PM, Marc Herbert wrote:
> Hi Smita,
> 
>    The code below triggers the error "invalid agent type: 1" in Intel
> validation (internal issue 15018133056)
> 
> It's not clear to anyone we asked why you did not include RCH_DP in
> the `switch (prot_err->agent_type)` in cxl_cper_post_prot_err() below.
> 
> I can see how RCH_DP is special in cxl_cper_PRINT_prot_err() and I can
> even understand (despite my near-zero CPER knowledge) some of the
> special cases there. But in cxl_cper_post_prot_err() here, it's not
> clear why RCH_DP would be rejected. Could this be an oversight? If not,
> a comment with a short explanation would not hurt.
> 
> Marc
> 
> PS: the newer cxl_cper_post_prot_err() code is longer and does
> something with `wd`. That's irrelevant for this test case since the
> function errors and returns earlier anyway.

You're right. RCH_DP was excluded because it doesn’t report a valid SBDF 
in the CPER record. Instead, it provides only the RCRB base address.

I haven't thoroughly investigated whether SBDF can be reliably derived 
from the RCRB base. There might be a platform-specific mechanism for 
that, but at the time, it seemed non-trivial to implement. Introducing 
additional infrastructure solely to support RCH_DP felt like it was 
adding more complexity.

I agree that a brief comment explaining this rationale would help. I'm 
okay if you plan to include a fixup for this along with the one for the 
device serial number.

Thanks
Smita

> 
> 
> On 2025-01-23 00:44, Smita Koralahalli wrote:
>> Add support in GHES to detect and process CXL CPER Protocol errors, as
>> defined in UEFI v2.10, section N.2.13.
>>
>> Define struct cxl_cper_prot_err_work_data to cache CXL protocol error
>> information, including RAS capabilities and severity, for further
>> handling.
>>
>> These cached CXL CPER records will later be processed by workqueues
>> within the CXL subsystem.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 54 ++++++++++++++++++++++++++++++++++++++++
>>   include/cxl/event.h      |  6 +++++
>>   2 files changed, 60 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b72772494655..4d725d988c43 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -674,6 +674,56 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>>   	schedule_work(&entry->work);
>>   }
>>   
>> +static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
>> +				   int severity)
>> +{
>> +#ifdef CONFIG_ACPI_APEI_PCIEAER
>> +	struct cxl_cper_prot_err_work_data wd;
>> +	u8 *dvsec_start, *cap_start;
>> +
>> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
>> +		pr_err_ratelimited("CXL CPER invalid agent type\n");
>> +		return;
>> +	}
>> +
>> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
>> +		pr_err_ratelimited("CXL CPER invalid protocol error log\n");
>> +		return;
>> +	}
>> +
>> +	if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
>> +		pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
>> +				   prot_err->err_len);
>> +		return;
>> +	}
>> +
>> +	if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
>> +		pr_warn(FW_WARN "CXL CPER no device serial number\n");
>> +
>> +	switch (prot_err->agent_type) {
>> +	case RCD:
>> +	case DEVICE:
>> +	case LD:
>> +	case FMLD:
>> +	case RP:
>> +	case DSP:
>> +	case USP:
>> +		memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
>> +
>> +		dvsec_start = (u8 *)(prot_err + 1);
>> +		cap_start = dvsec_start + prot_err->dvsec_len;
>> +
>> +		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
>> +		wd.severity = cper_severity_to_aer(severity);
>> +		break;
>> +	default:
>> +		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
>> +				   prot_err->agent_type);
>> +		return;
>> +	}
>> +#endif
>> +}
>> +
>>   /* Room for 8 entries for each of the 4 event log queues */
>>   #define CXL_CPER_FIFO_DEPTH 32
>>   DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
>> @@ -777,6 +827,10 @@ static bool ghes_do_proc(struct ghes *ghes,
>>   		}
>>   		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>>   			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>> +		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
>> +			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
>> +
>> +			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
>>   		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
>>   			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
>>   
>> diff --git a/include/cxl/event.h b/include/cxl/event.h
>> index 66d85fc87701..ee1c3dec62fa 100644
>> --- a/include/cxl/event.h
>> +++ b/include/cxl/event.h
>> @@ -232,6 +232,12 @@ struct cxl_ras_capability_regs {
>>   	u32 header_log[16];
>>   };
>>   
>> +struct cxl_cper_prot_err_work_data {
>> +	struct cxl_cper_sec_prot_err prot_err;
>> +	struct cxl_ras_capability_regs ras_cap;
>> +	int severity;
>> +};
>> +
>>   #ifdef CONFIG_ACPI_APEI_GHES
>>   int cxl_cper_register_work(struct work_struct *work);
>>   int cxl_cper_unregister_work(struct work_struct *work);
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-28 16:25     ` Koralahalli Channabasappa, Smita
@ 2025-07-29  5:41       ` Marc Herbert
  2025-07-29 15:52         ` Koralahalli Channabasappa, Smita
  0 siblings, 1 reply; 39+ messages in thread
From: Marc Herbert @ 2025-07-29  5:41 UTC (permalink / raw)
  To: Koralahalli Channabasappa, Smita, linux-efi, linux-kernel,
	linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price

On 2025-07-28 09:25, Koralahalli Channabasappa, Smita wrote:

> On 7/22/2025 12:24 PM, Marc Herbert wrote:

>>    The code below triggers the error "invalid agent type: 1" in Intel
>> validation (internal issue 15018133056)
>>
>> It's not clear to anyone we asked why you did not include RCH_DP in
>> the `switch (prot_err->agent_type)` in cxl_cper_post_prot_err() below.
>>
>> I can see how RCH_DP is special in cxl_cper_PRINT_prot_err() and I can
>> even understand (despite my near-zero CPER knowledge) some of the
>> special cases there. But in cxl_cper_post_prot_err() here, it's not
>> clear why RCH_DP would be rejected. Could this be an oversight? If not,
>> a comment with a short explanation would not hurt.
>>
> 
> You're right. RCH_DP was excluded because it doesn’t report a valid
> SBDF in the CPER record. Instead, it provides only the RCRB base
> address.
> 
> I haven't thoroughly investigated whether SBDF can be reliably derived
> from the RCRB base. There might be a platform-specific mechanism for
> that, but at the time, it seemed non-trivial to implement. Introducing
> additional infrastructure solely to support RCH_DP felt like it was
> adding more complexity.
> 
> I agree that a brief comment explaining this rationale would help. I'm
> okay if you plan to include a fixup for this along with the one for
> the device serial number.

If I understood you correctly, I think a different error message
would be much better than a comment. Like this?

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -730,6 +730,9 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
 		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
 		wd.severity = cper_severity_to_aer(severity);
 		break;
+	case RCH_DP:
+		pr_err_ratelimited("CXL CPER agent type unsupported: RCH_DP\n");
+		return;
 	default:
 		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
 				   prot_err->agent_type);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-29  5:41       ` Marc Herbert
@ 2025-07-29 15:52         ` Koralahalli Channabasappa, Smita
  2025-07-29 17:39           ` dan.j.williams
  0 siblings, 1 reply; 39+ messages in thread
From: Koralahalli Channabasappa, Smita @ 2025-07-29 15:52 UTC (permalink / raw)
  To: Marc Herbert, linux-efi, linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price

On 7/28/2025 10:41 PM, Marc Herbert wrote:
> On 2025-07-28 09:25, Koralahalli Channabasappa, Smita wrote:
> 
>> On 7/22/2025 12:24 PM, Marc Herbert wrote:
> 
>>>     The code below triggers the error "invalid agent type: 1" in Intel
>>> validation (internal issue 15018133056)
>>>
>>> It's not clear to anyone we asked why you did not include RCH_DP in
>>> the `switch (prot_err->agent_type)` in cxl_cper_post_prot_err() below.
>>>
>>> I can see how RCH_DP is special in cxl_cper_PRINT_prot_err() and I can
>>> even understand (despite my near-zero CPER knowledge) some of the
>>> special cases there. But in cxl_cper_post_prot_err() here, it's not
>>> clear why RCH_DP would be rejected. Could this be an oversight? If not,
>>> a comment with a short explanation would not hurt.
>>>
>>
>> You're right. RCH_DP was excluded because it doesn’t report a valid
>> SBDF in the CPER record. Instead, it provides only the RCRB base
>> address.
>>
>> I haven't thoroughly investigated whether SBDF can be reliably derived
>> from the RCRB base. There might be a platform-specific mechanism for
>> that, but at the time, it seemed non-trivial to implement. Introducing
>> additional infrastructure solely to support RCH_DP felt like it was
>> adding more complexity.
>>
>> I agree that a brief comment explaining this rationale would help. I'm
>> okay if you plan to include a fixup for this along with the one for
>> the device serial number.
> 
> If I understood you correctly, I think a different error message
> would be much better than a comment. Like this?

Yeah looks good to me. Thanks for fixing this!

Thanks
Smita

> 
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -730,6 +730,9 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
>   		memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
>   		wd.severity = cper_severity_to_aer(severity);
>   		break;
> +	case RCH_DP:
> +		pr_err_ratelimited("CXL CPER agent type unsupported: RCH_DP\n");
> +		return;
>   	default:
>   		pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
>   				   prot_err->agent_type);


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: "invalid agent type: 1" in acpi/ghes, cper: Recognize and cache CXL Protocol errors
  2025-07-29 15:52         ` Koralahalli Channabasappa, Smita
@ 2025-07-29 17:39           ` dan.j.williams
  0 siblings, 0 replies; 39+ messages in thread
From: dan.j.williams @ 2025-07-29 17:39 UTC (permalink / raw)
  To: Koralahalli Channabasappa, Smita, Marc Herbert, linux-efi,
	linux-kernel, linux-cxl
  Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Jonathan Cameron, Yazen Ghannam, Terry Bowman,
	Dave Jiang, tony.luck, Gregory Price

Koralahalli Channabasappa, Smita wrote:
[..]
> >> I agree that a brief comment explaining this rationale would help. I'm
> >> okay if you plan to include a fixup for this along with the one for
> >> the device serial number.
> > 
> > If I understood you correctly, I think a different error message
> > would be much better than a comment. Like this?
> 
> Yeah looks good to me. Thanks for fixing this!

Wait, I was thinking to delete these messages. The rationale being, why
continue to harass the system owner indefinetly with the fact that their
BIOS is emitting kernel unsupported CPER records? The value of that
unactionable report is look. This feels like dev_dbg_ratelimited(), or
dev_warn_once() at most.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-07-29 17:40 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-23  8:44 [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Smita Koralahalli
2025-01-23  8:44 ` [PATCH v6 1/6] efi/cper, cxl: Prefix protocol error struct and function names with cxl_ Smita Koralahalli
2025-02-04  0:12   ` Fan Ni
2025-02-05 19:17   ` Gregory Price
2025-01-23  8:44 ` [PATCH v6 2/6] efi/cper, cxl: Make definitions and structures global Smita Koralahalli
2025-02-04  0:16   ` Fan Ni
2025-02-05 19:16   ` Gregory Price
2025-02-06 10:54     ` Jonathan Cameron
2025-02-06 16:14       ` Gregory Price
2025-02-06 17:14         ` Konstantin Ryabitsev
2025-02-06 17:32           ` Gregory Price
2025-01-23  8:44 ` [PATCH v6 3/6] efi/cper, cxl: Remove cper_cxl.h Smita Koralahalli
2025-02-04  0:20   ` Fan Ni
2025-02-05 19:18   ` Gregory Price
2025-01-23  8:44 ` [PATCH v6 4/6] acpi/ghes, cper: Recognize and cache CXL Protocol errors Smita Koralahalli
2025-02-03 18:59   ` Luck, Tony
2025-02-05 19:35     ` Gregory Price
2025-02-05 22:21   ` Dan Williams
2025-07-22 19:24   ` "invalid agent type: 1" in " Marc Herbert
2025-07-23  7:13     ` Marc Herbert
2025-07-24 14:49       ` Fabio M. De Francesco
2025-07-25 11:04         ` Jonathan Cameron
2025-07-28 15:01       ` dan.j.williams
2025-07-28 16:25     ` Koralahalli Channabasappa, Smita
2025-07-29  5:41       ` Marc Herbert
2025-07-29 15:52         ` Koralahalli Channabasappa, Smita
2025-07-29 17:39           ` dan.j.williams
2025-01-23  8:44 ` [PATCH v6 5/6] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors Smita Koralahalli
2025-02-03 19:03   ` Luck, Tony
2025-02-12 21:04     ` Koralahalli Channabasappa, Smita
2025-02-05 19:50   ` Gregory Price
2025-02-05 22:58   ` Dan Williams
2025-02-12 20:57     ` Koralahalli Channabasappa, Smita
2025-01-23  8:44 ` [PATCH v6 6/6] cxl/pci: Add trace logging for CXL PCIe Port RAS errors Smita Koralahalli
2025-01-24 16:36   ` Ira Weiny
2025-02-05 20:01   ` Gregory Price
2025-02-05 23:06   ` Dan Williams
2025-02-03 17:09 ` [PATCH v6 0/6] acpi/ghes, cper, cxl: Process CXL CPER Protocol errors Dave Jiang
2025-02-06 18:38 ` Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).