* [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors
@ 2024-01-09 3:47 Smita Koralahalli
2024-01-09 3:47 ` [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records Smita Koralahalli
` (4 more replies)
0 siblings, 5 replies; 19+ messages in thread
From: Smita Koralahalli @ 2024-01-09 3:47 UTC (permalink / raw)
To: linux-efi, linux-kernel, linux-cxl
Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Jonathan Cameron, Yazen Ghannam, Smita Koralahalli
This patchset adds trace event support for FW-First Protocol Errors.
This series depends on:
https://lore.kernel.org/linux-cxl/20231220-cxl-cper-v5-0-1bb8a4ca2c7a@intel.com
Link to v1:
https://lore.kernel.org/all/20240102150933.161009-1-Smita.KoralahalliChannabasappa@amd.com
Smita Koralahalli (4):
acpi/ghes, cxl: Create a common CXL struct to handle different CXL
CPER records
efi/cper, cxl: Make definitions and structures global
acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors.
acpi/ghes, cxl/pci: Trace FW-First CXL Protocol Errors
drivers/acpi/apei/ghes.c | 22 ++++++++-
drivers/cxl/core/pci.c | 29 ++++++++++++
drivers/cxl/cxlpci.h | 3 ++
drivers/cxl/pci.c | 13 ++++--
drivers/firmware/efi/cper_cxl.c | 79 ++++++++++++++++++++++++++++-----
drivers/firmware/efi/cper_cxl.h | 7 +--
include/linux/cper.h | 4 ++
include/linux/cxl-event.h | 31 ++++++++++++-
8 files changed, 166 insertions(+), 22 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records 2024-01-09 3:47 [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors Smita Koralahalli @ 2024-01-09 3:47 ` Smita Koralahalli 2024-02-15 11:56 ` Jonathan Cameron 2024-01-09 3:47 ` [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global Smita Koralahalli ` (3 subsequent siblings) 4 siblings, 1 reply; 19+ messages in thread From: Smita Koralahalli @ 2024-01-09 3:47 UTC (permalink / raw) To: linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Smita Koralahalli Currently defined cxl_cper_callback interface between CXL subsystem and GHES module is just confined to handling CXL Component errors only. Extend this callback to process CXL Protocol errors as well. Achieve by defining a new struct cxl_cper_event_info to include cxl_cper_event_rec and other fields of CXL protocol errors which will be defined in future patches. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> --- v2: cxl_cper_rec_data -> cxl_cper_event_info data -> info --- drivers/acpi/apei/ghes.c | 6 +++++- drivers/cxl/pci.c | 8 ++++---- include/linux/cxl-event.h | 6 +++++- 3 files changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index aed465d2fd68..60b615d361d3 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -693,6 +693,10 @@ static cxl_cper_callback cper_callback; static void cxl_cper_post_event(enum cxl_event_type event_type, struct cxl_cper_event_rec *rec) { + struct cxl_cper_event_info info; + + info.rec = *(struct cxl_cper_event_rec *)rec; + if (rec->hdr.length <= sizeof(rec->hdr) || rec->hdr.length > sizeof(*rec)) { pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n", @@ -707,7 +711,7 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, guard(rwsem_read)(&cxl_cper_rw_sem); if (cper_callback) - cper_callback(event_type, rec); + cper_callback(event_type, &info); } int cxl_cper_register_callback(cxl_cper_callback callback) diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index b14237f824cf..1ad240ead4fd 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -972,9 +972,9 @@ static struct pci_driver cxl_pci_driver = { #define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0) static void cxl_cper_event_call(enum cxl_event_type ev_type, - struct cxl_cper_event_rec *rec) + struct cxl_cper_event_info *info) { - struct cper_cxl_event_devid *device_id = &rec->hdr.device_id; + struct cper_cxl_event_devid *device_id = &info->rec.hdr.device_id; struct pci_dev *pdev __free(pci_dev_put) = NULL; enum cxl_event_log_type log_type; struct cxl_dev_state *cxlds; @@ -996,11 +996,11 @@ static void cxl_cper_event_call(enum cxl_event_type ev_type, return; /* Fabricate a log type */ - hdr_flags = get_unaligned_le24(rec->event.generic.hdr.flags); + hdr_flags = get_unaligned_le24(info->rec.event.generic.hdr.flags); log_type = FIELD_GET(CXL_EVENT_HDR_FLAGS_REC_SEVERITY, hdr_flags); cxl_event_trace_record(cxlds->cxlmd, log_type, ev_type, - &uuid_null, &rec->event); + &uuid_null, &info->rec.event); } static int __init cxl_pci_driver_init(void) diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h index 17eadee819b6..6ce839c59749 100644 --- a/include/linux/cxl-event.h +++ b/include/linux/cxl-event.h @@ -141,8 +141,12 @@ struct cxl_cper_event_rec { union cxl_event event; } __packed; +struct cxl_cper_event_info { + struct cxl_cper_event_rec rec; +}; + typedef void (*cxl_cper_callback)(enum cxl_event_type type, - struct cxl_cper_event_rec *rec); + struct cxl_cper_event_info *info); #ifdef CONFIG_ACPI_APEI_GHES int cxl_cper_register_callback(cxl_cper_callback callback); -- 2.17.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records 2024-01-09 3:47 ` [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records Smita Koralahalli @ 2024-02-15 11:56 ` Jonathan Cameron 0 siblings, 0 replies; 19+ messages in thread From: Jonathan Cameron @ 2024-02-15 11:56 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Yazen Ghannam On Tue, 9 Jan 2024 03:47:52 +0000 Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > Currently defined cxl_cper_callback interface between CXL subsystem and > GHES module is just confined to handling CXL Component errors only. > > Extend this callback to process CXL Protocol errors as well. Achieve > by defining a new struct cxl_cper_event_info to include cxl_cper_event_rec > and other fields of CXL protocol errors which will be defined in future > patches. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Hi Smita, I guess this will get effected by the mess around the reporting that Ira is fixing but in meantime some comments on the current code. > --- > v2: > cxl_cper_rec_data -> cxl_cper_event_info > data -> info > --- > drivers/acpi/apei/ghes.c | 6 +++++- > drivers/cxl/pci.c | 8 ++++---- > include/linux/cxl-event.h | 6 +++++- > 3 files changed, 14 insertions(+), 6 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index aed465d2fd68..60b615d361d3 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -693,6 +693,10 @@ static cxl_cper_callback cper_callback; > static void cxl_cper_post_event(enum cxl_event_type event_type, > struct cxl_cper_event_rec *rec) > { > + struct cxl_cper_event_info info; > + > + info.rec = *(struct cxl_cper_event_rec *)rec; Why cast? > + > if (rec->hdr.length <= sizeof(rec->hdr) || > rec->hdr.length > sizeof(*rec)) { > pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n", > @@ -707,7 +711,7 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, > > guard(rwsem_read)(&cxl_cper_rw_sem); > if (cper_callback) > - cper_callback(event_type, rec); > + cper_callback(event_type, &info); > } > > int cxl_cper_register_callback(cxl_cper_callback callback) > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index b14237f824cf..1ad240ead4fd 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -972,9 +972,9 @@ static struct pci_driver cxl_pci_driver = { > > #define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0) > static void cxl_cper_event_call(enum cxl_event_type ev_type, > - struct cxl_cper_event_rec *rec) > + struct cxl_cper_event_info *info) > { > - struct cper_cxl_event_devid *device_id = &rec->hdr.device_id; > + struct cper_cxl_event_devid *device_id = &info->rec.hdr.device_id; > struct pci_dev *pdev __free(pci_dev_put) = NULL; > enum cxl_event_log_type log_type; > struct cxl_dev_state *cxlds; > @@ -996,11 +996,11 @@ static void cxl_cper_event_call(enum cxl_event_type ev_type, > return; > > /* Fabricate a log type */ > - hdr_flags = get_unaligned_le24(rec->event.generic.hdr.flags); > + hdr_flags = get_unaligned_le24(info->rec.event.generic.hdr.flags); > log_type = FIELD_GET(CXL_EVENT_HDR_FLAGS_REC_SEVERITY, hdr_flags); > > cxl_event_trace_record(cxlds->cxlmd, log_type, ev_type, > - &uuid_null, &rec->event); > + &uuid_null, &info->rec.event); > } > > static int __init cxl_pci_driver_init(void) > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h > index 17eadee819b6..6ce839c59749 100644 > --- a/include/linux/cxl-event.h > +++ b/include/linux/cxl-event.h > @@ -141,8 +141,12 @@ struct cxl_cper_event_rec { > union cxl_event event; > } __packed; > > +struct cxl_cper_event_info { > + struct cxl_cper_event_rec rec; Only parts of this will be relevant to the protocol errors. Maybe worth doing a union with the first part of rec in both structures but not the union cxl_event in the protocol error. Keep it all anonymous to avoid yet another structure in the reads/and writes though. > +}; > + > typedef void (*cxl_cper_callback)(enum cxl_event_type type, > - struct cxl_cper_event_rec *rec); > + struct cxl_cper_event_info *info); > > #ifdef CONFIG_ACPI_APEI_GHES > int cxl_cper_register_callback(cxl_cper_callback callback); ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global 2024-01-09 3:47 [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors Smita Koralahalli 2024-01-09 3:47 ` [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records Smita Koralahalli @ 2024-01-09 3:47 ` Smita Koralahalli 2024-02-15 11:58 ` Jonathan Cameron 2024-02-15 14:47 ` Ard Biesheuvel 2024-01-09 3:47 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors Smita Koralahalli ` (2 subsequent siblings) 4 siblings, 2 replies; 19+ messages in thread From: Smita Koralahalli @ 2024-01-09 3:47 UTC (permalink / raw) To: linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Smita Koralahalli In preparation to add tracepoint support, move protocol error UUID definition to a common location and make CXL RAS capability struct global for use across different modules. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> --- v2: No change. --- drivers/firmware/efi/cper_cxl.c | 11 ----------- drivers/firmware/efi/cper_cxl.h | 7 ++----- include/linux/cper.h | 4 ++++ include/linux/cxl-event.h | 11 +++++++++++ 4 files changed, 17 insertions(+), 16 deletions(-) diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c index a55771b99a97..4fd8d783993e 100644 --- a/drivers/firmware/efi/cper_cxl.c +++ b/drivers/firmware/efi/cper_cxl.c @@ -18,17 +18,6 @@ #define PROT_ERR_VALID_DVSEC BIT_ULL(5) #define PROT_ERR_VALID_ERROR_LOG BIT_ULL(6) -/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */ -struct cxl_ras_capability_regs { - u32 uncor_status; - u32 uncor_mask; - u32 uncor_severity; - u32 cor_status; - u32 cor_mask; - u32 cap_control; - u32 header_log[16]; -}; - static const char * const prot_err_agent_type_strs[] = { "Restricted CXL Device", "Restricted CXL Host Downstream Port", diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h index 86bfcf7909ec..6f8c00495708 100644 --- a/drivers/firmware/efi/cper_cxl.h +++ b/drivers/firmware/efi/cper_cxl.h @@ -7,14 +7,11 @@ * Author: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> */ +#include <linux/cxl-event.h> + #ifndef LINUX_CPER_CXL_H #define LINUX_CPER_CXL_H -/* CXL Protocol Error Section */ -#define CPER_SEC_CXL_PROT_ERR \ - GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78, \ - 0x4B, 0x77, 0x10, 0x48) - #pragma pack(1) /* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */ diff --git a/include/linux/cper.h b/include/linux/cper.h index c1a7dc325121..2cbf0a93785a 100644 --- a/include/linux/cper.h +++ b/include/linux/cper.h @@ -89,6 +89,10 @@ enum { #define CPER_NOTIFY_DMAR \ GUID_INIT(0x667DD791, 0xC6B3, 0x4c27, 0x8A, 0x6B, 0x0F, 0x8E, \ 0x72, 0x2D, 0xEB, 0x41) +/* CXL Protocol Error Section */ +#define CPER_SEC_CXL_PROT_ERR \ + GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78, \ + 0x4B, 0x77, 0x10, 0x48) /* * Flags bits definitions for flags in struct cper_record_header diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h index 6ce839c59749..3a41dd5723e8 100644 --- a/include/linux/cxl-event.h +++ b/include/linux/cxl-event.h @@ -141,6 +141,17 @@ struct cxl_cper_event_rec { union cxl_event event; } __packed; +/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */ +struct cxl_ras_capability_regs { + u32 uncor_status; + u32 uncor_mask; + u32 uncor_severity; + u32 cor_status; + u32 cor_mask; + u32 cap_control; + u32 header_log[16]; +}; + struct cxl_cper_event_info { struct cxl_cper_event_rec rec; }; -- 2.17.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global 2024-01-09 3:47 ` [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global Smita Koralahalli @ 2024-02-15 11:58 ` Jonathan Cameron 2024-02-15 14:47 ` Ard Biesheuvel 1 sibling, 0 replies; 19+ messages in thread From: Jonathan Cameron @ 2024-02-15 11:58 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Yazen Ghannam On Tue, 9 Jan 2024 03:47:53 +0000 Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > In preparation to add tracepoint support, move protocol error UUID > definition to a common location and make CXL RAS capability struct > global for use across different modules. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > v2: > No change. > --- > drivers/firmware/efi/cper_cxl.c | 11 ----------- > drivers/firmware/efi/cper_cxl.h | 7 ++----- > include/linux/cper.h | 4 ++++ > include/linux/cxl-event.h | 11 +++++++++++ > 4 files changed, 17 insertions(+), 16 deletions(-) > > diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c > index a55771b99a97..4fd8d783993e 100644 > --- a/drivers/firmware/efi/cper_cxl.c > +++ b/drivers/firmware/efi/cper_cxl.c > @@ -18,17 +18,6 @@ > #define PROT_ERR_VALID_DVSEC BIT_ULL(5) > #define PROT_ERR_VALID_ERROR_LOG BIT_ULL(6) > > -/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */ > -struct cxl_ras_capability_regs { > - u32 uncor_status; > - u32 uncor_mask; > - u32 uncor_severity; > - u32 cor_status; > - u32 cor_mask; > - u32 cap_control; > - u32 header_log[16]; > -}; > - > static const char * const prot_err_agent_type_strs[] = { > "Restricted CXL Device", > "Restricted CXL Host Downstream Port", > diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h > index 86bfcf7909ec..6f8c00495708 100644 > --- a/drivers/firmware/efi/cper_cxl.h > +++ b/drivers/firmware/efi/cper_cxl.h > @@ -7,14 +7,11 @@ > * Author: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > */ > > +#include <linux/cxl-event.h> > + > #ifndef LINUX_CPER_CXL_H > #define LINUX_CPER_CXL_H > > -/* CXL Protocol Error Section */ > -#define CPER_SEC_CXL_PROT_ERR \ > - GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78, \ > - 0x4B, 0x77, 0x10, 0x48) > - > #pragma pack(1) > > /* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */ > diff --git a/include/linux/cper.h b/include/linux/cper.h > index c1a7dc325121..2cbf0a93785a 100644 > --- a/include/linux/cper.h > +++ b/include/linux/cper.h > @@ -89,6 +89,10 @@ enum { > #define CPER_NOTIFY_DMAR \ > GUID_INIT(0x667DD791, 0xC6B3, 0x4c27, 0x8A, 0x6B, 0x0F, 0x8E, \ > 0x72, 0x2D, 0xEB, 0x41) > +/* CXL Protocol Error Section */ > +#define CPER_SEC_CXL_PROT_ERR \ > + GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78, \ > + 0x4B, 0x77, 0x10, 0x48) > > /* > * Flags bits definitions for flags in struct cper_record_header > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h > index 6ce839c59749..3a41dd5723e8 100644 > --- a/include/linux/cxl-event.h > +++ b/include/linux/cxl-event.h > @@ -141,6 +141,17 @@ struct cxl_cper_event_rec { > union cxl_event event; > } __packed; > > +/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */ > +struct cxl_ras_capability_regs { > + u32 uncor_status; > + u32 uncor_mask; > + u32 uncor_severity; > + u32 cor_status; > + u32 cor_mask; > + u32 cap_control; > + u32 header_log[16]; > +}; > + > struct cxl_cper_event_info { > struct cxl_cper_event_rec rec; > }; ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global 2024-01-09 3:47 ` [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global Smita Koralahalli 2024-02-15 11:58 ` Jonathan Cameron @ 2024-02-15 14:47 ` Ard Biesheuvel 1 sibling, 0 replies; 19+ messages in thread From: Ard Biesheuvel @ 2024-02-15 14:47 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam On Tue, 9 Jan 2024 at 04:48, Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > > In preparation to add tracepoint support, move protocol error UUID > definition to a common location and make CXL RAS capability struct > global for use across different modules. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> > --- > v2: > No change. > --- > drivers/firmware/efi/cper_cxl.c | 11 ----------- > drivers/firmware/efi/cper_cxl.h | 7 ++----- > include/linux/cper.h | 4 ++++ > include/linux/cxl-event.h | 11 +++++++++++ > 4 files changed, 17 insertions(+), 16 deletions(-) > > diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c > index a55771b99a97..4fd8d783993e 100644 > --- a/drivers/firmware/efi/cper_cxl.c > +++ b/drivers/firmware/efi/cper_cxl.c > @@ -18,17 +18,6 @@ > #define PROT_ERR_VALID_DVSEC BIT_ULL(5) > #define PROT_ERR_VALID_ERROR_LOG BIT_ULL(6) > > -/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */ > -struct cxl_ras_capability_regs { > - u32 uncor_status; > - u32 uncor_mask; > - u32 uncor_severity; > - u32 cor_status; > - u32 cor_mask; > - u32 cap_control; > - u32 header_log[16]; > -}; > - > static const char * const prot_err_agent_type_strs[] = { > "Restricted CXL Device", > "Restricted CXL Host Downstream Port", > diff --git a/drivers/firmware/efi/cper_cxl.h b/drivers/firmware/efi/cper_cxl.h > index 86bfcf7909ec..6f8c00495708 100644 > --- a/drivers/firmware/efi/cper_cxl.h > +++ b/drivers/firmware/efi/cper_cxl.h > @@ -7,14 +7,11 @@ > * Author: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > */ > > +#include <linux/cxl-event.h> > + > #ifndef LINUX_CPER_CXL_H > #define LINUX_CPER_CXL_H > > -/* CXL Protocol Error Section */ > -#define CPER_SEC_CXL_PROT_ERR \ > - GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78, \ > - 0x4B, 0x77, 0x10, 0x48) > - > #pragma pack(1) > > /* Compute Express Link Protocol Error Section, UEFI v2.10 sec N.2.13 */ > diff --git a/include/linux/cper.h b/include/linux/cper.h > index c1a7dc325121..2cbf0a93785a 100644 > --- a/include/linux/cper.h > +++ b/include/linux/cper.h > @@ -89,6 +89,10 @@ enum { > #define CPER_NOTIFY_DMAR \ > GUID_INIT(0x667DD791, 0xC6B3, 0x4c27, 0x8A, 0x6B, 0x0F, 0x8E, \ > 0x72, 0x2D, 0xEB, 0x41) > +/* CXL Protocol Error Section */ > +#define CPER_SEC_CXL_PROT_ERR \ > + GUID_INIT(0x80B9EFB4, 0x52B5, 0x4DE3, 0xA7, 0x77, 0x68, 0x78, \ > + 0x4B, 0x77, 0x10, 0x48) > > /* > * Flags bits definitions for flags in struct cper_record_header > diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h > index 6ce839c59749..3a41dd5723e8 100644 > --- a/include/linux/cxl-event.h > +++ b/include/linux/cxl-event.h > @@ -141,6 +141,17 @@ struct cxl_cper_event_rec { > union cxl_event event; > } __packed; > > +/* CXL RAS Capability Structure, CXL v3.0 sec 8.2.4.16 */ > +struct cxl_ras_capability_regs { > + u32 uncor_status; > + u32 uncor_mask; > + u32 uncor_severity; > + u32 cor_status; > + u32 cor_mask; > + u32 cap_control; > + u32 header_log[16]; > +}; > + > struct cxl_cper_event_info { > struct cxl_cper_event_rec rec; > }; > -- > 2.17.1 > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-01-09 3:47 [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors Smita Koralahalli 2024-01-09 3:47 ` [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records Smita Koralahalli 2024-01-09 3:47 ` [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global Smita Koralahalli @ 2024-01-09 3:47 ` Smita Koralahalli 2024-02-15 12:17 ` Jonathan Cameron 2024-01-09 3:47 ` [PATCH v2 4/4] acpi/ghes, cxl/pci: Trace FW-First " Smita Koralahalli 2024-05-07 9:35 ` [PATCH v2 0/4] acpi/ghes, cper, cxl: " Fabio M. De Francesco 4 siblings, 1 reply; 19+ messages in thread From: Smita Koralahalli @ 2024-01-09 3:47 UTC (permalink / raw) To: linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Smita Koralahalli UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. Add GHES support to detect CXL CPER Protocol record and cache error severity, device_id, serial number and CXL RAS capability struct in struct cxl_cper_event_info. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> --- v2: Change to sub-struct for protocol error specific elemenets. Set serial number unconditionally. Copy entire cxl_ras_capability_regs struct rather than pointer. Calculate error severity in efi/cper and change to enum. --- drivers/acpi/apei/ghes.c | 11 ++++++ drivers/firmware/efi/cper_cxl.c | 68 +++++++++++++++++++++++++++++++++ include/linux/cxl-event.h | 13 +++++++ 3 files changed, 92 insertions(+) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 60b615d361d3..1d4f3d68a0bc 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -714,6 +714,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, cper_callback(event_type, &info); } +void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) +{ + struct cxl_cper_event_info info; + + if (cxl_cper_handle_prot_err_info(gdata, &info)) + return; +} + int cxl_cper_register_callback(cxl_cper_callback callback) { guard(rwsem_write)(&cxl_cper_rw_sem); @@ -768,6 +776,9 @@ static bool ghes_do_proc(struct ghes *ghes, else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) { queued = ghes_handle_arm_hw_error(gdata, sev); } + else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { + cxl_cper_handle_prot_err(gdata); + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c index 4fd8d783993e..9b9b8c8f1157 100644 --- a/drivers/firmware/efi/cper_cxl.c +++ b/drivers/firmware/efi/cper_cxl.c @@ -8,6 +8,7 @@ */ #include <linux/cper.h> +#include <acpi/ghes.h> #include "cper_cxl.h" #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) @@ -44,6 +45,17 @@ enum { USP, /* CXL Upstream Switch Port */ }; +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) +{ + switch (cper_severity) { + case CPER_SEV_RECOVERABLE: + case CPER_SEV_FATAL: + return CXL_AER_UNCORRECTABLE; + default: + return CXL_AER_CORRECTABLE; + } +} + void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) { if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) @@ -176,3 +188,59 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e sizeof(cxl_ras->header_log), 0); } } + +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, + struct cxl_cper_event_info *info) +{ + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); + struct cper_cxl_event_devid *device_id = &info->rec.hdr.device_id; + struct cper_cxl_event_sn *dev_serial_num = &info->rec.hdr.dev_serial_num; + size_t size = sizeof(*prot_err) + prot_err->dvsec_len; + + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { + pr_err(FW_WARN "Not a valid protocol error log\n"); + return -EINVAL; + } + + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { + pr_err(FW_WARN "Not a valid Device ID\n"); + return -EINVAL; + } + + /* + * Set device serial number unconditionally. + * + * Print a warning message if it is not valid. The device serial + * number is considered valid for CXL 1.1 device, CXL 2.0 device, + * CXL 2.0 Logical device, or CXL 2.0 Fabric Manager Managed + * Logical Device. + */ + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || + prot_err->agent_type > 0x4 || prot_err->agent_type == RCH_DP) + pr_warn(FW_WARN "Not a valid serial number\n"); + + dev_serial_num->lower_dw = prot_err->dev_serial_num.lower_dw; + dev_serial_num->upper_dw = prot_err->dev_serial_num.upper_dw; + + /* + * The device ID or agent address is only valid for CXL 1.1 device, + * CXL 2.0 device, CXL 2.0 Logical device, CXL 2.0 Fabric Manager + * Managed Logical Device, CXL Root Port, CXL Downstream Switch + * Port, or CXL Upstream Switch Port. + */ + if (prot_err->agent_type <= 0x7 && prot_err->agent_type != RCH_DP) { + device_id->segment_num = prot_err->agent_addr.segment; + device_id->bus_num = prot_err->agent_addr.bus; + device_id->device_num = prot_err->agent_addr.device; + device_id->func_num = prot_err->agent_addr.function; + } else { + pr_err(FW_WARN "Not a valid agent type\n"); + return -EINVAL; + } + + info->p_err.cxl_ras = *(struct cxl_ras_capability_regs *)((long)prot_err + size); + + info->p_err.severity = cper_severity_cxl_aer(gdata->error_severity); + + return 0; +} diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h index 3a41dd5723e8..08e3979de9a3 100644 --- a/include/linux/cxl-event.h +++ b/include/linux/cxl-event.h @@ -152,13 +152,26 @@ struct cxl_ras_capability_regs { u32 header_log[16]; }; +enum cxl_aer_err_type { + CXL_AER_UNCORRECTABLE, + CXL_AER_CORRECTABLE, +}; + struct cxl_cper_event_info { struct cxl_cper_event_rec rec; + struct cxl_cper_prot_err { + struct cxl_ras_capability_regs cxl_ras; + int severity; + } p_err; }; typedef void (*cxl_cper_callback)(enum cxl_event_type type, struct cxl_cper_event_info *info); +struct acpi_hest_generic_data; +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, + struct cxl_cper_event_info *info); + #ifdef CONFIG_ACPI_APEI_GHES int cxl_cper_register_callback(cxl_cper_callback callback); int cxl_cper_unregister_callback(cxl_cper_callback callback); -- 2.17.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-01-09 3:47 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors Smita Koralahalli @ 2024-02-15 12:17 ` Jonathan Cameron 0 siblings, 0 replies; 19+ messages in thread From: Jonathan Cameron @ 2024-02-15 12:17 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Yazen Ghannam On Tue, 9 Jan 2024 03:47:54 +0000 Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. > > Add GHES support to detect CXL CPER Protocol record and cache error > severity, device_id, serial number and CXL RAS capability struct in > struct cxl_cper_event_info. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > --- > v2: > Change to sub-struct for protocol error specific elemenets. > Set serial number unconditionally. > Copy entire cxl_ras_capability_regs struct rather than pointer. > Calculate error severity in efi/cper and change to enum. > --- > drivers/acpi/apei/ghes.c | 11 ++++++ > drivers/firmware/efi/cper_cxl.c | 68 +++++++++++++++++++++++++++++++++ > include/linux/cxl-event.h | 13 +++++++ > 3 files changed, 92 insertions(+) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 60b615d361d3..1d4f3d68a0bc 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -714,6 +714,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, > cper_callback(event_type, &info); > } > > +void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) > +{ > + struct cxl_cper_event_info info; > + > + if (cxl_cper_handle_prot_err_info(gdata, &info)) > + return; > +} > + > int cxl_cper_register_callback(cxl_cper_callback callback) > { > guard(rwsem_write)(&cxl_cper_rw_sem); > @@ -768,6 +776,9 @@ static bool ghes_do_proc(struct ghes *ghes, > else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) { > queued = ghes_handle_arm_hw_error(gdata, sev); > } > + else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { > + cxl_cper_handle_prot_err(gdata); > + } > else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); > > diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c > index 4fd8d783993e..9b9b8c8f1157 100644 > --- a/drivers/firmware/efi/cper_cxl.c > +++ b/drivers/firmware/efi/cper_cxl.c > @@ -8,6 +8,7 @@ > */ > > #include <linux/cper.h> > +#include <acpi/ghes.h> > #include "cper_cxl.h" > > #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) > @@ -44,6 +45,17 @@ enum { > USP, /* CXL Upstream Switch Port */ > }; > > +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) > +{ > + switch (cper_severity) { > + case CPER_SEV_RECOVERABLE: > + case CPER_SEV_FATAL: > + return CXL_AER_UNCORRECTABLE; > + default: > + return CXL_AER_CORRECTABLE; > + } > +} > + > void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) > { > if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) > @@ -176,3 +188,59 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e > sizeof(cxl_ras->header_log), 0); > } > } > + > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_event_info *info) > +{ > + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); > + struct cper_cxl_event_devid *device_id = &info->rec.hdr.device_id; > + struct cper_cxl_event_sn *dev_serial_num = &info->rec.hdr.dev_serial_num; > + size_t size = sizeof(*prot_err) + prot_err->dvsec_len; Not obvious what this is size of. I'd rename it to reflect that's only the distance to the end of the dvsec copy. Or just compute the pointer below directly by putting this maths inline. > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { > + pr_err(FW_WARN "Not a valid protocol error log\n"); > + return -EINVAL; > + } > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { > + pr_err(FW_WARN "Not a valid Device ID\n"); "No device ID\n" is more accurate description. I'd move this down to next to where we check the data is valid. So keep each validity check next to where it matters rather than a bunch of checks up here. (mostly because I started writing you didn't check it was valid down there before remembering this earlier code :) > + return -EINVAL; > + } > + > + /* > + * Set device serial number unconditionally. > + * > + * Print a warning message if it is not valid. The device serial > + * number is considered valid for CXL 1.1 device, CXL 2.0 device, is required for perhaps? These all got renamed in the CXL spec. We should use that naming because it deliberately avoids limiting to particular spec versions. CXL RCD, CXL SLD, CXL LD, > + * CXL 2.0 Logical device, or CXL 2.0 Fabric Manager Managed > + * Logical Device. Not sure what this is now called.. :( > + */ > + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || > + prot_err->agent_type > 0x4 || prot_err->agent_type == RCH_DP) > + pr_warn(FW_WARN "Not a valid serial number\n"); > + > + dev_serial_num->lower_dw = prot_err->dev_serial_num.lower_dw; > + dev_serial_num->upper_dw = prot_err->dev_serial_num.upper_dw; > + > + /* > + * The device ID or agent address is only valid for CXL 1.1 device, > + * CXL 2.0 device, CXL 2.0 Logical device, CXL 2.0 Fabric Manager > + * Managed Logical Device, CXL Root Port, CXL Downstream Switch > + * Port, or CXL Upstream Switch Port. > + */ > + if (prot_err->agent_type <= 0x7 && prot_err->agent_type != RCH_DP) { > + device_id->segment_num = prot_err->agent_addr.segment; > + device_id->bus_num = prot_err->agent_addr.bus; > + device_id->device_num = prot_err->agent_addr.device; > + device_id->func_num = prot_err->agent_addr.function; > + } else { > + pr_err(FW_WARN "Not a valid agent type\n"); > + return -EINVAL; > + } > + > + info->p_err.cxl_ras = *(struct cxl_ras_capability_regs *)((long)prot_err + size); Casting to a long isn't nice. Keep it as a pointer for this maths a u8 * or void * would work. Particularly if you did it as something a bit more self documenting like u8 *dvsec_start = (u8 *)(prot_err + 1); u8 *cap_start = dvsec_start + prot_err->dvsec_length; info->p_err.cxl_ras = *(struct cxl_ras_capability_regs *)cap_start; > + > + info->p_err.severity = cper_severity_cxl_aer(gdata->error_severity); > + > + return 0; > +} ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 4/4] acpi/ghes, cxl/pci: Trace FW-First CXL Protocol Errors 2024-01-09 3:47 [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors Smita Koralahalli ` (2 preceding siblings ...) 2024-01-09 3:47 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors Smita Koralahalli @ 2024-01-09 3:47 ` Smita Koralahalli 2024-02-15 12:22 ` Jonathan Cameron 2024-05-07 9:35 ` [PATCH v2 0/4] acpi/ghes, cper, cxl: " Fabio M. De Francesco 4 siblings, 1 reply; 19+ messages in thread From: Smita Koralahalli @ 2024-01-09 3:47 UTC (permalink / raw) To: linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Smita Koralahalli When PCIe AER is in FW-First, OS should process CXL Protocol errors from CPER records. These CPER records obtained from GHES module, will rely on a registered callback to be notified to the CXL subsystem in order to be processed. Call the existing cxl_cper_callback to notify the CXL subsystem on a Protocol error. The defined trace events cxl_aer_uncorrectable_error and cxl_aer_correctable_error currently trace native CXL AER errors. Reuse them to trace FW-First Protocol Errors. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> --- v2: Added warning for serial number check. Moved severity determination to previous patch. --- drivers/acpi/apei/ghes.c | 5 +++++ drivers/cxl/core/pci.c | 29 +++++++++++++++++++++++++++++ drivers/cxl/cxlpci.h | 3 +++ drivers/cxl/pci.c | 5 +++++ include/linux/cxl-event.h | 1 + 5 files changed, 43 insertions(+) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 1d4f3d68a0bc..4318b602e797 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -716,10 +716,15 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) { + enum cxl_event_type event_type = CXL_CPER_EVENT_PROT_ERR; struct cxl_cper_event_info info; if (cxl_cper_handle_prot_err_info(gdata, &info)) return; + + guard(rwsem_read)(&cxl_cper_rw_sem); + if (cper_callback) + cper_callback(event_type, &info); } int cxl_cper_register_callback(cxl_cper_callback callback) diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index 37e1652afbc7..bde8ebf5e4b3 100644 --- a/drivers/cxl/core/pci.c +++ b/drivers/cxl/core/pci.c @@ -836,6 +836,35 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport) } EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL); +void cxl_prot_err_trace_record(struct cxl_dev_state *cxlds, + struct cxl_cper_event_info *info) +{ + struct cper_cxl_event_sn *dev_serial_num = &info->rec.hdr.dev_serial_num; + u32 status, fe; + + if (((u64)dev_serial_num->upper_dw << 32 | + dev_serial_num->lower_dw) != cxlds->serial) + pr_warn("The device serial number in CPER differs or isn't valid\n"); + + if (info->p_err.severity == CXL_AER_CORRECTABLE) { + status = info->p_err.cxl_ras.cor_status & ~info->p_err.cxl_ras.cor_mask; + + trace_cxl_aer_correctable_error(cxlds->cxlmd, status); + } else { + status = info->p_err.cxl_ras.uncor_status & ~info->p_err.cxl_ras.uncor_mask; + + if (hweight32(status) > 1) + fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK, + info->p_err.cxl_ras.cap_control)); + else + fe = status; + + trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, + info->p_err.cxl_ras.header_log); + } +} +EXPORT_SYMBOL_NS_GPL(cxl_prot_err_trace_record, CXL); + static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds, struct cxl_dport *dport) { diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h index 0fa4799ea316..216003d4aec1 100644 --- a/drivers/cxl/cxlpci.h +++ b/drivers/cxl/cxlpci.h @@ -93,4 +93,7 @@ void read_cdat_data(struct cxl_port *port); void cxl_cor_error_detected(struct pci_dev *pdev); pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, pci_channel_state_t state); +struct cxl_cper_event_info; +void cxl_prot_err_trace_record(struct cxl_dev_state *cxlds, + struct cxl_cper_event_info *info); #endif /* __CXL_PCI_H__ */ diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c index 1ad240ead4fd..515983e7df10 100644 --- a/drivers/cxl/pci.c +++ b/drivers/cxl/pci.c @@ -995,6 +995,11 @@ static void cxl_cper_event_call(enum cxl_event_type ev_type, if (!cxlds) return; + if (ev_type == CXL_CPER_EVENT_PROT_ERR) { + cxl_prot_err_trace_record(cxlds, info); + return; + } + /* Fabricate a log type */ hdr_flags = get_unaligned_le24(info->rec.event.generic.hdr.flags); log_type = FIELD_GET(CXL_EVENT_HDR_FLAGS_REC_SEVERITY, hdr_flags); diff --git a/include/linux/cxl-event.h b/include/linux/cxl-event.h index 08e3979de9a3..96cc88aa04f3 100644 --- a/include/linux/cxl-event.h +++ b/include/linux/cxl-event.h @@ -113,6 +113,7 @@ enum cxl_event_type { CXL_CPER_EVENT_GEN_MEDIA, CXL_CPER_EVENT_DRAM, CXL_CPER_EVENT_MEM_MODULE, + CXL_CPER_EVENT_PROT_ERR, }; #define CPER_CXL_DEVICE_ID_VALID BIT(0) -- 2.17.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 4/4] acpi/ghes, cxl/pci: Trace FW-First CXL Protocol Errors 2024-01-09 3:47 ` [PATCH v2 4/4] acpi/ghes, cxl/pci: Trace FW-First " Smita Koralahalli @ 2024-02-15 12:22 ` Jonathan Cameron 0 siblings, 0 replies; 19+ messages in thread From: Jonathan Cameron @ 2024-02-15 12:22 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Yazen Ghannam On Tue, 9 Jan 2024 03:47:55 +0000 Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote: > When PCIe AER is in FW-First, OS should process CXL Protocol errors from > CPER records. These CPER records obtained from GHES module, will rely on > a registered callback to be notified to the CXL subsystem in order to be > processed. > > Call the existing cxl_cper_callback to notify the CXL subsystem on a > Protocol error. > > The defined trace events cxl_aer_uncorrectable_error and > cxl_aer_correctable_error currently trace native CXL AER errors. Reuse > them to trace FW-First Protocol Errors. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Other than the bits of this that will change due to Ira's look good to me. Jonathan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors 2024-01-09 3:47 [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors Smita Koralahalli ` (3 preceding siblings ...) 2024-01-09 3:47 ` [PATCH v2 4/4] acpi/ghes, cxl/pci: Trace FW-First " Smita Koralahalli @ 2024-05-07 9:35 ` Fabio M. De Francesco 2024-05-16 17:59 ` Smita Koralahalli 4 siblings, 1 reply; 19+ messages in thread From: Fabio M. De Francesco @ 2024-05-07 9:35 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Smita Koralahalli On Tuesday, January 9, 2024 4:47:51 AM GMT+2 Smita Koralahalli wrote: > This patchset adds trace event support for FW-First Protocol Errors. > > This series depends on: > https://lore.kernel.org/linux-cxl/20231220-cxl-cper-v5-0-1bb8a4ca2c7a@intel. > com > Hello Smita, I'm working on a small series of enhancement and additions to kernel logs and trace events in extlog_print() (ELOG). I'm interested to reuse from ELOG the infrastructure which you made to trace the CXL Protocol Errors. I'm aware that this series depends on an old one from Ira which has been superseded by a different implementation and, furthermore, that this needs to be reworked because the execution of this code may sleep while in atomic context. I'd like to ask if you plan to fix and rebase this series to a current cxl branch. If so, I'll wait for the next version to integrate in ELOG. Otherwise, if you are not anymore actively working on this series, please notice that I'd be glad to help by making the necessary changes and by porting your code to a current cxl branch (of course, retaining your authorship). Thanks, Fabio M. De Francesco > > Link to v1: > https://lore.kernel.org/all/20240102150933.161009-1-Smita.KoralahalliChannab > asappa@amd.com > > Smita Koralahalli (4): > acpi/ghes, cxl: Create a common CXL struct to handle different CXL > CPER records > efi/cper, cxl: Make definitions and structures global > acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. > acpi/ghes, cxl/pci: Trace FW-First CXL Protocol Errors > > drivers/acpi/apei/ghes.c | 22 ++++++++- > drivers/cxl/core/pci.c | 29 ++++++++++++ > drivers/cxl/cxlpci.h | 3 ++ > drivers/cxl/pci.c | 13 ++++-- > drivers/firmware/efi/cper_cxl.c | 79 ++++++++++++++++++++++++++++----- > drivers/firmware/efi/cper_cxl.h | 7 +-- > include/linux/cper.h | 4 ++ > include/linux/cxl-event.h | 31 ++++++++++++- > 8 files changed, 166 insertions(+), 22 deletions(-) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors 2024-05-07 9:35 ` [PATCH v2 0/4] acpi/ghes, cper, cxl: " Fabio M. De Francesco @ 2024-05-16 17:59 ` Smita Koralahalli 0 siblings, 0 replies; 19+ messages in thread From: Smita Koralahalli @ 2024-05-16 17:59 UTC (permalink / raw) To: Fabio M. De Francesco Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam Hi Fabio, On 5/7/2024 2:35 AM, Fabio M. De Francesco wrote: > On Tuesday, January 9, 2024 4:47:51 AM GMT+2 Smita Koralahalli wrote: >> This patchset adds trace event support for FW-First Protocol Errors. >> >> This series depends on: >> https://lore.kernel.org/linux-cxl/20231220-cxl-cper-v5-0-1bb8a4ca2c7a@intel. >> com >> > Hello Smita, > > I'm working on a small series of enhancement and additions to kernel logs and > trace events in extlog_print() (ELOG). I'm interested to reuse from ELOG the > infrastructure which you made to trace the CXL Protocol Errors. > > I'm aware that this series depends on an old one from Ira which has been > superseded by a different implementation and, furthermore, that this needs to > be reworked because the execution of this code may sleep while in atomic > context. > > I'd like to ask if you plan to fix and rebase this series to a current cxl > branch. If so, I'll wait for the next version to integrate in ELOG. > > Otherwise, if you are not anymore actively working on this series, please > notice that I'd be glad to help by making the necessary changes and by porting > your code to a current cxl branch (of course, retaining your authorship). > > Thanks, > > Fabio M. De Francesco Sorry missed your message on my inbox. Yes, I will get back working on v3 on this early next week. Couldn't get a chance to continue due to other pcie work. Hopefully, I can get something out next week. Thanks Smita >> >> Link to v1: >> https://lore.kernel.org/all/20240102150933.161009-1-Smita.KoralahalliChannab >> asappa@amd.com >> >> Smita Koralahalli (4): >> acpi/ghes, cxl: Create a common CXL struct to handle different CXL >> CPER records >> efi/cper, cxl: Make definitions and structures global >> acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. >> acpi/ghes, cxl/pci: Trace FW-First CXL Protocol Errors >> >> drivers/acpi/apei/ghes.c | 22 ++++++++- >> drivers/cxl/core/pci.c | 29 ++++++++++++ >> drivers/cxl/cxlpci.h | 3 ++ >> drivers/cxl/pci.c | 13 ++++-- >> drivers/firmware/efi/cper_cxl.c | 79 ++++++++++++++++++++++++++++----- >> drivers/firmware/efi/cper_cxl.h | 7 +-- >> include/linux/cper.h | 4 ++ >> include/linux/cxl-event.h | 31 ++++++++++++- >> 8 files changed, 166 insertions(+), 22 deletions(-) > > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors @ 2024-10-01 0:52 Smita Koralahalli 2024-10-01 0:52 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process " Smita Koralahalli 0 siblings, 1 reply; 19+ messages in thread From: Smita Koralahalli @ 2024-10-01 0:52 UTC (permalink / raw) To: linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Bowman Terry This patchset adds trace event support for FW-First Protocol Errors. Reworked the patchset to reuse the work item written by Ira for handling CPER events. I understand this patchset also schedules the cxl_pci driver to look at potentially "non-device" errors as Terry's port error handling patches are under review. However, I'm also fine to just confine it until FMLD as of now. Looking for feedbacks here. Link to v1: https://lore.kernel.org/all/20240522150839.27578-1-Smita.KoralahalliChannabasappa@amd.com Smita Koralahalli (4): efi/cper, cxl: Make definitions and structures global cxl/pci: Define a common function get_cxl_devstate() acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. acpi/ghes, cxl/pci: Trace FW-First CXL Protocol Errors drivers/acpi/apei/ghes.c | 24 ++++++ drivers/cxl/core/pci.c | 24 ++++++ drivers/cxl/cxlpci.h | 3 + drivers/cxl/pci.c | 52 +++++++++---- drivers/firmware/efi/cper_cxl.c | 126 +++++++++++++++++++++++++++++--- drivers/firmware/efi/cper_cxl.h | 7 +- include/cxl/event.h | 38 ++++++++++ include/linux/cper.h | 4 + 8 files changed, 247 insertions(+), 31 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-01 0:52 Smita Koralahalli @ 2024-10-01 0:52 ` Smita Koralahalli 2024-10-01 15:47 ` Ira Weiny ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Smita Koralahalli @ 2024-10-01 0:52 UTC (permalink / raw) To: linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Bowman Terry UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. Add GHES support to detect CXL CPER Protocol Error Record and Cache Error Severity, Device ID, Device Serial number and CXL RAS capability struct in struct cxl_cper_prot_err. Include this struct as a member of struct cxl_cper_work_data. Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> --- v2: Defined array of structures for Device ID and Serial number comparison. p_err -> rec/p_rec. --- drivers/acpi/apei/ghes.c | 10 +++ drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ include/cxl/event.h | 26 ++++++++ 3 files changed, 151 insertions(+) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index ada93cfde9ba..9dcf0f78458f 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -717,6 +717,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, schedule_work(cxl_cper_work); } +static void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) +{ + struct cxl_cper_work_data wd; + + if (cxl_cper_handle_prot_err_info(gdata, &wd.p_rec)) + return; +} + int cxl_cper_register_work(struct work_struct *work) { if (cxl_cper_work) @@ -791,6 +799,8 @@ static bool ghes_do_proc(struct ghes *ghes, struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec); + } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { + cxl_cper_handle_prot_err(gdata); } else { void *err = acpi_hest_get_payload(gdata); diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c index 4fd8d783993e..08da7764c066 100644 --- a/drivers/firmware/efi/cper_cxl.c +++ b/drivers/firmware/efi/cper_cxl.c @@ -8,6 +8,7 @@ */ #include <linux/cper.h> +#include <acpi/ghes.h> #include "cper_cxl.h" #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) @@ -44,6 +45,66 @@ enum { USP, /* CXL Upstream Switch Port */ }; +struct agent_info { + const char *string; + bool req_sn; + bool req_sbdf; +}; + +static const struct agent_info agent_info[] = { + [RCD] = { + .string = "Restricted CXL Device", + .req_sbdf = true, + .req_sn = true, + }, + [RCH_DP] = { + .string = "Restricted CXL Host Downstream Port", + .req_sbdf = false, + .req_sn = false, + }, + [DEVICE] = { + .string = "CXL Device", + .req_sbdf = true, + .req_sn = true, + }, + [LD] = { + .string = "CXL Logical Device", + .req_sbdf = true, + .req_sn = true, + }, + [FMLD] = { + .string = "CXL Fabric Manager managed Logical Device", + .req_sbdf = true, + .req_sn = true, + }, + [RP] = { + .string = "CXL Root Port", + .req_sbdf = true, + .req_sn = false, + }, + [DSP] = { + .string = "CXL Downstream Switch Port", + .req_sbdf = true, + .req_sn = false, + }, + [USP] = { + .string = "CXL Upstream Switch Port", + .req_sbdf = true, + .req_sn = false, + }, +}; + +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) +{ + switch (cper_severity) { + case CPER_SEV_RECOVERABLE: + case CPER_SEV_FATAL: + return CXL_AER_UNCORRECTABLE; + default: + return CXL_AER_CORRECTABLE; + } +} + void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) { if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) @@ -176,3 +237,57 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e sizeof(cxl_ras->header_log), 0); } } + +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, + struct cxl_cper_prot_err *rec) +{ + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); + u8 *dvsec_start, *cap_start; + + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { + pr_err(FW_WARN "No Device ID\n"); + return -EINVAL; + } + + /* + * The device ID or agent address is required for CXL RCD, CXL + * SLD, CXL LD, CXL Fabric Manager Managed LD, CXL Root Port, + * CXL Downstream Switch Port and CXL Upstream Switch Port. + */ + if (!(agent_info[prot_err->agent_type].req_sbdf)) { + pr_err(FW_WARN "Invalid agent type\n"); + return -EINVAL; + } + + rec->segment = prot_err->agent_addr.segment; + rec->bus = prot_err->agent_addr.bus; + rec->device = prot_err->agent_addr.device; + rec->function = prot_err->agent_addr.function; + + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { + pr_err(FW_WARN "Invalid Protocol Error log\n"); + return -EINVAL; + } + + dvsec_start = (u8 *)(prot_err + 1); + cap_start = dvsec_start + prot_err->dvsec_len; + rec->cxl_ras = *(struct cxl_ras_capability_regs *)cap_start; + + /* + * Set device serial number unconditionally. + * + * Print a warning message if it is not valid. The device serial + * number is required for CXL RCD, CXL SLD, CXL LD and CXL Fabric + * Manager Managed LD. + */ + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || + !(agent_info[prot_err->agent_type].req_sn)) + pr_warn(FW_WARN "No Device Serial number\n"); + + rec->lower_dw = prot_err->dev_serial_num.lower_dw; + rec->upper_dw = prot_err->dev_serial_num.upper_dw; + + rec->severity = cper_severity_cxl_aer(gdata->error_severity); + + return 0; +} diff --git a/include/cxl/event.h b/include/cxl/event.h index 57b4630568f6..5b316150556a 100644 --- a/include/cxl/event.h +++ b/include/cxl/event.h @@ -158,11 +158,37 @@ struct cxl_ras_capability_regs { u32 header_log[16]; }; +enum cxl_aer_err_type { + CXL_AER_UNCORRECTABLE, + CXL_AER_CORRECTABLE, +}; + +struct cxl_cper_prot_err { + struct cxl_ras_capability_regs cxl_ras; + + /* Device ID */ + u8 function; + u8 device; + u8 bus; + u16 segment; + + /* Device Serial Number */ + u32 lower_dw; + u32 upper_dw; + + int severity; +}; + struct cxl_cper_work_data { enum cxl_event_type event_type; struct cxl_cper_event_rec rec; + struct cxl_cper_prot_err p_rec; }; +struct acpi_hest_generic_data; +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, + struct cxl_cper_prot_err *rec); + #ifdef CONFIG_ACPI_APEI_GHES int cxl_cper_register_work(struct work_struct *work); int cxl_cper_unregister_work(struct work_struct *work); -- 2.17.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-01 0:52 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process " Smita Koralahalli @ 2024-10-01 15:47 ` Ira Weiny 2024-10-01 17:41 ` Fan Ni 2024-10-02 23:47 ` Dan Williams 2 siblings, 0 replies; 19+ messages in thread From: Ira Weiny @ 2024-10-01 15:47 UTC (permalink / raw) To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Bowman Terry Smita Koralahalli wrote: > UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. > > Add GHES support to detect CXL CPER Protocol Error Record and Cache Error > Severity, Device ID, Device Serial number and CXL RAS capability struct in > struct cxl_cper_prot_err. Include this struct as a member of struct > cxl_cper_work_data. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > --- > v2: > Defined array of structures for Device ID and Serial number > comparison. > p_err -> rec/p_rec. > --- > drivers/acpi/apei/ghes.c | 10 +++ > drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ > include/cxl/event.h | 26 ++++++++ > 3 files changed, 151 insertions(+) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index ada93cfde9ba..9dcf0f78458f 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -717,6 +717,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, > schedule_work(cxl_cper_work); > } > > +static void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) > +{ > + struct cxl_cper_work_data wd; > + > + if (cxl_cper_handle_prot_err_info(gdata, &wd.p_rec)) > + return; > +} This is a bit odd. One could call cxl_cper_handle_prot_err_info() directly from ghes_do_proc() in this patch. Then add cxl_cper_handle_prot_err() in patch 4/4 but either way is fine with me. Reviewed-by: Ira Weiny <ira.weiny@intel.com> > + > int cxl_cper_register_work(struct work_struct *work) > { > if (cxl_cper_work) > @@ -791,6 +799,8 @@ static bool ghes_do_proc(struct ghes *ghes, > struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); > > cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec); > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { > + cxl_cper_handle_prot_err(gdata); > } else { > void *err = acpi_hest_get_payload(gdata); > > diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c > index 4fd8d783993e..08da7764c066 100644 > --- a/drivers/firmware/efi/cper_cxl.c > +++ b/drivers/firmware/efi/cper_cxl.c > @@ -8,6 +8,7 @@ > */ > > #include <linux/cper.h> > +#include <acpi/ghes.h> > #include "cper_cxl.h" > > #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) > @@ -44,6 +45,66 @@ enum { > USP, /* CXL Upstream Switch Port */ > }; > > +struct agent_info { > + const char *string; > + bool req_sn; > + bool req_sbdf; > +}; > + > +static const struct agent_info agent_info[] = { > + [RCD] = { > + .string = "Restricted CXL Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [RCH_DP] = { > + .string = "Restricted CXL Host Downstream Port", > + .req_sbdf = false, > + .req_sn = false, > + }, > + [DEVICE] = { > + .string = "CXL Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [LD] = { > + .string = "CXL Logical Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [FMLD] = { > + .string = "CXL Fabric Manager managed Logical Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [RP] = { > + .string = "CXL Root Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > + [DSP] = { > + .string = "CXL Downstream Switch Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > + [USP] = { > + .string = "CXL Upstream Switch Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > +}; > + > +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) > +{ > + switch (cper_severity) { > + case CPER_SEV_RECOVERABLE: > + case CPER_SEV_FATAL: > + return CXL_AER_UNCORRECTABLE; > + default: > + return CXL_AER_CORRECTABLE; > + } > +} > + > void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) > { > if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) > @@ -176,3 +237,57 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e > sizeof(cxl_ras->header_log), 0); > } > } > + > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_prot_err *rec) > +{ > + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); > + u8 *dvsec_start, *cap_start; > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { > + pr_err(FW_WARN "No Device ID\n"); > + return -EINVAL; > + } > + > + /* > + * The device ID or agent address is required for CXL RCD, CXL > + * SLD, CXL LD, CXL Fabric Manager Managed LD, CXL Root Port, > + * CXL Downstream Switch Port and CXL Upstream Switch Port. > + */ > + if (!(agent_info[prot_err->agent_type].req_sbdf)) { > + pr_err(FW_WARN "Invalid agent type\n"); > + return -EINVAL; > + } > + > + rec->segment = prot_err->agent_addr.segment; > + rec->bus = prot_err->agent_addr.bus; > + rec->device = prot_err->agent_addr.device; > + rec->function = prot_err->agent_addr.function; > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { > + pr_err(FW_WARN "Invalid Protocol Error log\n"); > + return -EINVAL; > + } > + > + dvsec_start = (u8 *)(prot_err + 1); > + cap_start = dvsec_start + prot_err->dvsec_len; > + rec->cxl_ras = *(struct cxl_ras_capability_regs *)cap_start; > + > + /* > + * Set device serial number unconditionally. > + * > + * Print a warning message if it is not valid. The device serial > + * number is required for CXL RCD, CXL SLD, CXL LD and CXL Fabric > + * Manager Managed LD. > + */ > + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || > + !(agent_info[prot_err->agent_type].req_sn)) > + pr_warn(FW_WARN "No Device Serial number\n"); > + > + rec->lower_dw = prot_err->dev_serial_num.lower_dw; > + rec->upper_dw = prot_err->dev_serial_num.upper_dw; > + > + rec->severity = cper_severity_cxl_aer(gdata->error_severity); > + > + return 0; > +} > diff --git a/include/cxl/event.h b/include/cxl/event.h > index 57b4630568f6..5b316150556a 100644 > --- a/include/cxl/event.h > +++ b/include/cxl/event.h > @@ -158,11 +158,37 @@ struct cxl_ras_capability_regs { > u32 header_log[16]; > }; > > +enum cxl_aer_err_type { > + CXL_AER_UNCORRECTABLE, > + CXL_AER_CORRECTABLE, > +}; > + > +struct cxl_cper_prot_err { > + struct cxl_ras_capability_regs cxl_ras; > + > + /* Device ID */ > + u8 function; > + u8 device; > + u8 bus; > + u16 segment; > + > + /* Device Serial Number */ > + u32 lower_dw; > + u32 upper_dw; > + > + int severity; > +}; > + > struct cxl_cper_work_data { > enum cxl_event_type event_type; > struct cxl_cper_event_rec rec; > + struct cxl_cper_prot_err p_rec; > }; > > +struct acpi_hest_generic_data; > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_prot_err *rec); > + > #ifdef CONFIG_ACPI_APEI_GHES > int cxl_cper_register_work(struct work_struct *work); > int cxl_cper_unregister_work(struct work_struct *work); > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-01 0:52 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process " Smita Koralahalli 2024-10-01 15:47 ` Ira Weiny @ 2024-10-01 17:41 ` Fan Ni 2024-10-03 19:19 ` Smita Koralahalli 2024-10-02 23:47 ` Dan Williams 2 siblings, 1 reply; 19+ messages in thread From: Fan Ni @ 2024-10-01 17:41 UTC (permalink / raw) To: Smita Koralahalli Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Bowman Terry On Tue, Oct 01, 2024 at 12:52:33AM +0000, Smita Koralahalli wrote: > UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. > > Add GHES support to detect CXL CPER Protocol Error Record and Cache Error > Severity, Device ID, Device Serial number and CXL RAS capability struct in > struct cxl_cper_prot_err. Include this struct as a member of struct > cxl_cper_work_data. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > --- > v2: > Defined array of structures for Device ID and Serial number > comparison. > p_err -> rec/p_rec. > --- > drivers/acpi/apei/ghes.c | 10 +++ > drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ > include/cxl/event.h | 26 ++++++++ > 3 files changed, 151 insertions(+) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index ada93cfde9ba..9dcf0f78458f 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -717,6 +717,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, > schedule_work(cxl_cper_work); > } > > +static void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) > +{ > + struct cxl_cper_work_data wd; > + > + if (cxl_cper_handle_prot_err_info(gdata, &wd.p_rec)) > + return; > +} Why we need a if here? It seems the function will return anyway. Fan > + > int cxl_cper_register_work(struct work_struct *work) > { > if (cxl_cper_work) > @@ -791,6 +799,8 @@ static bool ghes_do_proc(struct ghes *ghes, > struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); > > cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec); > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { > + cxl_cper_handle_prot_err(gdata); > } else { > void *err = acpi_hest_get_payload(gdata); > > diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c > index 4fd8d783993e..08da7764c066 100644 > --- a/drivers/firmware/efi/cper_cxl.c > +++ b/drivers/firmware/efi/cper_cxl.c > @@ -8,6 +8,7 @@ > */ > > #include <linux/cper.h> > +#include <acpi/ghes.h> > #include "cper_cxl.h" > > #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) > @@ -44,6 +45,66 @@ enum { > USP, /* CXL Upstream Switch Port */ > }; > > +struct agent_info { > + const char *string; > + bool req_sn; > + bool req_sbdf; > +}; > + > +static const struct agent_info agent_info[] = { > + [RCD] = { > + .string = "Restricted CXL Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [RCH_DP] = { > + .string = "Restricted CXL Host Downstream Port", > + .req_sbdf = false, > + .req_sn = false, > + }, > + [DEVICE] = { > + .string = "CXL Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [LD] = { > + .string = "CXL Logical Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [FMLD] = { > + .string = "CXL Fabric Manager managed Logical Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [RP] = { > + .string = "CXL Root Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > + [DSP] = { > + .string = "CXL Downstream Switch Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > + [USP] = { > + .string = "CXL Upstream Switch Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > +}; > + > +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) > +{ > + switch (cper_severity) { > + case CPER_SEV_RECOVERABLE: > + case CPER_SEV_FATAL: > + return CXL_AER_UNCORRECTABLE; > + default: > + return CXL_AER_CORRECTABLE; > + } > +} > + > void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) > { > if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) > @@ -176,3 +237,57 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e > sizeof(cxl_ras->header_log), 0); > } > } > + > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_prot_err *rec) > +{ > + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); > + u8 *dvsec_start, *cap_start; > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { > + pr_err(FW_WARN "No Device ID\n"); > + return -EINVAL; > + } > + > + /* > + * The device ID or agent address is required for CXL RCD, CXL > + * SLD, CXL LD, CXL Fabric Manager Managed LD, CXL Root Port, > + * CXL Downstream Switch Port and CXL Upstream Switch Port. > + */ > + if (!(agent_info[prot_err->agent_type].req_sbdf)) { > + pr_err(FW_WARN "Invalid agent type\n"); > + return -EINVAL; > + } > + > + rec->segment = prot_err->agent_addr.segment; > + rec->bus = prot_err->agent_addr.bus; > + rec->device = prot_err->agent_addr.device; > + rec->function = prot_err->agent_addr.function; > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { > + pr_err(FW_WARN "Invalid Protocol Error log\n"); > + return -EINVAL; > + } > + > + dvsec_start = (u8 *)(prot_err + 1); > + cap_start = dvsec_start + prot_err->dvsec_len; > + rec->cxl_ras = *(struct cxl_ras_capability_regs *)cap_start; > + > + /* > + * Set device serial number unconditionally. > + * > + * Print a warning message if it is not valid. The device serial > + * number is required for CXL RCD, CXL SLD, CXL LD and CXL Fabric > + * Manager Managed LD. > + */ > + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || > + !(agent_info[prot_err->agent_type].req_sn)) > + pr_warn(FW_WARN "No Device Serial number\n"); > + > + rec->lower_dw = prot_err->dev_serial_num.lower_dw; > + rec->upper_dw = prot_err->dev_serial_num.upper_dw; > + > + rec->severity = cper_severity_cxl_aer(gdata->error_severity); > + > + return 0; > +} > diff --git a/include/cxl/event.h b/include/cxl/event.h > index 57b4630568f6..5b316150556a 100644 > --- a/include/cxl/event.h > +++ b/include/cxl/event.h > @@ -158,11 +158,37 @@ struct cxl_ras_capability_regs { > u32 header_log[16]; > }; > > +enum cxl_aer_err_type { > + CXL_AER_UNCORRECTABLE, > + CXL_AER_CORRECTABLE, > +}; > + > +struct cxl_cper_prot_err { > + struct cxl_ras_capability_regs cxl_ras; > + > + /* Device ID */ > + u8 function; > + u8 device; > + u8 bus; > + u16 segment; > + > + /* Device Serial Number */ > + u32 lower_dw; > + u32 upper_dw; > + > + int severity; > +}; > + > struct cxl_cper_work_data { > enum cxl_event_type event_type; > struct cxl_cper_event_rec rec; > + struct cxl_cper_prot_err p_rec; > }; > > +struct acpi_hest_generic_data; > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_prot_err *rec); > + > #ifdef CONFIG_ACPI_APEI_GHES > int cxl_cper_register_work(struct work_struct *work); > int cxl_cper_unregister_work(struct work_struct *work); > -- > 2.17.1 > -- Fan Ni ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-01 17:41 ` Fan Ni @ 2024-10-03 19:19 ` Smita Koralahalli 0 siblings, 0 replies; 19+ messages in thread From: Smita Koralahalli @ 2024-10-03 19:19 UTC (permalink / raw) To: Fan Ni Cc: linux-efi, linux-kernel, linux-cxl, Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Bowman Terry On 10/1/2024 10:41 AM, Fan Ni wrote: > On Tue, Oct 01, 2024 at 12:52:33AM +0000, Smita Koralahalli wrote: >> UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. >> >> Add GHES support to detect CXL CPER Protocol Error Record and Cache Error >> Severity, Device ID, Device Serial number and CXL RAS capability struct in >> struct cxl_cper_prot_err. Include this struct as a member of struct >> cxl_cper_work_data. >> >> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> >> --- >> v2: >> Defined array of structures for Device ID and Serial number >> comparison. >> p_err -> rec/p_rec. >> --- >> drivers/acpi/apei/ghes.c | 10 +++ >> drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ >> include/cxl/event.h | 26 ++++++++ >> 3 files changed, 151 insertions(+) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index ada93cfde9ba..9dcf0f78458f 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -717,6 +717,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, >> schedule_work(cxl_cper_work); >> } >> >> +static void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) >> +{ >> + struct cxl_cper_work_data wd; >> + >> + if (cxl_cper_handle_prot_err_info(gdata, &wd.p_rec)) >> + return; >> +} > > Why we need a if here? It seems the function will return anyway. > > Fan Yeah, it doesn't make sense here without 4/4. Thanks Smita [snip] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-01 0:52 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process " Smita Koralahalli 2024-10-01 15:47 ` Ira Weiny 2024-10-01 17:41 ` Fan Ni @ 2024-10-02 23:47 ` Dan Williams 2024-10-03 19:15 ` Smita Koralahalli 2 siblings, 1 reply; 19+ messages in thread From: Dan Williams @ 2024-10-02 23:47 UTC (permalink / raw) To: Smita Koralahalli, linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams, Jonathan Cameron, Yazen Ghannam, Bowman Terry Smita Koralahalli wrote: > UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. > > Add GHES support to detect CXL CPER Protocol Error Record and Cache Error > Severity, Device ID, Device Serial number and CXL RAS capability struct in > struct cxl_cper_prot_err. Include this struct as a member of struct > cxl_cper_work_data. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > --- > v2: > Defined array of structures for Device ID and Serial number > comparison. > p_err -> rec/p_rec. > --- > drivers/acpi/apei/ghes.c | 10 +++ > drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ > include/cxl/event.h | 26 ++++++++ > 3 files changed, 151 insertions(+) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index ada93cfde9ba..9dcf0f78458f 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -717,6 +717,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, > schedule_work(cxl_cper_work); > } > > +static void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) > +{ > + struct cxl_cper_work_data wd; > + > + if (cxl_cper_handle_prot_err_info(gdata, &wd.p_rec)) > + return; > +} > + > int cxl_cper_register_work(struct work_struct *work) > { > if (cxl_cper_work) > @@ -791,6 +799,8 @@ static bool ghes_do_proc(struct ghes *ghes, > struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); > > cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec); > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { > + cxl_cper_handle_prot_err(gdata); I would prefer this follow the format of cxl_cper_post_event and pass a 'struct cxl_cper_sec_prot_err *' directly. > } else { > void *err = acpi_hest_get_payload(gdata); > > diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c > index 4fd8d783993e..08da7764c066 100644 > --- a/drivers/firmware/efi/cper_cxl.c > +++ b/drivers/firmware/efi/cper_cxl.c > @@ -8,6 +8,7 @@ > */ > > #include <linux/cper.h> > +#include <acpi/ghes.h> > #include "cper_cxl.h" > > #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) > @@ -44,6 +45,66 @@ enum { > USP, /* CXL Upstream Switch Port */ > }; > > +struct agent_info { > + const char *string; > + bool req_sn; > + bool req_sbdf; > +}; > + > +static const struct agent_info agent_info[] = { > + [RCD] = { > + .string = "Restricted CXL Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [RCH_DP] = { > + .string = "Restricted CXL Host Downstream Port", > + .req_sbdf = false, > + .req_sn = false, > + }, > + [DEVICE] = { > + .string = "CXL Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [LD] = { > + .string = "CXL Logical Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [FMLD] = { > + .string = "CXL Fabric Manager managed Logical Device", > + .req_sbdf = true, > + .req_sn = true, > + }, > + [RP] = { > + .string = "CXL Root Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > + [DSP] = { > + .string = "CXL Downstream Switch Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > + [USP] = { > + .string = "CXL Upstream Switch Port", > + .req_sbdf = true, > + .req_sn = false, > + }, > +}; > + > +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) > +{ > + switch (cper_severity) { > + case CPER_SEV_RECOVERABLE: > + case CPER_SEV_FATAL: > + return CXL_AER_UNCORRECTABLE; > + default: > + return CXL_AER_CORRECTABLE; > + } Why does the CPER severity need to be converted to a new CXL specific enum value? > +} > + > void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) > { > if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) > @@ -176,3 +237,57 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e > sizeof(cxl_ras->header_log), 0); > } > } > + > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_prot_err *rec) > +{ > + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); Late feedback, but cper_sec_prot_err is too generic of a name. Lets make if cxl_cper_sec_prot_err similar to cxl_cper_event_rec. > + u8 *dvsec_start, *cap_start; > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { > + pr_err(FW_WARN "No Device ID\n"); This should be pr_err_ratelimited(). This feedback likely also applies to the existing support, but I think protocol errors are even more likely than component errors to be bursty and persistent. This error message and all the others should clarify that they are coming from the CXL CPER code with something like: #define pr_fmt(fmt) "cxl/cper: " fmt ...at the top of the file. > + return -EINVAL; > + } > + > + /* > + * The device ID or agent address is required for CXL RCD, CXL > + * SLD, CXL LD, CXL Fabric Manager Managed LD, CXL Root Port, > + * CXL Downstream Switch Port and CXL Upstream Switch Port. > + */ > + if (!(agent_info[prot_err->agent_type].req_sbdf)) { > + pr_err(FW_WARN "Invalid agent type\n"); > + return -EINVAL; > + } All CPER records without a device-id have already been dropped above, so why reject agent-types that do not require a device-id here? I think this agent_info[] scheme makes the code more difficult to read especially since agent_info() is only consulted a couple times. Just put a "switch (prot_err->agent_type)" in the code directly and skip the indirection. > + > + rec->segment = prot_err->agent_addr.segment; > + rec->bus = prot_err->agent_addr.bus; > + rec->device = prot_err->agent_addr.device; > + rec->function = prot_err->agent_addr.function; > + > + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { > + pr_err(FW_WARN "Invalid Protocol Error log\n"); > + return -EINVAL; > + } > + > + dvsec_start = (u8 *)(prot_err + 1); > + cap_start = dvsec_start + prot_err->dvsec_len; > + rec->cxl_ras = *(struct cxl_ras_capability_regs *)cap_start; Where is the validation that the size of the RAS field matches expectations? I.e. what if the BIOS builds a bad error record? > + > + /* > + * Set device serial number unconditionally. > + * > + * Print a warning message if it is not valid. The device serial > + * number is required for CXL RCD, CXL SLD, CXL LD and CXL Fabric > + * Manager Managed LD. > + */ > + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || > + !(agent_info[prot_err->agent_type].req_sn)) > + pr_warn(FW_WARN "No Device Serial number\n"); > + > + rec->lower_dw = prot_err->dev_serial_num.lower_dw; > + rec->upper_dw = prot_err->dev_serial_num.upper_dw; Serial numbers are u64s, so if any conversion is to be done here it should be from upper+lower to a u64, but then again see below on my question about why a new cxl_cper_prot_err is being added. > + > + rec->severity = cper_severity_cxl_aer(gdata->error_severity); > + > + return 0; > +} > diff --git a/include/cxl/event.h b/include/cxl/event.h > index 57b4630568f6..5b316150556a 100644 > --- a/include/cxl/event.h > +++ b/include/cxl/event.h > @@ -158,11 +158,37 @@ struct cxl_ras_capability_regs { > u32 header_log[16]; > }; > > +enum cxl_aer_err_type { > + CXL_AER_UNCORRECTABLE, > + CXL_AER_CORRECTABLE, > +}; > + > +struct cxl_cper_prot_err { > + struct cxl_ras_capability_regs cxl_ras; > + > + /* Device ID */ > + u8 function; > + u8 device; > + u8 bus; > + u16 segment; > + > + /* Device Serial Number */ > + u32 lower_dw; > + u32 upper_dw; > + > + int severity; > +}; Hmm, 'struct cxl_cper_event_rec' follows the raw format of the record from the specification, and 'struct cxl_cper_sec_prot_err' (formerly cper_sec_prot_err) already exists, so why is this new intermediate data structure needed? > + > struct cxl_cper_work_data { > enum cxl_event_type event_type; > struct cxl_cper_event_rec rec; > + struct cxl_cper_prot_err p_rec; > }; > > +struct acpi_hest_generic_data; > +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, > + struct cxl_cper_prot_err *rec); > + > #ifdef CONFIG_ACPI_APEI_GHES > int cxl_cper_register_work(struct work_struct *work); > int cxl_cper_unregister_work(struct work_struct *work); > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-02 23:47 ` Dan Williams @ 2024-10-03 19:15 ` Smita Koralahalli 2024-10-03 23:21 ` Dan Williams 0 siblings, 1 reply; 19+ messages in thread From: Smita Koralahalli @ 2024-10-03 19:15 UTC (permalink / raw) To: Dan Williams, linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Jonathan Cameron, Yazen Ghannam, Bowman Terry On 10/2/2024 4:47 PM, Dan Williams wrote: > Smita Koralahalli wrote: >> UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. >> >> Add GHES support to detect CXL CPER Protocol Error Record and Cache Error >> Severity, Device ID, Device Serial number and CXL RAS capability struct in >> struct cxl_cper_prot_err. Include this struct as a member of struct >> cxl_cper_work_data. >> >> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> >> --- >> v2: >> Defined array of structures for Device ID and Serial number >> comparison. >> p_err -> rec/p_rec. >> --- >> drivers/acpi/apei/ghes.c | 10 +++ >> drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ >> include/cxl/event.h | 26 ++++++++ >> 3 files changed, 151 insertions(+) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index ada93cfde9ba..9dcf0f78458f 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -717,6 +717,14 @@ static void cxl_cper_post_event(enum cxl_event_type event_type, >> schedule_work(cxl_cper_work); >> } >> >> +static void cxl_cper_handle_prot_err(struct acpi_hest_generic_data *gdata) >> +{ >> + struct cxl_cper_work_data wd; >> + >> + if (cxl_cper_handle_prot_err_info(gdata, &wd.p_rec)) >> + return; >> +} >> + >> int cxl_cper_register_work(struct work_struct *work) >> { >> if (cxl_cper_work) >> @@ -791,6 +799,8 @@ static bool ghes_do_proc(struct ghes *ghes, >> struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata); >> >> cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec); >> + } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) { >> + cxl_cper_handle_prot_err(gdata); > > I would prefer this follow the format of cxl_cper_post_event and pass a > 'struct cxl_cper_sec_prot_err *' directly. Sure. > >> } else { >> void *err = acpi_hest_get_payload(gdata); >> >> diff --git a/drivers/firmware/efi/cper_cxl.c b/drivers/firmware/efi/cper_cxl.c >> index 4fd8d783993e..08da7764c066 100644 >> --- a/drivers/firmware/efi/cper_cxl.c >> +++ b/drivers/firmware/efi/cper_cxl.c >> @@ -8,6 +8,7 @@ >> */ >> >> #include <linux/cper.h> >> +#include <acpi/ghes.h> >> #include "cper_cxl.h" >> >> #define PROT_ERR_VALID_AGENT_TYPE BIT_ULL(0) >> @@ -44,6 +45,66 @@ enum { >> USP, /* CXL Upstream Switch Port */ >> }; >> >> +struct agent_info { >> + const char *string; >> + bool req_sn; >> + bool req_sbdf; >> +}; >> + >> +static const struct agent_info agent_info[] = { >> + [RCD] = { >> + .string = "Restricted CXL Device", >> + .req_sbdf = true, >> + .req_sn = true, >> + }, >> + [RCH_DP] = { >> + .string = "Restricted CXL Host Downstream Port", >> + .req_sbdf = false, >> + .req_sn = false, >> + }, >> + [DEVICE] = { >> + .string = "CXL Device", >> + .req_sbdf = true, >> + .req_sn = true, >> + }, >> + [LD] = { >> + .string = "CXL Logical Device", >> + .req_sbdf = true, >> + .req_sn = true, >> + }, >> + [FMLD] = { >> + .string = "CXL Fabric Manager managed Logical Device", >> + .req_sbdf = true, >> + .req_sn = true, >> + }, >> + [RP] = { >> + .string = "CXL Root Port", >> + .req_sbdf = true, >> + .req_sn = false, >> + }, >> + [DSP] = { >> + .string = "CXL Downstream Switch Port", >> + .req_sbdf = true, >> + .req_sn = false, >> + }, >> + [USP] = { >> + .string = "CXL Upstream Switch Port", >> + .req_sbdf = true, >> + .req_sn = false, >> + }, >> +}; >> + >> +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) >> +{ >> + switch (cper_severity) { >> + case CPER_SEV_RECOVERABLE: >> + case CPER_SEV_FATAL: >> + return CXL_AER_UNCORRECTABLE; >> + default: >> + return CXL_AER_CORRECTABLE; >> + } > > Why does the CPER severity need to be converted to a new CXL specific > enum value? I was just following up the same convention here as done for AER.. cper_severity_to_aer(). I can change if there is no value in doing it. > >> +} >> + >> void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_err) >> { >> if (prot_err->valid_bits & PROT_ERR_VALID_AGENT_TYPE) >> @@ -176,3 +237,57 @@ void cper_print_prot_err(const char *pfx, const struct cper_sec_prot_err *prot_e >> sizeof(cxl_ras->header_log), 0); >> } >> } >> + >> +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, >> + struct cxl_cper_prot_err *rec) >> +{ >> + struct cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata); > > Late feedback, but cper_sec_prot_err is too generic of a name. Lets make > if cxl_cper_sec_prot_err similar to cxl_cper_event_rec. okay > > >> + u8 *dvsec_start, *cap_start; >> + >> + if (!(prot_err->valid_bits & PROT_ERR_VALID_DEVICE_ID)) { >> + pr_err(FW_WARN "No Device ID\n"); > > This should be pr_err_ratelimited(). > > This feedback likely also applies to the existing support, but I think > protocol errors are even more likely than component errors to be bursty > and persistent. > > This error message and all the others should clarify that they are > coming from the CXL CPER code with something like: > > #define pr_fmt(fmt) "cxl/cper: " fmt > > ...at the top of the file. > okay >> + return -EINVAL; >> + } >> + >> + /* >> + * The device ID or agent address is required for CXL RCD, CXL >> + * SLD, CXL LD, CXL Fabric Manager Managed LD, CXL Root Port, >> + * CXL Downstream Switch Port and CXL Upstream Switch Port. >> + */ >> + if (!(agent_info[prot_err->agent_type].req_sbdf)) { >> + pr_err(FW_WARN "Invalid agent type\n"); >> + return -EINVAL; >> + } > > All CPER records without a device-id have already been dropped above, so > why reject agent-types that do not require a device-id here? > > I think this agent_info[] scheme makes the code more difficult to read > especially since agent_info() is only consulted a couple times. Just put > a "switch (prot_err->agent_type)" in the code directly and skip the > indirection. Hmm, I initially thought I would do switch case and then changed it to if-else thinking that would look cleaner. What would you suggest? Just incorporate switch case similar to cper_print_prot_err() here as well or clean up switch case in cper_print_prot_err() and reuse the agent_info[] there? This agent_info[] would include 3 fields then, two as above and another for valid_cap. Maybe unify sbdf with valid_cap.. > >> + >> + rec->segment = prot_err->agent_addr.segment; >> + rec->bus = prot_err->agent_addr.bus; >> + rec->device = prot_err->agent_addr.device; >> + rec->function = prot_err->agent_addr.function; >> + >> + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) { >> + pr_err(FW_WARN "Invalid Protocol Error log\n"); >> + return -EINVAL; >> + } >> + >> + dvsec_start = (u8 *)(prot_err + 1); >> + cap_start = dvsec_start + prot_err->dvsec_len; >> + rec->cxl_ras = *(struct cxl_ras_capability_regs *)cap_start; > > Where is the validation that the size of the RAS field matches > expectations? I.e. what if the BIOS builds a bad error record? Will include check for size. > >> + >> + /* >> + * Set device serial number unconditionally. >> + * >> + * Print a warning message if it is not valid. The device serial >> + * number is required for CXL RCD, CXL SLD, CXL LD and CXL Fabric >> + * Manager Managed LD. >> + */ >> + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER) || >> + !(agent_info[prot_err->agent_type].req_sn)) >> + pr_warn(FW_WARN "No Device Serial number\n"); >> + >> + rec->lower_dw = prot_err->dev_serial_num.lower_dw; >> + rec->upper_dw = prot_err->dev_serial_num.upper_dw; > > Serial numbers are u64s, so if any conversion is to be done here it > should be from upper+lower to a u64, but then again see below on my > question about why a new cxl_cper_prot_err is being added. > >> + >> + rec->severity = cper_severity_cxl_aer(gdata->error_severity); >> + >> + return 0; >> +} >> diff --git a/include/cxl/event.h b/include/cxl/event.h >> index 57b4630568f6..5b316150556a 100644 >> --- a/include/cxl/event.h >> +++ b/include/cxl/event.h >> @@ -158,11 +158,37 @@ struct cxl_ras_capability_regs { >> u32 header_log[16]; >> }; >> >> +enum cxl_aer_err_type { >> + CXL_AER_UNCORRECTABLE, >> + CXL_AER_CORRECTABLE, >> +}; >> + >> +struct cxl_cper_prot_err { >> + struct cxl_ras_capability_regs cxl_ras; >> + >> + /* Device ID */ >> + u8 function; >> + u8 device; >> + u8 bus; >> + u16 segment; >> + >> + /* Device Serial Number */ >> + u32 lower_dw; >> + u32 upper_dw; >> + >> + int severity; >> +}; > > Hmm, 'struct cxl_cper_event_rec' follows the raw format of the record > from the specification, and 'struct cxl_cper_sec_prot_err' (formerly > cper_sec_prot_err) already exists, so why is this new intermediate data > structure needed? Yeah, the intention was to extract only necessary info to be consumed by cxl_pci driver and to be passed to cxl_cper_fifo. Going forward, while handling protocol errors separately, I can send the entire cxl_cper_sec_prot_err. Thanks Smita > >> + >> struct cxl_cper_work_data { >> enum cxl_event_type event_type; >> struct cxl_cper_event_rec rec; >> + struct cxl_cper_prot_err p_rec; > >> }; >> >> +struct acpi_hest_generic_data; >> +int cxl_cper_handle_prot_err_info(struct acpi_hest_generic_data *gdata, >> + struct cxl_cper_prot_err *rec); >> + >> #ifdef CONFIG_ACPI_APEI_GHES >> int cxl_cper_register_work(struct work_struct *work); >> int cxl_cper_unregister_work(struct work_struct *work); >> -- >> 2.17.1 >> > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors. 2024-10-03 19:15 ` Smita Koralahalli @ 2024-10-03 23:21 ` Dan Williams 0 siblings, 0 replies; 19+ messages in thread From: Dan Williams @ 2024-10-03 23:21 UTC (permalink / raw) To: Smita Koralahalli, Dan Williams, linux-efi, linux-kernel, linux-cxl Cc: Ard Biesheuvel, Alison Schofield, Vishal Verma, Ira Weiny, Jonathan Cameron, Yazen Ghannam, Bowman Terry Smita Koralahalli wrote: > On 10/2/2024 4:47 PM, Dan Williams wrote: > > Smita Koralahalli wrote: > >> UEFI v2.10 section N.2.13 defines a CPER record for CXL Protocol errors. > >> > >> Add GHES support to detect CXL CPER Protocol Error Record and Cache Error > >> Severity, Device ID, Device Serial number and CXL RAS capability struct in > >> struct cxl_cper_prot_err. Include this struct as a member of struct > >> cxl_cper_work_data. > >> > >> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > >> --- > >> v2: > >> Defined array of structures for Device ID and Serial number > >> comparison. > >> p_err -> rec/p_rec. > >> --- > >> drivers/acpi/apei/ghes.c | 10 +++ > >> drivers/firmware/efi/cper_cxl.c | 115 ++++++++++++++++++++++++++++++++ > >> include/cxl/event.h | 26 ++++++++ > >> 3 files changed, 151 insertions(+) > >> [..] > >> +static enum cxl_aer_err_type cper_severity_cxl_aer(int cper_severity) > >> +{ > >> + switch (cper_severity) { > >> + case CPER_SEV_RECOVERABLE: > >> + case CPER_SEV_FATAL: > >> + return CXL_AER_UNCORRECTABLE; > >> + default: > >> + return CXL_AER_CORRECTABLE; > >> + } > > > > Why does the CPER severity need to be converted to a new CXL specific > > enum value? > > I was just following up the same convention here as done for AER.. > cper_severity_to_aer(). I can change if there is no value in doing it. I think because PCIe and CXL are both using AER as the transport they can both use cper_severity_to_aer(), or at a minimum do not introduce 'enum cxl_aer_err_type' and instead use the existing AER_ values. [..] > > All CPER records without a device-id have already been dropped above, so > > why reject agent-types that do not require a device-id here? > > > > I think this agent_info[] scheme makes the code more difficult to read > > especially since agent_info() is only consulted a couple times. Just put > > a "switch (prot_err->agent_type)" in the code directly and skip the > > indirection. > > Hmm, I initially thought I would do switch case and then changed it to > if-else thinking that would look cleaner. > > What would you suggest? Just incorporate switch case similar to > cper_print_prot_err() here as well or clean up switch case in > cper_print_prot_err() and reuse the agent_info[] there? Even though it is more lines, I find cper_print_prot_err() easier to read because I do not need to switch back and forth to the agent_info definition. > This agent_info[] would include 3 fields then, two as above and another > for valid_cap. Maybe unify sbdf with valid_cap.. agent_info[] ends up being code-logic masquerading as a data-structure. Keep the logic all in one coherent readable block. [..] > > Hmm, 'struct cxl_cper_event_rec' follows the raw format of the record > > from the specification, and 'struct cxl_cper_sec_prot_err' (formerly > > cper_sec_prot_err) already exists, so why is this new intermediate data > > structure needed? > > Yeah, the intention was to extract only necessary info to be consumed by > cxl_pci driver and to be passed to cxl_cper_fifo. > > Going forward, while handling protocol errors separately, I can send the > entire cxl_cper_sec_prot_err. I think it defensible to send all of it. Recall that the real consumer is not cxl_pci it is rasdaemon in userspace. The kernel is not in a good position to filter what rasdaemon expects to find in the record based on the EFI specification. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-10-03 23:21 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-09 3:47 [PATCH v2 0/4] acpi/ghes, cper, cxl: Trace FW-First CXL Protocol Errors Smita Koralahalli 2024-01-09 3:47 ` [PATCH v2 1/4] acpi/ghes, cxl: Create a common CXL struct to handle different CXL CPER records Smita Koralahalli 2024-02-15 11:56 ` Jonathan Cameron 2024-01-09 3:47 ` [PATCH v2 2/4] efi/cper, cxl: Make definitions and structures global Smita Koralahalli 2024-02-15 11:58 ` Jonathan Cameron 2024-02-15 14:47 ` Ard Biesheuvel 2024-01-09 3:47 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process CXL Protocol Errors Smita Koralahalli 2024-02-15 12:17 ` Jonathan Cameron 2024-01-09 3:47 ` [PATCH v2 4/4] acpi/ghes, cxl/pci: Trace FW-First " Smita Koralahalli 2024-02-15 12:22 ` Jonathan Cameron 2024-05-07 9:35 ` [PATCH v2 0/4] acpi/ghes, cper, cxl: " Fabio M. De Francesco 2024-05-16 17:59 ` Smita Koralahalli -- strict thread matches above, loose matches on Subject: below -- 2024-10-01 0:52 Smita Koralahalli 2024-10-01 0:52 ` [PATCH v2 3/4] acpi/ghes, efi/cper: Recognize and process " Smita Koralahalli 2024-10-01 15:47 ` Ira Weiny 2024-10-01 17:41 ` Fan Ni 2024-10-03 19:19 ` Smita Koralahalli 2024-10-02 23:47 ` Dan Williams 2024-10-03 19:15 ` Smita Koralahalli 2024-10-03 23:21 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox