* [PATCH 0/4 v2] Make ELOG log and trace consistently with GHES
@ 2025-04-29 17:21 Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body Fabio M. De Francesco
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-04-29 17:21 UTC (permalink / raw)
To: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
Cc: Fabio M. De Francesco
When Firmware First is enabled, BIOS handles errors first and then it
makes them available to the kernel via the Common Platform Error Record
(CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
via one of two similar paths, either ELOG or GHES.
Currently, ELOG and GHES show some inconsistencies in how they print to
the kernel log as well as in how they report to userspace via trace
events.
Make the two mentioned paths act similarly for what relates to logging
and tracing.
--- Changes for v2 ---
- Add a patch to pass log levels to pci_print_aer() (Dan)
- Add a patch to trace CPER CXL Protocol Errors
- Rework commit messages (Dan)
- Use log_non_standard_event() (Bjorn)
--- Changes for v1 ---
- Drop the RFC prefix and restart from PATCH v1
- Drop patch 3/3 because a discussion on it has not yet been
settled
- Drop namespacing in export of pci_print_aer while() (Dan)
- Don't use '#ifdef' in *.c files (Dan)
- Drop a reference on pdev after operation is complete (Dan)
- Don't log an error message if pdev is NULL (Dan)
--- Changes for RFC v2 ---
- 0/3: rework the subject line and the letter.
- 1/3: no changes.
- 2/3: trace CPER PCIe Section only if CONFIG_ACPI_APEI_PCIEAER
is defined; the kernel test robot reported the use of two
undefined symbols because the test for the config option was
missing; rewrite the subject line and part of commit message.
- 3/3: no changes.
Fabio M. De Francesco (4):
ACPI: extlog: Trace CPER Non-standard Section Body
PCI/AER: Modify pci_print_aer() to take log level
ACPI: extlog: Trace CPER PCI Express Error Section
ACPI: extlog: Trace CPER CXL Protocol Errors
drivers/acpi/acpi_extlog.c | 96 ++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/pci.c | 2 +-
drivers/cxl/core/ras.c | 6 +++
drivers/pci/pcie/aer.c | 18 +++----
drivers/ras/ras.c | 1 +
include/cxl/event.h | 2 +
include/linux/aer.h | 13 +++++-
7 files changed, 126 insertions(+), 12 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body
2025-04-29 17:21 [PATCH 0/4 v2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco
@ 2025-04-29 17:21 ` Fabio M. De Francesco
2025-05-20 11:08 ` Jonathan Cameron
2025-04-29 17:21 ` [PATCH 2/4 v2] PCI/AER: Modify pci_print_aer() to take log level Fabio M. De Francesco
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-04-29 17:21 UTC (permalink / raw)
To: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
Cc: Fabio M. De Francesco
ghes_do_proc() has a catch-all for unknown or unhandled CPER formats
(UEFI v2.10 Appendix N 2.3), extlog_print() does not. This gap was
noticed by a RAS test that injected CXL protocol errors which were
notified to extlog_print() via the IOMCA (I/O Machine Check
Architecture) mechanism. Bring parity to the extlog_print() path by
including a similar log_non_standard_event().
Cc: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
---
drivers/acpi/acpi_extlog.c | 6 ++++++
drivers/ras/ras.c | 1 +
2 files changed, 7 insertions(+)
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index f7fb7205028d..caca6ccd6e99 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -182,6 +182,12 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
if (gdata->error_data_length >= sizeof(*mem))
trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
(u8)gdata->error_severity);
+ } else {
+ void *err = acpi_hest_get_payload(gdata);
+
+ log_non_standard_event(sec_type, fru_id, fru_text,
+ gdata->error_severity, err,
+ gdata->error_data_length);
}
}
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index a6e4792a1b2e..ac0e132ccc3e 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -51,6 +51,7 @@ void log_non_standard_event(const guid_t *sec_type, const guid_t *fru_id,
{
trace_non_standard_event(sec_type, fru_id, fru_text, sev, err, len);
}
+EXPORT_SYMBOL_GPL(log_non_standard_event);
void log_arm_hw_error(struct cper_sec_proc_arm *err)
{
--
2.48.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/4 v2] PCI/AER: Modify pci_print_aer() to take log level
2025-04-29 17:21 [PATCH 0/4 v2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body Fabio M. De Francesco
@ 2025-04-29 17:21 ` Fabio M. De Francesco
2025-05-20 11:10 ` Jonathan Cameron
2025-04-29 17:21 ` [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors Fabio M. De Francesco
3 siblings, 1 reply; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-04-29 17:21 UTC (permalink / raw)
To: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
Cc: Fabio M. De Francesco
Modify pci_print_aer() to take a printk() log level in preparation of a
patch that logs PCIe Components and Link errors from ELOG.
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
---
drivers/cxl/core/pci.c | 2 +-
drivers/pci/pcie/aer.c | 16 ++++++++--------
include/linux/aer.h | 4 ++--
3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 3b80e9a76ba8..ad8d7939c2e1 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -885,7 +885,7 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
return;
- pci_print_aer(pdev, severity, &aer_regs);
+ pci_print_aer(KERN_ERR, pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
cxl_handle_rdport_cor_ras(cxlds, dport);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a1cf8c7ef628..d0ebf7c15afa 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -760,7 +760,7 @@ int cper_severity_to_aer(int cper_severity)
EXPORT_SYMBOL_GPL(cper_severity_to_aer);
#endif
-void pci_print_aer(struct pci_dev *dev, int aer_severity,
+void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer)
{
int layer, agent, tlp_header_valid = 0;
@@ -785,14 +785,15 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
- pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+ pci_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
+ status, mask);
__aer_print_error(dev, &info);
- pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
- aer_error_layer[layer], aer_agent_string[agent]);
+ pci_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
+ aer_error_layer[layer], aer_agent_string[agent]);
if (aer_severity != AER_CORRECTABLE)
- pci_err(dev, "aer_uncor_severity: 0x%08x\n",
- aer->uncor_severity);
+ pci_printk(level, dev, "aer_uncor_severity: 0x%08x\n",
+ aer->uncor_severity);
if (tlp_header_valid)
pcie_print_tlp_log(dev, &aer->header_log, dev_fmt(" "));
@@ -1146,8 +1147,7 @@ static void aer_recover_work_func(struct work_struct *work)
PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
continue;
}
- pci_print_aer(pdev, entry.severity, entry.regs);
-
+ pci_print_aer(KERN_ERR, pdev, entry.severity, entry.regs);
/*
* Memory for aer_capability_regs(entry.regs) is being
* allocated from the ghes_estatus_pool to protect it from
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 02940be66324..45d0fb2e2e75 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -64,8 +64,8 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
#endif
-void pci_print_aer(struct pci_dev *dev, int aer_severity,
- struct aer_capability_regs *aer);
+void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
+ struct aer_capability_regs *aer);
int cper_severity_to_aer(int cper_severity);
void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
int severity, struct aer_capability_regs *aer_regs);
--
2.48.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section
2025-04-29 17:21 [PATCH 0/4 v2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 2/4 v2] PCI/AER: Modify pci_print_aer() to take log level Fabio M. De Francesco
@ 2025-04-29 17:21 ` Fabio M. De Francesco
2025-04-29 18:02 ` Yazen Ghannam
2025-04-29 17:21 ` [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors Fabio M. De Francesco
3 siblings, 1 reply; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-04-29 17:21 UTC (permalink / raw)
To: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
Cc: Fabio M. De Francesco
I/O Machine Check Arcitecture events may signal failing PCIe components
or links. The AER event contains details on what was happening on the wire
when the error was signaled.
Trace the CPER PCIe Error section (UEFI v2.10, Appendix N.2.7) reported
by the I/O MCA.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
---
drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
drivers/pci/pcie/aer.c | 2 +-
include/linux/aer.h | 13 +++++++++++--
3 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index caca6ccd6e99..7d7a813169f1 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx,
return 1;
}
+static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
+ int severity)
+{
+ struct aer_capability_regs *aer;
+ struct pci_dev *pdev;
+ unsigned int devfn;
+ unsigned int bus;
+ int aer_severity;
+ int domain;
+
+ if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+ aer_severity = cper_severity_to_aer(severity);
+ aer = (struct aer_capability_regs *)pcie_err->aer_info;
+ domain = pcie_err->device_id.segment;
+ bus = pcie_err->device_id.bus;
+ devfn = PCI_DEVFN(pcie_err->device_id.device,
+ pcie_err->device_id.function);
+ pdev = pci_get_domain_bus_and_slot(domain, bus, devfn);
+ if (!pdev)
+ return;
+ pci_print_aer(KERN_DEBUG, pdev, aer_severity, aer);
+ pci_dev_put(pdev);
+ }
+}
+
static int extlog_print(struct notifier_block *nb, unsigned long val,
void *data)
{
@@ -182,6 +208,10 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
if (gdata->error_data_length >= sizeof(*mem))
trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
(u8)gdata->error_severity);
+ } else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
+ struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+ extlog_print_pcie(pcie_err, gdata->error_severity);
} else {
void *err = acpi_hest_get_payload(gdata);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d0ebf7c15afa..627fcf434698 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -801,7 +801,7 @@ void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
trace_aer_event(dev_name(&dev->dev), (status & ~mask),
aer_severity, tlp_header_valid, &aer->header_log);
}
-EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
+EXPORT_SYMBOL_GPL(pci_print_aer);
/**
* add_error_device - list device to be handled
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 45d0fb2e2e75..737db92e6570 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -56,17 +56,26 @@ struct aer_capability_regs {
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
+void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
+ struct aer_capability_regs *aer);
#else
static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
return -EINVAL;
}
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
+static inline void pci_print_aer(char *level, struct pci_dev *dev,
+ int aer_severity,
+ struct aer_capability_regs *aer)
+{ }
#endif
-void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
- struct aer_capability_regs *aer);
+#if defined(CONFIG_ACPI_APEI_PCIEAER)
int cper_severity_to_aer(int cper_severity);
+#else
+static inline int cper_severity_to_aer(int cper_severity) { return 0; }
+#endif
+
void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
int severity, struct aer_capability_regs *aer_regs);
#endif //_AER_H_
--
2.48.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors
2025-04-29 17:21 [PATCH 0/4 v2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco
` (2 preceding siblings ...)
2025-04-29 17:21 ` [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section Fabio M. De Francesco
@ 2025-04-29 17:21 ` Fabio M. De Francesco
2025-04-29 18:20 ` Yazen Ghannam
3 siblings, 1 reply; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-04-29 17:21 UTC (permalink / raw)
To: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
Cc: Fabio M. De Francesco
When Firmware First is enabled, BIOS handles errors first and then it
makes them available to the kernel via the Common Platform Error Record
(CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
via one of two similar paths, either ELOG or GHES.
Currently, ELOG and GHES show some inconsistencies in how they report to
userspace via trace events.
Therfore make the two mentioned paths act similarly by tracing the CPER
CXL Protocol Error Section (UEFI v2.10, Appendix N.2.13) signaled by the
I/O Machine Check Architecture and reported by BIOS in FW-First.
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
---
drivers/acpi/acpi_extlog.c | 60 ++++++++++++++++++++++++++++++++++++++
drivers/cxl/core/ras.c | 6 ++++
include/cxl/event.h | 2 ++
3 files changed, 68 insertions(+)
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index 7d7a813169f1..8f2ff3505d47 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -12,6 +12,7 @@
#include <linux/ratelimit.h>
#include <linux/edac.h>
#include <linux/ras.h>
+#include <cxl/event.h>
#include <acpi/ghes.h>
#include <asm/cpu.h>
#include <asm/mce.h>
@@ -157,6 +158,60 @@ static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
}
}
+static void
+extlog_cxl_cper_handle_prot_err(struct cxl_cper_sec_prot_err *prot_err,
+ int severity)
+{
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+ struct cxl_cper_prot_err_work_data wd;
+ u8 *dvsec_start, *cap_start;
+
+ if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
+ pr_err_ratelimited("CXL CPER invalid agent type\n");
+ return;
+ }
+
+ if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
+ pr_err_ratelimited("CXL CPER invalid protocol error log\n");
+ return;
+ }
+
+ if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
+ pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
+ prot_err->err_len);
+ return;
+ }
+
+ if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
+ pr_warn(FW_WARN "CXL CPER no device serial number\n");
+
+ switch (prot_err->agent_type) {
+ case RCD:
+ case DEVICE:
+ case LD:
+ case FMLD:
+ case RP:
+ case DSP:
+ case USP:
+ memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
+
+ dvsec_start = (u8 *)(prot_err + 1);
+ cap_start = dvsec_start + prot_err->dvsec_len;
+
+ memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
+ wd.severity = cper_severity_to_aer(severity);
+ break;
+ default:
+ pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
+ prot_err->agent_type);
+ return;
+ }
+
+ cxl_cper_ras_handle_prot_err(&wd);
+
+#endif
+}
+
static int extlog_print(struct notifier_block *nb, unsigned long val,
void *data)
{
@@ -208,6 +263,10 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
if (gdata->error_data_length >= sizeof(*mem))
trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
(u8)gdata->error_severity);
+ } else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
+ struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
+
+ extlog_cxl_cper_handle_prot_err(prot_err, gdata->error_severity);
} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
@@ -375,3 +434,4 @@ module_exit(extlog_exit);
MODULE_AUTHOR("Chen, Gong <gong.chen@intel.com>");
MODULE_DESCRIPTION("Extended MCA Error Log Driver");
MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS("CXL");
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 485a831695c7..56db290c88d3 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -98,6 +98,12 @@ static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
cxl_cper_trace_uncorr_prot_err(pdev, data->ras_cap);
}
+void cxl_cper_ras_handle_prot_err(struct cxl_cper_prot_err_work_data *wd)
+{
+ cxl_cper_handle_prot_err(wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_ras_handle_prot_err, "CXL");
+
static void cxl_cper_prot_err_work_fn(struct work_struct *work)
{
struct cxl_cper_prot_err_work_data wd;
diff --git a/include/cxl/event.h b/include/cxl/event.h
index f9ae1796da85..aef906e26033 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -285,4 +285,6 @@ static inline int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data
}
#endif
+void cxl_cper_ras_handle_prot_err(struct cxl_cper_prot_err_work_data *wd);
+
#endif /* _LINUX_CXL_EVENT_H */
--
2.48.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section
2025-04-29 17:21 ` [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section Fabio M. De Francesco
@ 2025-04-29 18:02 ` Yazen Ghannam
2025-06-02 16:59 ` Fabio M. De Francesco
0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2025-04-29 18:02 UTC (permalink / raw)
To: Fabio M. De Francesco
Cc: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
On Tue, Apr 29, 2025 at 07:21:08PM +0200, Fabio M. De Francesco wrote:
> I/O Machine Check Arcitecture events may signal failing PCIe components
> or links. The AER event contains details on what was happening on the wire
> when the error was signaled.
>
> Trace the CPER PCIe Error section (UEFI v2.10, Appendix N.2.7) reported
> by the I/O MCA.
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
> ---
> drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
> drivers/pci/pcie/aer.c | 2 +-
> include/linux/aer.h | 13 +++++++++++--
> 3 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index caca6ccd6e99..7d7a813169f1 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx,
> return 1;
> }
>
> +static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> + int severity)
> +{
> + struct aer_capability_regs *aer;
> + struct pci_dev *pdev;
> + unsigned int devfn;
> + unsigned int bus;
> + int aer_severity;
> + int domain;
> +
> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
You can save an indentation level by inverting this check and returning
early.
> + aer_severity = cper_severity_to_aer(severity);
I think it would help with clarity if all these lines were aligned on
the "=".
> + aer = (struct aer_capability_regs *)pcie_err->aer_info;
> + domain = pcie_err->device_id.segment;
> + bus = pcie_err->device_id.bus;
Many of these variables are passed unchanged to a single function below.
Why not pass them directly to the function?
Even if you split the function parameters across multiple lines, you
will still have fewer lines. Plus you will not need to allocate the
variables.
> + devfn = PCI_DEVFN(pcie_err->device_id.device,
> + pcie_err->device_id.function);
> + pdev = pci_get_domain_bus_and_slot(domain, bus, devfn);
> + if (!pdev)
> + return;
Newline here, please.
> + pci_print_aer(KERN_DEBUG, pdev, aer_severity, aer);
Why use a debug log level?
> + pci_dev_put(pdev);
> + }
> +}
> +
> static int extlog_print(struct notifier_block *nb, unsigned long val,
> void *data)
> {
> @@ -182,6 +208,10 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
> if (gdata->error_data_length >= sizeof(*mem))
> trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
> (u8)gdata->error_severity);
> + } else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> + extlog_print_pcie(pcie_err, gdata->error_severity);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d0ebf7c15afa..627fcf434698 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -801,7 +801,7 @@ void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
> trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> aer_severity, tlp_header_valid, &aer->header_log);
> }
> -EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
> +EXPORT_SYMBOL_GPL(pci_print_aer);
>
> /**
> * add_error_device - list device to be handled
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 45d0fb2e2e75..737db92e6570 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -56,17 +56,26 @@ struct aer_capability_regs {
> #if defined(CONFIG_PCIEAER)
> int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> int pcie_aer_is_native(struct pci_dev *dev);
> +void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
> + struct aer_capability_regs *aer);
> #else
> static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> {
> return -EINVAL;
> }
> static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> +static inline void pci_print_aer(char *level, struct pci_dev *dev,
> + int aer_severity,
> + struct aer_capability_regs *aer)
> +{ }
I think the "{ }" can just go at the end of the parameters.
> #endif
>
> -void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
> - struct aer_capability_regs *aer);
> +#if defined(CONFIG_ACPI_APEI_PCIEAER)
> int cper_severity_to_aer(int cper_severity);
> +#else
> +static inline int cper_severity_to_aer(int cper_severity) { return 0; }
This may have an unintentional side effect.
'0' means AER_NONFATAL.
So the function will return that the error is an uncorrectable AER error
that is potentially recoverable. At a minimum, this will incorrectly
classify the error for data collection, and it could cause incorrect
handling.
I guess the risk is minimal, since CONFIG_ACPI_APEI_PCIEAER will likely
be enabled on systems that would use this.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors
2025-04-29 17:21 ` [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors Fabio M. De Francesco
@ 2025-04-29 18:20 ` Yazen Ghannam
2025-06-02 18:06 ` Fabio M. De Francesco
0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2025-04-29 18:20 UTC (permalink / raw)
To: Fabio M. De Francesco
Cc: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
On Tue, Apr 29, 2025 at 07:21:09PM +0200, Fabio M. De Francesco wrote:
> When Firmware First is enabled, BIOS handles errors first and then it
> makes them available to the kernel via the Common Platform Error Record
> (CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
> via one of two similar paths, either ELOG or GHES.
>
> Currently, ELOG and GHES show some inconsistencies in how they report to
> userspace via trace events.
>
> Therfore make the two mentioned paths act similarly by tracing the CPER
> CXL Protocol Error Section (UEFI v2.10, Appendix N.2.13) signaled by the
> I/O Machine Check Architecture and reported by BIOS in FW-First.
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
> ---
> drivers/acpi/acpi_extlog.c | 60 ++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/ras.c | 6 ++++
> include/cxl/event.h | 2 ++
> 3 files changed, 68 insertions(+)
>
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index 7d7a813169f1..8f2ff3505d47 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -12,6 +12,7 @@
> #include <linux/ratelimit.h>
> #include <linux/edac.h>
> #include <linux/ras.h>
> +#include <cxl/event.h>
> #include <acpi/ghes.h>
> #include <asm/cpu.h>
> #include <asm/mce.h>
> @@ -157,6 +158,60 @@ static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> }
> }
>
> +static void
> +extlog_cxl_cper_handle_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> + int severity)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER
Why not apply this check on the function prototype?
Reference: Documentation/process/coding-style.rst
Section 21) Conditional Compilation
> + struct cxl_cper_prot_err_work_data wd;
> + u8 *dvsec_start, *cap_start;
> +
> + if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
> + pr_err_ratelimited("CXL CPER invalid agent type\n");
> + return;
> + }
> +
> + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
> + pr_err_ratelimited("CXL CPER invalid protocol error log\n");
> + return;
> + }
> +
> + if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> + pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> + prot_err->err_len);
> + return;
> + }
> +
> + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> + pr_warn(FW_WARN "CXL CPER no device serial number\n");
Is this a requirement (in the spec) that we should warn users about?
The UEFI spec says that serial number is only used if "CXL agent" is a
"CXL device".
"CXL ports" won't have serial numbers. So this will be a false warning
for port errors.
> +
> + switch (prot_err->agent_type) {
> + case RCD:
> + case DEVICE:
> + case LD:
> + case FMLD:
> + case RP:
> + case DSP:
> + case USP:
> + memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
> +
> + dvsec_start = (u8 *)(prot_err + 1);
> + cap_start = dvsec_start + prot_err->dvsec_len;
> +
> + memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
> + wd.severity = cper_severity_to_aer(severity);
> + break;
> + default:
> + pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
"invalid" is too harsh given that the specs may be updated. Maybe say
"reserved" or "unknown" or "unrecognized" instead.
Hopefully things will settle down to where a user will be able to have a
system with newer CXL "agents" without *requiring* a kernel update. :)
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body
2025-04-29 17:21 ` [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body Fabio M. De Francesco
@ 2025-05-20 11:08 ` Jonathan Cameron
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Cameron @ 2025-05-20 11:08 UTC (permalink / raw)
To: Fabio M. De Francesco
Cc: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Dave Jiang,
Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Tony Luck, Borislav Petkov, linux-kernel, linux-acpi, linux-cxl,
linuxppc-dev, linux-pci, linux-edac
On Tue, 29 Apr 2025 19:21:06 +0200
"Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> wrote:
> ghes_do_proc() has a catch-all for unknown or unhandled CPER formats
> (UEFI v2.10 Appendix N 2.3), extlog_print() does not. This gap was
> noticed by a RAS test that injected CXL protocol errors which were
> notified to extlog_print() via the IOMCA (I/O Machine Check
> Architecture) mechanism. Bring parity to the extlog_print() path by
> including a similar log_non_standard_event().
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Makes sense to me.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> drivers/acpi/acpi_extlog.c | 6 ++++++
> drivers/ras/ras.c | 1 +
> 2 files changed, 7 insertions(+)
>
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index f7fb7205028d..caca6ccd6e99 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -182,6 +182,12 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
> if (gdata->error_data_length >= sizeof(*mem))
> trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
> (u8)gdata->error_severity);
> + } else {
> + void *err = acpi_hest_get_payload(gdata);
> +
> + log_non_standard_event(sec_type, fru_id, fru_text,
> + gdata->error_severity, err,
> + gdata->error_data_length);
> }
> }
>
> diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
> index a6e4792a1b2e..ac0e132ccc3e 100644
> --- a/drivers/ras/ras.c
> +++ b/drivers/ras/ras.c
> @@ -51,6 +51,7 @@ void log_non_standard_event(const guid_t *sec_type, const guid_t *fru_id,
> {
> trace_non_standard_event(sec_type, fru_id, fru_text, sev, err, len);
> }
> +EXPORT_SYMBOL_GPL(log_non_standard_event);
>
> void log_arm_hw_error(struct cper_sec_proc_arm *err)
> {
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/4 v2] PCI/AER: Modify pci_print_aer() to take log level
2025-04-29 17:21 ` [PATCH 2/4 v2] PCI/AER: Modify pci_print_aer() to take log level Fabio M. De Francesco
@ 2025-05-20 11:10 ` Jonathan Cameron
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Cameron @ 2025-05-20 11:10 UTC (permalink / raw)
To: Fabio M. De Francesco
Cc: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Dave Jiang,
Alison Schofield, Vishal Verma, Ira Weiny, Dan Williams,
Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas,
Tony Luck, Borislav Petkov, linux-kernel, linux-acpi, linux-cxl,
linuxppc-dev, linux-pci, linux-edac
On Tue, 29 Apr 2025 19:21:07 +0200
"Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com> wrote:
> Modify pci_print_aer() to take a printk() log level in preparation of a
> patch that logs PCIe Components and Link errors from ELOG.
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Fair enough on reasoning and it does what you describe so
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section
2025-04-29 18:02 ` Yazen Ghannam
@ 2025-06-02 16:59 ` Fabio M. De Francesco
0 siblings, 0 replies; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-06-02 16:59 UTC (permalink / raw)
To: Yazen Ghannam
Cc: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
On Tuesday, April 29, 2025 8:02:42 PM Central European Summer Time Yazen Ghannam wrote:
> On Tue, Apr 29, 2025 at 07:21:08PM +0200, Fabio M. De Francesco wrote:
> > I/O Machine Check Arcitecture events may signal failing PCIe components
> > or links. The AER event contains details on what was happening on the wire
> > when the error was signaled.
> >
> > Trace the CPER PCIe Error section (UEFI v2.10, Appendix N.2.7) reported
> > by the I/O MCA.
> >
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
> > ---
> > drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
> > drivers/pci/pcie/aer.c | 2 +-
> > include/linux/aer.h | 13 +++++++++++--
> > 3 files changed, 42 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> > index caca6ccd6e99..7d7a813169f1 100644
> > --- a/drivers/acpi/acpi_extlog.c
> > +++ b/drivers/acpi/acpi_extlog.c
> > @@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx,
> > return 1;
> > }
> >
> > +static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> > + int severity)
> > +{
> > + struct aer_capability_regs *aer;
> > + struct pci_dev *pdev;
> > + unsigned int devfn;
> > + unsigned int bus;
> > + int aer_severity;
> > + int domain;
> > +
> > + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> > + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
>
> You can save an indentation level by inverting this check and returning
> early.
>
Nice idea, I'll do it.
>
> > + aer_severity = cper_severity_to_aer(severity);
>
> I think it would help with clarity if all these lines were aligned on
> the "=".
>
> > + aer = (struct aer_capability_regs *)pcie_err->aer_info;
> > + domain = pcie_err->device_id.segment;
> > + bus = pcie_err->device_id.bus;
>
> Many of these variables are passed unchanged to a single function below.
>
> Why not pass them directly to the function?
>
> Even if you split the function parameters across multiple lines, you
> will still have fewer lines. Plus you will not need to allocate the
> variables.
>
I think that the cost is minimal and readability is improved.
>
> > + devfn = PCI_DEVFN(pcie_err->device_id.device,
> > + pcie_err->device_id.function);
> > + pdev = pci_get_domain_bus_and_slot(domain, bus, devfn);
> > + if (!pdev)
> > + return;
>
> Newline here, please.
>
Sure.
>
> > + pci_print_aer(KERN_DEBUG, pdev, aer_severity, aer);
>
> Why use a debug log level?
>
Dan Williams suggested the debug log level commenting v1.
>
> > + pci_dev_put(pdev);
> > + }
> > +}
> > +
> > static int extlog_print(struct notifier_block *nb, unsigned long val,
> > void *data)
> > {
> > @@ -182,6 +208,10 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
> > if (gdata->error_data_length >= sizeof(*mem))
> > trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
> > (u8)gdata->error_severity);
> > + } else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> > + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> > +
> > + extlog_print_pcie(pcie_err, gdata->error_severity);
> > } else {
> > void *err = acpi_hest_get_payload(gdata);
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index d0ebf7c15afa..627fcf434698 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -801,7 +801,7 @@ void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
> > trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> > aer_severity, tlp_header_valid, &aer->header_log);
> > }
> > -EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
> > +EXPORT_SYMBOL_GPL(pci_print_aer);
> >
> > /**
> > * add_error_device - list device to be handled
> > diff --git a/include/linux/aer.h b/include/linux/aer.h
> > index 45d0fb2e2e75..737db92e6570 100644
> > --- a/include/linux/aer.h
> > +++ b/include/linux/aer.h
> > @@ -56,17 +56,26 @@ struct aer_capability_regs {
> > #if defined(CONFIG_PCIEAER)
> > int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> > int pcie_aer_is_native(struct pci_dev *dev);
> > +void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
> > + struct aer_capability_regs *aer);
> > #else
> > static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> > {
> > return -EINVAL;
> > }
> > static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> > +static inline void pci_print_aer(char *level, struct pci_dev *dev,
> > + int aer_severity,
> > + struct aer_capability_regs *aer)
> > +{ }
>
> I think the "{ }" can just go at the end of the parameters.
>
> > #endif
> >
> > -void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
> > - struct aer_capability_regs *aer);
> > +#if defined(CONFIG_ACPI_APEI_PCIEAER)
> > int cper_severity_to_aer(int cper_severity);
> > +#else
> > +static inline int cper_severity_to_aer(int cper_severity) { return 0; }
>
> This may have an unintentional side effect.
>
> '0' means AER_NONFATAL.
>
> So the function will return that the error is an uncorrectable AER error
> that is potentially recoverable. At a minimum, this will incorrectly
> classify the error for data collection, and it could cause incorrect
> handling.
>
> I guess the risk is minimal, since CONFIG_ACPI_APEI_PCIEAER will likely
> be enabled on systems that would use this.
>
Noted. Kconfig will select CONFIG_ACPI_APEI_PCIEAER.
>
> Thanks,
> Yazen
>
Thanks,
Fabio
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors
2025-04-29 18:20 ` Yazen Ghannam
@ 2025-06-02 18:06 ` Fabio M. De Francesco
0 siblings, 0 replies; 11+ messages in thread
From: Fabio M. De Francesco @ 2025-06-02 18:06 UTC (permalink / raw)
To: Yazen Ghannam
Cc: Rafael J . Wysocki, Len Brown, Davidlohr Bueso, Jonathan Cameron,
Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
Dan Williams, Mahesh J Salgaonkar, Oliver O'Halloran,
Bjorn Helgaas, Tony Luck, Borislav Petkov, linux-kernel,
linux-acpi, linux-cxl, linuxppc-dev, linux-pci, linux-edac
On Tuesday, April 29, 2025 8:20:55 PM Central European Summer Time Yazen Ghannam wrote:
> On Tue, Apr 29, 2025 at 07:21:09PM +0200, Fabio M. De Francesco wrote:
> > When Firmware First is enabled, BIOS handles errors first and then it
> > makes them available to the kernel via the Common Platform Error Record
> > (CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
> > via one of two similar paths, either ELOG or GHES.
> >
> > Currently, ELOG and GHES show some inconsistencies in how they report to
> > userspace via trace events.
> >
> > Therfore make the two mentioned paths act similarly by tracing the CPER
> > CXL Protocol Error Section (UEFI v2.10, Appendix N.2.13) signaled by the
> > I/O Machine Check Architecture and reported by BIOS in FW-First.
> >
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
> > ---
> > drivers/acpi/acpi_extlog.c | 60 ++++++++++++++++++++++++++++++++++++++
> > drivers/cxl/core/ras.c | 6 ++++
> > include/cxl/event.h | 2 ++
> > 3 files changed, 68 insertions(+)
> >
> > diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> > index 7d7a813169f1..8f2ff3505d47 100644
> > --- a/drivers/acpi/acpi_extlog.c
> > +++ b/drivers/acpi/acpi_extlog.c
> > @@ -12,6 +12,7 @@
> > #include <linux/ratelimit.h>
> > #include <linux/edac.h>
> > #include <linux/ras.h>
> > +#include <cxl/event.h>
> > #include <acpi/ghes.h>
> > #include <asm/cpu.h>
> > #include <asm/mce.h>
> > @@ -157,6 +158,60 @@ static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> > }
> > }
> >
> > +static void
> > +extlog_cxl_cper_handle_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> > + int severity)
> > +{
> > +#ifdef CONFIG_ACPI_APEI_PCIEAER
>
> Why not apply this check on the function prototype?
>
This function is static.
>
> Reference: Documentation/process/coding-style.rst
> Section 21) Conditional Compilation
>
> > + struct cxl_cper_prot_err_work_data wd;
> > + u8 *dvsec_start, *cap_start;
> > +
> > + if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
> > + pr_err_ratelimited("CXL CPER invalid agent type\n");
> > + return;
> > + }
> > +
> > + if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
> > + pr_err_ratelimited("CXL CPER invalid protocol error log\n");
> > + return;
> > + }
> > +
> > + if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> > + pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> > + prot_err->err_len);
> > + return;
> > + }
> > +
> > + if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> > + pr_warn(FW_WARN "CXL CPER no device serial number\n");
>
> Is this a requirement (in the spec) that we should warn users about?
>
> The UEFI spec says that serial number is only used if "CXL agent" is a
> "CXL device".
>
> "CXL ports" won't have serial numbers. So this will be a false warning
> for port errors.
>
I'll add a test and print that warning only if agent is a device (RCD,
DEVICE, LD, FMLD).
>
> > +
> > + switch (prot_err->agent_type) {
> > + case RCD:
> > + case DEVICE:
> > + case LD:
> > + case FMLD:
> > + case RP:
> > + case DSP:
> > + case USP:
> > + memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
> > +
> > + dvsec_start = (u8 *)(prot_err + 1);
> > + cap_start = dvsec_start + prot_err->dvsec_len;
> > +
> > + memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
> > + wd.severity = cper_severity_to_aer(severity);
> > + break;
> > + default:
> > + pr_err_ratelimited("CXL CPER invalid agent type: %d\n",
>
> "invalid" is too harsh given that the specs may be updated. Maybe say
> "reserved" or "unknown" or "unrecognized" instead.
>
> Hopefully things will settle down to where a user will be able to have a
> system with newer CXL "agents" without *requiring* a kernel update. :)
>
I'll replace "invalid" with "unknown".
>
> Thanks,
> Yazen
>
Thanks,
Fabio
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-06-02 18:11 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 17:21 [PATCH 0/4 v2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 1/4 v2] ACPI: extlog: Trace CPER Non-standard Section Body Fabio M. De Francesco
2025-05-20 11:08 ` Jonathan Cameron
2025-04-29 17:21 ` [PATCH 2/4 v2] PCI/AER: Modify pci_print_aer() to take log level Fabio M. De Francesco
2025-05-20 11:10 ` Jonathan Cameron
2025-04-29 17:21 ` [PATCH 3/4 v2] ACPI: extlog: Trace CPER PCI Express Error Section Fabio M. De Francesco
2025-04-29 18:02 ` Yazen Ghannam
2025-06-02 16:59 ` Fabio M. De Francesco
2025-04-29 17:21 ` [PATCH 4/4 v2] ACPI: extlog: Trace CPER CXL Protocol Errors Fabio M. De Francesco
2025-04-29 18:20 ` Yazen Ghannam
2025-06-02 18:06 ` Fabio M. De Francesco
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).